For this week’s practicum I looked at several different open source text mining tools, comparing the results with each other and my expectations. The terms I choose were “Army,” “Navy,” and “Military.” I was looking to see if there was any significant change in usage over time in the preference between the three as well as the overall usage.
Searching the terms gave similar results in the Google n-gram viewer and the NYT Chronicle, and shows patterns that one might expect:
All three terms track along parallel tracks (Army most used, followed by military, with Navy less common), spiking at times of war. Towards the latter part of the 20th century, around 1960 for both viewers, the word military became more prevalent than army.
For the matching results from the NYT Chronicle, click the link below:
The NYT Chronicle is especially useful because it allows you to interact with the sources behind the results. For example, clicking on the peak of the use of the word “Army” will bring up the 6,281 articles that mention that word during 1862 (the same year show 4,574 results for military and 2,257 uses of Navy).
Putting the same terms into Bookworm and searching the Chronicling America files reveals similar trends but with small variations. There is a spike in both Army and Military (but not Navy) around 1840 that doesn’t appear on Google ngram and isn’t covered on NYT (starts in 1860s). Also, unlike the other two, the term military never overtakes army in usage, other than for a military specific spike (army and navy stay flat) in the 1870s.
For the Bookworm results, click the link below:
Bookworm also lets you access the articles behind its Chronicling America ngram, revealing that the spike in 1840 is caused largely by articles from the Illinois Free Trader, with every article from the first three pages of result coming from that single journal. This spike in references to military in Illinois may be related to the conflicts with Mormons that were occurring at that time.
The final text mining tool I looked at was the Voyant Tool, rather than searching a set corpus as all of the three above tools, will evaluate whatever text you put into it. For this experiment I used three files from the NY Gettysburg Monuments Commission (NY_Gettysburg_1.1, NY_Gettysburg_1.2, and NY_Gettysburg_2.1).
Voyant allows you to look at the text mining results in several different ways, and I experimented with looking at the results using in bubblelines, collocate clusters, and correspondence analysis. These tools all helped visualize that data in different ways, but provide much more complex results than the ngram viewers discussed above, and I frequently had trouble understanding what I was looking at and if it meant anything.
The five most common words in the corpus were brigade, corps, regiment, new, and York, none of which are very surprising. When looking at the bubbleline results it appears that are two sections in the 1.2 file where there is less discussion of military units, but other than that all of these words were relatively common throughout the files. It seems as if these files may be too similar to get much in the way of contrast.