For this week’s practicum I looked at several different open source text mining tools, comparing the results with each other and my expectations. The terms I choose were “Army,” “Navy,” and “Military.” I was looking to see if there was any significant change in usage over time in the preference between the three as well as the overall usage.

Searching the terms gave similar results in the Google n-gram viewer and the NYT Chronicle, and shows patterns that one might expect:

All three terms  track along parallel tracks (Army most used, followed by military, with Navy less common), spiking at times of war. Towards the latter part of the 20th century, around 1960 for both viewers, the word military became more prevalent than army.

For the matching results from the NYT Chronicle, click the link below:

http://chronicle.nytlabs.com/?keyword=army.navy.military

The NYT Chronicle is especially useful because it allows you to interact with the sources behind the results. For example, clicking on the peak of the use of the word “Army” will bring up the 6,281 articles that mention that word during 1862 (the same year show 4,574 results for military and 2,257 uses of Navy).

 

Putting the same terms into Bookworm and searching the Chronicling America files reveals similar trends but with small variations. There is a spike in both Army and Military (but not Navy) around 1840 that doesn’t appear on Google ngram and isn’t covered on NYT (starts in 1860s). Also, unlike the other two, the term military never overtakes army in usage, other than for a military specific spike (army and navy stay flat) in the 1870s.

For the Bookworm results, click the link below:

Bookworm

Bookworm also lets you access the articles behind its Chronicling America ngram, revealing that the spike in 1840 is caused largely by articles from the Illinois Free Trader, with every article from the first three pages of result coming from that single journal. This spike in references to military in Illinois may be related to the conflicts with Mormons that were occurring at that time.

The final text mining tool I looked at was the Voyant Tool, rather than searching  a set corpus as all of the three above tools, will evaluate whatever text you put into it. For this experiment I used three files from the NY Gettysburg Monuments Commission (NY_Gettysburg_1.1, NY_Gettysburg_1.2, and NY_Gettysburg_2.1).

Voyant allows you to look at the text mining results in several different ways, and I experimented with looking at the results using in bubblelines, collocate clusters, and correspondence analysis. These tools all helped visualize that data in different ways, but provide much more complex results than the ngram viewers discussed above, and I frequently had trouble understanding what I was looking at and if it meant anything.

The five most common words in the corpus were brigade, corps, regiment, new, and York, none of which are very surprising. When looking at the bubbleline results it appears that are two sections in the 1.2 file where there is less discussion of military units, but other than that all of these words were relatively common throughout the files. It seems as if these files may be too similar to get much in the way of contrast.


This weeks readings were useful in describing the tools available in text mining and topic modeling and also some of the important considerations in their use. These tools are closely related to the keyword searches we looked at last week, but a little but more advance. Like our readings on keyword searches, so of the most valuable elements of this weeks readings were less then nuts and bolts of the tools than some of the issues in their use and the discussion of the need for a proper understanding of their implications and methodology.

As Ted Underwood points out in his article, “Theorizing Research Practices We  Forgot to Theorize Twenty Years Ago,” it is important to consider searches and text mining as a “philosophical discourse” rather than just as tools. This means seriously considering how we approach their use, and understanding the implications of their use rather than just using them as a faster route to the same results we would seek with more traditional research methods. A large element of this is, as Underwood points out, that we need to understand how the algorithm behind the search engine is producing results (ie, what does it define as relevant). Without this understanding, it is very easy to conduct searches that will never produce alternate theses, or keep trying new searches until ANY thesis eventually produces enough results to be judged “supported.”

Corollary to this is a methodological approach recommended both by Underwood and Frederick Gibbs and Daniel Cohen: rather than using text mining to provide evidence for a defined thesis, it can be used instead as an open ended investigation. By the use of text mining and “distant reading,” a volume of sources that would be impossible to compare using traditional methods can be studied in a way to reveal patterns otherwise undetectable. This, in Gibbs and Cohen’s words, this method can provide “signposts toward further explanation rather than conclusive evidence.” According to Robert Nelson, the same results can be achieved from topic modeling, which allows digital historians to “detect patterns within not a sampling but the entirety of an archive.”

Finally, the articles (or more properly digital projects) of Cameron Blevins and Miki Kaufman demonstrate the ability of text mining and topic modeling, in combination with other digital tools, to provide a visual demonstration of patterns and coherence drawn from a huge amount of data that would be difficult to research or comprehend using more traditional methods.

In a survey of 3 years worth of the Journal of Military History (2007-2009), there was only one explicit reference to a database search. Even this reference was not to a full text key word search, but rather a study of library holdings using WorldCat.

In Kenneth P. Werrell’s article “Across the Yalu: Rules of Engagement and the Communist Air Sanctuary during the Korean War,” (Journal of Military History Vol. 72, No. 2 April 2008) Werrell argues that a study of the literature of the air war reveals both confusion and change over time in the evaluation of violations of the Chinese Air Sanctuary. To support this, he cites a WorldCat search that:

“shows that libraries have over 34,000 copies of the books cited in notes 1 through 34. Of this number, 20 percent tell of numerous violations, 35 percent note occasional or inadvertent violations, and 45 percent do not address the issue or claim there were no violations. Over 60 percent of the books in the first category were written by Hastings and MacDonald. Of the five most widely held books, those in over 2,000 libraries, one is in the first category, two in the second, and two in the third.”

This is the only database search that is explicitly mentioned in the period covered, and is an interesting example in that it is not central to his article, instead merely setting up the historiography he is writing against, and methodologically odd in its focus on the number of copies held versus simply individual titles.

In addition to this single explicit use of a database, there was at least one article that seemed to have made use of an unacknowledged sources. An article in the April 2009 issue of the Journal of Military History cited a single newspaper article in all of its 135 citations (although it cited this article several times). While it is possible that the author scanned through multiple rolls of microfilm before discarding all but this one, but it seems more likely that this article was found through a key word search and used to support his other sources.

Finally, there were a number of articles that made widespread use of digitized web sources outside of database and key word searches. Mark C. Jones, in his article “Experiment at Dundee: The Royal Navy’s 9th Submarine Flotilla and Multinational Naval Cooperation during World War II,” (Journal of Military History Vol. 72, No. 4 October 2008) uses several online sites to provide reference material (for example, he refers readers to http://www.unithistories.com/officers/RN_officersR.html for data on Royal Naval Officers and their assignments). Douglas C. Peifer, in his October 2007 article “The Past in the Present: Passion, Politics, and the Historical Profession and British Pardon Campaign,” goes even further. While he makes no reference to database searches, 21 of Peifer’s 76 citations make reference to digitized sources (with links and accessed dates), including both newspaper articles and government sources.

This limited use of databases, at least explicitly, reflects the continued focus of main stream historians of traditional research methods (with notable exceptions such as Peifer and Jones). It also reflects a relatively low use of newspapers, the most commonly digitized sources, within the field of military history. Finally, there may be, and likely are, additional incidents within the articles survey of the unacknowledged use of databases, as historians largely continue to cite hard copies even when accessing online versions, often through database searches (certainly some of the articles and dissertations cited were accessed through JSTOR, ProQuest, or similar databases).

While my reaction to last week’s readings was to focus on some of the negative aspects of digitization, this week’s readings opened up some benefits of digitization and new media that I had not previously considered. The use of new media has often been recognized as, at least to some extent, democratizing history by allowing a wider audience access and the ability to interact with history, not the least of this being thorough the widespread digitization of primary sources. Several of this week’s readings, however, highlight other ways in which the use of digitization, databases, and keyword searches have broken down barriers in research practices.

In his article on the role of digital history in re-formatting historical knowledge, Timothy Hitchcock points out the role of digitization and key word searches in decreasing the hard lines between disciplines with the result that “historical conferences are becoming more literary, and literature conferences are becoming significantly more historical.” As Hitchcock points out, the ease of keyword searches makes it possible for historians and other scholars to quickly and profitably engage with the corpus of work available digitally in other disciplines that is related to their work. This provides increased context on all sides without requiring significant additional inputs of time and research, greatly improving the end product through tangential research that would have been prohibitive in the analog age.

This idea is similar to the idea of “side-glances” as described by Lara Putnam in her article covering the intersection of transnationalism and digital history. For Putnam, the rise of digitized sources, and especially key word searches, has been instrumental in helping historians discover connections across national boundaries. What once would have been an impossible effort to chase down intriguing leads and scattered connections is now the matter of minutes of work at a computer. This digitization “makes whole new realms of connection possible,” with the end result of a sea change in the way historians, and especially transnational historians, do research.

A large part of this is digitization’s role in making what Putnam describes as “fishing expeditions” cost-reasonable. In the pre-digital age researchers were forced by the constraints of time and money to focus on archive visits to locations where they KNEW they could find usable material; with the digitization and searchability of increased numbers of archival sources these expeditions are now possible and even imperative.

The use of key word searches has also reduced the institutional bias that both Putnam and Hitchcock noted in the use of traditional archives. Archives, predominantly speaking, are the product of organizations and states, and have thus embedded into the sources most valued by historians a bias towards the perspectives of institutions and the organization of history around nation-state boundaries. The use of digitized key word searches allows historians to see past this by developing connections not visible in analog archives.

This is not to say that key word searches and database are without their pitfalls, as noted in many of both this week and last week’s readings. Some of these issues, as pointed out by Patrick Spedding, are not related to the failures of OCR or other technology, but to the nature of the sources themselves. Taking the example of his own research in 18th century references to condoms, Spedding points out the usefulness of peripheral text searches, using a combination of terms related to the original search topic to turn up sources unmarked by the original term. While Spedding used this mainly to get around the 18th century reticence in mentioning sex topics, this technique can also be applied to help alleviate some of the errors caused by OCR mistranslations, although at the risk of creating many more false positives while uncovering a small number of false negatives.

For all of the benefits deriving from the “digital turn” and the digitization and searchability of sources, it remains a tool for historians like all of their other research techniques. And like their other tools, historians must recognize the drawbacks and flaws inherent in the use of key word searches, while making the most of the new opportunities they provide.

This week’s practicum is focused on the use of OCR, both utilizing it personally and evaluating its effectiveness on some of the sources available on the web.

To start, I utilized Google OCR to digitize a scanned image from the Pinkerton National Detective Agency (Image #4). Before putting it through the OCR I rotated it to be upright and cropped the image to exclude the margins of the page in the hopes of eliminating extraneous object to confuse the OCR. This did help improve the OCR’s performance a little bit, but the program still produced a very high percentage of errors. Importantly, most of these errors were not single character errors that left a recognizable word, but rather turned many of the words into complete gibberish unrecognizable to the naked eye, let alone a key word search. Some of the issue may stem from the considerable bleed through of words from the backside of the page, and the irregular format of the Q&A seems also to have confused the OCR. There is also a surprising amount of errors mistaking letters for punctuation and vice versa.

Here is the resulting Google Doc: Pinkerton Image #4 Google Doc

The next element of the practicum was evaluating the OCR performance for several newspages on the Chronicling America project of the Library of Congress. To start with, I searched Virginia papers in the year range 1861 to 1865 with the key word “Stonewall.”

The first page that showed up was: The Soldiers’ journal. (Rendevous of Distribution, Va.), October 05, 1864, Image 3

The image was pretty clear, and the text easily readable with only a few stray marks on the page. The OCR is actually pretty decent, with only an error every couple of sentences. With most of these errors it is readily apparent what the original was, even without referring back to the original (Bull Hun from Bull Run, Conterville from Centerville, Bth July from 8th July). The most common error seems to be, for whatever reason, the substitution of “o” for “e” when it occurs in the middle of a word. Looking back at the image it does appear that the “e” in this type face has a pretty tight opening, making the OCR’s errors a little more understandable, although for most of these you would think the dictionary would have corrected the character confusion. There is also an issue or recognizing numbers, although what dictionary would favor “Bth” over “8th”

The second page is: The Abingdon Virginian. (Abingdon [Va.]), January 30, 1863, Image 1

This image is a little bit rougher, with much smaller print that is hard to read, some uneven inking, and an obvious crease. Perhaps unsurprisingly based on how hard it is to read even for human eyes, the OCR on this one is significantly worse. There are quite a high percentage of errors, and there seems to be less of a pattern to them. There are significantly more letters mistaken for punctuation and special characters, and the column breaks seem to have been especially a problem with a string of loose letters, mostly “i” and “j” inserted at the edges of columns. Finally, the column with the crease mark seems to have a higher percentage of errors than the others.

The third page was: The Soldiers’ journal. (Rendevous of Distribution, Va.), September 28, 1864, Image 7

The image on this is clear and easily readable with the naked eye, but there is some noticeable bleed through from the backside of the page, and the entire image is canted a little. The OCR is pretty decent with an error rate similar to the first page, although the errors seems less consistent in this one. There are the occasional random “j”s at the end of lines, possible from the line the paper uses to separate its columns. There seems to have been a specific problem in recognizing numbers when paired with letters, almost every case of 8th or 1st has been translated into Bth or some other pure letter gibberish.

The final element of the practicum was to evaluate the OCR of a primary source for my research area. For this I looked at the Google e-book version of the Annual Report of the Secretary of War for 1867, published by the Government Printing Office. The plain text of the OCR for this document is very high quality, and in actually a little bit easier to read than the image. There are very few errors and even manages to capture the words that are italicized. The only consistent error I could note is that “8”s are frequently transposed into “S”s, especially when in the middle of dates (so 1860 becomes 1S60).

 

After reviewing these different OCR results, it seems clear to me that there can be a very great degree of variety in the quality of OCR output, and therefore in the effectiveness of keyword searches. For example, the most likely keyword search looking for the Pinkerton paper I digitized would be a search for the “Thaw matter” which somehow got translated by the OCR as @1151! matth. A keyword search would thus obviously have missed at least this page of the file. These differing levels of quality are likely due to the quality of OCR program used (presumably Google uses a more advanced program for its ebooks than its open source OCR provided on Google drive) but also, as reflected by the differing results on the Chronicling America papers, due to the quality of the original image fed into the program. When doing research it is probably work taking a look at the OCR text of each database in order to get a better idea of how effective or spotty keyword search results will be, and tailoring your research and search methods based on this evaluation.

The most interesting aspect, for me, of this week’s readings is the negative side of the digitalization of sources. To some extent, this negative side is a reflection of the treatment of digitalized sources as the equivalent of their analog predecessors. As several of the articles we read point out, many digital scholars consider the medium to be intrinsic to the meaning and interpretation of digital records just as much as physical materials. Based on this, according to Marlene Manoff “if print and electronic versions are different objects, we should not treat them as if they are interchangeable.”

Unfortunately, that is how many people see them, with detrimental results. Manoff especially decries the tendency of libraries to view these different versions as interchangeable, resulting in such practices as the elimination of print holdings (especially of periodicals) once digital versions are available, as well as the cancelation of current print subscriptions.

This view of digital and print as interchangeable, in addition to the rather abstract concepts of the materiality of digital collections, has more concrete results on scholarship in the digital age. Specifically, as pointed out by Ian Milligan (as well as in our class discussion last week), scholars are increasingly tending to access journals and other sources through online databases while continuing to cite the hard copies. Beyond simply representing a misleading practice, this practice conceals a reliance on databases and key word searches that may miss key sources. As Milligan points out using the example of the artistic woodwork strike of 1973, key word searches can overlook some entries due to the lack of 100% accuracy in OCR renditions of print sources, resulting in “false negatives.”

More broadly, the increasing prevalence of key word searches, while vastly opening up the available sources and the scope of research available to the average scholar, has to a large, and mostly unremarked extent, sacrificed context. This sacrifice is further concealed by the continuing prevalence of citing sources as if the hard copy was consulted. By navigating via key word only to those article directly dealing with the topic at hand rather than combing through an entire date range of coverage, there is much less of an ability to get a feeling of the context of the times and perhaps critical tangential coverage (not to mention the occurrence of “archival serendipity”). This almost by definition makes it more difficult for the historian to interpret the vastly more numerous but much more narrowly selected sources made available by digitalization and key word search ability.

While is unlikely that even the most tradition bound historian would choose to completely ignore the vastly increased research capability provided by digitalization, it also seems incumbent upon the profession to mitigate the negative consequences of the primacy of digital searches, many of which seem to be little acknowledged today (certainly no historian to my knowledge has yet been called to the carpet for citing hard copies after reading the article on JSTOR).

Today’s project is to assess the existing digital history of my research topic. My research interest is in 19th century American Military History, specifically the post-Civil War professionalization of the Army. That seems a little broad for an internet search, but Google’s fast so let’s see what it produces before I narrow it down to something more specific and searchable.

1st Google Search: “19th Century American Military History”

Perhaps unsurprisingly the first several entries are Wikipedia articles. There is a timeline of Military History from www.militaryhistory.about.com; it seems pretty simplistic and pretty Western-focused, but could be useful for general background information. There are also several book lists from barnes and nobles and similar sites. Further down in the results is the 19th Century page for Military History Online, which actually includes a reasonable set of what appear to be somewhat scholarly articles (at least they have footnotes), although they seem to be member contributions rather than peer-reviewed entries. None appear to be particularly related to my specific research interests, although several appear interesting. As the last link on the first page of the results page is the Early 19th Century Online Bookshelf of the U.S. Army Center of Military History. For the 19th Century this includes two digitalized general reference books, and on digitalized archival source: “the Regulation for the Uniform and Dress of the United States Army, 1839.” Searching through the Center of Military History also pulls up Online Bookshelves for the Late 19th Century (7 digitized secondary sources, 1 digitized archival source), the War with Spain (5 digitized secondary sources, and 7 digitized archival sources).

Let’s see what adding “digital” to the search does.

2nd Google Search: “19th Century Military Digital History”

This search is completely dominated by www.digitalhistory.uh.edu articles. (slogan: “using new technologies to enhance teaching and research.”). While the articles lack footnotes, the .edu address is a least moderately reassuring. Going to their homepage, it seems to be focused primarily on K-12, but they do have some digitized primary documents as well as some exhibits (although none are related to my area). They also, quite usefully, have a link for how to cite them.

So it seems there is a decent amount of 19th Century military history available digitally, but so far nothing specific to my research interests, so I will try a more focused search.

3rd Google Search: “Leavenworth School System”

Well that was not super useful, and probably to be predicted. The results were all links to current Leavenworth school districts.

4th Google Search: “Army Leavenworth Schools”

Strike 2. Results of this search are split between sponsored links of for-profit schools trying to sell online degrees to service members and the websites for the current Fort Leavenworth schools (Command and General Staff College, Combined Arms Center, etc.).

This topic may be a little too esoteric and have too much overlap with current institutions to return a useful google search result, so I’ll have look deeper into likely digital archival sources. The most logical place to start is the U.S. Army Military History Institute. All of the USAMHI’s finding aides are available online, as well as a catalog search for their secondary sources and articles. The USAMHI also has several digital collections, including digitized army regulations, field manuals, and general orders. They also have a selection of manuscripts and printed material that have been digitized. Finally, there are key word searchable databases for their non-digitized manuscript holdings. The digitized files are mostly more recent documents, but the finding aides are useful for planning visits, especially as they have been digitized in such a way as to be able to do a key word search and have it bring you to specific entries in each document.

A search through the Nationals Archives site reveals similar results: the digitized elements of the archives are not relevant to my research, but all the finding aides and other research planning tools are available to at least do a majority of the leg work in identifying archival sources before showing up to the physical archive.

Finally, while none have shown up during the course of these searches, I know from prior research that many of the printed materials from this period have been digitized by Google books and other sites such as Hathi Trust. So while researching if a printed source is referred to, a quick google search of its full title will often turn up a digitized copy. Unfortunately, since many of the documents I work with have titles like Report of the Board of officers appointed in pursuance of the act of Congress approved June 6, 1872, for the purpose of selecting a breech-system for the muskets and carbines of the military service, together with their report upon the subject of trowel-bayonets, they rarely turn up in more generic key word searches.

The first thing that struck me about this weeks readings was just how wide and debated the concept (field? area? discipline?) of digital history is. There seems to be no universally accepted definition, and as Michael Frisch points out, it risks “meaning too much or too little.” This problem is not unique to digital history, as I have come across it in other aspects of my studies. The problem with lacking a definition is that your concept can quickly be stretch to the point where it loses any value as a category.

That said, digital history is both new enough and dynamic enough to make a single definition difficult if not impossible. Several of the articles we read this week, however, proposed useful ways of thinking about digital history. The most important I think, and certainly one of the most repeated, is the idea of thinking of digital history as both a field and a method. Thus, while digital history is distinct enough, and requires enough specialization among its practitioners to be rightfully considered its own field, it also provides an array of tools and methods that historians in other fields can apply singly or more systematically to their own studies. This mirrors, but is much more profound than Douglas Seerfeldt and William G. Thomas’ distinction between digitalization projects and true digital history products. More importantly perhaps, this recognition of digital history as simultaneously a field and a method highlights the need, raised by Anne Murrell Taylor, to avoid an over-focus on the tools of digital history to the extent that “mastering technology becomes the end rather than the means to a bigger end of producing innovative history.” Digital history as a field instead focuses on thinking about how we can create truly new history.

The second thing that struck me about this weeks reading was how little I knew about Digital History, despite being a full year into my graduate studies. Other than one rather tangential reference to the Valley of the Shadow project and a vague awareness of the existence of the Center for History and New Media, I’d managed to complete 27 credits of graduate history work with little connection to digital history, or at least the more advanced elements of it (I have used some digitalized sources in my research). This seems particularly amazing because some of the readings this week date from 10 or even 15 years ago when I was an undergrad. To me, this highlights another central issue of many of these readings: where digital history fits into the larger field of academic history. Despite the  possibility, and prediction in some cases, for digital history to completely change the entire academy, it seems to date that the mainstream of academic history has managed to remain relatively unchanged and ignorant of the role and uses of digital history. This seems, to me, less out of any inherent flaws in digital methods and practices (although Dan Cohen and Roy Rosenzweig do an excellent job of summarizing these in their chapter on the “Promises and Perils of Digital History”) than from the successful stonewalling by traditional historians. This is likely the result out of a perhaps understandable concern over change, a concern made especially powerful by the fact that the senior and most influential scholars in the field are, by definition as senior, the product of the earlier traditional scholarship and thus have little motivation to alter the system. Until the role of digital scholarship in hiring and tenure decision is more firmly established, digital history will likely remain peripheral, and a valuable but risky route for scholars. It is important to note this is only true for students seeking traditional tenure track careers; digital history has already more than proven its worth in less traditional fields such as public history.