For this weeks practicum I created a map using Google Map Engine showing the campaigns of the 1st Michigan Cavalry Regiment. The main take away, for me, was the sheer level of drudgery involved in a project like this. Using as my base data the service record of the regiment provided on the National Parks Service site, I input every service entry for the regiment from January 1864 through their mustering out in 1866. Despite the user friendly interface of Google Map Engine, this was a lot of work for two reasons. First, without an importable csv. there was the pure data entry of putting in each of the events. In addition, the data was not “clean” and I had to spend a substantial amount of time trying to find locations for each of these events, as not all came up in a simple google maps search. I had some success finding stuff using google and wikipedia (wikipedia was especially useful as the entries for several, but not all, of the battles had lat/long data that could be easily posted into the map engien), but even my end result is still only a partial solution, as finding the exact location of each of these events would require weeks of research.

For my map, I created layers for each year that can be displayed together or independently. Generally speaking, I marked each battle with a flag icon, each non-battle military event (“expedition,” “reconnaissance,” “demonstration,” etc.) with a horse icon, and occupations or encampments with a house symbol. The Grand Review got its own icon of two men walking in step (intended as hikers, but it worked ok for my purpose). For this classification, I used the assumption that anything not labeled otherwise was a battle. Movements with specified start and end points are shown as lines (such as the “Movement to Fort Leavenworth”), and I included two polygons for the Sheridan’s Shenandoah Campaign and the Expedition into Loudon and Faquier Counties to show the rough area of operations. Each icon is named using the title used by the NPS, and the description containing the date of the operation. This allows viewers to toggle back and forth between which is displayed using the label dropdown menu.

 

 

Two aspects of this week’s readings that stood out as particularly illuminating for the possibilities provided by digital history were the concept of an “interactive scholarly work” and the importance of scale. The idea of an interactive scholarly work has been inherent is some of the other tools we’ve looked at, but it is especially salient in mapping projects. An interactive scholarly work is more than just a static display of visual information, but rather allows users to interact and develop their own research agenda. In some cases this can produce citable evidence, but like many other digital tools this is often best used to raise questions for further research or exploration

The projects we looked at in this week’s readings, “Visualizing Emancipation,” “ORBIS,” and “Digital Harlem,” all allow the user to interact and display various data and connections, using layered searches to display relationships that would be difficult pick out through traditional means or the spatialization provided by map images. The better interactive scholarly works also tie their digital presentation closely to rigorous scholarly research. ORBIS provides all of its background data and sources, making it “not just a site, but also an online scholarly presentation” according to Scott Dunn. Digital Harlem supplements its interactive map with Blog post that explore various connections and ideas that the map reveals. This, combined with several longer articles published in connection to the program, allows Digital Harlem to “bridge the gap between digital and more traditional research” according to Nicholas Grant.

Another thread running through this week’s readings is the ability of these digital mapping projects to convey scale in a way impossible to do in print media. Edward Ayers and Scott Nesbit discuss this in connection with the concept of “deep contingency” where different aspects of social life interacted in unpredictable ways across the various different scales (local, regional, national, military, etc.) to effect individual actions and decisions.

Digital Harlem also deals closely with the effects of scale, as by mapping black life (and white presence) using Real Estate maps at a scale well beyond that typically described in text reveals and changes the way Harlem looks. Working at this smaller scale and including ALL evidence available provides a deeper and different picture. This picture is inherently digital, as it occurs at a scale that would be impossible to convey in print, and which can only be fully explored interactively with the ability to zoom in and out.

Interactivity and scale, therefore, are essential to the essence of digital mapping projects. The data available, both in amount and complexity, make it impossible to display them statically. Their full potential can only be unlocked through the digital medium and through an interactive user interface. However, tying this digitized and democratized history back to its scholarly background is key to both establish the credibility of these tools and their use for further research. The best digital mapping projects, including those we looked at this week, therefore, represent both interactive websites and online scholarly publications.

 

Working with several of the open source visualization tools available allowed me to see some of the possibilities visualization provides for mapping data and exposing connections. The data we worked with, units and battles of the Civil War, was relatively simple but even with this small sample set and uncomplicated relationships the visualizations help reveal connections faster than studying a table of raw data.

However, one thing I realized even from this limited project is that that the compilation and organization of that raw data is the biggest part of any visualization project. Both Palladio and RAW provide pretty user friendly interfaces for uploading data, but that data has to be properly formatted for the program (and most visualization tools have different formatting requirements). Besides the formatting, which can become fairly obnoxious in and of itself (Gelphi especially requires some pretty extensive work to get data properly input), a digital historian first has to compile the data itself, which unless your professor is kind enough to provide to you already organized and formatting, can take a significant amount of time and effort. With the relative accessibility of Palladio and RAW, any visualization project will likely involve way more time spent compiling and organizing data than interfacing with the actual visualization tools.

Uploading the data into Palladio should be quick and easy, but since I’m a Windows user it wasn’t. Originally the data uploaded fine, but wouldn’t display in the graph screen. I was able to get it to work by switching from Internet Explorer to Google Chrome.

Palladio

I thought the Palladio visualization was the most intuitive in showing both the connections between multiple units that saw service in the same battle and the relative amount of combat seen by the various regiments. There isn’t as many options in Palladio as in RAW, at least not for a limited data set we’re working with, although the addition of latitude and longitude would allow us to map the locations of battles these units fought in.

RAW allows more options for visualization, but once again with only this limited data set only a few are useful. The ones that worked the best were Alluvial Diagram, Circle Packing, Cluster Dendrogram, and Circular Dendogram.

Alluvial Diagram:

Alluvial

Circle Packing:

Circle Packing

 

Cluster Dendrogram:

Cluster Dendrogram

Circle Dendrogram:

Circle Dendrogram

Of these, the Alluvial Diagram does the best of illustrating overlapping battles, while the other three are only really useful in highlighting which units had participated in more battles. Still, the many options available on RAW provide more visualization options than Palladio.

Finally, there is Gephi. Gephi may be the exception to the rule I started this post with, that data compilation and organization was the most time consuming part of visualization.  With Gephi, figuring out Gephi is the longest part. I was actually never able to get it to fully work, despite having the data already set up and the very helpful tutorial provided by Elena Friot.

While not as explicitly made as in the readings on text mining last week, an important theme of this weeks’ articles on visualization and networks is the dual role of these tools. While these tools are usually used to provide a visualization of data, they can also be used in an exploratory mode to reveal connections not normal apparent, and thus serve as a starting point for inquiry rather than merely as evidence for an argument or a conclusion. Like many of the other tools we have learned about, they also have their flaws that must be accounted for if they are not to lead the novice digital historian astray.

To start with the traditional use of networks, I think David Staley’s two element definition of visuals images, quoted by John Theibault in his article on “Visualization and Historical Argument,” is an extremely simple and useful way to think about the use of these networked images. Visual images can be a stand-alone organization of meaningful information, much as the digital history projects on the Houston Daily Post and Kissinger’s memcoms and telecoms, or this week’s “Mapping the Republic of Letters.” Alternatively, and perhaps how we as historians are most familiar with their use, is when they are employed as a supplement to written accounts to further bolster textual arguments or provide additional evidence. Both of these uses are primarily for the purpose of displaying data, but while the standalone visualizations in our readings have been finished projects, it is easy to imagine how they could also be used to expose new questions and paths of inquiry.

Thiebault does a valuable job of exploring the possibilities opened up for further visualization by the increased use of digital tools, but one point he made struck me as especially surprising as well as illustrating the radically expansive possibilities and democratization made possible by new media. This was the simple point that you can use color extensively in digital history, while it is prohibitively expensive in print media. This seems like a minor point, but when one considers how important it is in visual images, it clearly illustrates how even such small factors create large changes in the new world of digital media.

This does not mean, however, that visual networks and other visual tools are without their pitfalls. As Johanna Drucker argues persuasively, we have an innate tendency to accept these images as substantiated fact with their own intrinsic proof in a way we would never do for a textual argument. Instead, we must consider visual images like we do interpretive arguments…by evaluating the evidence and methodology that underlie them to make our own determination if the end result is supported and convincing. To this danger, Scott Weingart adds several additional criticisms. For Weingart, “network structures are deceitful.” The must be evaluated closely to ensure that what the historian is attempting to show by applying his data to a network structure matches the network structure used. Central to this is networks lack of memory (ie they can only show connections, not how those connections were used) and difficulty showing multimodality. Weingarts final call is for historians to ensure these visual networks are employed only when appropriate, going back to Staley’s second definition of using visual images to supplement written arguments.

While these criticisms must be considered, none of these historians is calling for the abandonment of the use of visual networks. Rather, like the other tools we have learned about, they must be approached with a complete understanding of their methodology and implications in order to be properly applied. One component of this, not as explicitly covered in the readings, is the ability to use these visualizations as a research tool rather than simply as evidence or a final product.

For this week’s practicum I looked at several different open source text mining tools, comparing the results with each other and my expectations. The terms I choose were “Army,” “Navy,” and “Military.” I was looking to see if there was any significant change in usage over time in the preference between the three as well as the overall usage.

Searching the terms gave similar results in the Google n-gram viewer and the NYT Chronicle, and shows patterns that one might expect:

All three terms  track along parallel tracks (Army most used, followed by military, with Navy less common), spiking at times of war. Towards the latter part of the 20th century, around 1960 for both viewers, the word military became more prevalent than army.

For the matching results from the NYT Chronicle, click the link below:

http://chronicle.nytlabs.com/?keyword=army.navy.military

The NYT Chronicle is especially useful because it allows you to interact with the sources behind the results. For example, clicking on the peak of the use of the word “Army” will bring up the 6,281 articles that mention that word during 1862 (the same year show 4,574 results for military and 2,257 uses of Navy).

 

Putting the same terms into Bookworm and searching the Chronicling America files reveals similar trends but with small variations. There is a spike in both Army and Military (but not Navy) around 1840 that doesn’t appear on Google ngram and isn’t covered on NYT (starts in 1860s). Also, unlike the other two, the term military never overtakes army in usage, other than for a military specific spike (army and navy stay flat) in the 1870s.

For the Bookworm results, click the link below:

Bookworm

Bookworm also lets you access the articles behind its Chronicling America ngram, revealing that the spike in 1840 is caused largely by articles from the Illinois Free Trader, with every article from the first three pages of result coming from that single journal. This spike in references to military in Illinois may be related to the conflicts with Mormons that were occurring at that time.

The final text mining tool I looked at was the Voyant Tool, rather than searching  a set corpus as all of the three above tools, will evaluate whatever text you put into it. For this experiment I used three files from the NY Gettysburg Monuments Commission (NY_Gettysburg_1.1, NY_Gettysburg_1.2, and NY_Gettysburg_2.1).

Voyant allows you to look at the text mining results in several different ways, and I experimented with looking at the results using in bubblelines, collocate clusters, and correspondence analysis. These tools all helped visualize that data in different ways, but provide much more complex results than the ngram viewers discussed above, and I frequently had trouble understanding what I was looking at and if it meant anything.

The five most common words in the corpus were brigade, corps, regiment, new, and York, none of which are very surprising. When looking at the bubbleline results it appears that are two sections in the 1.2 file where there is less discussion of military units, but other than that all of these words were relatively common throughout the files. It seems as if these files may be too similar to get much in the way of contrast.


This weeks readings were useful in describing the tools available in text mining and topic modeling and also some of the important considerations in their use. These tools are closely related to the keyword searches we looked at last week, but a little but more advance. Like our readings on keyword searches, so of the most valuable elements of this weeks readings were less then nuts and bolts of the tools than some of the issues in their use and the discussion of the need for a proper understanding of their implications and methodology.

As Ted Underwood points out in his article, “Theorizing Research Practices We  Forgot to Theorize Twenty Years Ago,” it is important to consider searches and text mining as a “philosophical discourse” rather than just as tools. This means seriously considering how we approach their use, and understanding the implications of their use rather than just using them as a faster route to the same results we would seek with more traditional research methods. A large element of this is, as Underwood points out, that we need to understand how the algorithm behind the search engine is producing results (ie, what does it define as relevant). Without this understanding, it is very easy to conduct searches that will never produce alternate theses, or keep trying new searches until ANY thesis eventually produces enough results to be judged “supported.”

Corollary to this is a methodological approach recommended both by Underwood and Frederick Gibbs and Daniel Cohen: rather than using text mining to provide evidence for a defined thesis, it can be used instead as an open ended investigation. By the use of text mining and “distant reading,” a volume of sources that would be impossible to compare using traditional methods can be studied in a way to reveal patterns otherwise undetectable. This, in Gibbs and Cohen’s words, this method can provide “signposts toward further explanation rather than conclusive evidence.” According to Robert Nelson, the same results can be achieved from topic modeling, which allows digital historians to “detect patterns within not a sampling but the entirety of an archive.”

Finally, the articles (or more properly digital projects) of Cameron Blevins and Miki Kaufman demonstrate the ability of text mining and topic modeling, in combination with other digital tools, to provide a visual demonstration of patterns and coherence drawn from a huge amount of data that would be difficult to research or comprehend using more traditional methods.

In a survey of 3 years worth of the Journal of Military History (2007-2009), there was only one explicit reference to a database search. Even this reference was not to a full text key word search, but rather a study of library holdings using WorldCat.

In Kenneth P. Werrell’s article “Across the Yalu: Rules of Engagement and the Communist Air Sanctuary during the Korean War,” (Journal of Military History Vol. 72, No. 2 April 2008) Werrell argues that a study of the literature of the air war reveals both confusion and change over time in the evaluation of violations of the Chinese Air Sanctuary. To support this, he cites a WorldCat search that:

“shows that libraries have over 34,000 copies of the books cited in notes 1 through 34. Of this number, 20 percent tell of numerous violations, 35 percent note occasional or inadvertent violations, and 45 percent do not address the issue or claim there were no violations. Over 60 percent of the books in the first category were written by Hastings and MacDonald. Of the five most widely held books, those in over 2,000 libraries, one is in the first category, two in the second, and two in the third.”

This is the only database search that is explicitly mentioned in the period covered, and is an interesting example in that it is not central to his article, instead merely setting up the historiography he is writing against, and methodologically odd in its focus on the number of copies held versus simply individual titles.

In addition to this single explicit use of a database, there was at least one article that seemed to have made use of an unacknowledged sources. An article in the April 2009 issue of the Journal of Military History cited a single newspaper article in all of its 135 citations (although it cited this article several times). While it is possible that the author scanned through multiple rolls of microfilm before discarding all but this one, but it seems more likely that this article was found through a key word search and used to support his other sources.

Finally, there were a number of articles that made widespread use of digitized web sources outside of database and key word searches. Mark C. Jones, in his article “Experiment at Dundee: The Royal Navy’s 9th Submarine Flotilla and Multinational Naval Cooperation during World War II,” (Journal of Military History Vol. 72, No. 4 October 2008) uses several online sites to provide reference material (for example, he refers readers to http://www.unithistories.com/officers/RN_officersR.html for data on Royal Naval Officers and their assignments). Douglas C. Peifer, in his October 2007 article “The Past in the Present: Passion, Politics, and the Historical Profession and British Pardon Campaign,” goes even further. While he makes no reference to database searches, 21 of Peifer’s 76 citations make reference to digitized sources (with links and accessed dates), including both newspaper articles and government sources.

This limited use of databases, at least explicitly, reflects the continued focus of main stream historians of traditional research methods (with notable exceptions such as Peifer and Jones). It also reflects a relatively low use of newspapers, the most commonly digitized sources, within the field of military history. Finally, there may be, and likely are, additional incidents within the articles survey of the unacknowledged use of databases, as historians largely continue to cite hard copies even when accessing online versions, often through database searches (certainly some of the articles and dissertations cited were accessed through JSTOR, ProQuest, or similar databases).

While my reaction to last week’s readings was to focus on some of the negative aspects of digitization, this week’s readings opened up some benefits of digitization and new media that I had not previously considered. The use of new media has often been recognized as, at least to some extent, democratizing history by allowing a wider audience access and the ability to interact with history, not the least of this being thorough the widespread digitization of primary sources. Several of this week’s readings, however, highlight other ways in which the use of digitization, databases, and keyword searches have broken down barriers in research practices.

In his article on the role of digital history in re-formatting historical knowledge, Timothy Hitchcock points out the role of digitization and key word searches in decreasing the hard lines between disciplines with the result that “historical conferences are becoming more literary, and literature conferences are becoming significantly more historical.” As Hitchcock points out, the ease of keyword searches makes it possible for historians and other scholars to quickly and profitably engage with the corpus of work available digitally in other disciplines that is related to their work. This provides increased context on all sides without requiring significant additional inputs of time and research, greatly improving the end product through tangential research that would have been prohibitive in the analog age.

This idea is similar to the idea of “side-glances” as described by Lara Putnam in her article covering the intersection of transnationalism and digital history. For Putnam, the rise of digitized sources, and especially key word searches, has been instrumental in helping historians discover connections across national boundaries. What once would have been an impossible effort to chase down intriguing leads and scattered connections is now the matter of minutes of work at a computer. This digitization “makes whole new realms of connection possible,” with the end result of a sea change in the way historians, and especially transnational historians, do research.

A large part of this is digitization’s role in making what Putnam describes as “fishing expeditions” cost-reasonable. In the pre-digital age researchers were forced by the constraints of time and money to focus on archive visits to locations where they KNEW they could find usable material; with the digitization and searchability of increased numbers of archival sources these expeditions are now possible and even imperative.

The use of key word searches has also reduced the institutional bias that both Putnam and Hitchcock noted in the use of traditional archives. Archives, predominantly speaking, are the product of organizations and states, and have thus embedded into the sources most valued by historians a bias towards the perspectives of institutions and the organization of history around nation-state boundaries. The use of digitized key word searches allows historians to see past this by developing connections not visible in analog archives.

This is not to say that key word searches and database are without their pitfalls, as noted in many of both this week and last week’s readings. Some of these issues, as pointed out by Patrick Spedding, are not related to the failures of OCR or other technology, but to the nature of the sources themselves. Taking the example of his own research in 18th century references to condoms, Spedding points out the usefulness of peripheral text searches, using a combination of terms related to the original search topic to turn up sources unmarked by the original term. While Spedding used this mainly to get around the 18th century reticence in mentioning sex topics, this technique can also be applied to help alleviate some of the errors caused by OCR mistranslations, although at the risk of creating many more false positives while uncovering a small number of false negatives.

For all of the benefits deriving from the “digital turn” and the digitization and searchability of sources, it remains a tool for historians like all of their other research techniques. And like their other tools, historians must recognize the drawbacks and flaws inherent in the use of key word searches, while making the most of the new opportunities they provide.

This week’s practicum is focused on the use of OCR, both utilizing it personally and evaluating its effectiveness on some of the sources available on the web.

To start, I utilized Google OCR to digitize a scanned image from the Pinkerton National Detective Agency (Image #4). Before putting it through the OCR I rotated it to be upright and cropped the image to exclude the margins of the page in the hopes of eliminating extraneous object to confuse the OCR. This did help improve the OCR’s performance a little bit, but the program still produced a very high percentage of errors. Importantly, most of these errors were not single character errors that left a recognizable word, but rather turned many of the words into complete gibberish unrecognizable to the naked eye, let alone a key word search. Some of the issue may stem from the considerable bleed through of words from the backside of the page, and the irregular format of the Q&A seems also to have confused the OCR. There is also a surprising amount of errors mistaking letters for punctuation and vice versa.

Here is the resulting Google Doc: Pinkerton Image #4 Google Doc

The next element of the practicum was evaluating the OCR performance for several newspages on the Chronicling America project of the Library of Congress. To start with, I searched Virginia papers in the year range 1861 to 1865 with the key word “Stonewall.”

The first page that showed up was: The Soldiers’ journal. (Rendevous of Distribution, Va.), October 05, 1864, Image 3

The image was pretty clear, and the text easily readable with only a few stray marks on the page. The OCR is actually pretty decent, with only an error every couple of sentences. With most of these errors it is readily apparent what the original was, even without referring back to the original (Bull Hun from Bull Run, Conterville from Centerville, Bth July from 8th July). The most common error seems to be, for whatever reason, the substitution of “o” for “e” when it occurs in the middle of a word. Looking back at the image it does appear that the “e” in this type face has a pretty tight opening, making the OCR’s errors a little more understandable, although for most of these you would think the dictionary would have corrected the character confusion. There is also an issue or recognizing numbers, although what dictionary would favor “Bth” over “8th”

The second page is: The Abingdon Virginian. (Abingdon [Va.]), January 30, 1863, Image 1

This image is a little bit rougher, with much smaller print that is hard to read, some uneven inking, and an obvious crease. Perhaps unsurprisingly based on how hard it is to read even for human eyes, the OCR on this one is significantly worse. There are quite a high percentage of errors, and there seems to be less of a pattern to them. There are significantly more letters mistaken for punctuation and special characters, and the column breaks seem to have been especially a problem with a string of loose letters, mostly “i” and “j” inserted at the edges of columns. Finally, the column with the crease mark seems to have a higher percentage of errors than the others.

The third page was: The Soldiers’ journal. (Rendevous of Distribution, Va.), September 28, 1864, Image 7

The image on this is clear and easily readable with the naked eye, but there is some noticeable bleed through from the backside of the page, and the entire image is canted a little. The OCR is pretty decent with an error rate similar to the first page, although the errors seems less consistent in this one. There are the occasional random “j”s at the end of lines, possible from the line the paper uses to separate its columns. There seems to have been a specific problem in recognizing numbers when paired with letters, almost every case of 8th or 1st has been translated into Bth or some other pure letter gibberish.

The final element of the practicum was to evaluate the OCR of a primary source for my research area. For this I looked at the Google e-book version of the Annual Report of the Secretary of War for 1867, published by the Government Printing Office. The plain text of the OCR for this document is very high quality, and in actually a little bit easier to read than the image. There are very few errors and even manages to capture the words that are italicized. The only consistent error I could note is that “8”s are frequently transposed into “S”s, especially when in the middle of dates (so 1860 becomes 1S60).

 

After reviewing these different OCR results, it seems clear to me that there can be a very great degree of variety in the quality of OCR output, and therefore in the effectiveness of keyword searches. For example, the most likely keyword search looking for the Pinkerton paper I digitized would be a search for the “Thaw matter” which somehow got translated by the OCR as @1151! matth. A keyword search would thus obviously have missed at least this page of the file. These differing levels of quality are likely due to the quality of OCR program used (presumably Google uses a more advanced program for its ebooks than its open source OCR provided on Google drive) but also, as reflected by the differing results on the Chronicling America papers, due to the quality of the original image fed into the program. When doing research it is probably work taking a look at the OCR text of each database in order to get a better idea of how effective or spotty keyword search results will be, and tailoring your research and search methods based on this evaluation.

The most interesting aspect, for me, of this week’s readings is the negative side of the digitalization of sources. To some extent, this negative side is a reflection of the treatment of digitalized sources as the equivalent of their analog predecessors. As several of the articles we read point out, many digital scholars consider the medium to be intrinsic to the meaning and interpretation of digital records just as much as physical materials. Based on this, according to Marlene Manoff “if print and electronic versions are different objects, we should not treat them as if they are interchangeable.”

Unfortunately, that is how many people see them, with detrimental results. Manoff especially decries the tendency of libraries to view these different versions as interchangeable, resulting in such practices as the elimination of print holdings (especially of periodicals) once digital versions are available, as well as the cancelation of current print subscriptions.

This view of digital and print as interchangeable, in addition to the rather abstract concepts of the materiality of digital collections, has more concrete results on scholarship in the digital age. Specifically, as pointed out by Ian Milligan (as well as in our class discussion last week), scholars are increasingly tending to access journals and other sources through online databases while continuing to cite the hard copies. Beyond simply representing a misleading practice, this practice conceals a reliance on databases and key word searches that may miss key sources. As Milligan points out using the example of the artistic woodwork strike of 1973, key word searches can overlook some entries due to the lack of 100% accuracy in OCR renditions of print sources, resulting in “false negatives.”

More broadly, the increasing prevalence of key word searches, while vastly opening up the available sources and the scope of research available to the average scholar, has to a large, and mostly unremarked extent, sacrificed context. This sacrifice is further concealed by the continuing prevalence of citing sources as if the hard copy was consulted. By navigating via key word only to those article directly dealing with the topic at hand rather than combing through an entire date range of coverage, there is much less of an ability to get a feeling of the context of the times and perhaps critical tangential coverage (not to mention the occurrence of “archival serendipity”). This almost by definition makes it more difficult for the historian to interpret the vastly more numerous but much more narrowly selected sources made available by digitalization and key word search ability.

While is unlikely that even the most tradition bound historian would choose to completely ignore the vastly increased research capability provided by digitalization, it also seems incumbent upon the profession to mitigate the negative consequences of the primacy of digital searches, many of which seem to be little acknowledged today (certainly no historian to my knowledge has yet been called to the carpet for citing hard copies after reading the article on JSTOR).