Building an exhibit with Omeka this week demonstrated some of the same patterns that were apparent last week working on Google Map Engine. While I already had data on my computer from a previous research project on the U.S. Army Ordnance Department’s treatment of Breech-loading arms in the Civil War, the most time consuming portion of the entire exercise was going through the hundreds of images and finding the ones that would be useful in telling a story in exhibit form.

The next most time consuming part was entering the metadata for each item, which combined several decisions about how to categorize it with the drudgery of entering it. I found myself wishing there was a way to add metadata to a group of items rather than doing it individually for each one, as I found other than title and description, most other entries remained the same for each item. Working with a set of Ordnance Department records and correspondence from the National Archives (no copyright concerns!), I decided to list “U.S. Army Ordnance Department” as the creator and use the date of each individual record for the date. I listed the record group under source to help users locate the originals if they choose, but failed to go the extra step of listing out the full data I used in my record keeping under “identifier” as it does not conform to a standardized system. For the title of each item I used how I would cite it in a scholarly work, leaving a more detailed description for the captions.

I choose gallery as the layout for my exhibit, as it seemed to make the most sense for a simple exhibit like this. I organized the items chronologically, as it seemed the most reasonable considering the purpose of the exhibit in showing change over time, and the contrast between Ripley and Dyer.

Below is the link for my Omeka exhibit on the Ordnance Department and Breech-loaders during the Civil War:

 

Ordnance Department and the Civil War

 

 

 

Working with several of the open source visualization tools available allowed me to see some of the possibilities visualization provides for mapping data and exposing connections. The data we worked with, units and battles of the Civil War, was relatively simple but even with this small sample set and uncomplicated relationships the visualizations help reveal connections faster than studying a table of raw data.

However, one thing I realized even from this limited project is that that the compilation and organization of that raw data is the biggest part of any visualization project. Both Palladio and RAW provide pretty user friendly interfaces for uploading data, but that data has to be properly formatted for the program (and most visualization tools have different formatting requirements). Besides the formatting, which can become fairly obnoxious in and of itself (Gelphi especially requires some pretty extensive work to get data properly input), a digital historian first has to compile the data itself, which unless your professor is kind enough to provide to you already organized and formatting, can take a significant amount of time and effort. With the relative accessibility of Palladio and RAW, any visualization project will likely involve way more time spent compiling and organizing data than interfacing with the actual visualization tools.

Uploading the data into Palladio should be quick and easy, but since I’m a Windows user it wasn’t. Originally the data uploaded fine, but wouldn’t display in the graph screen. I was able to get it to work by switching from Internet Explorer to Google Chrome.

Palladio

I thought the Palladio visualization was the most intuitive in showing both the connections between multiple units that saw service in the same battle and the relative amount of combat seen by the various regiments. There isn’t as many options in Palladio as in RAW, at least not for a limited data set we’re working with, although the addition of latitude and longitude would allow us to map the locations of battles these units fought in.

RAW allows more options for visualization, but once again with only this limited data set only a few are useful. The ones that worked the best were Alluvial Diagram, Circle Packing, Cluster Dendrogram, and Circular Dendogram.

Alluvial Diagram:

Alluvial

Circle Packing:

Circle Packing

 

Cluster Dendrogram:

Cluster Dendrogram

Circle Dendrogram:

Circle Dendrogram

Of these, the Alluvial Diagram does the best of illustrating overlapping battles, while the other three are only really useful in highlighting which units had participated in more battles. Still, the many options available on RAW provide more visualization options than Palladio.

Finally, there is Gephi. Gephi may be the exception to the rule I started this post with, that data compilation and organization was the most time consuming part of visualization.  With Gephi, figuring out Gephi is the longest part. I was actually never able to get it to fully work, despite having the data already set up and the very helpful tutorial provided by Elena Friot.

For this week’s practicum I looked at several different open source text mining tools, comparing the results with each other and my expectations. The terms I choose were “Army,” “Navy,” and “Military.” I was looking to see if there was any significant change in usage over time in the preference between the three as well as the overall usage.

Searching the terms gave similar results in the Google n-gram viewer and the NYT Chronicle, and shows patterns that one might expect:

All three terms  track along parallel tracks (Army most used, followed by military, with Navy less common), spiking at times of war. Towards the latter part of the 20th century, around 1960 for both viewers, the word military became more prevalent than army.

For the matching results from the NYT Chronicle, click the link below:

http://chronicle.nytlabs.com/?keyword=army.navy.military

The NYT Chronicle is especially useful because it allows you to interact with the sources behind the results. For example, clicking on the peak of the use of the word “Army” will bring up the 6,281 articles that mention that word during 1862 (the same year show 4,574 results for military and 2,257 uses of Navy).

 

Putting the same terms into Bookworm and searching the Chronicling America files reveals similar trends but with small variations. There is a spike in both Army and Military (but not Navy) around 1840 that doesn’t appear on Google ngram and isn’t covered on NYT (starts in 1860s). Also, unlike the other two, the term military never overtakes army in usage, other than for a military specific spike (army and navy stay flat) in the 1870s.

For the Bookworm results, click the link below:

Bookworm

Bookworm also lets you access the articles behind its Chronicling America ngram, revealing that the spike in 1840 is caused largely by articles from the Illinois Free Trader, with every article from the first three pages of result coming from that single journal. This spike in references to military in Illinois may be related to the conflicts with Mormons that were occurring at that time.

The final text mining tool I looked at was the Voyant Tool, rather than searching  a set corpus as all of the three above tools, will evaluate whatever text you put into it. For this experiment I used three files from the NY Gettysburg Monuments Commission (NY_Gettysburg_1.1, NY_Gettysburg_1.2, and NY_Gettysburg_2.1).

Voyant allows you to look at the text mining results in several different ways, and I experimented with looking at the results using in bubblelines, collocate clusters, and correspondence analysis. These tools all helped visualize that data in different ways, but provide much more complex results than the ngram viewers discussed above, and I frequently had trouble understanding what I was looking at and if it meant anything.

The five most common words in the corpus were brigade, corps, regiment, new, and York, none of which are very surprising. When looking at the bubbleline results it appears that are two sections in the 1.2 file where there is less discussion of military units, but other than that all of these words were relatively common throughout the files. It seems as if these files may be too similar to get much in the way of contrast.


In a survey of 3 years worth of the Journal of Military History (2007-2009), there was only one explicit reference to a database search. Even this reference was not to a full text key word search, but rather a study of library holdings using WorldCat.

In Kenneth P. Werrell’s article “Across the Yalu: Rules of Engagement and the Communist Air Sanctuary during the Korean War,” (Journal of Military History Vol. 72, No. 2 April 2008) Werrell argues that a study of the literature of the air war reveals both confusion and change over time in the evaluation of violations of the Chinese Air Sanctuary. To support this, he cites a WorldCat search that:

“shows that libraries have over 34,000 copies of the books cited in notes 1 through 34. Of this number, 20 percent tell of numerous violations, 35 percent note occasional or inadvertent violations, and 45 percent do not address the issue or claim there were no violations. Over 60 percent of the books in the first category were written by Hastings and MacDonald. Of the five most widely held books, those in over 2,000 libraries, one is in the first category, two in the second, and two in the third.”

This is the only database search that is explicitly mentioned in the period covered, and is an interesting example in that it is not central to his article, instead merely setting up the historiography he is writing against, and methodologically odd in its focus on the number of copies held versus simply individual titles.

In addition to this single explicit use of a database, there was at least one article that seemed to have made use of an unacknowledged sources. An article in the April 2009 issue of the Journal of Military History cited a single newspaper article in all of its 135 citations (although it cited this article several times). While it is possible that the author scanned through multiple rolls of microfilm before discarding all but this one, but it seems more likely that this article was found through a key word search and used to support his other sources.

Finally, there were a number of articles that made widespread use of digitized web sources outside of database and key word searches. Mark C. Jones, in his article “Experiment at Dundee: The Royal Navy’s 9th Submarine Flotilla and Multinational Naval Cooperation during World War II,” (Journal of Military History Vol. 72, No. 4 October 2008) uses several online sites to provide reference material (for example, he refers readers to http://www.unithistories.com/officers/RN_officersR.html for data on Royal Naval Officers and their assignments). Douglas C. Peifer, in his October 2007 article “The Past in the Present: Passion, Politics, and the Historical Profession and British Pardon Campaign,” goes even further. While he makes no reference to database searches, 21 of Peifer’s 76 citations make reference to digitized sources (with links and accessed dates), including both newspaper articles and government sources.

This limited use of databases, at least explicitly, reflects the continued focus of main stream historians of traditional research methods (with notable exceptions such as Peifer and Jones). It also reflects a relatively low use of newspapers, the most commonly digitized sources, within the field of military history. Finally, there may be, and likely are, additional incidents within the articles survey of the unacknowledged use of databases, as historians largely continue to cite hard copies even when accessing online versions, often through database searches (certainly some of the articles and dissertations cited were accessed through JSTOR, ProQuest, or similar databases).

This week’s practicum is focused on the use of OCR, both utilizing it personally and evaluating its effectiveness on some of the sources available on the web.

To start, I utilized Google OCR to digitize a scanned image from the Pinkerton National Detective Agency (Image #4). Before putting it through the OCR I rotated it to be upright and cropped the image to exclude the margins of the page in the hopes of eliminating extraneous object to confuse the OCR. This did help improve the OCR’s performance a little bit, but the program still produced a very high percentage of errors. Importantly, most of these errors were not single character errors that left a recognizable word, but rather turned many of the words into complete gibberish unrecognizable to the naked eye, let alone a key word search. Some of the issue may stem from the considerable bleed through of words from the backside of the page, and the irregular format of the Q&A seems also to have confused the OCR. There is also a surprising amount of errors mistaking letters for punctuation and vice versa.

Here is the resulting Google Doc: Pinkerton Image #4 Google Doc

The next element of the practicum was evaluating the OCR performance for several newspages on the Chronicling America project of the Library of Congress. To start with, I searched Virginia papers in the year range 1861 to 1865 with the key word “Stonewall.”

The first page that showed up was: The Soldiers’ journal. (Rendevous of Distribution, Va.), October 05, 1864, Image 3

The image was pretty clear, and the text easily readable with only a few stray marks on the page. The OCR is actually pretty decent, with only an error every couple of sentences. With most of these errors it is readily apparent what the original was, even without referring back to the original (Bull Hun from Bull Run, Conterville from Centerville, Bth July from 8th July). The most common error seems to be, for whatever reason, the substitution of “o” for “e” when it occurs in the middle of a word. Looking back at the image it does appear that the “e” in this type face has a pretty tight opening, making the OCR’s errors a little more understandable, although for most of these you would think the dictionary would have corrected the character confusion. There is also an issue or recognizing numbers, although what dictionary would favor “Bth” over “8th”

The second page is: The Abingdon Virginian. (Abingdon [Va.]), January 30, 1863, Image 1

This image is a little bit rougher, with much smaller print that is hard to read, some uneven inking, and an obvious crease. Perhaps unsurprisingly based on how hard it is to read even for human eyes, the OCR on this one is significantly worse. There are quite a high percentage of errors, and there seems to be less of a pattern to them. There are significantly more letters mistaken for punctuation and special characters, and the column breaks seem to have been especially a problem with a string of loose letters, mostly “i” and “j” inserted at the edges of columns. Finally, the column with the crease mark seems to have a higher percentage of errors than the others.

The third page was: The Soldiers’ journal. (Rendevous of Distribution, Va.), September 28, 1864, Image 7

The image on this is clear and easily readable with the naked eye, but there is some noticeable bleed through from the backside of the page, and the entire image is canted a little. The OCR is pretty decent with an error rate similar to the first page, although the errors seems less consistent in this one. There are the occasional random “j”s at the end of lines, possible from the line the paper uses to separate its columns. There seems to have been a specific problem in recognizing numbers when paired with letters, almost every case of 8th or 1st has been translated into Bth or some other pure letter gibberish.

The final element of the practicum was to evaluate the OCR of a primary source for my research area. For this I looked at the Google e-book version of the Annual Report of the Secretary of War for 1867, published by the Government Printing Office. The plain text of the OCR for this document is very high quality, and in actually a little bit easier to read than the image. There are very few errors and even manages to capture the words that are italicized. The only consistent error I could note is that “8”s are frequently transposed into “S”s, especially when in the middle of dates (so 1860 becomes 1S60).

 

After reviewing these different OCR results, it seems clear to me that there can be a very great degree of variety in the quality of OCR output, and therefore in the effectiveness of keyword searches. For example, the most likely keyword search looking for the Pinkerton paper I digitized would be a search for the “Thaw matter” which somehow got translated by the OCR as @1151! matth. A keyword search would thus obviously have missed at least this page of the file. These differing levels of quality are likely due to the quality of OCR program used (presumably Google uses a more advanced program for its ebooks than its open source OCR provided on Google drive) but also, as reflected by the differing results on the Chronicling America papers, due to the quality of the original image fed into the program. When doing research it is probably work taking a look at the OCR text of each database in order to get a better idea of how effective or spotty keyword search results will be, and tailoring your research and search methods based on this evaluation.

Today’s project is to assess the existing digital history of my research topic. My research interest is in 19th century American Military History, specifically the post-Civil War professionalization of the Army. That seems a little broad for an internet search, but Google’s fast so let’s see what it produces before I narrow it down to something more specific and searchable.

1st Google Search: “19th Century American Military History”

Perhaps unsurprisingly the first several entries are Wikipedia articles. There is a timeline of Military History from www.militaryhistory.about.com; it seems pretty simplistic and pretty Western-focused, but could be useful for general background information. There are also several book lists from barnes and nobles and similar sites. Further down in the results is the 19th Century page for Military History Online, which actually includes a reasonable set of what appear to be somewhat scholarly articles (at least they have footnotes), although they seem to be member contributions rather than peer-reviewed entries. None appear to be particularly related to my specific research interests, although several appear interesting. As the last link on the first page of the results page is the Early 19th Century Online Bookshelf of the U.S. Army Center of Military History. For the 19th Century this includes two digitalized general reference books, and on digitalized archival source: “the Regulation for the Uniform and Dress of the United States Army, 1839.” Searching through the Center of Military History also pulls up Online Bookshelves for the Late 19th Century (7 digitized secondary sources, 1 digitized archival source), the War with Spain (5 digitized secondary sources, and 7 digitized archival sources).

Let’s see what adding “digital” to the search does.

2nd Google Search: “19th Century Military Digital History”

This search is completely dominated by www.digitalhistory.uh.edu articles. (slogan: “using new technologies to enhance teaching and research.”). While the articles lack footnotes, the .edu address is a least moderately reassuring. Going to their homepage, it seems to be focused primarily on K-12, but they do have some digitized primary documents as well as some exhibits (although none are related to my area). They also, quite usefully, have a link for how to cite them.

So it seems there is a decent amount of 19th Century military history available digitally, but so far nothing specific to my research interests, so I will try a more focused search.

3rd Google Search: “Leavenworth School System”

Well that was not super useful, and probably to be predicted. The results were all links to current Leavenworth school districts.

4th Google Search: “Army Leavenworth Schools”

Strike 2. Results of this search are split between sponsored links of for-profit schools trying to sell online degrees to service members and the websites for the current Fort Leavenworth schools (Command and General Staff College, Combined Arms Center, etc.).

This topic may be a little too esoteric and have too much overlap with current institutions to return a useful google search result, so I’ll have look deeper into likely digital archival sources. The most logical place to start is the U.S. Army Military History Institute. All of the USAMHI’s finding aides are available online, as well as a catalog search for their secondary sources and articles. The USAMHI also has several digital collections, including digitized army regulations, field manuals, and general orders. They also have a selection of manuscripts and printed material that have been digitized. Finally, there are key word searchable databases for their non-digitized manuscript holdings. The digitized files are mostly more recent documents, but the finding aides are useful for planning visits, especially as they have been digitized in such a way as to be able to do a key word search and have it bring you to specific entries in each document.

A search through the Nationals Archives site reveals similar results: the digitized elements of the archives are not relevant to my research, but all the finding aides and other research planning tools are available to at least do a majority of the leg work in identifying archival sources before showing up to the physical archive.

Finally, while none have shown up during the course of these searches, I know from prior research that many of the printed materials from this period have been digitized by Google books and other sites such as Hathi Trust. So while researching if a printed source is referred to, a quick google search of its full title will often turn up a digitized copy. Unfortunately, since many of the documents I work with have titles like Report of the Board of officers appointed in pursuance of the act of Congress approved June 6, 1872, for the purpose of selecting a breech-system for the muskets and carbines of the military service, together with their report upon the subject of trowel-bayonets, they rarely turn up in more generic key word searches.

The first practicum for my new CLIO wired class on digital history involved establishing a personal domain online (this website and blog). Using Reclaim Hosting and WordPress, the process was surprisingly simple and intuitive, although there is still so much to the site to explore and understand.

This site represents a relatively recent and limited foray into the online world for me…a current google search of my name reveals how limited that presence is…

Due to an unfortunate confluence of my last name and consumer products, the first result (as well as many of the subsequent ones) is for “Affordable Brand Name Furniture.” I am likewise absent from the image results, which are heavily dominated by Ben & Jerry’s ice cream.

Facebook pages for others sharing my name are the fourth result, but due to my privacy settings and the way I structured my profile name, my page does not appear. As I primarily use Facebook for social purposes and not professional connections, I’d like to keep it this way.

Finally, the sixth result leads to my LinkendIn profile, which I need to update with a photo. I should probably also update it to emphasize my current studies over my prior military service.

As it stands, my online presence is pretty sparse. Moving forward I need to update my LinkedIn, and will be adding a (professionally focused) Twitter account later today. I’ll be keeping my Facebook private and personal, so the focus there will be keeping that out of simple searches rather than integrating it with the rest of my online presence. Finally, I need to look at adding an academia.edu profile, although will be more of a long term project as I build my CV. Next year when I begin work at the United States Military Academy I will also have a page through the History Department there as well. This website will also continue to grow, and hopefully eventually become the “home base” of my online presence as I learn more about how to use it.