This week’s practicum is focused on the use of OCR, both utilizing it personally and evaluating its effectiveness on some of the sources available on the web.
To start, I utilized Google OCR to digitize a scanned image from the Pinkerton National Detective Agency (Image #4). Before putting it through the OCR I rotated it to be upright and cropped the image to exclude the margins of the page in the hopes of eliminating extraneous object to confuse the OCR. This did help improve the OCR’s performance a little bit, but the program still produced a very high percentage of errors. Importantly, most of these errors were not single character errors that left a recognizable word, but rather turned many of the words into complete gibberish unrecognizable to the naked eye, let alone a key word search. Some of the issue may stem from the considerable bleed through of words from the backside of the page, and the irregular format of the Q&A seems also to have confused the OCR. There is also a surprising amount of errors mistaking letters for punctuation and vice versa.
Here is the resulting Google Doc: Pinkerton Image #4 Google Doc
The next element of the practicum was evaluating the OCR performance for several newspages on the Chronicling America project of the Library of Congress. To start with, I searched Virginia papers in the year range 1861 to 1865 with the key word “Stonewall.”
The first page that showed up was: The Soldiers’ journal. (Rendevous of Distribution, Va.), October 05, 1864, Image 3
The image was pretty clear, and the text easily readable with only a few stray marks on the page. The OCR is actually pretty decent, with only an error every couple of sentences. With most of these errors it is readily apparent what the original was, even without referring back to the original (Bull Hun from Bull Run, Conterville from Centerville, Bth July from 8th July). The most common error seems to be, for whatever reason, the substitution of “o” for “e” when it occurs in the middle of a word. Looking back at the image it does appear that the “e” in this type face has a pretty tight opening, making the OCR’s errors a little more understandable, although for most of these you would think the dictionary would have corrected the character confusion. There is also an issue or recognizing numbers, although what dictionary would favor “Bth” over “8th”
The second page is: The Abingdon Virginian. (Abingdon [Va.]), January 30, 1863, Image 1
This image is a little bit rougher, with much smaller print that is hard to read, some uneven inking, and an obvious crease. Perhaps unsurprisingly based on how hard it is to read even for human eyes, the OCR on this one is significantly worse. There are quite a high percentage of errors, and there seems to be less of a pattern to them. There are significantly more letters mistaken for punctuation and special characters, and the column breaks seem to have been especially a problem with a string of loose letters, mostly “i” and “j” inserted at the edges of columns. Finally, the column with the crease mark seems to have a higher percentage of errors than the others.
The third page was: The Soldiers’ journal. (Rendevous of Distribution, Va.), September 28, 1864, Image 7
The image on this is clear and easily readable with the naked eye, but there is some noticeable bleed through from the backside of the page, and the entire image is canted a little. The OCR is pretty decent with an error rate similar to the first page, although the errors seems less consistent in this one. There are the occasional random “j”s at the end of lines, possible from the line the paper uses to separate its columns. There seems to have been a specific problem in recognizing numbers when paired with letters, almost every case of 8th or 1st has been translated into Bth or some other pure letter gibberish.
The final element of the practicum was to evaluate the OCR of a primary source for my research area. For this I looked at the Google e-book version of the Annual Report of the Secretary of War for 1867, published by the Government Printing Office. The plain text of the OCR for this document is very high quality, and in actually a little bit easier to read than the image. There are very few errors and even manages to capture the words that are italicized. The only consistent error I could note is that “8”s are frequently transposed into “S”s, especially when in the middle of dates (so 1860 becomes 1S60).
After reviewing these different OCR results, it seems clear to me that there can be a very great degree of variety in the quality of OCR output, and therefore in the effectiveness of keyword searches. For example, the most likely keyword search looking for the Pinkerton paper I digitized would be a search for the “Thaw matter” which somehow got translated by the OCR as @1151! matth. A keyword search would thus obviously have missed at least this page of the file. These differing levels of quality are likely due to the quality of OCR program used (presumably Google uses a more advanced program for its ebooks than its open source OCR provided on Google drive) but also, as reflected by the differing results on the Chronicling America papers, due to the quality of the original image fed into the program. When doing research it is probably work taking a look at the OCR text of each database in order to get a better idea of how effective or spotty keyword search results will be, and tailoring your research and search methods based on this evaluation.