Using OCR to Build a Digital Humanities Corpus


Nick Wolf | February 8, 2017

Get this presentation: https://tinyurl.com/dhweek2017-ocr


What is OCR and What Can We Do with It?

  • OCR = Optical Character Recognition
  • A system that analyzes an image of a writing glyph-by-glyph and turns it into a document of machine-readable characters
  • High-performing OCR depends on machine-learning: you supervise your computer in recognizing images of characters—including unusual fonts, non-English language texts, etc.
  • Many options, but for highest accuracy (especially on older documents) use ABBYY or Tesseract

What is OCR and What Can We Do with It?

  • Aim is to produce a machine-readable file, whether plain text or html/xml (if possible, with bounding boxes)
  • Preserve metadata elements important to DH inquiry (e.g. document structural elements, like page number)
  • Preserve textual ornamentation (e.g. bold, italics, indents, capitalization) that may assist with identifying word tokens

Input Formats

  • PDF or Tiff. For Tesseract, Tiff only.
  • Input images at 300dpi with good contrast.

Test Materials

Download as a zip this shared folder with the following files.
  • nypl-twain-typescript.jpg : A page from Mark Twain's Autobiographical Account of a Stay in England, available at NYPL.
  • masses079.pdf : An issue of The Masses (1917), a periodical available in NYU's digital library holdings.
  • huerta.tiff : A page from Leslie's Illustrated Weekly Newspaper (1915), available at NYPL.
  • 1875-parl.pdf : A few pages from the 1871 English/Welsh Census, available from ProQuest's Parliamentary Papers database.

Option One: ABBYY FineReader

Basic specs:

  • Mac and Windows versions. Currently, Windows interface much better.
  • 30-day Trial License available (great for one-off projects)
  • Discount license ($118.99 currently) for teachers/students/researchers
  • Two Mac versions on scanner computer stations in Digital Studio at Bobst; licensed version here in 617, trial version in 5th-Floor Research Commons labs.



Step One: Load Document

Generally, select "Image or PDF File to Other Formats"

Step Two: Initial Pass by ABBYY

Hands off! Let ABBYY try to recognize every page first

Step Three: The Windows FineReader Dashboard

When asked if you want to save, select no...we need to refine the output considerably.

Text Area Types

  • Text Area (light green)
  • Picture Area (red)
  • Table Area (blue)

Step Four: Best Workflow Order of Operations

  1. Work page by page to first DELETE all unwanted text/picture/table areas.
  2. ADJUST size of current areas to capture anything on the page, preserving the original text/picture/table box.
  3. Why? Order matters! The output text will follow the order of text/picture/table boxes in first window. Any boxes you add get tacked on to end of page. So preserve the naturally created order of boxes first, deleting and editing what is already there

  4. ADD text areas as needed, adjust order

Step Four: Best Workflow Order of Operations

4. Click "Read" on top menu from time to time to generate a new output text. This allows ABBYY to continue to use its embedded tools to guess at words

5. Use the bottom detail window to adjust location of row and columns in tables. Select your table area using selection pointer, then click on table row/column line to delete, add, or move separators.

6. Implement the pattern trainer by following the "Creating and Training a User Pattern Tutorial here

Step Four: Best Workflow Order of Operations

7. FINALLY you may want to walk through the righthand text editor window and correct mistakes in blue. However, note that ABBYY marks in blue things that are wrong and things that might be wrong. Don't waste time eliminating blue markup.

Step Five: Save and Export

  1. Save the ABBYY project bundle. Go to FILE >> Save FineReader Document and save the project form time to time.
  2. When ready to export, hit the "Save" icon at top menu bar and select out put format. Note also the dropdown options under the "Document Layout" section. Opt in or out to keep things like pictures, page numbers, headers, footers, line breaks, hyphens.
  3. Best options: RTF (if you want the exact layout on the page, especially line breaks) or TXT (if you just want text and line breaks).

Option Two (the free option): Tesseract 3.x

  • Find documentation here
  • Install by first loading Homebrew on Mac (if you don't have it already). From terminal, type:
    /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  • Then type: brew install tesseract
  • Works on .tiff (uncompressed) files. Use tools like ImageMagick and Ghostscript to make conversions.


Running Tesseract

  • Runs in the command line, but don't be intimidated...basic command is: tesseract input-image-location output-text-location
  • To output to an html file with bounding boxes, use tesseract input-image-location output-text-location hocr
  • Batch OCR: for item in *.tiff; do tesseract $item output_folder_name/$item; done
  • For more help, checkout this tutorial from Washington University St. Louis
  • Training must also be done through command line, and that is a little harder. See tutorial at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract.

Happy OCRing. Questions?


Email me: nicholas.wolf@nyu.edu

Get this presentation: guides.nyu.edu/data_management/resources

Make an appointment: guides.nyu.edu/appointment