Nick Wolf | February 8, 2017
Get this presentation: https://tinyurl.com/dhweek2017-ocr
Nick's ORCID: 0000-0001-5512-6151
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
nypl-twain-typescript.jpg
: A page from Mark Twain's Autobiographical Account of a Stay in England, available at NYPL.masses079.pdf
: An issue of The Masses (1917), a periodical available in NYU's digital library holdings.
huerta.tiff
: A page from Leslie's Illustrated Weekly Newspaper (1915), available at NYPL.
1875-parl.pdf
: A few pages from the 1871 English/Welsh Census, available from ProQuest's Parliamentary Papers database.Basic specs:
Generally, select "Image or PDF File to Other Formats"
Hands off! Let ABBYY try to recognize every page first
When asked if you want to save, select no...we need to refine the output considerably.
Why? Order matters! The output text will follow the order of text/picture/table boxes in first window. Any boxes you add get tacked on to end of page. So preserve the naturally created order of boxes first, deleting and editing what is already there
4. Click "Read" on top menu from time to time to generate a new output text. This allows ABBYY to continue to use its embedded tools to guess at words
5. Use the bottom detail window to adjust location of row and columns in tables. Select your table area using selection pointer, then click on table row/column line to delete, add, or move separators.
6. Implement the pattern trainer by following the "Creating and Training a User Pattern Tutorial here
7. FINALLY you may want to walk through the righthand text editor window and correct mistakes in blue. However, note that ABBYY marks in blue things that are wrong and things that might be wrong. Don't waste time eliminating blue markup.
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install tesseract
tesseract input-image-location output-text-location
tesseract input-image-location output-text-location hocr
for item in *.tiff; do tesseract $item output_folder_name/$item; done
Email me: nicholas.wolf@nyu.edu
Get this presentation: guides.nyu.edu/data_management/resources
Make an appointment: guides.nyu.edu/appointment
Nick's ORCID: 0000-0001-5512-6151
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.