What is OCR:
OCR stands for optical character recognition, which is basically the technique to convert a pixelated image into words and characters for better readability, and interpretability. Some ancient books are only available in scanned versions so some words may be hard to see, as well as they are hard to preserve digitized. In this independent study with prof. Naiman Jill, we explored different ways to examine how to increase the readability and correctness of OCR data so they can read better to humans. An example of an OCR’d page is in the figure above.
Major current issues we found with OCR data are missing characters, punctuations and line feeds, and grammar errors. For the OCR engine, we choose the google OCR engineTesseract, as implemented with the Python wrapper named “pytesseract” (https://pypi.org/project/pytesseract/ )
Our examination and exploration extended through different methods and modules such as using a spell checker (https://github.com/filyp/autocorrect) , using GROBID data format to parse PDF, use metric measurements to create our own dictionary about the topic, and ways to quantify word error rate (WER, which means how many words are parsed error from ground truth) and character error rate (CER, which means how many character are parsed wrong).
Our goal is to digitize and textify the old scanned paper, and probably to host figures + captions at AIE (http://www.astroexplorer.org/).
The first method we tried was to do a grammar check on text with OCR with package autocorrect (https://github.com/filyp/autocorrect) . Because OCR text generally have some small spelling errors or characters are missing, the spell checker can often detect those spelling errors and fix them with most common phrases and expressions in English. However, sometimes the error is not so obvious and many words contain many letters wrong in a single word. Also, a spell checker can sometimes change the meaning of a sentence which is not okay for scientific writing.
Our second approach is to make some GROBID data so the output is more structured with metadata. GROBID is a shortened version of GeneRation Of BIbliographic Data, which is an algorithm that parses vector PDF documents and saves the outputs as a special notation in data format similar to TEI/XML format.
OCR’d pages sometimes have hyphens at the end of a line for a word that needs to be “wrapped”, but GROBID parsed format can have the whole word together. Spell checkers sometimes regard that as spelling errors and try to fix that, but with our metadata information, we can concatenate two parts together.
The other approach is to make a personal dictionary about the frequent words in astronomy data. Here we used the python module “pyenchant” (the Python wrapper for C enchant) with its function PyPWL, which is a module for us to build our own dictionary using the GROBID-parsed data from vector PDFs.
We had piped in tokenized words selected from other similar scientific documents to build our dictionary as produced by GROBID. The first approach used about a million words (with appox. 100 thousand words deleting repeated words with different tense, and some non-alphabetic chracters). The dictionary module was built relatively fast which surprised me, and the word suggestion method is better towards the scientific realm in contrast with our first spell checker method, however, the module tends to execute slow exponentially if we pipe in longer words. To our rough estimation, it takes less than one second to process common words less than 4-5 letters, while the module executes longer and predicts worse for longer words (10-20 seconds for 10 letter words).
With further research, the underlying algorithm of this module “enchant” personal dictionary is based on word edit distance (levenshitein distance), which depicts the minimum number ofedits are needed to edit a word to another word. This may explain why this dictionary runs longer for bigger words.
The next task is about using HathiTrust vs. Gutenberg dataset to quantify error rates (A Prototype Gutenberg-HathiTrust Sentence-level Parallel Corpus for OCR Error Analysis: Pilot Investigations, Ming). This approach helped us to quantify how well OCR transfers information. In this approach, we get the text from the same document both in HaitiTrust (images, PDFs, OCRed data) and Gutenberg (human typed in the ground truth). We have tried both word error rate and character error rate on these different datasets.
The outcome of this is we discovered the OCR data has a high chance of having a character-level error rate (CER).
Our last approach, which we did not quite finish, was to use a python module written in C named ocr-post-correction for training and testing models for spell and grammar checks (https://github.com/shrutirij/ocr-post-correction) . As the procedure, we split the document into train sets and test sets. As this is supervised machine learning, we used the dataset with the HathiTrust part as train input data, while we use Gutenberg as the train output data. Then we intended to test some of the HathiTrust data to see how well this method runs. Unfortunately, we were not able to implement the full training procedure as GPU training was not fully supported with the current codebase on Google Colab. .
Conclusion and Future Works
With our semester’s individual research about OCR and the method of post-OCR data, we have interacted with different types of correctors modules in Python for minimizing error in OCRed text. These methods are helpful in making OCR data more correct, while they all have some ways to improve. So to digitize a larger corpus of documentation, we still need to work on implementing different deep learning models and methods for OCR correction.