Twitter, Facebook, WeChat, etc. are common social media. People always communicate for some reason, either being professional or conveying feelings. As humans, we can read and decipher what is the feeling embedded in an excerpt of text and perform corresponding reactions. However, can machine learning algorithms also “decipher” the embedded feelings within the text?
OCR stands for optical character recognition, which is basically the technique to convert a pixelated image into words and characters for better readability, and interpretability. In this independent study with prof. Naiman Jill, we explored different ways to examine how to increase the readability and correctness of OCR data so they can read better to humans. An example of an OCR’d page is in the figure above.
Optical Character Recognition (OCR) is one of the methods to extract text from an image. There are many issues with the accuracy of OCR such as spelling errors and punctuation.
My study aims to investigate the types of OCR errors produced with the Tesseract OCR engine and quantify the accuracy of the OCR engine on astrophysical historical documents through different optimization of the OCR process.