Text recognition

The task of turning a picture of a document (a scan, a photo, a microfilm frame, etc.) into machine-readable, searchable text. When the source is printed, this is usually called OCR; when it is handwritten, it is called handwritten-text recognition (HTR), which is harder because every hand is different.

For language work this is typically the digitization step: getting old manuscripts, field notes, dictionaries, and records out of boxes and into a form that can be searched, edited, and stored. Tools like Transkribus specialize in it, and the results usually need somewhere to live afterward (for example a Mukurtu CMS archive).

Modern text recognition is built on machine learning, which is why it works best when there is training data for the script and language in question and why it can struggle with low-resource languages, unusual orthographies, or a single distinctive handwriting. Increasingly the same job can also be done by general-purpose vision language models, though a tool built specifically for the task is often more accurate and more controllable.

Created · Updated
Supported By the National Science Foundation Award 2542375.