Automatic correction of OCR

A milestone: I have begun automatically correcting the OCR errors in the 46 volumes of Danby Pickering’s ‘Statutes At Large’, and have uploaded the improved text to Github.

Given the quantity of text I’m dealing with – the Pickering series alone amounts to over fifteen million words – correcting each volume ‘by hand’ is obviously impractical. Bulk ‘find and replace’ is an improvement, but still not fast enough to be practical.

Such repetitive tasks are grist to the digital mill. So, using this list of common OCR errors, augmented with others I’ve found, and a one line bash script, automatic improvement of the texts has commenced.

The results are obviously an improvement. Nevertheless, the texts still aren’t great. There are still many spelling errors. As I used spaces as separators, words with punctuation attached are uncorrected. The many problems arising from layout are still to be faced.

But this is an important step forward.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.