Category Archives: Uncategorized

Updates, January and February 2018

Over the past two months I have taken a look at the volumes of statutes published from 1820 on, that is, with a modern typeface and without the long s that OCR software interprets in a multitude of ways.

Overall, the standard of text generated from the digitized PDFs is good to very good. Part of this may simply be due to the books not being as old, and therefore printed better, on better paper and being less worn and torn, than older volumes. But the typeface is certainly more amenable to being OCR’d, and the raw text is generally quite readable. The major problem is the recognition of the page layout, which with the statutes means that the side annotations get integrated into the body of the text. Certainly, the speed with which I have corrected some of the lists of legislation is far greater than for the pre-1820 texts.

Consequently, I’m considering concentrating on these volumes, although the eighteenth century is where most of my interests lie. But apart from sorting the tables, this is a decision I shall put off.

Also this month:

The usual run of automatic corrections; find improved text on Github.

Added tables of statutes for 1703, 1713, 1790, and 1866 to 1878. Again, find them on Github.

New acts: the famous 1918 Representation of the People act, in honour of its centenary; the notorious Buggery Act of Henry VIII, the 1706 Escape from Prisons act and the Repeal of the South Sea Bubble Act.

Added a bibliography of volumes of statutes in the series ‘A Collection of Public general Statutes’, with links to the relevent Google Books page, for 1837 to 1869;

And finally, a blog post on ways of checking and correcting OCR’d text.

There will be a pause until after Easter, whilst work and PhD take priority. This is very much a one-person side project, without any funding, and as such has to take second (and third) place to other demands.

On automatic correction of OCR output

Although this project began because I found many historical questions led to statutory source material, it has taken a technical turn into creating reliable and useful texts of the laws. Whilst I wasn’t surprised to find that the raw OCR of the eighteenth and early nineteenth century publications was foul, I had hoped it could be knocked into reasonable shape simply by correcting obvious, predictable errors, such as the long s being interpreted as an f.

This turned out to be true to a certain extent. I’m running a fairly simple bash script that takes a list of errors and their corrections, and one by one works through each word of the OCR’d text of circa 90 volumes published before 1820, and the results are promising. The errors are much more diverse than I presumed, but are still fairly uniform. For example, the combination of long s followed by h, as in parish, is often read as lh, lii, jh, and so on.

A bigger problem is when the s interpreted as f produces another english word, such as ‘lame’ or ‘fame’ for same. For this I have used the same script to check for phrases. Day makes sense preceeded by same, so correcting nonsense phrases like lame day and fame day is quite safe. And as the statutes are quite formulaic, with many repeated phrases, this approach is quite suited to them. Even better, as more words are corrected, the more these phrases are made apparent. With the word ‘act’ corrected from the very many misreadings, one can start correcting the phrase ‘act parted’ into ‘act passed.’

Another approach is to think in terms of parts of words. Given that the verb ‘establish’, often rendered as eftablifh, has a number of derivatives – established, establishing, disestablishment and so on – it makes sense to correct the stem of the word, rather than check for each variant.

All to the good, but this is a big body of text. There’s something like 14 million words in Pickering’s collection of the statutes alone. And that means there’s going to be a lot of mistakes, and more importantly, a lot of types of mistakes. The long s alone has at least 3 types of common misreading, as f, j, and l, and even more when it gets taken in conjunction with its following letter.

Working out how to tackle this has been gratifyingly interesting. There’s all sorts of technical ways of doing this, by looking at the texts as individual words, as stems, or lemmas, of words, as a collection of phrases, as strings of characters. There’s also some deeper, mathematical, ways of thinking about this, that would alleviate having to compile a near-infinite list of possible errors that do not run afoul of false positives for any eighteenth century text. For example, the lame king is not to be found in the statutes, but no doubt turns up in some novel of the time.

It should, for example, be possible to search the statutes for every string close to, but not identical with, the phrase ‘the authority aforesaid’ and correct it, without having to produce a list of every possible variant. Such a more subtle process should be quicker than the ‘brute force’ method I am currently using.

This is leaving aside the other causes of errors: those caused by the quality of the digitization, the quality of the printing and the markings of readers in the volumes digitized, and most problematic for this project, the mis-recognition of the layout of the pages. The convention of annotating laws with marginal notes – and these notes are not part of the statute itself – complicates the page design, and the raw OCR often integrates the comments into the main body of the text. On reflection, I should have taken more care of that when putting the books through the OCR machine, but that comes with a considerable cost in time. There may be ways of automating the detection of such errors.

Work on error correction continues, with the pleasant collateral that it is a fascinating problem, and not mere drudgery. In the meantime, I have a growing set of lists of automatic correction pairs on github. These have been split into certain categories: place names, latin, phrases, as well as English words. Depending on the text being corrected, some will be relevent and others not. Note that because of the script I am using (which I hope to publish soon), spaces in phrases and split words are escaped with a backslash, as in ‘authority\ aforesaid’.


Automatic correction of OCR

A milestone: I have begun automatically correcting the OCR errors in the 46 volumes of Danby Pickering’s ‘Statutes At Large’, and have uploaded the improved text to Github.

Given the quantity of text I’m dealing with – the Pickering series alone amounts to over fifteen million words – correcting each volume ‘by hand’ is obviously impractical. Bulk ‘find and replace’ is an improvement, but still not fast enough to be practical.

Such repetitive tasks are grist to the digital mill. So, using this list of common OCR errors, augmented with others I’ve found, and a one line bash script, automatic improvement of the texts has commenced.

The results are obviously an improvement. Nevertheless, the texts still aren’t great. There are still many spelling errors. As I used spaces as separators, words with punctuation attached are uncorrected. The many problems arising from layout are still to be faced.

But this is an important step forward.

Introducing The Statutes Project

The aim of the Statutes project is quite simple: to put the majority of historic English legislation online in accessible, useful formats, readable by humans and machines alike, with accompanying metadata, without any financial, technical or legal obstacles to use or adaption.

The simplicity of this statement masks the many difficulties: finding the laws, digitizing them, turning page images into clean, correct text, and so on. And doing so  without having an entire life devoured by spell checking and hand correction.

The many volumes of statutes compiled through the last three centuries, coupled with mass digitization projects such as those run by Google Books and the Internet Archive, along with optical character recognition and text correction tools, does at least allow for the hope that useable – but not perfect – texts can be produced with a minimum of effort.

The focus will be on the late seventeenth, eighteenth and early nineteenth centuries, the ‘long eighteenth century’ that is central to my own historical studies. Expect a concentration on matters relating to debt and debtors; that is the subject of my PhD.

This blog is more a notebook than a full archive of legislation, although that is the long-term hope. It will cover the technical side more than the theoretical, although that won’t be absent. When there’s a sufficient corpus, quantatively and qualatively, there will be some preliminary attempts at analysis, little games aiming to investigate the possibilities.

Future posts will discuss the project in more detail, covering the source volumes, the software, textual analysis, dissemination, and undoubtably the many trials and tribulations produced by a simple idea rashly executed.