Over the past two months I have taken a look at the volumes of statutes published from 1820 on, that is, with a modern typeface and without the long s that OCR software interprets in a multitude of ways.
Overall, the standard of text generated from the digitized PDFs is good to very good. Part of this may simply be due to the books not being as old, and therefore printed better, on better paper and being less worn and torn, than older volumes. But the typeface is certainly more amenable to being OCR’d, and the raw text is generally quite readable. The major problem is the recognition of the page layout, which with the statutes means that the side annotations get integrated into the body of the text. Certainly, the speed with which I have corrected some of the lists of legislation is far greater than for the pre-1820 texts.
Consequently, I’m considering concentrating on these volumes, although the eighteenth century is where most of my interests lie. But apart from sorting the tables, this is a decision I shall put off.
Also this month:
The usual run of automatic corrections; find improved text on Github.
Added tables of statutes for 1703, 1713, 1790, and 1866 to 1878. Again, find them on Github.
New acts: the famous 1918 Representation of the People act, in honour of its centenary; the notorious Buggery Act of Henry VIII, the 1706 Escape from Prisons act and the Repeal of the South Sea Bubble Act.
Added a bibliography of volumes of statutes in the series ‘A Collection of Public general Statutes’, with links to the relevent Google Books page, for 1837 to 1869;
And finally, a blog post on ways of checking and correcting OCR’d text.
There will be a pause until after Easter, whilst work and PhD take priority. This is very much a one-person side project, without any funding, and as such has to take second (and third) place to other demands.