Bibliographies of Collections of British Statutes

I have now finished compiling the bibliographies of the several collections of British legislation I have used for this project. Each entry, due to the magic of Zotero, should have a link to the digitized version of the book, and each bibliography a link to the OCRd text I am currently correcting, hosted on Github.

These bibliographies are not complete, both in that there are other collections I have not made lists for, and that those I have do not list all the volumes. I have concentrated on books freely available online, and that I have used to generate the OCRd texts I am correcting. Given time, I may well expand this, but for the moment it provides at least one volume covering the period from Magna Carta until 1878. After that date, far fewer volumes are freely available, and so for all intents and purposes, the project stops there. But note that legislation for the twentieth and twenty first centuries is available via Matthew Williams marvelous datasets.

Not every law is to be found in full in these volumes. Some are abbreviated, giving the preamble, perhaps a few clauses, and a summary. Some are omitted entirely. Very few private, personal and local acts are given. And very annoyingly, volume 43 part 2 of Pickering’s Statutes at Large is the sole missing part of that long and useful series.

All this notwithstanding, I think these bibliographies will be of great utility to anyone wanting to track down historic laws.

Go to the Index Page.

Updates, January and February 2018

Over the past two months I have taken a look at the volumes of statutes published from 1820 on, that is, with a modern typeface and without the long s that OCR software interprets in a multitude of ways.

Overall, the standard of text generated from the digitized PDFs is good to very good. Part of this may simply be due to the books not being as old, and therefore printed better, on better paper and being less worn and torn, than older volumes. But the typeface is certainly more amenable to being OCR’d, and the raw text is generally quite readable. The major problem is the recognition of the page layout, which with the statutes means that the side annotations get integrated into the body of the text. Certainly, the speed with which I have corrected some of the lists of legislation is far greater than for the pre-1820 texts.

Consequently, I’m considering concentrating on these volumes, although the eighteenth century is where most of my interests lie. But apart from sorting the tables, this is a decision I shall put off.

Also this month:

The usual run of automatic corrections; find improved text on Github.

Added tables of statutes for 1703, 1713, 1790, and 1866 to 1878. Again, find them on Github.

New acts: the famous 1918 Representation of the People act, in honour of its centenary; the notorious Buggery Act of Henry VIII, the 1706 Escape from Prisons act and the Repeal of the South Sea Bubble Act.

Added a bibliography of volumes of statutes in the series ‘A Collection of Public general Statutes’, with links to the relevent Google Books page, for 1837 to 1869;

And finally, a blog post on ways of checking and correcting OCR’d text.

There will be a pause until after Easter, whilst work and PhD take priority. This is very much a one-person side project, without any funding, and as such has to take second (and third) place to other demands.

On automatic correction of OCR output

Although this project began because I found many historical questions led to statutory source material, it has taken a technical turn into creating reliable and useful texts of the laws. Whilst I wasn’t surprised to find that the raw OCR of the eighteenth and early nineteenth century publications was foul, I had hoped it could be knocked into reasonable shape simply by correcting obvious, predictable errors, such as the long s being interpreted as an f.

This turned out to be true to a certain extent. I’m running a fairly simple bash script that takes a list of errors and their corrections, and one by one works through each word of the OCR’d text of circa 90 volumes published before 1820, and the results are promising. The errors are much more diverse than I presumed, but are still fairly uniform. For example, the combination of long s followed by h, as in parish, is often read as lh, lii, jh, and so on.

A bigger problem is when the s interpreted as f produces another english word, such as ‘lame’ or ‘fame’ for same. For this I have used the same script to check for phrases. Day makes sense preceeded by same, so correcting nonsense phrases like lame day and fame day is quite safe. And as the statutes are quite formulaic, with many repeated phrases, this approach is quite suited to them. Even better, as more words are corrected, the more these phrases are made apparent. With the word ‘act’ corrected from the very many misreadings, one can start correcting the phrase ‘act parted’ into ‘act passed.’

Another approach is to think in terms of parts of words. Given that the verb ‘establish’, often rendered as eftablifh, has a number of derivatives – established, establishing, disestablishment and so on – it makes sense to correct the stem of the word, rather than check for each variant.

All to the good, but this is a big body of text. There’s something like 14 million words in Pickering’s collection of the statutes alone. And that means there’s going to be a lot of mistakes, and more importantly, a lot of types of mistakes. The long s alone has at least 3 types of common misreading, as f, j, and l, and even more when it gets taken in conjunction with its following letter.

Working out how to tackle this has been gratifyingly interesting. There’s all sorts of technical ways of doing this, by looking at the texts as individual words, as stems, or lemmas, of words, as a collection of phrases, as strings of characters. There’s also some deeper, mathematical, ways of thinking about this, that would alleviate having to compile a near-infinite list of possible errors that do not run afoul of false positives for any eighteenth century text. For example, the lame king is not to be found in the statutes, but no doubt turns up in some novel of the time.

It should, for example, be possible to search the statutes for every string close to, but not identical with, the phrase ‘the authority aforesaid’ and correct it, without having to produce a list of every possible variant. Such a more subtle process should be quicker than the ‘brute force’ method I am currently using.

This is leaving aside the other causes of errors: those caused by the quality of the digitization, the quality of the printing and the markings of readers in the volumes digitized, and most problematic for this project, the mis-recognition of the layout of the pages. The convention of annotating laws with marginal notes – and these notes are not part of the statute itself – complicates the page design, and the raw OCR often integrates the comments into the main body of the text. On reflection, I should have taken more care of that when putting the books through the OCR machine, but that comes with a considerable cost in time. There may be ways of automating the detection of such errors.

Work on error correction continues, with the pleasant collateral that it is a fascinating problem, and not mere drudgery. In the meantime, I have a growing set of lists of automatic correction pairs on github. These have been split into certain categories: place names, latin, phrases, as well as English words. Depending on the text being corrected, some will be relevent and others not. Note that because of the script I am using (which I hope to publish soon), spaces in phrases and split words are escaped with a backslash, as in ‘authority\ aforesaid’.

 

Updates, November and December 2017

Work carried out in the last two months of 2017:

Transcribed two missing pages, listing statutes of the reign of Edward the Third, of volume 2 of Pickering’s Statutes at Large.

Started removing latin and french text from the earlier volumes of Pickering’s Statutes. English translations are given in the books, and the foreign versions serve to complicate the OCR correction process.

Github reorganization: I have split the Butterworths volumes into two groups on the basis of whether they used the long s or not.

Tables of statutes added to Github: just one, for 1756.

Many individual statutes added to this site.

Uploaded the first item to a new folder of miscellaneous items, namely a collection of statutes relating to Kingston Upon Hull.

And of course, automatic correction of many common mis-transcriptions in the Pickering, Ruffhead and Butterworths ‘long-s’ volumes.

To do in 2018, given I have other pressing commitments: to concentrate on the legislation of the parliament of Great Britain, from union with Scotland (1707) to union with Ireland (1800); to produce a full set of tables for these years; and to experiment with visualizing these tables.

Updates, August and September 2017

The last two months have seen: continuing automated correction of the OCR-generated text of Pickering’s Statutes At Large, and some of the Butterworths-published volumes (1807 to 1819, in other words those using the ‘long s‘). The bash script I have written for this is improving, and I hope to release it soon on github (under a free license of course).

A side effect of hunting down erroneous OCR is the production of lists of such mistranscriptions. I have started to put those on Github; used with the forthcoming script this will constitute an easy way of improving raw OCR of eighteenth century books.

I have started a page collecting volumes of historic American state legislation, mainly colonial, but with some post-revolutionary laws.

SSL has been enabled for the site, courtesy of a free certificate via my hosts Evohosting and Let’s Encrypt! I will be making all URLs secure by default at some point in the future; this should not break any pages you have bookmarked. Until then, simply starting any them with ‘https://’ will call up the secure advise

New laws added to the site, including: the 1807 Abolition of Slavery Act; from 1740, encouragement of mariners; and Hogarth’s act for protecting copyright in engravings of 1735.

There will now be a hiatus until November, whilst I concentrate upon writing my PhD thesis.

May, June and July 2017 updates

Work on the Statutes Project in the last three months has mainly consisted of running bash scripts on the OCR of Pickering’s Statutes At Large, to correct the more obvious errors. It’s slowly getting into easonable shape. You can find the latest plain text on the Statutes Github repository.

More tables of acts have been added: they now run from 1716 to 1736, with some others up to 1760. The aim is to have a complete set covering the reigns of Georges one and two by autumn. Then some text mining can begin. Again, find them on Github.

Various individual acts have been added to this website, including the Licensing Act of 1737, the Irish Dependency Act, it’s repeal and a clarification of the repeal. Plus the Equal Franchise Act of 1928, to accompany the recent election and its surrounding ballyhoo.

Also added is 1661 Tumultuous petitioning act, taken from the Ruffhead edition of the Statutes at Large, as it does not appear in the Danby Pickering series I have been concentrating on. Consequently, it looks a little different, as the two versions have different standards and protocols.

On the agenda for the next couple of months: more automatic OCR correction, and more tables.

 

 

 

For the word “man” there shall be substituted the word “person”

It is election day today here in the United Kingdom, with the attendent exhortations to vote.

Among these exhortations was a tweet from the UK Parliamentary Archives:

Which infuriated me. Firstly, there isn’t a link to the document, even though it is on their website.  Secondly, because the digitization on their website is, as I put it in an intemperate tweet, Badly digitised, low resolution, illegible, incomplete. Insulting.

Let’s expand on this. It is badly digitized, probably just a photograph rather than a proper scan. It’s a low resolution image, so when running it through an OCR reader produced far worse text than what one can expect from a twentieth century document. It’s not just software that can’t read it; it’s difficult for a human to read as well. And finally, it is incomplete, reproducing only the first three pages of text.

And all this means it is insulting. It’s showing off a possession, not actually sharing or allowing others to read and understand it. This is made worse given the subject: a historic piece of legislation, that finally gave women the vote on the same terms as men, is being used as a boast, reducing the long struggle for this essential right to a scrap of property you can glimpse but not enjoy.

Furthermore, it reduces democracy to one, electoral, dimension. Democracy is not just about voting. It is also about checks and balances, the separation of powers, and the rule of law. The blasé maxim Ignorance of the law is no excuse has to be matched with a commitment that the law be easily available to all. That includes laws like this one, that although repealed have established fundamental principles that survive to this day.

Consequently, I’ve rushed to the British Library to transcribe the act, and published the full text.

Note that I have not found this act in the commercial law archives Hein Online or Lexis Nexis. It is available via Justis, but behind in a paywall. On a happier note, hunting for this text has led me to Matthew William’s fantastic plain text archive of U.K. legislation, 1900-2015. I will be writing more about this amazing resource when I’ve dug deeper into it.

Automatic correction of OCR

A milestone: I have begun automatically correcting the OCR errors in the 46 volumes of Danby Pickering’s ‘Statutes At Large’, and have uploaded the improved text to Github.

Given the quantity of text I’m dealing with – the Pickering series alone amounts to over fifteen million words – correcting each volume ‘by hand’ is obviously impractical. Bulk ‘find and replace’ is an improvement, but still not fast enough to be practical.

Such repetitive tasks are grist to the digital mill. So, using this list of common OCR errors, augmented with others I’ve found, and a one line bash script, automatic improvement of the texts has commenced.

The results are obviously an improvement. Nevertheless, the texts still aren’t great. There are still many spelling errors. As I used spaces as separators, words with punctuation attached are uncorrected. The many problems arising from layout are still to be faced.

But this is an important step forward.

March and April 2017 Updates

Work on the Statutes Project in March and April 2017:

0: Numerous corrections to Pickering’s series of Statutes at Large. Latest versions to be found, as ever, on Github.

1: More tables of statutes uploaded to Github. Currently, there are tables for public acts 1716 to 1736, with just 1721 missing. This I’ll upload shortly.

2: More legislation collected, to the point that the menus are getting unweildly and I’ll have to do some reorganizing. Acts added include:
The Murder Act of 1751, giving the corpses of the hanged to the surgeons (and occasioning many a riot).
The Regency Act 1729, allowing the Queen to govern whilst George the Second went off to Hanover.
The Septennial Act, extending the life of a parliament to seven years. A quite undemocratic act, had there been any meaningful suffrage

On the to do list for May 2017: due to the demands of my PhD, I’ll be working on the insolvent debtor relief acts from 1649 to 1813 over the next month; consequently, those texts will be corrected and added.

February 2017 Updates

Work on the Statutes Project in February 2017:

0: Numerous corrections to the OCR of the Pickering and Ruffhead editions of the Statutes At Large, uploaded to Github. Still a long way from readable, but getting there.

1: A new series OCR’d, or at least half a series. The Statutes of the Realm was the most academic, comprehensive and careful collection of acts, the text generally taken from the statute rolls themselves. Consequently, it is a typographical nightmare, and the OCR is  worse than for the – admittedly less reliable – series of the Statutes At Large. I have put on Github the text for two volumes (numbers 3 and 5) found on Google, and, thanks to the University of Southampton waving their No Derivatives license, the text for volumes 6 to 11 from the British Parliamentary Publications set, digitized by Soton, on archive.org.

2: I have also started extracting the tables of acts from the OCR’d volumes, and uploading them to Github. The idea is to create a reliable list of legislation enacted, with the long title of each act. Given the length of titles, this will constitute a corpus of sufficient size for text mining and distance reading (I hope). It also constitutes the first step in creating metadata for this project.

3: Laws collected from around the web:

1536 27 Henry 8 c.19: An act limiting an order for Sanctuaries and Sanctuary persons.

The 1918 Parliament (Qualification of Women) Act: Allowing women to sit in Parliament, and the shortest statute at a mere 27 words, when preamble and short title clause are put aside.

4: And also: a short post on James I’s laws on sanctuary, over at my  Alsatia blog.

Planned for March: More acts collected from round the web, and more tables of statutes.