Building upon Google Books.

Some months ago I finished compiling a Chronological Bibliography of British and U.K. statutes – volumes of statutes organized by regnal year or years. This is an easier way of locating (British) laws than via the other bibliographies I’ve compiled. Each link is to an openly accessible, public domain book, the majority digitized by Google, and hosted on either Google Books or the Internet Archive. In the course of searching for these I’ve been able to extend the coverage up to 1920, 10 & 11 George 5. In a very few cases, this is because I had overlooked volumes; but mostly, it is because I raised an issue via the Google Books Inquiry form.

Through this, I was able to request that the full content of out-of-copyright volumes available in ‘snippet view’ be made available. And in the vast majority of cases – just one refusal, and one request unresolved – the full text has been made available, and promptly so.

Without Google Books, and the similar Internet Archive, this project, based on nearly 200 volumes of British statutes, would not be possible. It would be just too difficult and time-consuming for a single person to approach and negotiate with however many organisations and libraries, obtain hundreds of books and digitize them, before getting to the stage I am at, of correcting the OCR’d text. This vast, free to access library of out-of-copyright and out-of-print volumes, can be a foundation on which to build all sorts of historical resources, investigations and analyses.

Against this, of course, is a whole series of problems relating to how Google Books was conceived and run: as an industrial process, on a huge scale, producing a vast reservoir of data, aiming simply to get enough right, the maximum return from the smallest possible investment. This is Google Books literal ‘darker’ side: precarious and poorly paid workers, frequently women, frequently black.

A direct consequence of this labour-intensive, high-tempo factory system is the poor curation. There’s the notoriously poor metadata – a veritable train wreck – attached to the books; the hideous OCR, although there has been some automated correction of it; the many poor scans, distorted and obscured; the worn, worn-out books indiscriminately put through the production line.

Even worse than all these specific flaws are, is just how opaque the library is as a whole. There seems to be no way to comprehend it as an archive, no way to know what is in it, no way to extract subsets of books or their metadata. Even something as simple as listing all the titles in their archive for a year of decade isn’t possible. Given that search is Google’s forte, this obscurity has to be deliberate; the public-facing library is fundamentally a side product of a big (linguistic) data haul, a negotiation with the libraries that provide the books, and a swerve round the publishers that hold copyrights. (And I wonder if the absence of a list of the half million titles recently added to Google from the British Library has been contractually forbidden, perhaps under clause 4.7, restricting automated access. It’s impossible that there isn’t such a manifest, and one has been released for the Microsoft-digitized volumes held by the B.L. Of course it is possible the B.L. just doesn’t want to release it.) By contrast, the Internet Archive goes to great lengths to allow deep searches and bulk downloads of their holdings. That they take in Google’s scanned books frees them from these obstacles.

The limitations and restrictions of Google Books may well disuade the building of projects upon it. Really, it is just a large repository of page images in PDFs without much support. But if one accepts its limitations and expects no more, it is still useful. Projects like this one can curate a subset of interrelated documents within certain parameters. Even if there is considerable work to be done, a significant part has been done. And it is better that the creation of historical archives is made by historians than corporations.



The post Peterloo ‘Six Acts’

2019 is the centenary of the Peterloo massacre, when a pro-reform demonstration in Manchester was attacked by Yeomanry and Hussars, resulting in as many as 18 protestors being killed and up to 700 more injured. (Figures are disputed: these are taken from the Peterloo Massacre website.)

If the historical event itself is well-known, the ramifications and repercussions are perhaps less so. It became a national event, with pamphlets recounting the bloodshed and condemning the government widely circulated, protests and demonstrations in support of the victims held nationwide, and reports of imminent uprising sent from all over to the Home Office.

(For an interesting way of presenting the fall out, see the Peterloo 1819 news twitter account; a remarkable and comprehensive tracking of events.)

In response, the Government passed a series of laws – the ‘Six Acts’, as they became known – at the end of 1819, a legislative program against the democratic movement.  These statutes firstly strengthened the state’s local presence by giving exceptional powers to the Justices of the Peace. The J.P.s could act in neighbouring jurisidctions, issue warrants to raid houses and oblige public meetings to be authorized. Legal procedure was quickened, to the detriment of the accused. Rights of assembly and organization were limited, public meeting and military drilling alike (and give the state a monopoly over the latter).

The last two statutes dealt with publications: the Seditious Libel act permitting the seizure of works critical of state and church and punishing repeat offenders with banishment and transportation, and the Stamp Duties Act taxing printed works, to make them too expensive for their plebian and proletarian audience.

Notwithstanding the few concessions wrung out of the government by the Whig opposition, these acts offer both an anatomy of, and a program against, the radical movement. It considers it as geographically diffuse, present all over the country so local authorities are given powers to oppose it. Each locale is a point for gathering people together to communicate with each other, so meetings and ‘military’ associating are repressed. The locales are connected with each other, made national, through the medium of print, so publications are taxed and seized.

The acts also describe the government of the day: as fundamentally repressive and based in the final instance on brute military force, the violence of which provoked the subsequent movement.

Although the drilling provisions were the longest lasting legally (until 2008), and as time limits were set on the seizure and meetings acts, the tax on print was the most repressive measure. However, it led to the ‘War of the Unstamped‘, the refusal of publishers and vendors to pay the duty, and their willingness to go to prison for their pains. The stamp on newspapers was lowered to a penny in 1836, then abolished in 1855, thirty six years after its passing.

If the Peterloo massacre can be fixed to a time and place, its consequences, of which these statutes are just a few, were very directly felt for years afterwards.

60 George 3 & 1 George 4 c.1: The Unlawful Drilling Act

60 George 3 & 1 George 4 c.2: The Seizure of Arms Act

60 George 3 & 1 George 4 c.4: The Misdemeanours Act

60 George 3 & 1 George 4 c.6: The Seditious Meetings Act

60 George 3 & 1 George 4 c.8: Blasphemous and Seditious Libels

60 George 3 & 1 George 4 c.9: The Newspaper and Stamp Duties Act

Standardizing Statutes

I have just added the 1689 act ‘Absence of King William‘ to the statutes text section.

I took the text from Wikisource, which in turn transcribed it from the Statutes of the Realm collection, volume 6. It is also available from British History Online, which has transcribed three volumes of that series.

Statutes of the Realm is the most complete collection of pre-Union legislation available; it was commissioned to collect all the laws up to the union with Scotland, without regard to whether an act was in force or not. The act is not included in either Pickering’s or Ruffhead’s ‘Statutes At Large’ series, presumably because it had long since expired at the time those were published, and those collections were more pragmatically focused.

The text I’ve posted is different from the other transcriptions, in that I have standardized it. The Statutes of the Realm sought fidelity to the original manuscripts, and reconciling the originals and the inrolled copies, noting their differences, omissions, and discrepancies, and strictly following original spellings. This makes for difficult, interrupted reading for humans; similarly, it is an obstacle to ‘distant reading’, that is, the digital analysis analysis of large volumes of text.

Consequently, with the help of a simple line of code and a short, hand compiled list of obsolete spellings, the version I publish is readable both for people and machines.

All the changes to the text are quite minor: replacing antiquated and inconsistent spellings with regular, modern ones, often just removing a superfluous last letter (Regal for Regall, public for publick, etc.). The list of standardization couples available on github. It’s short, just 52 pairs, but it’s a start. I haven’t uploaded a script to utilise them yet, mainly because just one line is adequate:

while read n k; do sed -i.bak "s/\b$n\b/$k/g" target/*.txt; done < word-standardization-couples.txt

This should produce corrected versions of texts in the folder called target (insert your own path), with the originals renamed to *.txt.bak.

Note this has been tested on Lubuntu 18.04 and Mac OS High Sierra; other operating systems are available.

There is obviously a great deal more to say about manipulating texts in this way, covering matters ethical, academic, technical, and typographical. For the moment I leave all that aside, but it is worth noting these issues.

A Chronological Bibliography

Following an exchange on twitter with the Victorian Commons project, I have rejigged part of my first listing of volumes of statutes, and published a chronological bibliography of nineteenth century law.

This will make it easier to locate the texts of laws in the editions held by Google Books and the Internet Archive, as long as you know the correct calendar and regnal years for an act.

At the moment, this bibliography covers the years 1806 to 1908, but many later nineteenth century volumes are missing. These will be added as they are located, and when I have time.


Statutes in the Parliament.UK Digital Archive

I have recently found a new digital archive of English, British and U.K. statutes, at the website.

It appears to have around 1,200 items of legislation, some of which are professionally photographed manuscripts, and some of which are PDFs. The vast majority are of local acts; there’s only 56 (at the time of writing) public statutes available. The reproductions of the rolls and manuscripts are of high quality, and hosted externally on a system called ‘CollectionsBase.’ There is a download button in the bottom left hand corner, which, with the ‘Gallery’ view (top right corner) allows all the pages of a document to be downloaded in .jpg format.

Unfortunately, the system for items hosted on their own site is less usable. I have not found a single PDF file with the extension .pdf, even though the links to these documents claim them to be so and have such. This can cause problems with displaying the document, whether through the browser or using a desktop app, and creates work for the user in that every PDF downloaded needs to be renamed. Many local acts have the pseudo extension .local, though I have also found .South, .Western,  and .Clydebank. I presume the latter is due to the use of multiple full stops in the file names; the processing software seems to have truncated the name at the first of them.

Furthermore, it is difficult to navigate the catalogue other than with the search function. This means that it is difficult to know what is generally available, such as how many enclosure acts are there, how many there are, and what proportion it constitutes of the total legislation passed.

However, there are ways of finding all the public and private acts using the search function. These links are on the site, but I had difficulty finding them

Find all digitized public acts.

Find all digitized private acts.

In total, right now there are over 5,000 digitized documents. Find them all here.

Witchcraft Acts

Prompted in the first place by Hallowe’en, and then getting interested in the subject, I have put up the texts of the major statutes concerning witchcraft in the British Isles. For England, Great Britain, and the United Kingdom, these are:

1541-2: 33 Henry 8 c.8: The Act against Conjurations, Witchcraft, Sorcery and Enchantments

1563: 5 Elizabeth 1 c.16: An Act against Conjurations, Inchantments and Witchcraft

1580-1: 23 Elizabeth c.2: Against seditious words and rumours (This because it has clauses on prophesizing the Queen’s life span.)

1604: 1 James 1 c.12: An Act against Witchcraft

1735: 9 George 2 c.5: The Witchcraft Act

1821: 1 & 2 George 4 c.17: Repeal of the Irish Witchcraft Act

1951: 14 & 15 George 6 c. 33: Fraudulent Mediums Act

Also, I’ve added two acts from Ireland, and one from Scotland, from the legislatures previous to their respective acts of union. For Ireland, 1586: 28 Elizabeth 1 c. 2: An Act against Witchcraft and Sorcerie, repealed by the 1821 act above, and 10 Charles 1 s.2 c. 19: An act for the trial of murders, &c., as it mentions murders through bewitchment. And for Scotland, 1563: Mary c.73: Anentis Witchcraft.

Updates, October to December 2018.

Work on the Statutes project for the last three months of 2018:

The big news is that I now have a complete set of volumes of statutes for the nineteenth century, courtesy of the Institute of Historical Research allowing me to photograph their copies. The OCR’d text, messy but undergoing correction, can be found on Github.

There is also now a complete set of tables of public acts for the Parliament of the United Kingdom of Great Britain and Ireland, 1801 to 1921. Again, find them on Github.

Laws added: the utterances of an oaf required the addition of  Statute of Praemunire; Hallowe’en led me to add some witchcraft acts from 1541, 1563, and 1604, and Bonfire night was marked with James I’s dictat for the Observance of November the 5th. Topical stuff, eh?

Also added: 1739 County Rates Act and 1838 Public Records Act.

A new section has been created for private, local and personal acts; the first text in it is the Lancashire Sessions Act of 1798.

And the usual round of automated OCR corrections.


Digitization of the missing late c19th volumes

Although there are many digitized collections of statutes available online, and indeed many digitizations of the same publication, I have not found a number of volumes from the last two decades of the nineteenth century.

Happily, I have now been able to digitize these volumes myself, courtesy of the Institute of Historical Research, who very kindly allowed me to photograph their copies.

I copied them using an iphone and a selfie stick designed by Sussex Unversity Humanities Lab. Althugh SHL are developing a whole workflow for DIY scanning and OCRing documents through a modern smartphone, I simply took pictures, and later ran them through Abbyy Finereader, as I have been doing with the digital volumes downloaded from Google Books and Internet Archive.

The whole procedure took a full work day, which I think quite quick given the size and number of the volumes; once I got into the rhythm, the apparatus held firm, I averaged about one volume an hour, photographing two pages at a time.

The text of these volumes can be found on github; some automated correcting has been carried out, but it is still all pretty raw, especially the tables. No doubt there will be pages I have inadvertently photographed twice, photographed poorly, or accidentally omitted, but by and large I think the quality is as good as can be expected. As with all the other volumes I have OCRd, the text is public domain.

Once again, my thanks to the IHR for access to their books and a desk at which to copy them, and to Sussex Humanities Lab for the selfie sticks. Without such help, ‘unofficial’, grassroots, lone scholar projects such as this one would not be able to develop their potential.

Tables of Statutes of the United Kingdom, 1801 to 1921.

I have now completed tables of the full, long titles of public statutes passed by the parliament of the United Kingdom of Great Britain and Ireland, from the Act of Union in 1801 up to 1921, when Ireland was divided and the south achieved independence. They can be found on github.  All these tables are public domain, and can be reused for any purpose and in any way one wishes.

I am currently working on generating tables of abbreviated titles of private and local acts for this period, using the annotated lists of local acts and private acts produced by

This will be quicker than working through the full titles in the volumes of statutes for this period, although at the cost of less detail. (Tables giving full titles will be produced eventually as I work on correcting the OCR of the scanned volumes, but this will take some time.)

Once the private and local tables have been created, I will produce a more convenient package of these lists, easy to download and suitable for searching and text mining.

Updates: August and September 2018

Work on the Statutes Project done over the last two months:

A blog post: on a satyrical law against make-up and adornments, sometimes taken as real, that I’ve dated back to 1785.

New tables: There is now a complete run of tables of public acts spanning 1807 to 1912, hosted on Github.

New acts added: three on the preservation of historical monuments from 1882, 1892 and 1900; the Corruption of Blood Act, 1814; and the Transportation Act, 1718.

And the usual round of automatic corrections to the OCR’d text of the collections of statutes. Whilst still very messy, the text is readable for those volumes in a modern font, and approaching readability for those in old, ‘long-s’ typefaces. Find them on Github.