Category Archives: Digitization

Building upon Google Books.

Some months ago I finished compiling a Chronological Bibliography of British and U.K. statutes – volumes of statutes organized by regnal year or years. This is an easier way of locating (British) laws than via the other bibliographies I’ve compiled. Each link is to an openly accessible, public domain book, the majority digitized by Google, and hosted on either Google Books or the Internet Archive. In the course of searching for these I’ve been able to extend the coverage up to 1920, 10 & 11 George 5. In a very few cases, this is because I had overlooked volumes; but mostly, it is because I raised an issue via the Google Books Inquiry form.

Through this, I was able to request that the full content of out-of-copyright volumes available in ‘snippet view’ be made available. And in the vast majority of cases – just one refusal, and one request unresolved – the full text has been made available, and promptly so.

Without Google Books, and the similar Internet Archive, this project, based on nearly 200 volumes of British statutes, would not be possible. It would be just too difficult and time-consuming for a single person to approach and negotiate with however many organisations and libraries, obtain hundreds of books and digitize them, before getting to the stage I am at, of correcting the OCR’d text. This vast, free to access library of out-of-copyright and out-of-print volumes, can be a foundation on which to build all sorts of historical resources, investigations and analyses.

Against this, of course, is a whole series of problems relating to how Google Books was conceived and run: as an industrial process, on a huge scale, producing a vast reservoir of data, aiming simply to get enough right, the maximum return from the smallest possible investment. This is Google Books literal ‘darker’ side: precarious and poorly paid workers, frequently women, frequently black.

A direct consequence of this labour-intensive, high-tempo factory system is the poor curation. There’s the notoriously poor metadata – a veritable train wreck – attached to the books; the hideous OCR, although there has been some automated correction of it; the many poor scans, distorted and obscured; the worn, worn-out books indiscriminately put through the production line.

Even worse than all these specific flaws are, is just how opaque the library is as a whole. There seems to be no way to comprehend it as an archive, no way to know what is in it, no way to extract subsets of books or their metadata. Even something as simple as listing all the titles in their archive for a year of decade isn’t possible. Given that search is Google’s forte, this obscurity has to be deliberate; the public-facing library is fundamentally a side product of a big (linguistic) data haul, a negotiation with the libraries that provide the books, and a swerve round the publishers that hold copyrights. (And I wonder if the absence of a list of the half million titles recently added to Google from the British Library has been contractually forbidden, perhaps under clause 4.7, restricting automated access. It’s impossible that there isn’t such a manifest, and one has been released for the Microsoft-digitized volumes held by the B.L. Of course it is possible the B.L. just doesn’t want to release it.) By contrast, the Internet Archive goes to great lengths to allow deep searches and bulk downloads of their holdings. That they take in Google’s scanned books frees them from these obstacles.

The limitations and restrictions of Google Books may well disuade the building of projects upon it. Really, it is just a large repository of page images in PDFs without much support. But if one accepts its limitations and expects no more, it is still useful. Projects like this one can curate a subset of interrelated documents within certain parameters. Even if there is considerable work to be done, a significant part has been done. And it is better that the creation of historical archives is made by historians than corporations.

Update, 8 October 2021: StoryTracer has published a step by step guide to requesting Google Books release public domain books.


Standardizing Statutes

I have just added the 1689 act ‘Absence of King William‘ to the statutes text section.

I took the text from Wikisource, which in turn transcribed it from the Statutes of the Realm collection, volume 6. It is also available from British History Online, which has transcribed three volumes of that series.

Statutes of the Realm is the most complete collection of pre-Union legislation available; it was commissioned to collect all the laws up to the union with Scotland, without regard to whether an act was in force or not. The act is not included in either Pickering’s or Ruffhead’s ‘Statutes At Large’ series, presumably because it had long since expired at the time those were published, and those collections were more pragmatically focused.

The text I’ve posted is different from the other transcriptions, in that I have standardized it. The Statutes of the Realm sought fidelity to the original manuscripts, and reconciling the originals and the inrolled copies, noting their differences, omissions, and discrepancies, and strictly following original spellings. This makes for difficult, interrupted reading for humans; similarly, it is an obstacle to ‘distant reading’, that is, the digital analysis analysis of large volumes of text.

Consequently, with the help of a simple line of code and a short, hand compiled list of obsolete spellings, the version I publish is readable both for people and machines.

All the changes to the text are quite minor: replacing antiquated and inconsistent spellings with regular, modern ones, often just removing a superfluous last letter (Regal for Regall, public for publick, etc.). The list of standardization couples available on github. It’s short, just 52 pairs, but it’s a start. I haven’t uploaded a script to utilise them yet, mainly because just one line is adequate:

while read n k; do sed -i.bak "s/\b$n\b/$k/g" target/*.txt; done < word-standardization-couples.txt

This should produce corrected versions of texts in the folder called target (insert your own path), with the originals renamed to *.txt.bak.

Note this has been tested on Lubuntu 18.04 and Mac OS High Sierra; other operating systems are available.

There is obviously a great deal more to say about manipulating texts in this way, covering matters ethical, academic, technical, and typographical. For the moment I leave all that aside, but it is worth noting these issues.

On automatic correction of OCR output

Although this project began because I found many historical questions led to statutory source material, it has taken a technical turn into creating reliable and useful texts of the laws. Whilst I wasn’t surprised to find that the raw OCR of the eighteenth and early nineteenth century publications was foul, I had hoped it could be knocked into reasonable shape simply by correcting obvious, predictable errors, such as the long s being interpreted as an f.

This turned out to be true to a certain extent. I’m running a fairly simple bash script that takes a list of errors and their corrections, and one by one works through each word of the OCR’d text of circa 90 volumes published before 1820, and the results are promising. The errors are much more diverse than I presumed, but are still fairly uniform. For example, the combination of long s followed by h, as in parish, is often read as lh, lii, jh, and so on.

A bigger problem is when the s interpreted as f produces another english word, such as ‘lame’ or ‘fame’ for same. For this I have used the same script to check for phrases. Day makes sense preceeded by same, so correcting nonsense phrases like lame day and fame day is quite safe. And as the statutes are quite formulaic, with many repeated phrases, this approach is quite suited to them. Even better, as more words are corrected, the more these phrases are made apparent. With the word ‘act’ corrected from the very many misreadings, one can start correcting the phrase ‘act parted’ into ‘act passed.’

Another approach is to think in terms of parts of words. Given that the verb ‘establish’, often rendered as eftablifh, has a number of derivatives – established, establishing, disestablishment and so on – it makes sense to correct the stem of the word, rather than check for each variant.

All to the good, but this is a big body of text. There’s something like 14 million words in Pickering’s collection of the statutes alone. And that means there’s going to be a lot of mistakes, and more importantly, a lot of types of mistakes. The long s alone has at least 3 types of common misreading, as f, j, and l, and even more when it gets taken in conjunction with its following letter.

Working out how to tackle this has been gratifyingly interesting. There’s all sorts of technical ways of doing this, by looking at the texts as individual words, as stems, or lemmas, of words, as a collection of phrases, as strings of characters. There’s also some deeper, mathematical, ways of thinking about this, that would alleviate having to compile a near-infinite list of possible errors that do not run afoul of false positives for any eighteenth century text. For example, the lame king is not to be found in the statutes, but no doubt turns up in some novel of the time.

It should, for example, be possible to search the statutes for every string close to, but not identical with, the phrase ‘the authority aforesaid’ and correct it, without having to produce a list of every possible variant. Such a more subtle process should be quicker than the ‘brute force’ method I am currently using.

This is leaving aside the other causes of errors: those caused by the quality of the digitization, the quality of the printing and the markings of readers in the volumes digitized, and most problematic for this project, the mis-recognition of the layout of the pages. The convention of annotating laws with marginal notes – and these notes are not part of the statute itself – complicates the page design, and the raw OCR often integrates the comments into the main body of the text. On reflection, I should have taken more care of that when putting the books through the OCR machine, but that comes with a considerable cost in time. There may be ways of automating the detection of such errors.

Work on error correction continues, with the pleasant collateral that it is a fascinating problem, and not mere drudgery. In the meantime, I have a growing set of lists of automatic correction pairs on github. These have been split into certain categories: place names, latin, phrases, as well as English words. Depending on the text being corrected, some will be relevent and others not. Note that because of the script I am using (which I hope to publish soon), spaces in phrases and split words are escaped with a backslash, as in ‘authority\ aforesaid’.


For the word “man” there shall be substituted the word “person”

It is election day today here in the United Kingdom, with the attendent exhortations to vote.

Among these exhortations was a tweet from the UK Parliamentary Archives:

Which infuriated me. Firstly, there isn’t a link to the document, even though it is on their website.  Secondly, because the digitization on their website is, as I put it in an intemperate tweet, Badly digitised, low resolution, illegible, incomplete. Insulting.

Let’s expand on this. It is badly digitized, probably just a photograph rather than a proper scan. It’s a low resolution image, so when running it through an OCR reader produced far worse text than what one can expect from a twentieth century document. It’s not just software that can’t read it; it’s difficult for a human to read as well. And finally, it is incomplete, reproducing only the first three pages of text.

And all this means it is insulting. It’s showing off a possession, not actually sharing or allowing others to read and understand it. This is made worse given the subject: a historic piece of legislation, that finally gave women the vote on the same terms as men, is being used as a boast, reducing the long struggle for this essential right to a scrap of property you can glimpse but not enjoy.

Furthermore, it reduces democracy to one, electoral, dimension. Democracy is not just about voting. It is also about checks and balances, the separation of powers, and the rule of law. The blasé maxim Ignorance of the law is no excuse has to be matched with a commitment that the law be easily available to all. That includes laws like this one, that although repealed have established fundamental principles that survive to this day.

Consequently, I’ve rushed to the British Library to transcribe the act, and published the full text.

Note that I have not found this act in the commercial law archives Hein Online or Lexis Nexis. It is available via Justis, but behind in a paywall. On a happier note, hunting for this text has led me to Matthew William’s fantastic plain text archive of U.K. legislation, 1900-2015. I will be writing more about this amazing resource when I’ve dug deeper into it.