Category Archives: Archives

Building upon Google Books.

Some months ago I finished compiling a Chronological Bibliography of British and U.K. statutes – volumes of statutes organized by regnal year or years. This is an easier way of locating (British) laws than via the other bibliographies I’ve compiled. Each link is to an openly accessible, public domain book, the majority digitized by Google, and hosted on either Google Books or the Internet Archive. In the course of searching for these I’ve been able to extend the coverage up to 1920, 10 & 11 George 5. In a very few cases, this is because I had overlooked volumes; but mostly, it is because I raised an issue via the Google Books Inquiry form.

Through this, I was able to request that the full content of out-of-copyright volumes available in ‘snippet view’ be made available. And in the vast majority of cases – just one refusal, and one request unresolved – the full text has been made available, and promptly so.

Without Google Books, and the similar Internet Archive, this project, based on nearly 200 volumes of British statutes, would not be possible. It would be just too difficult and time-consuming for a single person to approach and negotiate with however many organisations and libraries, obtain hundreds of books and digitize them, before getting to the stage I am at, of correcting the OCR’d text. This vast, free to access library of out-of-copyright and out-of-print volumes, can be a foundation on which to build all sorts of historical resources, investigations and analyses.

Against this, of course, is a whole series of problems relating to how Google Books was conceived and run: as an industrial process, on a huge scale, producing a vast reservoir of data, aiming simply to get enough right, the maximum return from the smallest possible investment. This is Google Books literal ‘darker’ side: precarious and poorly paid workers, frequently women, frequently black.

A direct consequence of this labour-intensive, high-tempo factory system is the poor curation. There’s the notoriously poor metadata – a veritable train wreck – attached to the books; the hideous OCR, although there has been some automated correction of it; the many poor scans, distorted and obscured; the worn, worn-out books indiscriminately put through the production line.

Even worse than all these specific flaws are, is just how opaque the library is as a whole. There seems to be no way to comprehend it as an archive, no way to know what is in it, no way to extract subsets of books or their metadata. Even something as simple as listing all the titles in their archive for a year of decade isn’t possible. Given that search is Google’s forte, this obscurity has to be deliberate; the public-facing library is fundamentally a side product of a big (linguistic) data haul, a negotiation with the libraries that provide the books, and a swerve round the publishers that hold copyrights. (And I wonder if the absence of a list of the half million titles recently added to Google from the British Library has been contractually forbidden, perhaps under clause 4.7, restricting automated access. It’s impossible that there isn’t such a manifest, and one has been released for the Microsoft-digitized volumes held by the B.L. Of course it is possible the B.L. just doesn’t want to release it.) By contrast, the Internet Archive goes to great lengths to allow deep searches and bulk downloads of their holdings. That they take in Google’s scanned books frees them from these obstacles.

The limitations and restrictions of Google Books may well disuade the building of projects upon it. Really, it is just a large repository of page images in PDFs without much support. But if one accepts its limitations and expects no more, it is still useful. Projects like this one can curate a subset of interrelated documents within certain parameters. Even if there is considerable work to be done, a significant part has been done. And it is better that the creation of historical archives is made by historians than corporations.



Statutes in the Parliament.UK Digital Archive

I have recently found a new digital archive of English, British and U.K. statutes, at the website.

It appears to have around 1,200 items of legislation, some of which are professionally photographed manuscripts, and some of which are PDFs. The vast majority are of local acts; there’s only 56 (at the time of writing) public statutes available. The reproductions of the rolls and manuscripts are of high quality, and hosted externally on a system called ‘CollectionsBase.’ There is a download button in the bottom left hand corner, which, with the ‘Gallery’ view (top right corner) allows all the pages of a document to be downloaded in .jpg format.

Unfortunately, the system for items hosted on their own site is less usable. I have not found a single PDF file with the extension .pdf, even though the links to these documents claim them to be so and have such. This can cause problems with displaying the document, whether through the browser or using a desktop app, and creates work for the user in that every PDF downloaded needs to be renamed. Many local acts have the pseudo extension .local, though I have also found .South, .Western,  and .Clydebank. I presume the latter is due to the use of multiple full stops in the file names; the processing software seems to have truncated the name at the first of them.

Furthermore, it is difficult to navigate the catalogue other than with the search function. This means that it is difficult to know what is generally available, such as how many enclosure acts are there, how many there are, and what proportion it constitutes of the total legislation passed.

However, there are ways of finding all the public and private acts using the search function. These links are on the site, but I had difficulty finding them

Find all digitized public acts.

Find all digitized private acts.

In total, right now there are over 5,000 digitized documents. Find them all here.