LORLS | Extracting Harvard citations from Word documents

Over the years, one question that we’ve had pop up occasionally from academics and library staff is whether we could import reading lists from existing Microsoft Word documents. Many academics have produced course handouts including reading material in Word format and some still do, even though we’ve had a web based reading list system at Loughborough for over a decade now, and a VLE for a roughly similar period.

We’ve always had to say no in the past, because Microsoft Word’s proprietary binary format was very difficult to process (especially on the non-Microsoft platforms we use to host our systems) and we had other, more important development tasks. Also we thought that extracting the variety of citation/bibliography formats that different academics use would be a nightmare.

However with the new LUMP based LORLS now well bedded in at Loughborough and Microsoft basing the document format of newer versions of Word on XML, we thought we’d revisit the idea and spend a bit of time to see what we could do.

Microsoft Office Word 2007 was introduced as part of the Office 2007 suite using a default file format based on XML, called Office Open XML Format, or OpenXML for short. A Word 2007 document is really a compressed ZIP archive containing a directory structure populated with a set of XML documents conforming to Microsoft’s published XML schemas, as well as any media files required for the documents (images, movies, etc). Most academics are now using versions of Microsoft Word that generate files in this format, which can be identified easily by looking for the “.docx” filename extension.

The XML documents inside the ZIPed .docx archive contain both the text of the document, styling information and properties about the document (ie who created it and when). There’s actually quite a lot of structural information stored as well, which Microsoft explain how to process in order to work out how different parts of the document are related to each other. Some of this is rather complex, but for a simple “proof of concept” what we needed was the actual document text structure. By default this lives in a file called “word/document.xml” inside the ZIP archive.

The document.xml file contains an XML element called <w:body></w:body> that encapsulates the actual document text. Individual paragraphs are then themselves held in <w:p></w:p> elements and these are then further broken down based on styling applied, whether there are embedded hyperlinks in the paragraph, etc, etc. Looking through a few sample reading lists in .docx format gave us a good feel for what sort of structures we’d find. Processing the .docx OpenXML using Perl would be possible using the Archive::Any module to unpack the ZIP archive and then the XML::Simple module to process the XML data held within into Perl data structures.

The next issue was how do we find the citations held inside the Word documents and turn them into Structural Units in LORLS? We decided to aim to import Harvard style citations and this is where we hit the first major problem: not everyone seems to agree on what a Harvard style bibliographic reference should look like. For example some Harvard referencing texts say that author names in books should be capitalised, publication dates should follow in brackets and titles underlined like this:

WILLS, H., (1985), Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

whereas other sources don’t say anything about author capitalisation or surname/firstname/initial ordering but want the title in italics, and no brackets round the publication date:

Henry Willis, 1985, Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

When you start to look at real lists of citations from academics it becomes clear that many aren’t even consistent within a single document, which makes things even more tricky. Some of the differences may be down to simple mistakes, but others may be due to cutting and pasting between other documents with similar, but not quite the same, Harvard citation styles.

The end result of this is that we need to do a lot of pattern matching and also accept that we aren’t going to get a 100% hit rate (at least not straight away!). Luckily the LORLS back end is written in Perl and that is a language just dripping with pattern matching abilities – especially its powerful regular expression processor. So for our proof of concept we took some representative OpenXML format Word .docx files from one of our academic departments and then used them to refine a set of regular expressions designed to extract Harvard-esque citations in paragraph, trying to work out what type of work it is (book, article, etc) based on the ordering of parts of the citation and the use of italics and/or underlining.

The initial proof of concept was a command line script that was given the name of a .docx document and would then spit out a stream of document text interspersed with extracted citations. Once we’d got the regular expressions tweaked to the point where our test set of documents were generating 80-90% hit rates, we took this code and turned it into a CGI script that could then be used as an API call to extract citations from an uploaded document and return a list of potential hits in JSONP format.

One thing to note about file uploading to an API is that browser security models don’t necessarily make this terribly easy – a simple HTML FORM file upload will try to refresh the page rather than do an AJAX style client-server transaction. The trick, as used by a wide variety of web sites now, is to use a hidden IFRAME as the target of the file upload output or some XHR scripting. Luckily packages such as jQuery now come with support for this, so we don’t need to do too much heavy lifting to make it work from the Javascript front end client. Using JSONP makes this a bit easier as well, as its much easier to handle the JSON format in Javascript that if we returned XML.

The JSONP results returned from our OpenXML processing CGI script provide structures containing the details extracted from each work. This script does NOT actually generate LUMP Structural Units; instead we just return what type we think each work is (book, etc) and then some extracted citation information relevant to that type of work.

The reasons for not inserting Structural Units immediately are three fold. Firstly because we can’t be sure we’ve got the pattern matching working 100% it makes sense to allow the user to see what we think the matched citations are and allow them to select the ones they want to import into LORLS. Secondly, we already have API calls to allow Structural Units to be created and edited, so we probably shouldn’t reinvent the wheel here – the client already knows how to talk to those to make new works appear in reading lists. Lastly by not actually dealing with LUMP Structural Units, we’ve got a more general purpose CGI script – it could be used in systems that have nothing to with LORLS or LUMP for example.

So that’s the current state of play. We’re gathering together more Word documents from academics with Harvard style bibliographies in them so that we can test our regular expressions against them, and Jason will be looking at integrating our test Javascript .docx uploader into CLUMP. Hopefully the result will be something the academics and library staff will be happy to see at last!