Posts tagged Importing citations

The continuing battle of parsing Word documents for reading list material

At this year’s Meeting The Reading List Challenge (MTLRC) workshop, Gary Brewerton (my boss) showed the delegates one of our LORLS features: the ability to suck citation data out of Word .docx documents.  We’ve had this for a few years and it is intended to allow academics to take existing reading lists that they have produced in Word and import them relatively easily into our electronic reading lists system. The nice front end was written by my colleague Jason Cooper, but I was responsible for the underlying guts that the APIs call to try parsing the Word document and turn it into structured data that LORLS can understand and use. We wrote it originally based on suggestions from a few academics who already had reading lists with Harvard style references in them, and they used it to quickly populate LORLS with their data.

Shortly after the MTRLC workshop, Gary met with some other academics who also needed to import existing reading lists into LORLS.  He showed them our existing importer and, whilst it worked, it left quite alot entries as “notes”, meaning it couldn’t parse them into structured data.  Gary then asked me to take another look at the backend code and see if I could improve its recognition rate.

I had a set of “test” lists donated by the academics of varying lengths, all from the same department.  With the existing code, in some cases less than 50% of the items in these documents were recognised and classified correctly.  Of those, some were misclassified (eg book chapters appearing as books).

The existing LORLS .docx import code used Perl regular expression pattern matching alone to try to work out what sort of work a citation referred to this.  This worked OK with Word documents where the citations were well formed.  A brief glance through the new lists showed that lots of the citations were not well formed.  Indeed the citation style and layout seemed to vary from item to item, probably because they had been collected over a period of years by a variety of academics.  Whilst I could modify some of the pattern matches to help recognise some of the more obvious cases, it was clear that the code was going to need something extra.

That extra turned out to be Z39.50 look ups.  We realised that the initial pattern matches could be quite generalised to see if we could extract out authors, titles, publishers and dates, and then use those to do a Z39.50 look up.  Lots of the citations also had classmarks attached, so we’d got a good clue that many of the works did exist in the library catalogue.  This initial pattern match was still done using regular expressions and needed quite a lot of tweaking to get recognition accuracy up.  For example spotting publishers separated from titles can be “interesting”, especially if the title isn’t delimited properly.  We can spot some common cases, such as publishers located in London, New York or in US states with two letter abbreviations.  It isn’t fool proof, but its better than nothing.

However using this left us with only around 5% of the entries in the documents classified as unstructured notes when visual checking indicated that they were probably citations. These remaining items are currently left as notes, as there are a number of reasons why the code can’t parse them.

The number one reason is that they don’t have enough punctuation and/or formatting in the citation to allow regular expression to determine which parts are which. In some cases the layout doesn’t even really bear much relation to any formal Harvard style – the order of authors and titles can some time switch round and in some cases it isn’t clear where the title finishes and the publisher ends. In a way they’re a good example to students of the sort of thing they shouldn’t have in their own referencing!

The same problems encountered using the regular expressions would happen with a formal parser as these entries are effectively “syntax errors” if treating Harvard citations as a sort of grammar.  This is probably about the best we’ll be able to do for a while, at least until we’ve got some deep AI that can actually read the text and understand it, rather that just scan for patterns.

And if we reach that point LORLS will probably become self aware… now there’s a scary thought!

Extracting Harvard citations from Word documents

Over the years, one question that we’ve had pop up occasionally from academics and library staff is whether we could import reading lists from existing Microsoft Word documents. Many academics have produced course handouts including reading material in Word format and some still do, even though we’ve had a web based reading list system at Loughborough for over a decade now, and a VLE for a roughly similar period.

We’ve always had to say no in the past, because Microsoft Word’s proprietary binary format was very difficult to process (especially on the non-Microsoft platforms we use to host our systems) and we had other, more important development tasks.  Also we thought that extracting the variety of citation/bibliography formats that different academics use would be a nightmare.

However with the new LUMP based LORLS now well bedded in at Loughborough and Microsoft basing the document format of newer versions of Word on XML, we thought we’d revisit the idea and spend a bit of time to see what we could do.

Microsoft Office Word 2007 was introduced as part of the Office 2007 suite using a default file format based on XML, called Office Open XML Format, or OpenXML for short.  A Word 2007 document is really a compressed ZIP archive containing a directory structure populated with a set of XML documents conforming to Microsoft’s published XML schemas, as well as any media files required for the documents (images, movies, etc).  Most academics are now using versions of Microsoft Word that generate files in this format, which can be identified easily by looking for the “.docx” filename extension.

The XML documents inside the ZIPed .docx archive contain both the text of the document, styling information and properties about the document (ie who created it and when).  There’s actually quite a lot of structural information stored as well, which Microsoft explain how to process in order to work out how different parts of the document are related to each other.  Some of this is rather complex, but for a simple “proof of concept” what we needed was the actual document text structure.  By default this lives in a file called “word/document.xml” inside the ZIP archive.

The document.xml file contains an XML element called <w:body></w:body> that encapsulates the actual document text.  Individual paragraphs are then themselves held in <w:p></w:p> elements and these are then further broken down based on styling applied, whether there are embedded hyperlinks in the paragraph, etc, etc.  Looking through a few sample reading lists in .docx format gave us a good feel for what sort of structures we’d find.  Processing the .docx OpenXML using Perl would be possible using the Archive::Any module to unpack the ZIP archive and then the XML::Simple module to process the XML data held within into Perl data structures.

The next issue was how do we find the citations held inside the Word documents and turn them into Structural Units in LORLS?  We decided to aim to import Harvard style citations and this is where we hit the first major problem: not everyone seems to agree on what a Harvard style bibliographic reference should look like.  For example some Harvard referencing texts say that author names in books should be capitalised, publication dates should follow in brackets and titles underlined like this:

WILLS, H., (1985), Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

whereas other sources don’t say anything about author capitalisation or surname/firstname/initial ordering but want the title in italics, and no brackets round the publication date:

Henry Willis, 1985, Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

When you start to look at real lists of citations from academics it becomes clear that many aren’t even consistent within a single document, which makes things even more tricky.  Some of the differences may be down to simple mistakes, but others may be due to cutting and pasting between other documents with similar, but not quite the same, Harvard citation styles.

The end result of this is that we need to do a lot of pattern matching and also accept that we aren’t going to get a 100% hit rate (at least not straight away!).  Luckily the LORLS back end is written in Perl and that is a language just dripping with pattern matching abilities – especially its powerful regular expression processor. So for our proof of concept we took some representative OpenXML format Word .docx files from one of our academic departments and then used them to refine a set of regular expressions designed to extract Harvard-esque citations in paragraph, trying to work out what type of work it is (book, article, etc) based on the ordering of parts of the citation and the use of italics and/or underlining.

The initial proof of concept was a command line script that was given the name of a .docx document and would then spit out a stream of document text interspersed with extracted citations.  Once we’d got the regular expressions tweaked to the point where our test set of documents were generating 80-90% hit rates, we took this code and turned it into a CGI script that could then be used as an API call to extract citations from an uploaded document and return a list of potential hits in JSONP format.

One thing to note about file uploading to an API is that browser security models don’t necessarily make this terribly easy – a simple HTML FORM  file upload will try to refresh the page rather than do an AJAX style client-server transaction.  The trick, as used by a wide variety of web sites now, is to use a hidden IFRAME as the target of the file upload output or some XHR scripting.  Luckily packages such as jQuery now come with support for this, so we don’t need to do too much heavy lifting to make it work from the Javascript front end client.  Using JSONP makes this a bit easier as well, as its much easier to handle the JSON format in Javascript that if we returned XML.

The JSONP results returned from our OpenXML processing CGI script provide structures containing the details extracted from each work.  This script does NOT actually generate LUMP Structural Units; instead we just return what type we think each work is (book, etc) and then some extracted citation information relevant to that type of work.

The reasons for not inserting Structural Units immediately are three fold.  Firstly because we can’t be sure we’ve got the pattern matching working 100% it makes sense to allow the user to see what we think the matched citations are and allow them to select the ones they want to import into LORLS.  Secondly, we already have API calls to allow Structural Units to be created and edited, so we probably shouldn’t reinvent the wheel here – the client already knows how to talk to those to make new works appear in reading lists.  Lastly by not actually dealing with LUMP Structural Units, we’ve got a more general purpose CGI script – it could be used in systems that have nothing to with LORLS or LUMP for example.

So that’s the current state of play.  We’re gathering together more Word documents from academics with Harvard style bibliographies in them so that we can test our regular expressions against them, and Jason will be looking at integrating our test Javascript .docx uploader into CLUMP.  Hopefully the result will be something the academics and library staff will be happy to see at last!


Go to Top