Extracting Harvard citations from Word documents

Over the years, one question that we’ve had pop up occasionally from academics and library staff is whether we could import reading lists from existing Microsoft Word documents. Many academics have produced course handouts including reading material in Word format and some still do, even though we’ve had a web based reading list system at Loughborough for over a decade now, and a VLE for a roughly similar period.

We’ve always had to say no in the past, because Microsoft Word’s proprietary binary format was very difficult to process (especially on the non-Microsoft platforms we use to host our systems) and we had other, more important development tasks.  Also we thought that extracting the variety of citation/bibliography formats that different academics use would be a nightmare.

However with the new LUMP based LORLS now well bedded in at Loughborough and Microsoft basing the document format of newer versions of Word on XML, we thought we’d revisit the idea and spend a bit of time to see what we could do.

Microsoft Office Word 2007 was introduced as part of the Office 2007 suite using a default file format based on XML, called Office Open XML Format, or OpenXML for short.  A Word 2007 document is really a compressed ZIP archive containing a directory structure populated with a set of XML documents conforming to Microsoft’s published XML schemas, as well as any media files required for the documents (images, movies, etc).  Most academics are now using versions of Microsoft Word that generate files in this format, which can be identified easily by looking for the “.docx” filename extension.

The XML documents inside the ZIPed .docx archive contain both the text of the document, styling information and properties about the document (ie who created it and when).  There’s actually quite a lot of structural information stored as well, which Microsoft explain how to process in order to work out how different parts of the document are related to each other.  Some of this is rather complex, but for a simple “proof of concept” what we needed was the actual document text structure.  By default this lives in a file called “word/document.xml” inside the ZIP archive.

The document.xml file contains an XML element called <w:body></w:body> that encapsulates the actual document text.  Individual paragraphs are then themselves held in <w:p></w:p> elements and these are then further broken down based on styling applied, whether there are embedded hyperlinks in the paragraph, etc, etc.  Looking through a few sample reading lists in .docx format gave us a good feel for what sort of structures we’d find.  Processing the .docx OpenXML using Perl would be possible using the Archive::Any module to unpack the ZIP archive and then the XML::Simple module to process the XML data held within into Perl data structures.

The next issue was how do we find the citations held inside the Word documents and turn them into Structural Units in LORLS?  We decided to aim to import Harvard style citations and this is where we hit the first major problem: not everyone seems to agree on what a Harvard style bibliographic reference should look like.  For example some Harvard referencing texts say that author names in books should be capitalised, publication dates should follow in brackets and titles underlined like this:

WILLS, H., (1985), Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

whereas other sources don’t say anything about author capitalisation or surname/firstname/initial ordering but want the title in italics, and no brackets round the publication date:

Henry Willis, 1985, Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

When you start to look at real lists of citations from academics it becomes clear that many aren’t even consistent within a single document, which makes things even more tricky.  Some of the differences may be down to simple mistakes, but others may be due to cutting and pasting between other documents with similar, but not quite the same, Harvard citation styles.

The end result of this is that we need to do a lot of pattern matching and also accept that we aren’t going to get a 100% hit rate (at least not straight away!).  Luckily the LORLS back end is written in Perl and that is a language just dripping with pattern matching abilities – especially its powerful regular expression processor. So for our proof of concept we took some representative OpenXML format Word .docx files from one of our academic departments and then used them to refine a set of regular expressions designed to extract Harvard-esque citations in paragraph, trying to work out what type of work it is (book, article, etc) based on the ordering of parts of the citation and the use of italics and/or underlining.

The initial proof of concept was a command line script that was given the name of a .docx document and would then spit out a stream of document text interspersed with extracted citations.  Once we’d got the regular expressions tweaked to the point where our test set of documents were generating 80-90% hit rates, we took this code and turned it into a CGI script that could then be used as an API call to extract citations from an uploaded document and return a list of potential hits in JSONP format.

One thing to note about file uploading to an API is that browser security models don’t necessarily make this terribly easy – a simple HTML FORM  file upload will try to refresh the page rather than do an AJAX style client-server transaction.  The trick, as used by a wide variety of web sites now, is to use a hidden IFRAME as the target of the file upload output or some XHR scripting.  Luckily packages such as jQuery now come with support for this, so we don’t need to do too much heavy lifting to make it work from the Javascript front end client.  Using JSONP makes this a bit easier as well, as its much easier to handle the JSON format in Javascript that if we returned XML.

The JSONP results returned from our OpenXML processing CGI script provide structures containing the details extracted from each work.  This script does NOT actually generate LUMP Structural Units; instead we just return what type we think each work is (book, etc) and then some extracted citation information relevant to that type of work.

The reasons for not inserting Structural Units immediately are three fold.  Firstly because we can’t be sure we’ve got the pattern matching working 100% it makes sense to allow the user to see what we think the matched citations are and allow them to select the ones they want to import into LORLS.  Secondly, we already have API calls to allow Structural Units to be created and edited, so we probably shouldn’t reinvent the wheel here – the client already knows how to talk to those to make new works appear in reading lists.  Lastly by not actually dealing with LUMP Structural Units, we’ve got a more general purpose CGI script – it could be used in systems that have nothing to with LORLS or LUMP for example.

So that’s the current state of play.  We’re gathering together more Word documents from academics with Harvard style bibliographies in them so that we can test our regular expressions against them, and Jason will be looking at integrating our test Javascript .docx uploader into CLUMP.  Hopefully the result will be something the academics and library staff will be happy to see at last!

 

Converting structural units from one type to another

With LORLS6, we’ve moved away from the older LORLS “one size fits all” approach for the resources we hold in reading lists to having structural units, each of which is of a particular type.  Those types can be books, book chapters, journals, journal articles, reading lists, modules, departments and, er, well more or less anything the system’s administrator wants them to be.  This gives the system some great flexibility, especially seeing as the structural unit types also have sets of valid parent/child links between different types (so that departments can appear under an institution, but not under a book for example).  It also means that we can provide validation for what goes into each type, and treat different types of resource in different ways.

All well and good.  But want if you need to change the type of a structural unit?  Why would you need to do that?  Well, it might be that someone has mistakenly created a load of resources as books when they should be journal articles.  Or you’ve introduced a new organisational unit between an institution and a department (call it, oooh, say a school) and need to change some departments into the new type.  Until now you either needed to delete the existing structural unit and then create a new one of the new type, or get down and dirty with the database to tweaking the relationships at a SQL level (we’ve done the latter at Loughborough for example, for the “introducing schools” reason given above).

However we thought it might be an idea to provide an API to allow this conversion to be done programmatically, and potentially have a nice webby user interface wrapped round it so that end users/librarians can do this themselves.  So enter stage left: the ConvertSU API call.

This new call normally takes two parameters: su_id and target_sut_id.  The former is the ID of the structural unit we’re diddling around with, whilst the latter is the ID of the structural type that we’re planning on changing it to.  Behind the scenes there’s a rather large Perl data structure that tells the script which structural unit types can be converted to which other types, and also what the mapping is for individual data elements.

Now there are some things to note about this:

  1. Not all conversions from one type to another are possible, or even desirable.  For example we don’t want to turn books into departments usually.  If the script doesn’t have a valid conversion detailed in its data structure, it fails the mapping, leaving the structural unit untouched.
  2. Even if we have a conversion between one structural unit type and another, not all the data types that are valid in the original structural unit type are going to be valid in the new one, so that in general the conversion process is lossy with some data elements being dropped from the structural unit.  If we do have to drop a data element like this, we’ve decided to pop the details into a new Librarian Note, so at least the library staff can see what has been lost (handy if someone fumbles and converts a structural unit they then decide that they didn’t mean to!).
  3. If we do have a conversion mapping from one data type in the old structural unit type to a new data type in the new structural unit type, we might still need to tweak the data in the related data element(s) as we do the conversion.  Part of this might be that we need to amalgamate elements of a repeatable data type in the original structural unit type into a non-repeatable data type in the new structural unit type.  We need to give some options for how this is done (do we take the first data element we find?   Or concatenate all the data elements of this type together?)
  4. There are some situations where even if the data types are both repeatable/non-repeatable we might want to do some post processing of the element value.  For example we might want to turn the contents of an ISBN field into an ISSN, but only if the ISBN looks a bit like an ISSN to start with (if for example we had some dodgy imports from LORLS5… cough, cough! 🙂 ).  To allow for this, we permit Perl regular expression substitutions to be specified in the conversion data structure in the script – with enough evilness in hand Perl regexp can do most data conversions, even if there are considered by some to be write only coding!

The ConvertSU API also has an additional parameter called dump_rules which if set to a non-zero value will turn the rules data structure in the Perl script into an XML structure and return that to the client without doing a conversion.  This allows the client to see what types are convertible to other types, and so restrain from providing users with conversion options that don’t work.

So far the ConvertSU API is sitting on our development server and seems to be able to happily convert book chapters to articles and back again.  Jason is now going to step up to the plate to make use of this new API in the development version of the CLUMP client (and no doubt find a load of bugs that I’ve left in the alpha code for him! 😉 ).  Assuming there’s no show stoppers, we can then release it to our production system and see how our users get on with it, before slipping it into the LUMP/CLUMP GPL distribution.

In the fullness of time, we might want to abstract the conversion rules data structure out of the script and into the database.  That would allow us to produce admin front ends to permit the rule set to be edited without folk having to fire up emacs or vi and edit the ConvertSU Perl script.  We’ve not taken that route yet because we wanted to make sure that the basic logic worked and was useful, and also we’d prefer that any changes that result in a change to the underlying database schema be rolled into LORLS7.  In other words database scheme updates are a major version number change.  We’ve already got a couple of other schema changes on the cards for the future – for example over the summer we did a spot of work on logging guest views into a new table in the database so that we can provide academics and librarians with feedback on how often the students are looking at the reading lists (and even possibly what they are looking at).

There’s no time scale on LORLS7 deployment though before anyone asks! 🙂

Open source seminar

Jason and I attended an open source seminar from PTFS Europe last week. Conveniently for us the seminar was being hosted by the Department of Information Science at Loughborough University so all we had to do was depart the office, go up a single flight of stairs and we were there.

The were presentations from PTFS about open source in general was well as specific case studies about implementing Koha at Halton and Staffordshire University. The afternoon included a session about other products which PTFS are hoping to support including VuFind, Cufts, Godot and reading lists!

The original publicity for the event indicated they were considering adopting List8D but this isn’t the case as they’re developing their own system, after all in their own words “developing a reading list system isn’t exactly rocket science!” The system will be a hosted solution and so would be a direct competitor to Talis Aspire.

We look forward to seeing how their reading list system develops over the coming months.

Improving performance

As we seemed to have reached a stable period for our codebase I decided it was time to start digging into LUMP’s code to see where we could make some performance increases.

I took a standard performance improvement approach of installing a profiler (NYTProf) on LUMP’s development server and configuring apache to use it for Perl CGI scripts.  After a short period of profiling three places where the performance could be improved were identified.

The first improvement was the removal of a system call (using backticks) to the whoami program to identify a users login.  Normally this wouldn’t make much difference but the call was in a sub procedure that would be called 30 or 40 times from the FastGetStructuralUnit script.  The multiple times it had to spawn a new process to run whoami pushed this line to the top of the profiler’s output.  A quick change from `whoami` to getlogin() and the line disappeared from the profiler’s list of time consuming lines.

The second improvement was tucked away in the BaseSQL module.  The MySQL call to get the ID of the last inserted item was regularly appearing in the profilers output as a slow call.  Upon digging in to the code I saw the line

$sql = "select LAST_INSERT_ID() from ".$self->{_table};

This is a very instinctive line to code when you want to get the last ID of an item inserted into a specific table (almost always inserted by the previous SQL call). The problem is that LAST_INSERT_ID() returns the ID of the last row inserted in the current session and not of the specified table. Calling it in the way that we were was a big waste of time. Here is an example to demonstrate why it is waste of time.

mysql> select count(LAST_INSERT_ID()) from data_element;
1933973

By calling it in the way we were MySQL would pass back a row, consisting of the LAST_INSERT_ID() value, for every row in the table we specified. The solution was simply to change the line to


mysql> select count(LAST_INSERT_ID());
1

This change actually knocked a second of some of our slower API’s running time and also reduced the load on the MySQL server.

The third factor that stood out as slowing down most API calls was the module being used to build the XML response, XML::Mini. XML::Mini is a very powerful module for processing very complex XML documents. We didn’t need to use much of its potential as our XML structure was deliberately kept simple. An hour later we had produced XMLify which would take the same hash reference we passed to XML::Mini and produce compatible output. By using our own XMLify sub procedure we managed to reduce the run time of our large FastGetStructuralUnit calls by over 1 second.

Meeting the Reading List Challenge workshop

On July 14th 2011 a workshop was held at Loughborough University entitled “Meeting The Reading List Challenge”.  42 people attended and, after a couple of presentations on reading lists in the morning, the afternoon was spent in group discussions looking at various aspects of reading list design and implementation.

The groups were each asked two questions, and each question was asked of two groups.  The questions were:

  1. What makes a perfect read list? And how can an academic keep it relevant?
  2. Who should be involved in the development of a reading list and what are their roles?
  3. Who do you want to view a reading list and who don’t you want to see it?
  4. How do you get your whole institution engaged with reading lists?
  5. Is there a formula that describes the relationship between reading list content and library stock?
  6. What other systems does a resource/reading list management system need interact with and why?

You can see the posters made from the results of the discussion online.

After the workshop, Gary, Jason and myself sat down and had a think about how some of the things that had come out of the discussions could be implemented in LORLS, and if they were things that we might find useful at Loughborough.  As a result we’ve got a list of some new things to investigate and potentially implement:

  1. Produce a report that is emailed to library staff and/or academics that flags when a new edition of an existing work is available.
  2. Report back to academics on the usage that their reading list is getting.  As we don’t ask the students to log into our LORLS installation, this will have to be anonymous usage information, either from the webserver or from data recorded by the API.
  3. Look at options for purchasing formulae to assist library staff in placing orders for works.  These formulae would be based on various facets such as the number of reading lists a work is on, how many students are on the corresponding modules, the importance attached to the work by the academic(s), the cost of the work, etc.  We might even factor in some simple machine learning so that past purchasing decisions can help inform the system about the likely outcome of future ones.
  4. Importing works from existing bibliographic management tools, especially from RIS/Refworks format.
  5. Provide the students with an ability to rate items and/or lists.  This would provide academics with feedback on how useful the students found the works on the reading lists and might also help the purchasing decisions.
  6. Do some work on the back end to get cookies, Shibboleth SSO and JSON(P) supported to provide a more integrated system.
  7. Sending suggestion emails to academics when new works are added to library stock that cover similar topics as ones already on their reading lists.
  8. Do some W3C accessibility and mobile web support testing.
  9. Introduce a ‘tickstamp’ data type that is set with the current date/time when someone ticks a check box.  This could then help support workflow for the librarians (ie a list of check boxes that have to be ticked off for each list and/or item).

We’re not at the stage of attaching time scales to the development of any of these, and indeed we might find that we don’t actually implement all of them.  However this list does give an idea of where we’re looking to take LORLS now that we have v6 out in production use at Loughborough.

LORLS v6 unleashed

In the early hours of yesterday morning LORLS v6 slipped its keepers (Jon and Jason) and escaped into the wild. LORLS v6 is described as flexible open source resource/reading list management system. Alongside LORLS v6 its three children (LUMP, CLUMP and BibGrab) also successfully made their breaks for freedom.

Members of the public are advised to check the following safety guidelines before approaching the beast.

Momentous events

Well, OK maybe they’re not that momentous but…

A couple of months ago we (Jason and I) met up with Ian Corns of Talis Aspire fame and had a bit of a catch-up session. Much has changed at Talis: their Library Management System division has been sold off, what was Talis Aspire is now called Talis Aspire Campus Edition, they are launching talisaspire.com and Ian has a new job title (which is no laughing matter :-)).

We also bemoaned the lack of any reading list events happening this Summer. So in light of that we were particularly pleased when the Department of Information Science at Loughborough University (i.e. them upstairs) decided to host a workshop on “Meeting the reading list challenge” especially as Ian and myself will be giving a presentation on reading list systems at the event.

As we’re now involved in helping to organise the event I thought advertising it here might be a good idea!

Meeting the reading list challenge: A workshop
Department of Information Science, Loughborough University
Thursday 14th July 2011, 10:30am – 3:00pm

Do you know what resources your academics are recommending to students? How easy do your students find it to locate these key resources?

These issues (and many others) will be discussed at this forthcoming workshop.

Your host for the day will be Dr Ann O’Brien from the Department of Information Science, Loughborough University. The morning session will consist of presentations on “What is a reading list?” and “A magical mystery tour of resource/reading list management systems” given by Gary Brewerton, Project Manager for LORLS (Loughborough online reading list system) and Ian Corns,Customer Liaison Manager for Talis Aspire.

A free buffet lunch will be provided after which there will be wide ranging discussions on topics such as: what makes a good list? How do you engage with academic staff? And, what roles does the library actually have with regard to reading lists? There will also be opportunities for you to ask questions of those present.

This is a free event. If you would like to attend please email Sue Manuel (s.manuel@lboro.ac.uk) to reserve a place stating your name, institution and any specific dietary requirements.

We look forward to hearing what others have to say about reading lists and associated systems on the day.

Twitter hash tag for event: #mtrlc

Finding things that aren’t there

Jason noticed a bug in LUMP last week (just one I hear you cry? 🙂 ).  When non-admin/non-library staff created a structural unit for something like a book, there wouldn’t be any “Held by Library” flag data elements created, as these have a data type that lives in a data type group that the academics don’t have edit rights to.  This means that some of the backend scripts that were looking for works with this flag set to “N” were missing some records that don’t have the flag at all.

The long term fix for this was relatively easy – I just got EditStructuralUnit to create data elements filled in with the default value if such a thing exists in the data type table for any data type groups associated with the structural unit type of the structural unit being created (and breathe… honestly it makes more sense than it sounds like it does!).  It does this even if the user creating the structural unit doesn’t normally have edit rights to data types in some data type groups.  This should be quite safe to do as we’re creating a new structural unit so having defaulted data elements where possible should make library staffs’ lives easier.

However there was still a gotcha – we needed to go and find all the structural units that were missing their “Held by Library” data elements and create a suitable data element filled in with the default (in our case “N” as we assume we don’t hold works until proven otherwise).  Now I could have knocked out a Perl script with several selects and subqueries in it, but I decided to exercise the old SQL muscle and see if I could do it straight into the database, preferably in a single SQL statement.  So after some cogitating, here’s is the resulting SQL:

insert into data_element (structural_unit_id, data_type_id, value)
select su.id, dt.id, dt.default_value
from structural_unit as su
inner join data_type_group as dtg
on su.structural_unit_type_id = dtg.structural_unit_type_id
inner join data_type as dt
on dt.data_type_group_id = dtg.id
and dt.name = "Held by Library"
left outer join data_element as de
on de.data_type_id = dt.id
and de.structural_unit_id = su.id
where su.structural_unit_type_id in (1)
and de.id is null;

Hmm, that’s some serious joining going on there!  The basic idea is that we’re doing normal inner joins to link structural units, data type groups and data types together, and then a left outer join to spot data_elements.  As the left outer join has the data element columns set to NULL where the data element doesn’t match in a row, we use that in the where clause to limit the output to just the missing structural unit and data type information (plus a limitation on the types of structural unit we’re interested in – in this case just books).  This is then used directly in an insert.

Phew!  Seems to work though – we found 434 books on the live system that this corrects.  Relatively quick as well – less than 20 seconds on our production database. If anyone can see any better way to do this (or issues with this one!) please leave a comment.

Still, similar SQL can now be used on other structural unit types and other data types as well.  But I think I need a walk round the library first… 😉

Improving usability via popups

An area of CLUMP that we felt needed some work on was the length of time it could take for a list to reload after a user had gone into an item. Having discussed it for a while we decided to try using a popup for leaf nodes rather than actually moving into them.

Identifying if something is a leaf node or not is quite easy, we just need to see if its structural unit type can have any children. If it can then it isn’t a leaf node and we treat it as normal. If it can’t have any children then it is a leaf node and rather than putting in a link to move into the item we put in a link that displays it in a popup box.

The popup boxes have made a great improvement to the usability of CLUMP for both students and staff. When viewing large lists there is no longer any need for users to wait for a reading list to reload just because they decided to look at an item’s details.

Speeding up liststats

The LibrarianCentre contains a script called liststats that trawls back though editing events in the LUMP database and tries to determine how many lists in each department have been edited by which types of user.  Our library staff use the resulting statistics for a variety of tasks, such as distributing workload amongst teams and planning future loads.

The trouble is that liststats can be a bit slow, especially on a busy production database.  Today I’ve been doing some tweaking to speed it up, and got some pretty good results (mostly by creating some new methods to find the maximum priority a user has, either associated with a particular SU ID or in general).

All well and good, but it still takes some time to run so I decided to be flash and put a progress bar on the page whilst the script is running.  Hmmm… but the liststats script is just a plain old bog standard CGI script written in Perl, rather than one of Jason’s whizzy AJAX Javascript jobs.  A bit of hacking reveals that you can run Javascript whilst a page is loading though… so you can do a bit of sneaky Javascript and CSS to give us a nice graphical progress bar (with a textual incrementing percentage completed for the folk with no images/CSS/screen readers).  Cool – worked fine on our test server.

Then I moved it to our live server and all my progress bar loveliness disappeared.  What?!  Turns out that whilst Apache 2.x does unbuffered output by default (unlike some of the older Apache 1.x web servers of yesteryear that needed nph- scripts to do it), if you’ve got on the fly compression turned on the buffering is turned back on.  Rats.  Luckily there’s a handy way of turning it back off for particular scripts (so I don’t cock up Jason’s AJAX calls that benefit from compression, espeically on IE with its limited connection limit):

SetEnvIfNoCase Request_URI liststats.test no-gzip dont-vary

This just needs to sit somewhere appropriate in the HTTPd config files – on our CentOS box its in the included ssl.conf config file fragment.  Everything works again – hoorah!

Go to Top