The continuing battle of parsing Word documents for reading list material

At this year’s Meeting The Reading List Challenge (MTLRC) workshop, Gary Brewerton (my boss) showed the delegates one of our LORLS features: the ability to suck citation data out of Word .docx documents.  We’ve had this for a few years and it is intended to allow academics to take existing reading lists that they have produced in Word and import them relatively easily into our electronic reading lists system. The nice front end was written by my colleague Jason Cooper, but I was responsible for the underlying guts that the APIs call to try parsing the Word document and turn it into structured data that LORLS can understand and use. We wrote it originally based on suggestions from a few academics who already had reading lists with Harvard style references in them, and they used it to quickly populate LORLS with their data.

Shortly after the MTRLC workshop, Gary met with some other academics who also needed to import existing reading lists into LORLS.  He showed them our existing importer and, whilst it worked, it left quite alot entries as “notes”, meaning it couldn’t parse them into structured data.  Gary then asked me to take another look at the backend code and see if I could improve its recognition rate.

I had a set of “test” lists donated by the academics of varying lengths, all from the same department.  With the existing code, in some cases less than 50% of the items in these documents were recognised and classified correctly.  Of those, some were misclassified (eg book chapters appearing as books).

The existing LORLS .docx import code used Perl regular expression pattern matching alone to try to work out what sort of work a citation referred to this.  This worked OK with Word documents where the citations were well formed.  A brief glance through the new lists showed that lots of the citations were not well formed.  Indeed the citation style and layout seemed to vary from item to item, probably because they had been collected over a period of years by a variety of academics.  Whilst I could modify some of the pattern matches to help recognise some of the more obvious cases, it was clear that the code was going to need something extra.

That extra turned out to be Z39.50 look ups.  We realised that the initial pattern matches could be quite generalised to see if we could extract out authors, titles, publishers and dates, and then use those to do a Z39.50 look up.  Lots of the citations also had classmarks attached, so we’d got a good clue that many of the works did exist in the library catalogue.  This initial pattern match was still done using regular expressions and needed quite a lot of tweaking to get recognition accuracy up.  For example spotting publishers separated from titles can be “interesting”, especially if the title isn’t delimited properly.  We can spot some common cases, such as publishers located in London, New York or in US states with two letter abbreviations.  It isn’t fool proof, but its better than nothing.

However using this left us with only around 5% of the entries in the documents classified as unstructured notes when visual checking indicated that they were probably citations. These remaining items are currently left as notes, as there are a number of reasons why the code can’t parse them.

The number one reason is that they don’t have enough punctuation and/or formatting in the citation to allow regular expression to determine which parts are which. In some cases the layout doesn’t even really bear much relation to any formal Harvard style – the order of authors and titles can some time switch round and in some cases it isn’t clear where the title finishes and the publisher ends. In a way they’re a good example to students of the sort of thing they shouldn’t have in their own referencing!

The same problems encountered using the regular expressions would happen with a formal parser as these entries are effectively “syntax errors” if treating Harvard citations as a sort of grammar.  This is probably about the best we’ll be able to do for a while, at least until we’ve got some deep AI that can actually read the text and understand it, rather that just scan for patterns.

And if we reach that point LORLS will probably become self aware… now there’s a scary thought!

Extending the Structural Unit Types available

One additional Structural Unit Type (SUT) that we have been asked for is Audio Visual (AV) material, e.g. CDs, DVDs, Film, etc.  While we’ve manually added an AV SUT to our local instance we didn’t have an easy way to extend this to other instances of LORLS.

So to tackle this we have put together a quick Perl script that can be run from the command line which adds in the new AV SUT. If your LORLS install doesn’t have an AV Material SUT and you would like to add it then here are the instructions to do so:

  1. Back up your LORLS install (Don’t forget the database as this will be altered)
  2. Download the latest extendSUTs script (e.g. wget “”)
  3. Make the script executable (e.g. chmod +x extendSUTs)
  4. Run the script (e.g. ./extendSUTs –database=<database> –user=<database user>)
  5. When prompted enter the database user’s password
  6. If the script fails due to missing the Term::ReadKey Perl module then install it and try the script again (RedHat/CentOS should just need to run “sudo yum install perl-TermReadKey”)
  7. Once the script has run open a new browser session and try adding a new AV Material entry to a test list.


Integrating LORLS with Koha

At Loughborough we use Ex Libris’s Aleph as our Library Management System (LMS). Our LORLS reading list system also makes use of Aleph in a number of ways:

  1. To provide bibliographic information via Z39.50, allowing people to quickly add works to reading lists by simply entering the ISBN,
  2. In batch jobs to check which works the library holds, again using Z39.50,
  3. To find how many items for a work are held and how many are available for loan which is displayed on a work’s detail pop up in CLUMP.  This has been done using the Ex Libris X Server product (not to be confused with X11 servers!),
  4. To build up lists of current and past loans, used for purchase prediction, high demand reporting and usage tracking.  This is mostly done using custom local Perl scripts that interrogate Aleph’s Oracle database directly and then provide an XML API that LORLS code calls.

We’ve recently had another site say they are interested in using LORLS, but they use the Koha LMS.  They asked us what would be needed to allow LORLS to integrate with Koha?  Luckily Koha is open source, so we could just download a copy and install it on a new virtual machine to have a play with and see what was needed.

Looking up bibliographic information was pretty simple to support.  Koha has a built in Z39.50 server, so all we needed to do was to tweak the file on our dev server to point the Z3950Hostname(), Z3950Port() and Z3950DBName() to point to our new VM, the port that Koha’s Z39.50 server is running on and the new Z39.50 database name (which appears to default to “biblios”).  That seemed to work a treat.

Get item holdings for works was a bit more involved.  Koha obviously doesn’t have Ex Libris’s X server, so we needed another way to get similar data.  Luckily Koha does implement some of the Digital Library Federation Integrated Library System – Discovery Interface recommendations.  One of these Application Programming Interfaces (APIs) is called GetRecords() and, given a set of system record identifiers, will return an XML document with just the sort of information we need (eg for each item linked to a work we get things like item type, whether it can be loaned, what its due date is if it is on loan, if it is damaged, lost, etc).

Unfortunately LORLS doesn’t know anything about Koha’s system record identifiers, and the GetRecords() ILS-DI API doesn’t appear to allow searches based on control numbers such as ISBNs.  To get the system record numbers we can however fall back on Z39.50 again, which Koha uses to implement the ILS-DI SRU type interfaces.  Searching for a Bib-1 attribute of type 1007 with the value set to be the ISBN gets us some nice USMARC records to parse.  We need to look at the 999 $c field in the MARC record as this appears to be the Koha system record identifier.

Just to make things interesting we discovered accidentally that in Koha you can end up with more than one work for the same ISBN (and each work can then have multiple physical items).  I guess this is a flexibility feature in some way, but it means that we need to make sure that we get all the system record identifiers that match our ISBN from LORLS and then pass all of these to the Koha GetRecords() API.  Luckily the API call can take a whole set of system record identifiers in one go, so this isn’t too much of a problem.

One thing we do have to have in the code though is some way of distinguishing between loan categories (long loan, short loan, week loan, reference, etc).  In Koha you can create an arbitrary number of item types which can correspond to loan categories, to which you can then assign things like differing loan rules. We slipped in:

  • BK – normal long loan book (the default in Koha it seems)
  • REFBOOK – a book that can’t be loaned.
  • SL BOOK – a short loan book (usually loaned for less than a day – our “high demand” stock),
  • WL BOOK – a book that can be loaned for a week (effectively moderately in demand works).

Our code currently then has these hard coded in order to return the same sort of holdings Perl structure that Aleph did. Extra item types assigned in Koha will need to been inserted into this code – we might have to think of a “nice” way of doing this if folk make lots of these changes on a regular basis but I suspect item types are one of those things that are configured when an LMS is setup and rarely, if ever, tweaked again.

In the Perl module I created a new method called Koha_ILS_DI_Holdings() to implement the new Koha item holdings and availability code.  The existing Holdings() method was renamed to Aleph_Holdings() and a new Holdings() method implemented that checks a new Holdings() method in for the name of a holdings retrieval algorithm to use (currently now either “Koha:ILS-DI” which selects the new code, or anything else defaulting back to the old Aleph_Holdings() method).  This means that if someone else comes along with XYZ Corp LMS that uses some other whacky way of retrieving holdings availability we can simply write a new method in and add another if clause to the Holdings() method to allow a quick change to select it.  The advantage of this is that other code in LORLS that uses the Holdings() method from doesn’t have to be touched – it is insulated from the messy details of which implementation is in use (ooh, object oriented programming at work!).

This appears to work on our test set up and it means we can just ship a new and file to the folk with Koha and see how they get on with it.  Our next trick will be getting loan history information out of Koha – this may take a bit more work and its really replacing functionality that we’ve mostly implemented solely for Loughborough (using our custom scripts as even Aleph didn’t provide a usable API for what we needed).  It doesn’t appear at first glance that Koha’s ILS-DI APIs cover this use case – I guess we’re a bit odd in being interested in work loan histories!

Trouble installing? Disable SELinux

We are currently working on a new distribution of LORLS (That’s right version 7 is coming soon) and to test the installer’s ability to update an existing installation we needed a fresh v6 install to test on.   So I dropped into a fresh virtual machine we have dedicated specifically to this kind of activity, downloaded the version 6 installer and ran through the installation only to find that, while it had created the database and loaded the initial test data just fine, it hadn’t installed any of the system files.

So for the next 3 hours I was scouring apache’s logs, checking the usual culprits for these sort of issues and debugging the code.  One of the first things I did was check the SELinux configuration and it was set to permissive, which means that it doesn’t actually block anything just warns the user.  This lead me to discount SELinux as the cause of the problem.

After 3 hours of debugging I finally reached the stage of having a test script that would work when run by a user but not when run by apache.  The moment that I had this output I realised that while SELinux may be configured to be permissive, it will only pick up this change when the machine is restarted.  So I manually tried disabling SELinunx (as root use the command ‘echo 0 > /selinux/enforce’) and then tried the installer again.

Needless to say the installer worked fine after this, so if you are installing LORLS and find that it doesn’t install the files check that SELinux is disabled.

Big changes under the hood and a couple of minor ones above

A few weeks ago we pushed out an update to our APIs on our live instance of LORLS and this morning we switched over to a new version of our front end (CLUMP).  The changes introduced in this new version are the following

  • Collapsible sub-headings
  • Improved performance
  • Better support for diacritics

Collapsible sub-headings

Following a suggestion from an academic we have added a new feature for sub-headings.  Clicking on a sub-heading will now collapse all the entries beneath it.  To expand the section out again the user simple needs to click on the sub heading again.  This will be beneficial to both academics maintaining large lists and students trying to navigate them.

Improved performance

In our ever present quest to improve the performance, both actual and perceptual, we decided to see if using JSON instead of XML would help.  After a bit of experimentation we discovered that using JSON and JSONP would both reduced the quantity of JavaScript code in CLUMP’s routines and significantly improved the performance.

Adjusting Jon’s APIs in the back end (LUMP) to return in either XML, JSON or JSONP format was quite easy once the initial code had been inserted in the LUMP module’s respond routine.  Then it was simply a matter of adding 4 lines of code to each API script.

Switching CLUMP to using JSONP was a lot more time consuming.  Firstly every call had to have all it’s XML parsing code removed and then the rest of the code in the routine needed to be altered to use the JavaScript object received from the API.  This resulted in both nicer to code/read JavaScript and smaller functions.

Secondly a number of start up calls had been synchronous, so the JavaScript wouldn’t continue executing until the response from the server had been received and processed.  JSONP calls don’t have a synchronous option.  The solution in the end was to use a callback from the function that processes the JSONP response from the server and with a clever bit of coding this actually enabled us to make a number of calls in parallel and continue only after all of them had completed.  Previously the calls were made one after the other, each having to wait for the preceding call to have been completed before it could start.  While this only saved about half a second on the start up of CLUMP, it made a big difference to the user perception of the systems performance.

Better support for diacritics

This was actually another beneficial side-effect of switching to JSONP over XML for most of our API calls.  In Internet Explorer it was discovered that some UTF-8 diacritic characters in the data would break its XML parser, but because JSONP doesn’t use XML these UTF-8 characters are passed through and displayed fine by the browser.  Of course we do sometimes find some legacy entries in a reading list, created many years ago in a previous version of LORLS, which are in the Latin-1 character set rather than UTF-8, but even these don’t break the JavaScript engine (though they don’t necessarily display the character that they should).


Item Ratings

One of the things on the LORLS “to do” list from last summer’s Meeting the Reading Lists Challenge workshop was to have ratings on lists and/or items. Gary and I had a chat about this earlier today and decided that if we’re going to do it, it would probably be better just on items rather than lists. That way students are commenting on how useful they found individual books, articles, etc rather than the academics reading list as a whole, so there would probably be more acceptance from academics.

I’ve thus produced two new API calls to support this:

  • GetSURating – get the ratings for particular structural units and/or users. If you give it a “suid” parameter with a comma separated list of structural unit IDs it will try to find ratings for those. You can also ask for rating set by one or more users by specifying the “user_id” parameter with a comma separate list of user IDs (the latter mostly because I thought we might want to allow folk in the future to see which books they’d rated, a bit like LibraryThing does). The script normally returns some XML with the ratings for each matched SUID (good and bad). You can give it a “details” parameter set to ‘Y’ in which case it will just splurge out XML with all the matching records in (including creation/modification times, etc so we could do fancy time based rating analysis).
  • Editing/EditSURating – create/edit a rating. Needs to have the structural unit ID send in via the “suid” parameter and the rating itself (either “Good” or “Bad”) via the “rating” parameter. No user_id parameter as it takes the logged in user as the person to create the rating from. It returns a summary of the current ratings for the structural unit after updating for this user. If you’re not logged in it does nothing.

Each user can click on ratings for a particular structural unit as many times as they like, but they’ll only have one active record. That means that you can rate something as “Bad” at first, then re-read it later and decide that you were wrong and its actually “Good” and re-rate it. Your old “Bad” rating is replaced by the “Good” rating.

We’ve already been talking about how we can make use of the data once the students start rating items. For example we could have a graph or scatter chart in the academic’s dashboard showing them four quadrants: rarely borrowed items that aren’t liked, rarely borrowed items that are liked, heavily borrowed items that aren’t liked and heavily borrowed items that are liked.  This would provide some feedback to academics on how useful the students found the material on their reading lists, and would also potentially supply some useful information to library staff.  You could imagine that a very expensive book that the library has put off buying a copy of but which is heavily liked by people who’ve acquired/seen copies elsewhere might persuade library staff to order a copy for instance.

I’ve got a proof of concept up and running now on our test/dev server.  This shows thumbs up/down on the bibiographic details popup in CLUMP for leaf items (books, journals, etc). As it requires a change to the LUMP database schema (in fact a whole new user_item_rating table), this isn’t going to be a LORLS v6.x thing but instead a LORLS v7 feature.

Oh crumbs, I’ve started working on the next version of LORLS already! 🙂

Extracting Harvard citations from Word documents

Over the years, one question that we’ve had pop up occasionally from academics and library staff is whether we could import reading lists from existing Microsoft Word documents. Many academics have produced course handouts including reading material in Word format and some still do, even though we’ve had a web based reading list system at Loughborough for over a decade now, and a VLE for a roughly similar period.

We’ve always had to say no in the past, because Microsoft Word’s proprietary binary format was very difficult to process (especially on the non-Microsoft platforms we use to host our systems) and we had other, more important development tasks.  Also we thought that extracting the variety of citation/bibliography formats that different academics use would be a nightmare.

However with the new LUMP based LORLS now well bedded in at Loughborough and Microsoft basing the document format of newer versions of Word on XML, we thought we’d revisit the idea and spend a bit of time to see what we could do.

Microsoft Office Word 2007 was introduced as part of the Office 2007 suite using a default file format based on XML, called Office Open XML Format, or OpenXML for short.  A Word 2007 document is really a compressed ZIP archive containing a directory structure populated with a set of XML documents conforming to Microsoft’s published XML schemas, as well as any media files required for the documents (images, movies, etc).  Most academics are now using versions of Microsoft Word that generate files in this format, which can be identified easily by looking for the “.docx” filename extension.

The XML documents inside the ZIPed .docx archive contain both the text of the document, styling information and properties about the document (ie who created it and when).  There’s actually quite a lot of structural information stored as well, which Microsoft explain how to process in order to work out how different parts of the document are related to each other.  Some of this is rather complex, but for a simple “proof of concept” what we needed was the actual document text structure.  By default this lives in a file called “word/document.xml” inside the ZIP archive.

The document.xml file contains an XML element called <w:body></w:body> that encapsulates the actual document text.  Individual paragraphs are then themselves held in <w:p></w:p> elements and these are then further broken down based on styling applied, whether there are embedded hyperlinks in the paragraph, etc, etc.  Looking through a few sample reading lists in .docx format gave us a good feel for what sort of structures we’d find.  Processing the .docx OpenXML using Perl would be possible using the Archive::Any module to unpack the ZIP archive and then the XML::Simple module to process the XML data held within into Perl data structures.

The next issue was how do we find the citations held inside the Word documents and turn them into Structural Units in LORLS?  We decided to aim to import Harvard style citations and this is where we hit the first major problem: not everyone seems to agree on what a Harvard style bibliographic reference should look like.  For example some Harvard referencing texts say that author names in books should be capitalised, publication dates should follow in brackets and titles underlined like this:

WILLS, H., (1985), Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

whereas other sources don’t say anything about author capitalisation or surname/firstname/initial ordering but want the title in italics, and no brackets round the publication date:

Henry Willis, 1985, Pillboxes: A Study Of U.K. Defences 1940, Leo Cooper, London.

When you start to look at real lists of citations from academics it becomes clear that many aren’t even consistent within a single document, which makes things even more tricky.  Some of the differences may be down to simple mistakes, but others may be due to cutting and pasting between other documents with similar, but not quite the same, Harvard citation styles.

The end result of this is that we need to do a lot of pattern matching and also accept that we aren’t going to get a 100% hit rate (at least not straight away!).  Luckily the LORLS back end is written in Perl and that is a language just dripping with pattern matching abilities – especially its powerful regular expression processor. So for our proof of concept we took some representative OpenXML format Word .docx files from one of our academic departments and then used them to refine a set of regular expressions designed to extract Harvard-esque citations in paragraph, trying to work out what type of work it is (book, article, etc) based on the ordering of parts of the citation and the use of italics and/or underlining.

The initial proof of concept was a command line script that was given the name of a .docx document and would then spit out a stream of document text interspersed with extracted citations.  Once we’d got the regular expressions tweaked to the point where our test set of documents were generating 80-90% hit rates, we took this code and turned it into a CGI script that could then be used as an API call to extract citations from an uploaded document and return a list of potential hits in JSONP format.

One thing to note about file uploading to an API is that browser security models don’t necessarily make this terribly easy – a simple HTML FORM  file upload will try to refresh the page rather than do an AJAX style client-server transaction.  The trick, as used by a wide variety of web sites now, is to use a hidden IFRAME as the target of the file upload output or some XHR scripting.  Luckily packages such as jQuery now come with support for this, so we don’t need to do too much heavy lifting to make it work from the Javascript front end client.  Using JSONP makes this a bit easier as well, as its much easier to handle the JSON format in Javascript that if we returned XML.

The JSONP results returned from our OpenXML processing CGI script provide structures containing the details extracted from each work.  This script does NOT actually generate LUMP Structural Units; instead we just return what type we think each work is (book, etc) and then some extracted citation information relevant to that type of work.

The reasons for not inserting Structural Units immediately are three fold.  Firstly because we can’t be sure we’ve got the pattern matching working 100% it makes sense to allow the user to see what we think the matched citations are and allow them to select the ones they want to import into LORLS.  Secondly, we already have API calls to allow Structural Units to be created and edited, so we probably shouldn’t reinvent the wheel here – the client already knows how to talk to those to make new works appear in reading lists.  Lastly by not actually dealing with LUMP Structural Units, we’ve got a more general purpose CGI script – it could be used in systems that have nothing to with LORLS or LUMP for example.

So that’s the current state of play.  We’re gathering together more Word documents from academics with Harvard style bibliographies in them so that we can test our regular expressions against them, and Jason will be looking at integrating our test Javascript .docx uploader into CLUMP.  Hopefully the result will be something the academics and library staff will be happy to see at last!


Converting structural units from one type to another

With LORLS6, we’ve moved away from the older LORLS “one size fits all” approach for the resources we hold in reading lists to having structural units, each of which is of a particular type.  Those types can be books, book chapters, journals, journal articles, reading lists, modules, departments and, er, well more or less anything the system’s administrator wants them to be.  This gives the system some great flexibility, especially seeing as the structural unit types also have sets of valid parent/child links between different types (so that departments can appear under an institution, but not under a book for example).  It also means that we can provide validation for what goes into each type, and treat different types of resource in different ways.

All well and good.  But want if you need to change the type of a structural unit?  Why would you need to do that?  Well, it might be that someone has mistakenly created a load of resources as books when they should be journal articles.  Or you’ve introduced a new organisational unit between an institution and a department (call it, oooh, say a school) and need to change some departments into the new type.  Until now you either needed to delete the existing structural unit and then create a new one of the new type, or get down and dirty with the database to tweaking the relationships at a SQL level (we’ve done the latter at Loughborough for example, for the “introducing schools” reason given above).

However we thought it might be an idea to provide an API to allow this conversion to be done programmatically, and potentially have a nice webby user interface wrapped round it so that end users/librarians can do this themselves.  So enter stage left: the ConvertSU API call.

This new call normally takes two parameters: su_id and target_sut_id.  The former is the ID of the structural unit we’re diddling around with, whilst the latter is the ID of the structural type that we’re planning on changing it to.  Behind the scenes there’s a rather large Perl data structure that tells the script which structural unit types can be converted to which other types, and also what the mapping is for individual data elements.

Now there are some things to note about this:

  1. Not all conversions from one type to another are possible, or even desirable.  For example we don’t want to turn books into departments usually.  If the script doesn’t have a valid conversion detailed in its data structure, it fails the mapping, leaving the structural unit untouched.
  2. Even if we have a conversion between one structural unit type and another, not all the data types that are valid in the original structural unit type are going to be valid in the new one, so that in general the conversion process is lossy with some data elements being dropped from the structural unit.  If we do have to drop a data element like this, we’ve decided to pop the details into a new Librarian Note, so at least the library staff can see what has been lost (handy if someone fumbles and converts a structural unit they then decide that they didn’t mean to!).
  3. If we do have a conversion mapping from one data type in the old structural unit type to a new data type in the new structural unit type, we might still need to tweak the data in the related data element(s) as we do the conversion.  Part of this might be that we need to amalgamate elements of a repeatable data type in the original structural unit type into a non-repeatable data type in the new structural unit type.  We need to give some options for how this is done (do we take the first data element we find?   Or concatenate all the data elements of this type together?)
  4. There are some situations where even if the data types are both repeatable/non-repeatable we might want to do some post processing of the element value.  For example we might want to turn the contents of an ISBN field into an ISSN, but only if the ISBN looks a bit like an ISSN to start with (if for example we had some dodgy imports from LORLS5… cough, cough! 🙂 ).  To allow for this, we permit Perl regular expression substitutions to be specified in the conversion data structure in the script – with enough evilness in hand Perl regexp can do most data conversions, even if there are considered by some to be write only coding!

The ConvertSU API also has an additional parameter called dump_rules which if set to a non-zero value will turn the rules data structure in the Perl script into an XML structure and return that to the client without doing a conversion.  This allows the client to see what types are convertible to other types, and so restrain from providing users with conversion options that don’t work.

So far the ConvertSU API is sitting on our development server and seems to be able to happily convert book chapters to articles and back again.  Jason is now going to step up to the plate to make use of this new API in the development version of the CLUMP client (and no doubt find a load of bugs that I’ve left in the alpha code for him! 😉 ).  Assuming there’s no show stoppers, we can then release it to our production system and see how our users get on with it, before slipping it into the LUMP/CLUMP GPL distribution.

In the fullness of time, we might want to abstract the conversion rules data structure out of the script and into the database.  That would allow us to produce admin front ends to permit the rule set to be edited without folk having to fire up emacs or vi and edit the ConvertSU Perl script.  We’ve not taken that route yet because we wanted to make sure that the basic logic worked and was useful, and also we’d prefer that any changes that result in a change to the underlying database schema be rolled into LORLS7.  In other words database scheme updates are a major version number change.  We’ve already got a couple of other schema changes on the cards for the future – for example over the summer we did a spot of work on logging guest views into a new table in the database so that we can provide academics and librarians with feedback on how often the students are looking at the reading lists (and even possibly what they are looking at).

There’s no time scale on LORLS7 deployment though before anyone asks! 🙂

Improving performance

As we seemed to have reached a stable period for our codebase I decided it was time to start digging into LUMP’s code to see where we could make some performance increases.

I took a standard performance improvement approach of installing a profiler (NYTProf) on LUMP’s development server and configuring apache to use it for Perl CGI scripts.  After a short period of profiling three places where the performance could be improved were identified.

The first improvement was the removal of a system call (using backticks) to the whoami program to identify a users login.  Normally this wouldn’t make much difference but the call was in a sub procedure that would be called 30 or 40 times from the FastGetStructuralUnit script.  The multiple times it had to spawn a new process to run whoami pushed this line to the top of the profiler’s output.  A quick change from `whoami` to getlogin() and the line disappeared from the profiler’s list of time consuming lines.

The second improvement was tucked away in the BaseSQL module.  The MySQL call to get the ID of the last inserted item was regularly appearing in the profilers output as a slow call.  Upon digging in to the code I saw the line

$sql = "select LAST_INSERT_ID() from ".$self->{_table};

This is a very instinctive line to code when you want to get the last ID of an item inserted into a specific table (almost always inserted by the previous SQL call). The problem is that LAST_INSERT_ID() returns the ID of the last row inserted in the current session and not of the specified table. Calling it in the way that we were was a big waste of time. Here is an example to demonstrate why it is waste of time.

mysql> select count(LAST_INSERT_ID()) from data_element;

By calling it in the way we were MySQL would pass back a row, consisting of the LAST_INSERT_ID() value, for every row in the table we specified. The solution was simply to change the line to

mysql> select count(LAST_INSERT_ID());

This change actually knocked a second of some of our slower API’s running time and also reduced the load on the MySQL server.

The third factor that stood out as slowing down most API calls was the module being used to build the XML response, XML::Mini. XML::Mini is a very powerful module for processing very complex XML documents. We didn’t need to use much of its potential as our XML structure was deliberately kept simple. An hour later we had produced XMLify which would take the same hash reference we passed to XML::Mini and produce compatible output. By using our own XMLify sub procedure we managed to reduce the run time of our large FastGetStructuralUnit calls by over 1 second.

LORLS v6 unleashed

In the early hours of yesterday morning LORLS v6 slipped its keepers (Jon and Jason) and escaped into the wild. LORLS v6 is described as flexible open source resource/reading list management system. Alongside LORLS v6 its three children (LUMP, CLUMP and BibGrab) also successfully made their breaks for freedom.

Members of the public are advised to check the following safety guidelines before approaching the beast.

Go to Top