LORLS | regular expressions

At this year’s Meeting the Reading List Challenge (MTRLC) workshop, my boss Gary Brewerton demonstrated one of the features we have in LORLS: the ability to ingest a Word document that contains Harvard(ish) citations. Our script reads in a Office Open XML.docx format Word document and spits out some structured data ready to import into a LORLS reading list. The idea behind this is that academics still create reading lists in Word, despite us having had an online system for 15 years now. Anything we can do to make getting these Word documents into LORLS easier for them, the more likely it is that we’ll actually get to see the data. We’ve had this feature for a while now, and its one of those bits of code that we revisit every so often when we come across new Word documents that it doesn’t handle as well as we’d like.

The folk at MTRLC seemed to like it, and Gary suggested that I yank the core of the import code out of LORLS, bash it around a bit and then make it available as a standalone program for people to play with, including sites that don’t use LORLS. So that’s what I’ve done – you can download the single script from:

https://lorls.lboro.ac.uk/WordImporter/WordImporter

The code is, as with the rest of LORLS, written in Perl. It makes heavy use of regular expression pattern matching and Z39.50 look ups to do its work. It is intended to run as a CGI script, so you’ll need to drop it on a machine with a web server. It also uses some Perl modules from CPAN that you’ll need to make sure are installed:

Data::Dumper
Algorithm::Diff
Archive::Any
XML::Simple
JSON
IO::File
ZOOM
CGI

The code has been developed and run under Linux (specifically Debian Jessie and then CentOS 6) with the Apache web server. It doesn’t do anything terribly exciting with CGI though, so it should probably run OK on other platforms as long as you have working Perl interpreter and the above modules installed. As distributed its looks at the public Bodleian Library Z39.50 server in Oxford, but you’ll probably want to point it at your own library system’s Z39.50 server (the variable names are pretty self-explanatory in the code!).

This script gives a couple of options for output. The first is RIS format, which is an citation interchange format that quite a few systems accept. It also has the option of JSON output if you want to suck the data back into your own code. If you opt for JSON format you can also include a callback function name so that you can use JSONP (and thus make use of this script in Javascript code running in web browsers).

We hope you find this script useful. And if you do feel up to tweaking and improving it, we’d love to get patches and fixes back!

Posts tagged regular expressions

Word document importer