gplus-profile-picture

Jon Knight

This user hasn't shared any profile information

Posts by Jon Knight

Word document importer

At this year’s Meeting the Reading List Challenge (MTRLC) workshop, my boss Gary Brewerton demonstrated one of the features we have in LORLS: the ability to ingest a Word document that contains Harvard(ish) citations. Our script reads in a Office Open XML.docx format Word document and spits out some structured data ready to import into a LORLS reading list.  The idea behind this is that academics still create reading lists in Word, despite us having had an online system for 15 years now. Anything we can do to make getting these Word documents into LORLS easier for them, the more likely it is that we’ll actually get to see the data. We’ve had this feature for a while now, and its one of those bits of code that we revisit every so often when we come across new Word documents that it doesn’t handle as well as we’d like.

The folk at MTRLC seemed to like it, and Gary suggested that I yank the core of the import code out of LORLS, bash it around a bit and then make it available as a standalone program for people to play with, including sites that don’t use LORLS.  So that’s what I’ve done – you can download the single script from:

https://lorls.lboro.ac.uk/WordImporter/WordImporter

The code is, as with the rest of LORLS, written in Perl. It makes heavy use of regular expression pattern matching and Z39.50 look ups to do its work.  It is intended to run as a CGI script, so you’ll need to drop it on a machine with a web server.  It also uses some Perl modules from CPAN that you’ll need to make sure are installed:

  • Data::Dumper
  • Algorithm::Diff
  • Archive::Any
  • XML::Simple
  • JSON
  • IO::File
  • ZOOM
  • CGI

The code has been developed and run under Linux (specifically Debian Jessie and then CentOS 6) with the Apache web server.  It doesn’t do anything terribly exciting with CGI though, so it should probably run OK on other platforms as long as you have  working Perl interpreter and the above modules installed. As distributed its looks at the public Bodleian Library Z39.50 server in Oxford, but you’ll probably want to point it at your own library system’s Z39.50 server (the variable names are pretty self-explanatory in the code!).

This script gives a couple of options for output.  The first is RIS format, which is an citation interchange format that quite a few systems accept.  It also has the option of JSON output if you want to suck the data back into your own code.  If you opt for JSON format you can also include a callback function name so that you can use JSONP (and thus make use of this script in Javascript code running in web browsers).

We hope you find this script useful.  And if you do feel up to tweaking and improving it, we’d love to get patches and fixes back!

The continuing battle of parsing Word documents for reading list material

At this year’s Meeting The Reading List Challenge (MTLRC) workshop, Gary Brewerton (my boss) showed the delegates one of our LORLS features: the ability to suck citation data out of Word .docx documents.  We’ve had this for a few years and it is intended to allow academics to take existing reading lists that they have produced in Word and import them relatively easily into our electronic reading lists system. The nice front end was written by my colleague Jason Cooper, but I was responsible for the underlying guts that the APIs call to try parsing the Word document and turn it into structured data that LORLS can understand and use. We wrote it originally based on suggestions from a few academics who already had reading lists with Harvard style references in them, and they used it to quickly populate LORLS with their data.

Shortly after the MTRLC workshop, Gary met with some other academics who also needed to import existing reading lists into LORLS.  He showed them our existing importer and, whilst it worked, it left quite alot entries as “notes”, meaning it couldn’t parse them into structured data.  Gary then asked me to take another look at the backend code and see if I could improve its recognition rate.

I had a set of “test” lists donated by the academics of varying lengths, all from the same department.  With the existing code, in some cases less than 50% of the items in these documents were recognised and classified correctly.  Of those, some were misclassified (eg book chapters appearing as books).

The existing LORLS .docx import code used Perl regular expression pattern matching alone to try to work out what sort of work a citation referred to this.  This worked OK with Word documents where the citations were well formed.  A brief glance through the new lists showed that lots of the citations were not well formed.  Indeed the citation style and layout seemed to vary from item to item, probably because they had been collected over a period of years by a variety of academics.  Whilst I could modify some of the pattern matches to help recognise some of the more obvious cases, it was clear that the code was going to need something extra.

That extra turned out to be Z39.50 look ups.  We realised that the initial pattern matches could be quite generalised to see if we could extract out authors, titles, publishers and dates, and then use those to do a Z39.50 look up.  Lots of the citations also had classmarks attached, so we’d got a good clue that many of the works did exist in the library catalogue.  This initial pattern match was still done using regular expressions and needed quite a lot of tweaking to get recognition accuracy up.  For example spotting publishers separated from titles can be “interesting”, especially if the title isn’t delimited properly.  We can spot some common cases, such as publishers located in London, New York or in US states with two letter abbreviations.  It isn’t fool proof, but its better than nothing.

However using this left us with only around 5% of the entries in the documents classified as unstructured notes when visual checking indicated that they were probably citations. These remaining items are currently left as notes, as there are a number of reasons why the code can’t parse them.

The number one reason is that they don’t have enough punctuation and/or formatting in the citation to allow regular expression to determine which parts are which. In some cases the layout doesn’t even really bear much relation to any formal Harvard style – the order of authors and titles can some time switch round and in some cases it isn’t clear where the title finishes and the publisher ends. In a way they’re a good example to students of the sort of thing they shouldn’t have in their own referencing!

The same problems encountered using the regular expressions would happen with a formal parser as these entries are effectively “syntax errors” if treating Harvard citations as a sort of grammar.  This is probably about the best we’ll be able to do for a while, at least until we’ve got some deep AI that can actually read the text and understand it, rather that just scan for patterns.

And if we reach that point LORLS will probably become self aware… now there’s a scary thought!

A LORLS virtual machine image

We’ve been thinking recently about how to make it easier for people to try out the LORLS code. This includes ourselves – we sometimes want to spin up a new instance of LORLS for testing some feature or helping another site debug their installation. Normally that would mean doing an operating system, Perl module and then LORLS installation on a new machine (physical or virtual) before it could be used.

With the spread of virtual machine (VM) infrastructure, and the fact that many Universities now use VMs widely, we thought it might be worth making a “LORLS in a box” VM appliance image that people could grab and then use for testing, demos or as the basis of their own installation. The VM image would have LORLS pre-installed along with all the basic Perl modules required in place, and the test data that we use in our sandbox demo.

To that end, here’s a first cut of a LORLS in a box VM image. This is a ZIP archive containing just two files – an  Open Virtualization Format (OFV) package describing the VM and a Virtual Machine Disk (VMDK) file system image.  Be warned that its quite large (over 900MB!) as it has a full operating system disc image included – you might want to have a cup of tea and biscuit handy whilst it downloads. The image was built using the Virtual Box OSE VM platform and is based on a CentOS 6 Linux base, and it should be able to be imported into other VM infrastructures such as VMware.

Once you’ve imported the LORLs-in-a-box VM into your VM instructure and started it up, you’ll eventually be presented with a login on the console.  This should also show you the IP address that the VM has picked up and the URL that you can use to get to the CLUMP web front end.  By default the networking in the VM is using a bridged interface that picks up an IPv4 address via DHCP.  This is fine for testing and development, though if you’re using this image as the basis of a production system you’ll probably want to nip in and change this to a static IP address.

To login to the system there’s a “lorls” user with the password “lorls4you” (both without the quotes).  This user can then act as the superuser by using sudo.  The MySQL server on the machine has a root user password of “LUMPyStuff!” (again quoteless) should you wish to go in and tinker with the database directly.  You probably want to change all these passwords (and Linux root password) as soon as you can as everyone now knows them!  You’ll also most probably want to edit /usr/local/LUMP/LUMP.pm file to point at your own site’s Z39.50 server, etc.

Koha loan history script

We’ve been looking at intergrating the Koha LMS with the LORLS reading list management system, as we know of sites using Koha that are interested in LORLS (and also because we’re interested in seeing how Koha works!).  After getting the basic integration for looking up works using Koha’s Z39.50 server and finding out item holdings/availability working last week, the next thing to tackle was to get loan histories out of Koha.

We use loan histories in LORLS’s purchase predictor code. We need to be able to grab an XML feed of both current and old loan issues, which is then used to work out what the peak number of concurrent loans have been made for an item and thus whether it has had sufficient demand to warrant purchasing additional copies.

For the current loans we need to know the date and time they were issued and for old issues we want both the issue date/time and the return date/time. For both current and old loan issues we also want to know the type (status) of the item (“long_loan”, “short_loan”, etc) and which department the borrower of the loan came from. The latter is so that we can apportion purchasing costs between different departments for the cases where multiple modules include the same books on their reading lists.

The item status is fairly easy to do in Koha – we’d already created item types in Koha and these can easily be mapped in the Perl code that implements the XML API into the long_loan, short_loan, week_loan, etc status format our purchase predictor code already expects.  Indeed if we wanted to we could make the items types in Koha just be “long_loan”, “short_loan” and “week_loan” so no mapping would be required, but a mapping function adds a bit of flexibility.

The borrowers’ department is a bit more involved.  It appears that in Koha this would be an “extended attribute” which needs to be enabled (it doesn’t appear to be on by default).  I created an extended borrower attribute type of called DEPT, and then entered some of Loughborough’s department codes as a controlled vocabulary for it.  In real life these would have to be slipped into Koha as part of a regularly (probably daily) borrower upload from our central reservation systems, which is roughly how we do it with our production Aleph LMS.  In our test environment I just added the extended attribute value manual to a couple of test users so that we could play with the code.

At the end of this posting you’ll find the resulting Perl code for creating this simple XML feed, which Koha sites might find handy even if they don’t use LORLS. One interesting thing to note in Koha is that the isbn field of the biblioitems table appears to contain more than one ISBN, separated by white space and vertical bar character (” | “). This means that you need to do a “like” match on the ISBN. This was a little unexpected and took me a while to track down what was wrong with my SQL when I had a simple “=” rather than a “like” in the select statements! The separation of biolios (works), biblioitems (manifestations) and items (er, items) is nicely done though.


#!/usr/bin/perl

use strict;
use lib '/usr/share/koha/lib';
use CGI;
use DBI;
use C4::Context;

$| = 1;
my $q = new CGI;
my $isbn = $q->param('isbn');
my $dept_code = $q->param('dept_code');
my $return_period = $q->param('return_period') || 365;

print STDOUT "Content-type: text/xmlnn";
print STDOUT "<loan_history>n";

my $status_map = {
    'BK' => 'long_loan',
    'SL BOOK' => 'short loan',
    'WL BOOK' => 'week loan',
    'REFBOOK' => 'reference',
};

my $dbh = C4::Context->dbh;
my $total_number = 0;

my $sql =
    'select items.itemnumber, issues.issuedate, biblioitems.itemtype, ' .
    '       borrower_attributes.attribute ' .
    'from biblioitems, items, issues, borrowers, borrower_attributes, ' .
    '     borrower_attribute_types ' .
    'where biblioitems.isbn like ' . $dbh->quote("%$isbn%") . ' and ' .
    '      biblioitems.biblioitemnumber = items.biblioitemnumber and ' .
    '      items.itemnumber = issues.itemnumber and ' .
    '      issues.borrowernumber = borrowers.borrowernumber and ' .
    '      borrowers.borrowernumber = borrower_attributes.borrowernumber and '.
    '      borrower_attributes.code = borrower_attribute_types.code and ' .
    '      borrower_attribute_types.description = "Department" and ' .
    '      borrower_attributes.attribute = ' . $dbh->quote($dept_code);

my $currentloan = $dbh->prepare($sql);
$currentloan->execute;
while (my ($id, $issuedate, $status, $bor_type) = $currentloan->fetchrow_array) {
    $status = $status_map->{$status};
    print STDOUT "  <loan>n";
    print STDOUT "    <issue_date>$issuedate</issue_date>n";
    print STDOUT "    <status>$status</status>n";
    print STDOUT "    <dept_code>$bor_type</dept_code>n";
    print STDOUT "    <number>1</number>n";
    print STDOUT "    <current>Y<current>n";
    print STDOUT "  </loan>n";
$total_number++;
}
$currentloan->finish;

my($sec,$min,$hour,$mday,$mon,$year,$wday,$yday) = gmtime(time-($return_period * 24 * 60 * 60));
$year += 1900;
$mon++;
my $target_date = sprintf("%04d%02d%02d",$year,$mon,$mday);

$sql =
    'select items.itemnumber, old_issues.issuedate, old_issues.returndate, ' .
    '       biblioitems.itemtype, borrower_attributes.attribute ' .
    'from biblioitems, items, old_issues, borrowers, borrower_attributes, ' .
    '     borrower_attribute_types ' .
    'where biblioitems.isbn like ' . $dbh->quote("%$isbn%") . ' and ' .
    '      biblioitems.biblioitemnumber = items.biblioitemnumber and ' .
    '      items.itemnumber = old_issues.itemnumber and ' .
    '      old_issues.returndate > ' . $dbh->quote($target_date) . ' and ' .
    '      old_issues.borrowernumber = borrowers.borrowernumber and ' .
    '      borrowers.borrowernumber = borrower_attributes.borrowernumber and '.
    '      borrower_attributes.code = borrower_attribute_types.code and ' .
    '      borrower_attribute_types.description = "Department" and ' .
    '      borrower_attributes.attribute = ' . $dbh->quote($dept_code);
my $pastloan = $dbh->prepare($sql);

$pastloan->execute;
while (my ($id, $issuedate, $return_date, $status, $bor_type) = $pastloan->fetchrow_array)
{
    $status = $status_map->{$status};
    print STDOUT "  <loan>n";
    print STDOUT "    <issue_date>$issuedate</issue_date>n";
    print STDOUT "    <return_date>$returndate</return_date>n";
    print STDOUT "    <status>$status</status>n";
    print STDOUT "    <dept_code>$bor_type</dept_code>n";
    print STDOUT "    <number>1</number>n";
    print STDOUT "    <current>N<current>n";
    print STDOUT "  </loan>n";
    $total_number++;
}
$pastloan->finish;

print STDOUT " <total_loan_count>$total_number</total_loan_count>n";

print STDOUT "</loan_history>n";

Integrating LORLS with Koha

At Loughborough we use Ex Libris’s Aleph as our Library Management System (LMS). Our LORLS reading list system also makes use of Aleph in a number of ways:

  1. To provide bibliographic information via Z39.50, allowing people to quickly add works to reading lists by simply entering the ISBN,
  2. In batch jobs to check which works the library holds, again using Z39.50,
  3. To find how many items for a work are held and how many are available for loan which is displayed on a work’s detail pop up in CLUMP.  This has been done using the Ex Libris X Server product (not to be confused with X11 servers!),
  4. To build up lists of current and past loans, used for purchase prediction, high demand reporting and usage tracking.  This is mostly done using custom local Perl scripts that interrogate Aleph’s Oracle database directly and then provide an XML API that LORLS code calls.

We’ve recently had another site say they are interested in using LORLS, but they use the Koha LMS.  They asked us what would be needed to allow LORLS to integrate with Koha?  Luckily Koha is open source, so we could just download a copy and install it on a new virtual machine to have a play with and see what was needed.

Looking up bibliographic information was pretty simple to support.  Koha has a built in Z39.50 server, so all we needed to do was to tweak the LUMP.pm file on our dev server to point the Z3950Hostname(), Z3950Port() and Z3950DBName() to point to our new VM, the port that Koha’s Z39.50 server is running on and the new Z39.50 database name (which appears to default to “biblios”).  That seemed to work a treat.

Get item holdings for works was a bit more involved.  Koha obviously doesn’t have Ex Libris’s X server, so we needed another way to get similar data.  Luckily Koha does implement some of the Digital Library Federation Integrated Library System – Discovery Interface recommendations.  One of these Application Programming Interfaces (APIs) is called GetRecords() and, given a set of system record identifiers, will return an XML document with just the sort of information we need (eg for each item linked to a work we get things like item type, whether it can be loaned, what its due date is if it is on loan, if it is damaged, lost, etc).

Unfortunately LORLS doesn’t know anything about Koha’s system record identifiers, and the GetRecords() ILS-DI API doesn’t appear to allow searches based on control numbers such as ISBNs.  To get the system record numbers we can however fall back on Z39.50 again, which Koha uses to implement the ILS-DI SRU type interfaces.  Searching for a Bib-1 attribute of type 1007 with the value set to be the ISBN gets us some nice USMARC records to parse.  We need to look at the 999 $c field in the MARC record as this appears to be the Koha system record identifier.

Just to make things interesting we discovered accidentally that in Koha you can end up with more than one work for the same ISBN (and each work can then have multiple physical items).  I guess this is a flexibility feature in some way, but it means that we need to make sure that we get all the system record identifiers that match our ISBN from LORLS and then pass all of these to the Koha GetRecords() API.  Luckily the API call can take a whole set of system record identifiers in one go, so this isn’t too much of a problem.

One thing we do have to have in the code though is some way of distinguishing between loan categories (long loan, short loan, week loan, reference, etc).  In Koha you can create an arbitrary number of item types which can correspond to loan categories, to which you can then assign things like differing loan rules. We slipped in:

  • BK – normal long loan book (the default in Koha it seems)
  • REFBOOK – a book that can’t be loaned.
  • SL BOOK – a short loan book (usually loaned for less than a day – our “high demand” stock),
  • WL BOOK – a book that can be loaned for a week (effectively moderately in demand works).

Our code currently then has these hard coded in order to return the same sort of holdings Perl structure that Aleph did. Extra item types assigned in Koha will need to been inserted into this code – we might have to think of a “nice” way of doing this if folk make lots of these changes on a regular basis but I suspect item types are one of those things that are configured when an LMS is setup and rarely, if ever, tweaked again.

In the StructuralUnit.pm Perl module I created a new method called Koha_ILS_DI_Holdings() to implement the new Koha item holdings and availability code.  The existing Holdings() method was renamed to Aleph_Holdings() and a new Holdings() method implemented that checks a new Holdings() method in LUMP.pm for the name of a holdings retrieval algorithm to use (currently now either “Koha:ILS-DI” which selects the new code, or anything else defaulting back to the old Aleph_Holdings() method).  This means that if someone else comes along with XYZ Corp LMS that uses some other whacky way of retrieving holdings availability we can simply write a new method in StructuralUnit.pm and add another if clause to the Holdings() method to allow a quick LUMP.pm change to select it.  The advantage of this is that other code in LORLS that uses the Holdings() method from StructuralUnit.pm doesn’t have to be touched – it is insulated from the messy details of which implementation is in use (ooh, object oriented programming at work!).

This appears to work on our test set up and it means we can just ship a new StructuralUnit.pm and LUMP.pm file to the folk with Koha and see how they get on with it.  Our next trick will be getting loan history information out of Koha – this may take a bit more work and its really replacing functionality that we’ve mostly implemented solely for Loughborough (using our custom scripts as even Aleph didn’t provide a usable API for what we needed).  It doesn’t appear at first glance that Koha’s ILS-DI APIs cover this use case – I guess we’re a bit odd in being interested in work loan histories!

Leveraging LORLS data for prospective student engagement

When we think of reading lists we often try to look at it from the point of view of one or more of our user communities. We’ve got academics who want to have a quick and easy way to add and remove material from lists that they can then use to guide their students’ learning. We’ve got the students themselves who want pointers to works that they’ll need for reports and projects, preferably with as many online resources as possible. And we’ve got the librarians who need to use reading lists to help manage the book stocks and ensure that library budgets are spent wisely.

However today I’ve been hacking on some code for a different user community: prospective students. There’s actually really two camps here as well: people who are thinking about applying for a course at the University and those that have already applied and are waiting to be accepted and join their course in the next academic year. In either camp though there are individuals who would like to get guidance on the sort of material they will be expected to read when they come to the University to study. Our librarians already get the occasional request come in along these lines so we know there is a latent interest in this. And this is where the reading list data base can come in….

The LORLS database at Loughborough holds quite a lot of data on module reading lists., with a pretty high coverage rate of modules across all the schools and departments on campus. As part of academics providing information about each work on their list we’ve asked them to, where possible, make judgements as to the level of recommendation each work has. For example they may say that one book is absolutely essential, another comes highly recommended if you can get it and a third is OK as an optional reading source to provide a fuller understanding of a topic. Obviously only the really core texts for a module are marked as essential, even on really large reading lists.

My new program makes use of these academic supplied recommendations to find out what books are judged as essential on foundation and first year undergraduate modules within a department. Whilst not all of these modules in all departments will be compulsory for all students, they do give a good idea of what the core reading is likely to be for students in their initial year on campus. Indeed having a list of essential works to flick through in a local library or bookshop (or via extracts on Google Books!) before they turn up may help some prospective students make quicker decisions about any module options they are offered.

The code is pretty Loughborough centric so it isn’t likely to appear in the LORLS distributions directly: it has to wander over to our central information systems front end to get a list of modules for a department’s first year (as some departments may share modules on joint honours courses). In the future we might want to get more granularity by allowing prospective students to specify not only the department they are interested in but the programme of study itself (by name and/or UCAS code for example). However we get the list of modules though, the next step is purely based on running through the LORLS database to find the works that have essential recommendation levels at the moment.

Of course that list of works might be extended if this is actually implemented as a production service. For example if a department doesn’t have any works in any foundation or first year module marked as “essential” (some don’t!), we might want the code to fall back to looking at recommended works that are heavily borrowed (so linking in with our Aleph LMS’s borrower history as we’ve done in the past for our high demand reports and purchase prediction code). Or the University might want to add some non-academic pastoral support texts to the list (time and money management techniques, study skills, etc). Lots of options to consider there.

Once we’ve got the list of books that can be used as suggestions to the prospective students, we have some more options in how to present them. At the moment the code implements two options. The first of these is to display a simple HTML page containing the citations. Each work’s ISBN is hyperlinked to Google Books so that the prospective student (or their friends and family) can see cover art, read some extracts and potentially buy a copy online.

At the bottom of this list we also provide the second option: a link to allow the student to add the works found to a new virtual bookshelf on Goodreads.com. From there they can look at reviews, rankings, share with their friends, and do all the other social reading things that Goodreads permits. We called the new shelf “loughborough-wishlist” as it may well be the books they’d like to buy (or have others buy for them!).

This second option is a specific implementation of something we’ve been thinking about more generally for a while: getting our LORLS data linked in with various third party social media and sharing services that the users are already using. We picked Goodreads.com because the API is clean and provides exactly what we need. We’d like to do it with other services such as LibraryThing.com, Google Books, Amazon wishlists or even Facebook (which does have a Books API surprisingly).

We are now waiting for our library colleagues to evaluate the results before we consider going live with this system. We’ll probably want to pass it by the academic departments before it gets sent out to prospective students so that they have a chance to check what, if any, recommendations have been made for readings during the first year that will show up to prospective students.

However, what this program does demonstrate though is that once we have a suitably large body of data in a library system we can start to look at it in new ways and help new user communities. In this case the more that can be done to help engage prospective students and ease their transition into being a student, the better the chance that they’ll make the right choice and pick Loughborough as the place for their academic career.

Prototyping Goodreads integration

Part of my day job is developing and gluing together library systems. This week I’ve been making a start on doing some of this “gluing” by prototyping some code that will hopefully link our LORLS reading list management system with the Goodreads social book reading site. Now most of our LORLS code is written in either Perl or JavaScript; I tend to write the back end Perl stuff that talks to our databases and my partner in crime Jason Cooper writes the delightful, user friendly front ends in JavaScript. This means that I needed to get a way for a Perl CGI script to take some ISBNs and then use them to populate a shelf in Goodreads. The first prototype doesn’t have to look pretty – indeed my code may well end up being a LORLS API call that does the heavy lifting for some nice pretty JavaScript that Jason is far better at producing than I am!

Luckily, Goodreads has a really well thought out API, so I lunged straight in. They use OAuth 1.0 to authenticate requests to some of the API calls (mostly the ones concerned with updating data, which is exactly what I was up to) so I started looking for a Perl OAuth 1.0 module on CPAN. There’s some choice out there! OAuth 1.0 has been round the block for a while so it appears that multiple authors have had a go at making supporting libraries with varying amounts of success and complexity.

So in the spirit of being super helpful, I thought I’d share with you the prototype code that I knocked up today. Its far, far, far from production ready and there’s probably loads of security holes that you’ll need to plug. However it does demonstrate how to do OAuth 1.0 using the Net::OAuth::Simple Perl module and how to do both GET and POST style (view and update) Goodreads API calls. Its also a great way for me to remember what the heck I did when I next need to use OAuth calls!

First off we have a new Perl module I called Goodreads.pm. Its a super class of the Net::OAuth::Simple module that sets things up to talk to Goodreads and provides a few convenience functions. Its obviously massively stolen from the example in the Net::OAuth::Simple perldoc that comes with the module.

#!/usr/bin/perl

package Goodreads;

use strict;
use base qw(Net::OAuth::Simple);

sub new {
    my $class = shift;
    my %tokens = @_;

    return $class->SUPER::new( tokens => %tokens,
                               protocol_version => '1.0',
                               return_undef_on_error => 1,
                               urls => {
                                 authorization_url => 'http://www.goodreads.com/oauth/authorize',
                                 request_token_url => 'http://www.goodreads.com/oauth/request_token',
                                 access_token_url => 'http://www.goodreads.com/oauth/access_token',
                               });
}

sub view_restricted_resource {
    my $self = shift;
    my $url = shift;
    return $self->make_restricted_request($url, 'GET');
}

sub update_restricted_resource {
    my $self = shift;
    my $url = shift;
    my %extra_params = @_;
    return $self->make_restricted_request($url, 'POST', %extra_params);
}

sub make_restricted_request {
    my $self = shift;
    croak $Net::OAuth::Simple::UNAUTHORIZED unless $self->authorized;

    my( $url, $method, %extras ) = @_;

    my $uri = URI->new( $url );
    my %query = $uri->query_form;
    $uri->query_form( {} );

    $method = lc $method;

    my $content_body = delete $extras{ContentBody};
    my $content_type = delete $extras{ContentType};

    my $request = Net::OAuth::ProtectedResourceRequest->new(
        consumer_key => $self->consumer_key,
        consumer_secret => $self->consumer_secret,
        request_url => $uri,
        request_method => uc( $method ),
        signature_method => $self->signature_method,
        protocol_version => $self->oauth_1_0a ?
            Net::OAuth::PROTOCOL_VERSION_1_0A :
            Net::OAuth::PROTOCOL_VERSION_1_0,
        timestamp => time,
        nonce => $self->_nonce,
        token => $self->access_token,
        token_secret => $self->access_token_secret,
        extra_params => { %query, %extras },
    );
    $request->sign;
    die "COULDN'T VERIFY! Check OAuth parameters.n"
        unless $request->verify;

    my $request_url = URI->new( $url );

    my $req = HTTP::Request->new(uc($method) => $request_url);
    $req->header('Authorization' => $request->to_authorization_header);
    if ($content_body) {
        $req->content_type($content_type);
        $req->content_length(length $content_body);
        $req->content($content_body);
    }

    my $response = $self->{browser}->request($req);
    return $response;
}
1;

Next we have the actual CGI script that makes use of this module. This shows how to call the Goodreads.pm (and thus Net::OAuth::Simple) and then do the Goodreads API calls:

#!/usr/bin/perl

use strict;
use CGI;
use CGI::Cookie;
use Goodreads;
use XML::Mini::Document;
use Data::Dumper;

my %tokens;
$tokens{'consumer_key'} = 'YOUR_CONSUMER_KEY_GOES_IN_HERE';
$tokens{'consumer_secret'} = 'YOUR_CONSUMER_SECRET_GOES_IN_HERE';

my $q = new CGI;
my %cookies = fetch CGI::Cookie;

if($cookies{'at'}) {
    $tokens{'access_token'} = $cookies{'at'}->value;
}
if($cookies{'ats'}) {
    $tokens{'access_token_secret'} = $cookies{'ats'}->value;
}
if($q->param('isbns')) {
    $cookies{'isbns'} = $q->param('isbns');
}
my $oauth_token = undef;
if($q->param('authorize') == 1 && $q->param('oauth_token')) {
    $oauth_token = $q->param('oauth_token');
} elsif(defined $q->param('authorize') && !$q->param('authorize')) {
    print $q->header,
    $q->start_html,
    $q->h1('Not authorized to use Goodreads'),
    $q->p('This user does not allow us to use Goodreads');
    $q->end_html;
    exit;
}

my $app = Goodreads->new(%tokens);

unless ($app->consumer_key && $app->consumer_secret) {
    die "You must go get a consumer key and secret from Appn";
}

if ($oauth_token) {
    if(!$app->authorized) {
        GetOAuthAccessTokens();
    }
    StartInjection();
} else {
    my $url = $app->get_authorization_url(callback => 'https://example.com/cgi-bin/good-reads/inject');
    my @cookies;
    foreach my $name (qw(request_token request_token_secret)) {
        my $cookie = $q->cookie(-name => $name, -value => $app->$name);
        push @cookies, $cookie;
    }
    push @cookies, $q->cookie(-name => 'isbns',
                              -value => $cookies{'isbns'} || '');
    print $q->header(-cookie => @cookies,
                     -status=>'302 Moved',
                     -location=>$url,
                     );
}

exit;

sub GetOAuthAccessTokens {
    foreach my $name (qw(request_token request_token_secret)) {
        my $value = $q->cookie($name);
        $app->$name($value);
    }
    ($tokens{'access_token'},
     $tokens{'access_token_secret'}) =
        $app->request_access_token(
            callback => 'https://example.com/cgi-bin/goodreads-inject',
            );
}

sub StartInjection {
    my $at_cookie = new CGI::Cookie(-name=>'at',
                                    -value => $tokens{'access_token'});
    my $ats_cookie = new CGI::Cookie(-name => 'ats',
                                     -value => $tokens{'access_token_secret'});
    my $isbns_cookie = new CGI::Cookie(-name => 'isbns',
                                       -value => '');
    print $q->header(-cookie=>[$at_cookie,$ats_cookie,$isbns_cookie]);
    print $q->start_html;

    my $user_id = GetUserId();
    if($user_id) {
        my $shelf_id = LoughboroughShelf(user_id => $user_id);
        if($shelf_id) {
            my $isbns = $cookies{'isbns'}->value;
            print $q->p("Got ISBNs list of $isbns");
            AddBooksToShelf(shelf_id => $shelf_id,
                            isbns => $isbns,
                           );
        }
    }

    print $q->end_html;
}

sub GetUserId {
    my $user_id = 0;
    my $response = $app->view_restricted_resource(
                       'https://www.goodreads.com/api/auth_user'
                   );
    if($response->content) {
        my $xml = XML::Mini::Document->new();
        $xml->parse($response->content);
        my $user_xml = $xml->toHash();
        $user_id = $user_xml->{'GoodreadsResponse'}->{'user'}->{'id'};
    }
    return $user_id;
}

sub LoughboroughShelf {
    my $params;
    %{$params} = @_;

    my $shelf_id = 0;
    my $user_id = $params->{'user_id'} || return $shelf_id;

    my $response = $app->view_restricted_resource('https://www.goodreads.com/shelf/list.xml?key=' .  $tokens{'consumer_key'} . '&user_id=' . $user_id);
    if($response->content) {
        my $xml = XML::Mini::Document->new();
        $xml->parse($response->content);
        my $shelf_xml = $xml->toHash();
        foreach my $this_shelf (@{$shelf_xml->{'GoodreadsResponse'}->{'shelves'}->{'user_shelf'}}) {
            if($this_shelf->{'name'} eq 'loughborough-wishlist') {
                $shelf_id = $this_shelf->{'id'}->{'-content'};
                last;
            }
        }
        if(!$shelf_id) {
            $shelf_id = MakeLoughboroughShelf(user_id => $user_id);
        }
    }
    print $q->p("Returning shelf id of $shelf_id");
    return $shelf_id;
}

sub MakeLoughboroughShelf {
    my $params;
    %{$params} = @_;

    my $shelf_id = 0;
    my $user_id = $params->{'user_id'} || return $shelf_id;

    my $response = $app->update_restricted_resource('https://www.goodreads.com/user_shelves.xmluser_shelf[name]=loughborough-wishlist',
        );
    if($response->content) {
        my $xml = XML::Mini::Document->new();
        $xml->parse($response->content);
        my $shelf_xml = $xml->toHash();
        $shelf_id = $shelf_xml->{'user_shelf'}->{'id'}->{'-content'};
        print $q->p("Shelf hash: ".Dumper($shelf_xml));
    }
    return $shelf_id;
}

sub AddBooksToShelf {
    my $params;
    %{$params} = @_;

    my $shelf_id = $params->{'shelf_id'} || return;
    my $isbns = $params->{'isbns'} || return;
    foreach my $isbn (split(',',$isbns)) {
        my $response = $app->view_restricted_resource('https://www.goodreads.com/book/isbn_to_id?key=' . $tokens{'consumer_key'} . '&isbn=' . $isbn);
        if($response->content) {
            my $book_id = $response->content;
            print $q->p("Adding book ID for ISBN $isbn is $book_id");
            $response = $app->update_restricted_resource('http://www.goodreads.com/shelf/add_to_shelf.xml?name=loughborough-wishlist&book_id='.$book_id);
        }
    }
}

You’ll obviously need to get a developer consumer key and secret from the Goodreads site and pop them into the variables at the start of the script (no, I’m not sharing mine with you!). The real work is done by the StartInjection() subroutine and the subordinate subroutines that it then calls once the OAuth process has been completed. By this point we’ve got an access token and its associated secret so we can act as whichever user has allowed us to connect to Goodreads as them. The code will find this user’s Goodreads ID, see if they have a bookshelf called “loughborough-wishlist” (and create it if they don’t) and then add any books that Goodreads knows about with the given ISBN(s). You’d call this CGI script with a URL something like:

https://example.com/cgi-bin/goodreads-inject?isbns=9781565928282

Anyway, there’s a “works for me” simple example of talking to Goodreads from Perl using OAuth 1.0. There’s plenty of development work left in turning this into production level code (it needs to be made more secure for a start off, and the access tokens and secret could be cached in a file or database for reuse in subsequent sessions) but I hope some folk find this useful.

Reading lists and high demand items

Having a reading list management system has a number of advantages, including providing the Library with better management information to help manage its stock.  At Loughborough we’ve been working a couple of scripts that tie LORLS into our Aleph library management system.  These are pretty “bespoke” and probably won’t form part of the normal LORLS distribution as they rely on quite a bit on the local “information landscape” here, but it might be useful to describe how they could help.

A good example is our new high demand items report.  This script (or rather interlinked set of scripts as it runs on several machines to gather data from more than one system) tries to provide library staff with a report of items that have recently been in high demand, and potentially provide hints to both them and academics about reading lists that might benefit from these works being added.

The first part of the high demand system goes off to our Aleph server to peer into both requests for items and loan information.  We want to look for works that have more than a given number of hold requests placed on them in a set time period and/or more than a threshold number of simultaneous loans made.  These parameters are tunable and the main high demand script calls out via an XML API to another custom Perl script that lives on Aleph server that does the actual Oracle SQL jiggery-pokery.  This is actually quite slow – its trawling through a lot of data potentially (especially if the period of time that the query is applied over is quite long – we tended to limit queries to less than a week usually otherwise the whole process can take many hours to run).

This script returns a list of “interesting” items that match our high demand criteria and the users that have requested and/or borrowed them.  The main script can then use this user information to look up module membership.  It does this using LDAP queries to our Active Directory servers, as every module appears as a group in the AD with users being group members.  We can then look for groups of users interested in a particular item that share module membership. If we get more than a given (again tunable) number we check if this work is already mentioned on the reading list for the module and, if not, we can optionally make a note about the work’s ISBN, authors, title, etc into a new table.  This can then be used in the CLUMP academic dashboard to show suggestions about potentially useful extra works that might be suitable for the reading list.

The main script then marshals the retrieved data and generates an HTML table out of it, one row for each work.  We include details of ISBN, author, title, shelfmark, number of active reservations, maximum concurrent loans, the different types of loan, an average price estimate (in case extra copies may be required) and what reading lists (if any) the work is already on.  We can also optionally show the librarians a column for the modules that users trying to use this work are studying, as well as two blank columns for librarians to make their own notes in (when printed out – some still like using up dead trees!).

The main script can either be run as a web based CGI script or run from the command line.  We mostly use the latter (via cron) to provide emailed reports, as the interactive CGI script can take too long to run as mentioned earlier.  It does demonstrate how reading lists, library management systems and other University registration systems can work together to provide library staff with new information views on how stock is being used and by who.

Purchasing Prediction

For a while now we’ve been feverishly coding away at the top secret LORLS bunker on a new library management tool attached to LORLS: a system to help predict which books a library should buy and roughly how many will be required.  The idea behind this was that library staff at Loughborough were spending a lot of time trying to work out what books needed to be bought, mostly based on trawling through lots of data from systems such as the LORLS reading list management system and the main Library Management System (Aleph in our case).  We realised that all of this work was suitable for “mechanization” and we could extract some rules from the librarians about how to select works for purchase and then do a lot of the drudgery for them.

We wrote the initial hack-it-up-and-see-if-the-idea-works-at-all prototype about a year ago.  It was pretty successful and was used in Summer 2012 to help the librarians decide how to spend a bulk book purchasing budget windfall from the University.  We wrote up our experiences with this code as a Dlib article which was published recently.  That article explains the process behind the system, so if you’re interested in the basis behind this idea its probably worth a quick read (or to put it in other terms, I’m too lazy to repeat it here! 😉 ).

Since then we’ve stepped back from the initial proof of concept code and spent some time re-writing it to be more maintainable and extensible.  The original hacked together prototype was just a single monolithic Perl script that evolved and grew as we tried ideas out and added features.  It has to call out to external systems to find loan information and book pricing estimates with bodges stuffed on top of bodges.  We always intended it to be the “one we threw away” and after the summer tests it became clear that the purchase prediction could be useful to the Library staff and it was time to do version 2.  And do it properly this time round.

The new version is modular… using good old Perl modules! Hoorah! But we’ve been a bit sneaky: we’ve a single Perl module that encompasses the purchase prediction process but doesn’t actually do any of it.  This is effectively a “wrapper” module that a script can call and which knows about what other purchase prediction modules are available and what order to run them in.  The calling script just sets up the environment (parsing CGI parameters, command line options, hardcoded values or whatever is required), instantiates a new PurchasePrediction object and repeatedly calls that object’s DoNextStep() method with a set of input state (options, etc) until the object returns an untrue state.  At that point the prediction has been made and suggestions have, hopefully, been output.  Each call to DoNextStep() runs the next Perl module in the purchase prediction workflow.

The PurchasePrediction object comes with sane default state stored within it that provides standard workflow for the process and some handy information that we use regularly (for example ratios of books to students in different departments and with different levels of academic recommendation in LORLS.  You tend to want more essential books than you do optional ones obviously, whilst English students may require use of individual copies longer than Engineering students do).  This data isn’t private to the PurchasePredictor module  though – the calling script could if it wished alter any of it (maybe change the ratio of essential books per student for the Physics department or even alter the workflow of the purchase prediction process).

The workflow that we currently follow by default is:

  1. Acquire ISBNs to investigate,
  2. Validate those ISBNs,
  3. Make an initial estimate of the number of copies required for each work we have an ISBN for,
  4. Find the cost associated with buying a copy of each work,
  5. Find the loan history of each work,
  6. Actually do the purchasing algorithm itself,
  7. Output the purchasing suggestions in some form.

Each of those steps is broken down into one or more (mostly more!) separate operations, with each operation then having its own Perl module to instantiate it.  For example to acquire ISBNs we’ve a number of options:

  1. Have an explicit list of ISBNs given to us,
  2. Find the ISBNs of works on a set of reading list modules,
  3. Find the ISBNs of works on reading lists run by a particular department (and optionally stage within that department – ie First Year Mech Eng)
  4. Find ISBNs from reading list works that have been edited within a set number of days,
  5. Find all the ISBNs of books in the entire reading list database
  6. Find ISBNs of works related to works that we already have ISBNs for (eg using LibraryThing’s FRBR code to find alternate editions, etc).

Which of these will do anything depends on the input state passed into the PurchasePredictor’s DoNextStep() method by the calling script – they all get called but some will just return the state unchanged whilst others will add new works to be looked at later by subsequent Perl modules.

Now this might all sound horribly confusing at first but it has big advantages for development and maintenance. When we have a new idea for something to add to the workflow we can generate a new module for it, slip it into the list in the PurchasePredictor module’s data structure (either by altering this on the fly in a test script or, once the new code is debugged, altering the default workflow data held in the PurchasePredictor module code) and then pass some input to the PurchasePredictor module’s DoNextStep() method that includes flags to trigger this new workflow.

For example until today the workflow shown in the second list above did not include the step “Find ISBNs from reading list works that have been edited within a set number of days”; that got added as a new module written from scratch to a working state in a little over two hours.  And here’s the result, rendered as an HTML table:

Purchase Predictor pp2 HTML showing suggested purchases for works in LORLS edited in the last day.

Purchase Predictor pp2 HTML showing suggested purchases for works in LORLS edited in the last day.

As you can see the purchase predictor HTML output in this case tries to fit in with Jason’s new user interface, which it can do easily as that’s all encompassed in Perl modules as well!

There’s still lots more work we’ve can do with purchase prediction.  As one example, the next thing on my ‘To Do’ list is to make an output module that generates emails for librarians so that it can be run as batch job by cron every night.  The librarians can then sip their early morning coffee whilst pondering which book purchasing suggestions to follow up.  The extensible modular scheme also means we’re free to plug in different actual purchasing algorthims… maybe even incorporating some machine learning, with feedback provided by the actual purchases approved by the librarians against the original suggestions that the system made.

I knew those undergraduate CompSci Artificial Intelligence lectures would come in handy eventually… 🙂

Item Ratings

One of the things on the LORLS “to do” list from last summer’s Meeting the Reading Lists Challenge workshop was to have ratings on lists and/or items. Gary and I had a chat about this earlier today and decided that if we’re going to do it, it would probably be better just on items rather than lists. That way students are commenting on how useful they found individual books, articles, etc rather than the academics reading list as a whole, so there would probably be more acceptance from academics.

I’ve thus produced two new API calls to support this:

  • GetSURating – get the ratings for particular structural units and/or users. If you give it a “suid” parameter with a comma separated list of structural unit IDs it will try to find ratings for those. You can also ask for rating set by one or more users by specifying the “user_id” parameter with a comma separate list of user IDs (the latter mostly because I thought we might want to allow folk in the future to see which books they’d rated, a bit like LibraryThing does). The script normally returns some XML with the ratings for each matched SUID (good and bad). You can give it a “details” parameter set to ‘Y’ in which case it will just splurge out XML with all the matching records in (including creation/modification times, etc so we could do fancy time based rating analysis).
  • Editing/EditSURating – create/edit a rating. Needs to have the structural unit ID send in via the “suid” parameter and the rating itself (either “Good” or “Bad”) via the “rating” parameter. No user_id parameter as it takes the logged in user as the person to create the rating from. It returns a summary of the current ratings for the structural unit after updating for this user. If you’re not logged in it does nothing.

Each user can click on ratings for a particular structural unit as many times as they like, but they’ll only have one active record. That means that you can rate something as “Bad” at first, then re-read it later and decide that you were wrong and its actually “Good” and re-rate it. Your old “Bad” rating is replaced by the “Good” rating.

We’ve already been talking about how we can make use of the data once the students start rating items. For example we could have a graph or scatter chart in the academic’s dashboard showing them four quadrants: rarely borrowed items that aren’t liked, rarely borrowed items that are liked, heavily borrowed items that aren’t liked and heavily borrowed items that are liked.  This would provide some feedback to academics on how useful the students found the material on their reading lists, and would also potentially supply some useful information to library staff.  You could imagine that a very expensive book that the library has put off buying a copy of but which is heavily liked by people who’ve acquired/seen copies elsewhere might persuade library staff to order a copy for instance.

I’ve got a proof of concept up and running now on our test/dev server.  This shows thumbs up/down on the bibiographic details popup in CLUMP for leaf items (books, journals, etc). As it requires a change to the LUMP database schema (in fact a whole new user_item_rating table), this isn’t going to be a LORLS v6.x thing but instead a LORLS v7 feature.

Oh crumbs, I’ve started working on the next version of LORLS already! 🙂

Jon Knight's RSS Feed
Go to Top