Archive for September, 2011
Converting structural units from one type to another
With LORLS6, we’ve moved away from the older LORLS “one size fits all” approach for the resources we hold in reading lists to having structural units, each of which is of a particular type. Those types can be books, book chapters, journals, journal articles, reading lists, modules, departments and, er, well more or less anything the system’s administrator wants them to be. This gives the system some great flexibility, especially seeing as the structural unit types also have sets of valid parent/child links between different types (so that departments can appear under an institution, but not under a book for example). It also means that we can provide validation for what goes into each type, and treat different types of resource in different ways.
All well and good. But want if you need to change the type of a structural unit? Why would you need to do that? Well, it might be that someone has mistakenly created a load of resources as books when they should be journal articles. Or you’ve introduced a new organisational unit between an institution and a department (call it, oooh, say a school) and need to change some departments into the new type. Until now you either needed to delete the existing structural unit and then create a new one of the new type, or get down and dirty with the database to tweaking the relationships at a SQL level (we’ve done the latter at Loughborough for example, for the “introducing schools” reason given above).
However we thought it might be an idea to provide an API to allow this conversion to be done programmatically, and potentially have a nice webby user interface wrapped round it so that end users/librarians can do this themselves. So enter stage left: the ConvertSU API call.
This new call normally takes two parameters: su_id and target_sut_id. The former is the ID of the structural unit we’re diddling around with, whilst the latter is the ID of the structural type that we’re planning on changing it to. Behind the scenes there’s a rather large Perl data structure that tells the script which structural unit types can be converted to which other types, and also what the mapping is for individual data elements.
Now there are some things to note about this:
- Not all conversions from one type to another are possible, or even desirable. For example we don’t want to turn books into departments usually. If the script doesn’t have a valid conversion detailed in its data structure, it fails the mapping, leaving the structural unit untouched.
- Even if we have a conversion between one structural unit type and another, not all the data types that are valid in the original structural unit type are going to be valid in the new one, so that in general the conversion process is lossy with some data elements being dropped from the structural unit. If we do have to drop a data element like this, we’ve decided to pop the details into a new Librarian Note, so at least the library staff can see what has been lost (handy if someone fumbles and converts a structural unit they then decide that they didn’t mean to!).
- If we do have a conversion mapping from one data type in the old structural unit type to a new data type in the new structural unit type, we might still need to tweak the data in the related data element(s) as we do the conversion. Part of this might be that we need to amalgamate elements of a repeatable data type in the original structural unit type into a non-repeatable data type in the new structural unit type. We need to give some options for how this is done (do we take the first data element we find? Or concatenate all the data elements of this type together?)
- There are some situations where even if the data types are both repeatable/non-repeatable we might want to do some post processing of the element value. For example we might want to turn the contents of an ISBN field into an ISSN, but only if the ISBN looks a bit like an ISSN to start with (if for example we had some dodgy imports from LORLS5… cough, cough! 🙂 ). To allow for this, we permit Perl regular expression substitutions to be specified in the conversion data structure in the script – with enough evilness in hand Perl regexp can do most data conversions, even if there are considered by some to be write only coding!
The ConvertSU API also has an additional parameter called dump_rules which if set to a non-zero value will turn the rules data structure in the Perl script into an XML structure and return that to the client without doing a conversion. This allows the client to see what types are convertible to other types, and so restrain from providing users with conversion options that don’t work.
So far the ConvertSU API is sitting on our development server and seems to be able to happily convert book chapters to articles and back again. Jason is now going to step up to the plate to make use of this new API in the development version of the CLUMP client (and no doubt find a load of bugs that I’ve left in the alpha code for him! 😉 ). Assuming there’s no show stoppers, we can then release it to our production system and see how our users get on with it, before slipping it into the LUMP/CLUMP GPL distribution.
In the fullness of time, we might want to abstract the conversion rules data structure out of the script and into the database. That would allow us to produce admin front ends to permit the rule set to be edited without folk having to fire up emacs or vi and edit the ConvertSU Perl script. We’ve not taken that route yet because we wanted to make sure that the basic logic worked and was useful, and also we’d prefer that any changes that result in a change to the underlying database schema be rolled into LORLS7. In other words database scheme updates are a major version number change. We’ve already got a couple of other schema changes on the cards for the future – for example over the summer we did a spot of work on logging guest views into a new table in the database so that we can provide academics and librarians with feedback on how often the students are looking at the reading lists (and even possibly what they are looking at).
There’s no time scale on LORLS7 deployment though before anyone asks! 🙂
Improving performance
As we seemed to have reached a stable period for our codebase I decided it was time to start digging into LUMP’s code to see where we could make some performance increases.
I took a standard performance improvement approach of installing a profiler (NYTProf) on LUMP’s development server and configuring apache to use it for Perl CGI scripts. After a short period of profiling three places where the performance could be improved were identified.
The first improvement was the removal of a system call (using backticks) to the whoami program to identify a users login. Normally this wouldn’t make much difference but the call was in a sub procedure that would be called 30 or 40 times from the FastGetStructuralUnit script. The multiple times it had to spawn a new process to run whoami pushed this line to the top of the profiler’s output. A quick change from `whoami` to getlogin() and the line disappeared from the profiler’s list of time consuming lines.
The second improvement was tucked away in the BaseSQL module. The MySQL call to get the ID of the last inserted item was regularly appearing in the profilers output as a slow call. Upon digging in to the code I saw the line
$sql = "select LAST_INSERT_ID() from ".$self->{_table};
This is a very instinctive line to code when you want to get the last ID of an item inserted into a specific table (almost always inserted by the previous SQL call). The problem is that LAST_INSERT_ID() returns the ID of the last row inserted in the current session and not of the specified table. Calling it in the way that we were was a big waste of time. Here is an example to demonstrate why it is waste of time.
mysql> select count(LAST_INSERT_ID()) from data_element;
1933973
By calling it in the way we were MySQL would pass back a row, consisting of the LAST_INSERT_ID() value, for every row in the table we specified. The solution was simply to change the line to
mysql> select count(LAST_INSERT_ID());
1
This change actually knocked a second of some of our slower API’s running time and also reduced the load on the MySQL server.
The third factor that stood out as slowing down most API calls was the module being used to build the XML response, XML::Mini. XML::Mini is a very powerful module for processing very complex XML documents. We didn’t need to use much of its potential as our XML structure was deliberately kept simple. An hour later we had produced XMLify which would take the same hash reference we passed to XML::Mini and produce compatible output. By using our own XMLify sub procedure we managed to reduce the run time of our large FastGetStructuralUnit calls by over 1 second.