LORLS | Speeding up the imports

A while since the last update and that’s mostly because we’ve been banging our heads against a speed issue for the import of the old LORLSv5 reading lists into LUMP.

The first cut of the importer seemed to work but had barfed due to lack of disc space on my workstation before completing the run. For dev/testing the rest of the API that was fine though as we had enough data to play with. It was only when we installed a new virtual hosting server and I created myself a dedicated test virtual server with oodles of RAM and disc space that we discovered that the import would work… but it would take around two weeks to do four years worth of data. Ah, not good. Especially as the virtual host is supposed to be relatively big and fast (OK its a virtual server so we can’t count spindles directly as the filesystem is stuffed inside another file on the RAID array on the host, but its should still be plenty fast enough for LUMP to run, otherwise folk with older server hardware are going to be stuffed).

We’ve tried a number of different options to help sort this out. These have included:

Search caching in Perl: Tweaking the BaseSQL module to allow a Perl hash based cache to be turned on and off (and the status of caching checked). This is then used by some of the higher layer modules that encapsulate a few of the tables (StructuralUnit, DataElement, DataType, DataTypeGroup) to see if a search matches a previous search and, if caching is turned on, returns the results immediately from the Perl hash without hitting the database. Any updates on the table in question invalidate the cache. Reading the cached copy is much faster than accessing the database and so this can be a big win, especially on tables where there are relatively infrequent updates. Unforunately we do quite a bit of updating on StructuralUnit and DataElement tables.
A reload() method: Quite a lot of the time we create new Perl objects on a database table to do (for example) a search and then later have to do another new() on the same object to instantiate it with an existing row from the table (based on the id field). Every new() method reinterrogates the database to find out the fields for the table concerned and then recreates the Perl object from scratch. However the fields are unlikely to change from call to call (certainly during an import) so this is just wasted time. A reload() method has been added so that you can instantiate the object from a known row in the database via the id field without having the whole Perl object regenerated or the data queried for the fields available. This results in a slight but noticeable saving.
Adding noatime and nodiratime mount options: Normally on a Linux ext3 filesystem (which is what we’re running), the access time of files and directories are updated each time they are accessed. Of course that means that every SQL SELECT is effectively also a write on the filesytems. The noatime and nodirtime directives to mount (slipped into /etc/fstab) turn this behaviour off. You don’t even have to reboot for it to come into effect – the mount -oremount / command is your friend! This should remove another disk related bottleneck during the import.
Tweaking the ACL CopyRights() method: Every time a row is inserted into the structural_unit table for a new SU, several rows get stuck into the data_element and access_control_list tables. Looking at the latter, during the import many of these rows are created as a result of the CopyRights() method on the AccessControlList Perl object. This method allows several fancy options such as overwriting existing ACLs and cascading ACLs from a parent to all its children. Neither of these directly apply in the case of building a new SU and copying the access rights from its direct parent, yet we still had to do a load of SQL to support them. Therefore a new parameter called “new” was added to the method to indicated that the ACLs were being copied for a new SU, which allowed some short cutting. One part of this sort cutting was to use a single INSERT INTO...SELECT FROM... SQL construct. This should be fast because it is just copying internally in the database engine (using a temporary table as both the source and target tables are the same in our case) and doesn’t need to have results send to/from the Perl script. This appears to be quite a big win – performance with this and the previous two tweaks now hits 1000+ SUs and associated data being created every 10 minutes or so.

Whilst these tweaks look to be making bulk importing data from LORLSv5 more manageable, it remains to be seen if they ensure that the performance of the interactive editing and access acceptable. Hopefully now I can get back to making a non-Moodle front end and we’ll see!