Jon Knight
This user hasn't shared any profile information
Posts by Jon Knight
Librarian Centre CGIs – departmental breakdowns
Today was a bit of a change of pace in LUMP development – it was time to whizz up a backroom script for the librarians, allowing them to see how many modules, reading lists and different types of items were used by departments in a particular year. On the whole this was an “old skool” CGI script that doesn’t use the LUMP XML API – just straight calls to the underlying Perl modules that wrap round the LUMP database. This is both quick to run and quick to code, and for a simple statistics script is fine (no need for flashy JavaScript for those really). Gary is pondering other statistics scripts we might want to produce, both derived from the existing LORLS ones and also some new ideas. He’s going to chat to some of our librarians about those first though – no point wasting time writing code to produce information that nobody is interested in!
Installer hits a sticky patch
You know I loved CPAN until I started writing the LUMP installer but now I’m not so keen!
It looks like our cunning plan of allowing people to install the required modules to run LUMP from CPAN using point-n-dribble web front end might have come unstuck for a couple of reasons. The major one is having Perl modules that use XS to rely on C code/libraries/etc that might not actually be installed on the target machine. For example we need to use the ZOOM Perl module for Z39.50 access. This in turn makes use of the Yaz toolkit, which you either need to build from source or get an RPM for… and there aren’t any RPMs in the standard CentOS repos (you can pull CentOS RPMs from Index Data’s web site but not get them via yum). Similar requirements for Expat, although at least that has yum’ed packages in CentOS. I not noticed either of these because on my Debian Lenny based workstation there were already handy packages in place and the Perl modules that I’d been “needing” were pure Perl. I only stumbled across the issue when Jason created a “sandbox” virtual machine running CentOS and I tried to run the installer on there. Even if I could guess the package manager on use on the machine (yum, apt, etc) I’d need to ask the Admin for the system root password and then have to deal with the out of repo code such as Yaz. At this point the law of diminishing returns kicked in and we decided we’d wasted enough time trying to make a user friendly installer and it wasn’t going to be easy to get round these problems.
Still, all is not lost: at least the installer code can spot which modules are missing so that you can install them by hand. In fact I’ve left the code into install private CPAN modules – we might as well do as much as possible for the prospective Admin even if we do then need to tell them to break out the C compiler and build some support software. And of course the installer can still do everything else once the required Perl modules are in place – create the DB, import test data, and retrieve and configure the various scripts required.
I also accidentally stumbled across another problem today: DBD::MySQL doesn’t allow column filters in the column_info() DBI method. I was using that to check if the LUMP database schema on the target machine, if present, was out of date compared to the schema embedded in the installer. This is a bit of a pain but relatively simple to code round. The DBI manual page implies that if a DBD module can’t do filtering you might have to handle it yourself: DBD::MySQL just barfs with an error message so I’ve had to embed the call in eval() checks just to be on the safe side.
I’ve also just noticed that I’ve hard coded the Kerberos TGT realm! Oops – we’ll need to change that, especially as we’re just moving ActiveDirectory trees here at Loughborough anyway! We’ll probably also need to add in an override ability for folk with no AD or other Kerberized authentication infrastructure – that shouldn’t take too much work as its just a little tweak inside the LUMP.pm password validation method. Hopefully in a week or so we’ll have little demo system on our sandbox VM that we can then let people play with – a bit like the demo system that we already run for LORLS v5 (our “groundhog day” install, that gets blown away every morning in the wee hours and replaced with a fresh copy… in fact I used the LORLSv5 to LUMP importer to convert that database into our new format for this new sandbox machine as another useful test).
Installer progress…
Writing a web driven installer for LUMP in Perl is proving to be, er, “interesting”. Today has seen some useful progress and its probably worth noting some of the tricks used (mostly because they’re rather esoteric and bug hunting might be fun in a few years time when I’ve forgotten why I did what I did!).
The first thing I had to get going was bootstrapping the the required CPAN modules as documented in my previous post. This is now working OK. A few gotchas to note:
- When you fork off a child process from a CGI script that you want to be “long lived” (ie outlive the parent process that has returned some HTML to the user), you need to close down STDOUT, STDIN and STDERR, otherwise Apache hangs around waiting for the child to finish. This can actually be handy, as you can then reopen file handles with those names pointing elsewhere (such as to a session file) and record the output of other tools that the child calls upon (such as the very chatty CPAN::Shell).
- CPAN::Shell doesn’t appear to return a useful error status, so programmatically you can’t tell if your module install worked or not (the verbose output is designed to tell a human). To find out if a module is installed, you have to try to “use” that module again – once installed the use statement should work.
- Talking of use statements its worth noting that these are best done inside eval()’s, as that allows you to capture errors (as in the case of a module that has failed to install). You have to be careful whether you have compile or run time binding though, especially if you’re eval()ing stuff with variables in that you’re planning on instantiating during the run. Perl -cw is your friend as usual by warning you about dodgy compile time issues.
Once I could successfully install CPAN modules in to a private LUMP modules directory, the next thing to consider was the database. The installer CGI script asked the user for the database name, database account and database password to use and then checks to see if it can gain access. For most noobs this won’t work as they won’t have a database/account/password, so the CGI script traps that error and asks if they’d like it to create the database for them. This requires them supplying the database root password and then (assuming its right) creates the database and grants SELECT/INSERT rights to the named user identified by the given password.
The CGI script then checks that this user can connect to the database – if they can’t then something has gone wrong with the basic creation and we have to tell them to fiddle by hand. Assuming that is OK, it then changes back to the database root user to do the schema creation (as the user only has SELECT/INSERT privs). This however called for another hack: we’ve kept the database schema and index definitions in SQL text files. During development we’ve just source’d these in with the mysql command line client – but this won’t work from Perl’s DBI interface and I don’t want to rely on finding the mysql command line client and system()’ing out to it.
So my next trick was another Perl script: the installer_maker. This is a simple filter script that reads a “template” installer script in and writes out a proper installer with a couple of lines in the template replaced by some Perl array structures – one for the table schema and one for the indexes. These arrays simply contain one SQL statement in each element that the installer_maker has ripped out of the LUMP schema and indexes SQL text files. Bingo! We can now carry on tweaking the schema/indexes in the files we’re used to in development, whilst not having to remember to do lots of code twiddling to keep the installer in sync – we just need to run installer_maker before making a distribution. Hopefully that should result in less chance of our dev schemas getting out of step with released production schemas and the bugs that could result from that.
So its looking promising now, and from a user perspective is probably far friendly than the command line Perl installer from LORLS. The next thing to write is a routine in the installer that can check the schema and indexes in a live database against the schema/indexes we have now got embedded in the installer. If they differ in some way we need to tell the admin running the installer and let them decide whether to ignore the difference (bad idea probably), tweak it by hand or (preferably) let the installer tweak their schema/indexes. Hopefully in the long term this will be a useful place to hang LUMP schema/index upgrades on – folk could use the installer not only to load new versions of the Perl code but also update their database structure. That’ll need some serious testing though… 🙂
LORLS’s Installer
We’re getting to the point with LUMP and CLUMP where Jason and I are thinking about installers. If we were just going to use the code at Loughborough this would be a moot point – we’d just copy the various directories around and tweak under the hood to get the thing to work as we want, just as we have done during development.
However its not going to just be us that use it (or at least hopefully it won’t be!) so we need to think of a nice way for people to install it. The previous versions of LORLS have a Perl command line script that I wrote years ago that asks lots of questions and then does its funky thang. It works, but it could be friendlier, especially for the newbie admins. I’ve also seen a number of PHP based packages that allow you to do most of the installation via a web browser which is rather nice for the new admins, so I thought, “Hey, why not make an installer CGI script that knows very little about the system other than there is Perl there, a MySQL database server is running, the CGI.pm module (which has been part of the Perl distribution since 5.4 so should be there already on most machines installed in the 21st century) and is an executable CGI script?” Why not indeed…
Obviously the prospective LUMP admin needs to install (or have installed for him) Perl, MySQL and a webserver configured to execute CGI scripts in the directory that we tell them to unpack LUMP into, but once that’s done surely we can then check to make sure that all the other modules are present, install them if they aren’t, set up the MySQL database ready for LUMP to use and then configure all the LUMP scripts so that they are ready to go, all with a nice point and drool web interface?
Checking if a Perl module is available is relatively easy, thanks to the eval { } function in Perl. For example, say we want to check if Data::Dumper is installed. That can be done using some code such as:
eval { use Data::Dumper; };
if($@) {
# Module missing
} else {
# Module present, and available for use
}
Shimples!
However things start to get “interesting” if the module isn’t installed. I thought this would be easy, as the Comprehensive Perl Archive Network (CPAN) has a nice module that I’ve used for years to interactively install and update Perl modules with – things like:
perl -MCPAN -e install Data::Dumper
It has a programmatic option as well as an interactive side, so we’re sorted right? Well no, we’re not. Unfortunately the programmatic side really just generates the stream of stuff you see when you run the interactive side. If you include something like:
my $result = CPAN::Shell->install('Data::Dumper');
in your CGI script, eventually you’ll get a result in the web browser of a load of unformated raw text from this command interspersed with your active HTML. The $result variable on the other hand will stay completely empty, with no indication as to whether the installation has worked or not. Pants – not what we want here.
The long and short of it is, to get round this “feature” in CPAN::Shell it seems that you have to do a bit of fork() action. In other words folk off a child process to run the CPAN::Shell method in and then, back in the parent, capture its STDOUT stream into a variable which can then be scanned with a regexp or two looking for signs of success in the output text. Not terribly clean but it works.
There’s another “gotcha” as well: the web server is unlikely to be running as root (or at least it shouldn’t be!) and so the CGI script can’t install Perl modules into the system library directories. This is a more minor pain: you can tell CPAN::Shell that it should do a local installation to a place that the user executing it can write to. So that’s another requirement for running this CGI script: create a directory that’s readable and writable by the user running the web server (usually called something like apache or http) but which isn’t part of the web server document root. In other words, if the web server document root is /var/www/html, we might want to put this LUMP specific CPAN directory tree in /var/ww/LUMPCPAN where it can’t be seen by web browsers. You have to hack up a MyConfig.pm to put in this directory and then point @INC and $ENV{‘PERL5LIBS’} towards it, but that can be done automagically by the CGI installer script once the directory exists.
Now some readers (we do have readers, right?) might be wondering why I don’t use one of the fancy pants CPAN modules such local::libs or Module::Install to do this rather than tackling CPAN::Shell head on. Well its a chicken and egg problem really: I wanted to have the minimum requirements for the installer script to run and, if I’d have asked the user to install a load of libraries and modules to make using CPAN easier I might as well have just given them a list of modules to install. Which actually I have done anyway, just in case they want to install them system wide and/or not trut my installer script to get them right. But nevertheless I’d like to give the newbies (and also folk who have sluggish server admins if they don’t run their own boxes) the option of installing via the CGI installer script. More work for me, but hopefully less for them.
So, its a bit of kludge at the moment but it seems to be the way to go unless anyone else can come up with a better scheme. Once I’m happy it can get the required modules in place, my next trick will be to ask for database server details (database name, username, password, server, port) and then check if the database already exists, or create a fresh one. In the former case I’m also planning on checking the schema in the database against the one in this LUMP distribution and then offer to update it if needs be, and allow fresh databases to either be empty, have some simple test data in them or copy data from another database. Hopefully I can also bolt the importer from older LORLS versions in here so that there’s no command line interaction required at all. With a bit of luck all of those should be alot less hassle than CPAN::Shell has proved to be.
I hope… 🙂
Copying Structural Units
There’s now a new API cgi script – CopySU. This lets you both copy and link SUs to new parents. For example you might want to copy an existing book within the same reading list if there is another volume with very similar details, or you might want to link a single book SU into more than one reading list. In the latter case a change made to the book SU is reflected in all the reading lists it is linked to, irrespective of the list that it was “edited in”.
Jason now has to munge this into CLUMP so that the copy buttons in there run with this new functionality, as at the moment they are just placeholders that generate a new child SU. Seems to work from the command line/by hand so fingers crossed, eh? 🙂
Speeding things up (or at least trying to)
Jason has been doing a load of work in the last few months on getting the CLUMP AJAX and its looking rather nice. However Gary and Jason had decided that on longer reading list the performance wasn’t good enough. A 635 item list took over 3 minutes to completely render when Jason pulled the Reading List SU and then did separate XML API calls to retrieve each individual item (the advantage of this over just using a GetStructuralUnit API call with multiple levels is that the users could be presented with some of the data asynchronously as it arrived, rather than having to wait for the back end to process the whole lot).
So the question was: could it be made faster? One option was trying to create a single “mega” SQL select to get the data but that could almost immediately be discounted as we pull different bits of data from different tables for different things in GetStructuralUnit (ie the child SU info and SU/SUT basic info couldn’t easily be munged in with the data element/data type/data type group stuff). So we did two separate selects to get the basic data, ignoring the ACL stuff. These were both subsecond responses on the MySQL server.
Now obviously turning the SQL results into XML has a bit of an overhead, as does the client-server comms, but nowhere near enough to cause this slow down. This pointed the finger firmly in the direction of the ACLs. Every time we pull an SU (or its children) we need to do a load of ACL checks to make sure that the user requesting the data is allowed to see/edit the SU and/or the data type group. When we added ACLs back into the new “fast” XML API the 635 element list took just under two minutes to render. So we’d shaved a third off the time by crafting the SQL rather than using the Perl objects behind the XML API, but it was still a bit slow.
Gary then came up with a bright idea: why not allow FastGetStructuralUnit (the new whizzy version of the GetStructuralUnit XML API CGI script) to accept more than one SU ID at once? That way Jason’s CLUMP AJAX front end could request the outer reading list SU quickly, and then fill in the items but do them in blocks of several at once. We implemented this and had a play around with different numbers of items in the blocks. Five seemed quite good – this was fast enought to fill in the first screenful to keep the user occupied and managed to get the whole 635 item list rendered in the browser in just over a minute – two thirds of the original time.
Jason can now also try out more advanced ideas in the future, such as dynamically altering the number of items requested in the blocks based on response time and whether the user is scroll down to them or not. With lists under 100 items we’re getting a sub-10 second rendering time, so that’s hopefully going to be fast enough for the majority of users… and may even encourage some academics with long and unwieldy reading lists to split them up in to smaller sub lists.
More ACL fun
After fiddling with ACLs for the last few weeks we had to reload the access_control_list table with the new style ACLs. Last week I’d written a “fixacls” perl script to do this and left it running. It was still running a week later when I came in this morning, and was only about half way through the >88K SU’s we’ve got. A quick back of the envelope calculation showed that with the new ACL format we’d be looking at north of 2 million ACLs for this many SUs.
So, time to find a way of speeding it up. I decided to try something I’d seen phpBB do: write all the imports for a table into a single SQL INSERT statement. This should then be able to read all data in and generate the index entries all in one go, which should be much faster than doing it row by row in individual SQL statements.
The ACL generation script took just over 6 hours (I didn’t time it exactly unfortunately, but it was ticking along at about 300000 ACLs per hour, with just over 2 million ACLs in total for the 88882 SUs we’ve got).
Reading this into the database was actually quite quick:
mysql> source /home/cojpk/LUMP/trunk/ImportScripts/acl.sql
Query OK, 2266017 rows affected (51.98 sec)
Records: 2266017 Duplicates: 0 Warnings: 0
Well, it was once I’d upped the max_packet_size in /etc/my.cnf to 100MB from 1MB! Looking promising now. A similar technique might be useful for a revamp of the LORLS import script – certainly the ACL fixup can now be run as part of that to cut down on time required. Still need to tweak the fixacls script to deal with LORLS reading lists that were marked as unpublished, but that should be a quick and easy hack.
Access Control Lists
Been head scratching about access control lists, which have proved more complicated than originally anticipated. The trouble comes from working out how to do inheritence (so that we don’t need thousands of similar ACLs differing only in their SUID) and also how to set the initial ACLs for a SU that has been newly created.
Current thinking is as follows:
The existing access_control_list table will be left as is, but will be used in a slightly different way. To find out if a user has rights to something, we’ll first search through the table for all rows that contain a user group that the user is in that has the maximum priority. The guest usergroup has a priority of 0, general registered users 1, module support staff 170, module tutors 180, module library assistance 190, module librarians 200 and sysadmins 255.
If no rows match, game over, the user doesn’t have any sort of access. If rows are returned, we then look at the SUID field, and potentially the DTG_ID field if we’ve had one specified. Rows with no SUID or DTG_ID field are the lowest prescidence, then ones with just a SUID provided, then ones with SUID and DTIG_ID at the top. At this point we can then look at the view and edit fields to see what folk are allowed to do with them.
We might also need to allow folk to edit ACLs. We’ll let them edit ACLs of any group with a lower priority than the one that they are in and that mentions a SUID and/or DTG_ID that they have edit rights to. The only exception is Sysadmins – we can edit our own ACLs and can also edit ACLs that have a NULL (ie wildcard) SUID or DTG_ID. We will also disallow new ACLs to be created that have the same <usergroup_id, su_id, dtg_id> triple so that we don’t have two rules with different edit/view options at the end of the access validation algorithm.
When a new SU is created, we’ll need to add some rights to it. These initial rights will have user groups inherited from the parent SU, but this is where things get tricky: as we’ve discovered we can’t just copy the ACLs, because child SUs will in general be of a different SUT than their parents.
To get round this we’ve proposed two new tables. The first of these is “acl_defaults”. This links user group priorities with SUT_IDs, DTG_IDs and default “view” and “edit” settings. The second table is less important and more for UI “gloss” when creating/editing default ACLs – it is merely a list of priorities (as its ID field) and a name – so that we know that (for example) priority 170 is assigned to user groups for Support Staff.
The process when a new SU is created is as follows: for each usergroup that has an ACL in the parent SU take that usergroup’s priority field and then look up the row(s) in the acl_defaults table that match that combined with the SUT_ID for the new SU. These rows are then used to create new rows in the access_control_list table with the specific SUID for the new SU filled in (and if specified in acl_defaults, the value for DTG_ID as well).
Client Performance
In the last month we’ve been getting to grips with client side coding. This has thrown up a few XML API calls that we decided we wanted or needed tweaking (such as checking authentications without returning anything from the database, listing all the structural unit types available and allowing substring searches in FindSUID). It had also given us pause for thought on the performance of the system.
The old LORLS database was pretty lean and mean: the front end code knows about the structure of the data in the database and there’s relatively few tables to have to handle. LUMP is more complex and thus more searches have to be done. Also by having an XML API we’re that bit more removed and there’s more middle ware code to run. We could just say “hang it” and ignore the XML API for a CGI based client which might speed things up. However we’ve been trying to “eat our own dogfood” and use the APIs we’re making available for others.
Some of the speed hacks for the imports won’t work with the clients – for example the CGI scripts that implement the XML API are short lived so the caching hacks wouldn’t work, even if the clients all did tend to ask for the same things (which will in fact be the case for things like the Institutation, Faculty, Department, etc information). One avenue that we could persue to help with this is mod_perl in Apache, so that the CGI scripts are turned into much longer lived mod_perl scripts that could then benefit from caching.
We’ve currently ended up with a sort of three pronged approach to clients code:
- Jimll is writing a CGI based LUMP client that makes minimal use of JavaScript (mostly for the Google Book Search fancy pants bit really),
- Jason has a basic PHP Moodle client to demonstrate embeddablity,
- Jason is also writing a AJAXy client side tool.
All of these use the XML API. I guess we should also have a non-XML API client as a bench mark as well – the CGI based LUMP client could do that relaively easily in fact. Something to think about if performance really does turn out to be a dog.
We’ve also been considering the usefulness of the Display Format stuff held in the database. The CGI based LUMP client and the PHP Moodle client both use this. However the AJAX client retrieves raw XML documents from the server and then renders them itself. This might be a bit faster, but it does mean that the client becomes tied to the structure of data in the database (ie add a new Structural Unit Type and you might need to dick about with your client code(s) as well).
Speeding up the imports
A while since the last update and that’s mostly because we’ve been banging our heads against a speed issue for the import of the old LORLSv5 reading lists into LUMP.
The first cut of the importer seemed to work but had barfed due to lack of disc space on my workstation before completing the run. For dev/testing the rest of the API that was fine though as we had enough data to play with. It was only when we installed a new virtual hosting server and I created myself a dedicated test virtual server with oodles of RAM and disc space that we discovered that the import would work… but it would take around two weeks to do four years worth of data. Ah, not good. Especially as the virtual host is supposed to be relatively big and fast (OK its a virtual server so we can’t count spindles directly as the filesystem is stuffed inside another file on the RAID array on the host, but its should still be plenty fast enough for LUMP to run, otherwise folk with older server hardware are going to be stuffed).
We’ve tried a number of different options to help sort this out. These have included:
- Search caching in Perl
- Tweaking the BaseSQL module to allow a Perl hash based cache to be turned on and off (and the status of caching checked). This is then used by some of the higher layer modules that encapsulate a few of the tables (StructuralUnit, DataElement, DataType, DataTypeGroup) to see if a search matches a previous search and, if caching is turned on, returns the results immediately from the Perl hash without hitting the database. Any updates on the table in question invalidate the cache. Reading the cached copy is much faster than accessing the database and so this can be a big win, especially on tables where there are relatively infrequent updates. Unforunately we do quite a bit of updating on StructuralUnit and DataElement tables.
- A reload() method
- Quite a lot of the time we create new Perl objects on a database table to do (for example) a search and then later have to do another new() on the same object to instantiate it with an existing row from the table (based on the id field). Every new() method reinterrogates the database to find out the fields for the table concerned and then recreates the Perl object from scratch. However the fields are unlikely to change from call to call (certainly during an import) so this is just wasted time. A reload() method has been added so that you can instantiate the object from a known row in the database via the id field without having the whole Perl object regenerated or the data queried for the fields available. This results in a slight but noticeable saving.
- Adding noatime and nodiratime mount options
- Normally on a Linux ext3 filesystem (which is what we’re running), the access time of files and directories are updated each time they are accessed. Of course that means that every SQL SELECT is effectively also a write on the filesytems. The noatime and nodirtime directives to mount (slipped into /etc/fstab) turn this behaviour off. You don’t even have to reboot for it to come into effect – the mount -oremount / command is your friend! This should remove another disk related bottleneck during the import.
- Tweaking the ACL CopyRights() method
- Every time a row is inserted into the structural_unit table for a new SU, several rows get stuck into the data_element and access_control_list tables. Looking at the latter, during the import many of these rows are created as a result of the CopyRights() method on the AccessControlList Perl object. This method allows several fancy options such as overwriting existing ACLs and cascading ACLs from a parent to all its children. Neither of these directly apply in the case of building a new SU and copying the access rights from its direct parent, yet we still had to do a load of SQL to support them. Therefore a new parameter called “new” was added to the method to indicated that the ACLs were being copied for a new SU, which allowed some short cutting. One part of this sort cutting was to use a single INSERT INTO...SELECT FROM... SQL construct. This should be fast because it is just copying internally in the database engine (using a temporary table as both the source and target tables are the same in our case) and doesn’t need to have results send to/from the Perl script. This appears to be quite a big win – performance with this and the previous two tweaks now hits 1000+ SUs and associated data being created every 10 minutes or so.
Whilst these tweaks look to be making bulk importing data from LORLSv5 more manageable, it remains to be seen if they ensure that the performance of the interactive editing and access acceptable. Hopefully now I can get back to making a non-Moodle front end and we’ll see!