{"id":1863,"date":"2015-07-27T15:56:11","date_gmt":"2015-07-27T15:56:11","guid":{"rendered":"https:\/\/copyright.lboro.ac.uk\/lorls\/?p=1863"},"modified":"2015-07-27T15:56:11","modified_gmt":"2015-07-27T15:56:11","slug":"word-document-importer","status":"publish","type":"post","link":"https:\/\/blog.lboro.ac.uk\/lorls\/lorls\/word-document-importer","title":{"rendered":"Word document importer"},"content":{"rendered":"<p>At this year&#8217;s <a href=\"http:\/\/blogs.lboro.ac.uk\/mtrlc\/\" target=\"_blank\">Meeting the Reading List Challenge (MTRLC) workshop<\/a>, my boss Gary Brewerton demonstrated one of the features we have in LORLS: the ability to ingest a Word document that contains Harvard(ish) citations. Our script reads in a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Office_Open_XML\" target=\"_blank\">Office Open XML.docx format<\/a> Word document and spits out some structured data ready to import into a LORLS reading list. \u00a0The idea behind this is that academics still create reading lists in Word, despite us having had an online system for 15 years now. Anything we can do to make getting these Word documents into LORLS easier for them, the more likely it is that we&#8217;ll actually get to see the data. We&#8217;ve had this feature for a while now, and its one of those bits of code that we revisit every so often when we come across new Word documents that it doesn&#8217;t handle as well as we&#8217;d like.<\/p>\n<p>The folk at MTRLC seemed to like it, and Gary suggested that I yank the core of the import code out of LORLS, bash it around a bit and then make it available as a standalone program for people to play with, including sites that don&#8217;t use LORLS. \u00a0So that&#8217;s what I&#8217;ve done &#8211; you can download the single script from:<\/p>\n<p><a href=\"https:\/\/lorls.lboro.ac.uk\/WordImporter\/WordImporter\" target=\"_blank\">https:\/\/lorls.lboro.ac.uk\/WordImporter\/WordImporter<\/a><\/p>\n<p>The code is, as with the rest of LORLS, written in <a href=\"https:\/\/www.perl.org\/\" target=\"_blank\">Perl<\/a>. It makes heavy use of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Regular_expression\" target=\"_blank\">regular expression<\/a> pattern matching and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Z39.50\" target=\"_blank\">Z39.50<\/a> look ups to do its work. \u00a0It is intended to run as a CGI script, so you&#8217;ll need to drop it on a machine with a web server. \u00a0It also uses some Perl modules from <a href=\"http:\/\/www.cpan.org\/\" target=\"_blank\">CPAN<\/a> that you&#8217;ll need to make sure are installed:<\/p>\n<ul>\n<li>Data::Dumper<\/li>\n<li>Algorithm::Diff<\/li>\n<li>Archive::Any<\/li>\n<li>XML::Simple<\/li>\n<li>JSON<\/li>\n<li>IO::File<\/li>\n<li>ZOOM<\/li>\n<li>CGI<\/li>\n<\/ul>\n<p>The code has been developed and run under Linux (specifically <a href=\"https:\/\/www.debian.org\/\" target=\"_blank\">Debian Jessie<\/a> and then <a href=\"https:\/\/www.centos.org\/\" target=\"_blank\">CentOS 6<\/a>) with the <a href=\"http:\/\/httpd.apache.org\/\" target=\"_blank\">Apache web server<\/a>. \u00a0It doesn&#8217;t do anything terribly exciting with CGI though, so it should probably run OK on other platforms as long as you have \u00a0working Perl interpreter and the above modules installed. As distributed its looks at the public <a href=\"http:\/\/www.bodleian.ox.ac.uk\/bdlss\/olis-ils\/z3950\" target=\"_blank\">Bodleian Library Z39.50 server in Oxford<\/a>, but you&#8217;ll probably want to point it at your own library system&#8217;s\u00a0Z39.50 server\u00a0(the variable names are pretty self-explanatory in the code!).<\/p>\n<p>This script gives a couple of options for output. \u00a0The first is <a href=\"https:\/\/en.wikipedia.org\/wiki\/RIS_(file_format)\" target=\"_blank\">RIS format<\/a>, which is an citation interchange format that quite a few systems accept. \u00a0It also has the option of <a href=\"http:\/\/json.org\/\" target=\"_blank\">JSON<\/a> output if you want to suck the data back into your own code. \u00a0If you opt for JSON format you can also include a callback function name so that you can use <a href=\"https:\/\/en.wikipedia.org\/wiki\/JSONP\" target=\"_blank\">JSONP<\/a> (and thus make use of this script in Javascript code running in web browsers).<\/p>\n<p>We hope you find this script useful. \u00a0And if you do feel up to tweaking and improving it, we&#8217;d love to get patches and fixes back!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>At this year&#8217;s Meeting the Reading List Challenge (MTRLC) workshop, my boss Gary Brewerton demonstrated one of the features we have in LORLS: the ability to ingest a Word document that contains Harvard(ish) citations. Our script reads in a Office Open XML.docx format Word document and spits out some structured data ready to import into [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[3],"tags":[104,107,105,93,106],"class_list":["post-1863","post","type-post","status-publish","format-standard","hentry","category-lorls","tag-import","tag-meeting-the-reading-list-challenge","tag-regular-expressions","tag-word-documents","tag-z39-50","count-0","even alt","author-cojpk","last"],"_links":{"self":[{"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/posts\/1863","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/comments?post=1863"}],"version-history":[{"count":8,"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/posts\/1863\/revisions"}],"predecessor-version":[{"id":1871,"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/posts\/1863\/revisions\/1871"}],"wp:attachment":[{"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/media?parent=1863"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/categories?post=1863"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lboro.ac.uk\/lorls\/wp-json\/wp\/v2\/tags?post=1863"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}