Minggu, 12 April 2009

Automatic Back-Up of ONS files: Google Spreadsheets, JCAMP-DX, Flickr

As many of you know, we have been heavily dependent upon publicly editable Google Spreadsheets for storing results and calculations relating to our Open Notebook Science projects. We have recently integrated automated processing of NMR files in JCAMP-DX format to calculate solubility data by using web services called directly from within the Spreadsheets.

That represents a lot of distributed technology that is susceptible to network or server problems. Andy Lang, who wrote the web services that currently calculate the solubility, has enabled the recall of previously calculated values via a quick database look-up. While this substantially reduces server load by avoiding lengthy calculations, it does mean that the final numbers do not exist in the Spreadsheets themselves.

In addition to these concerns, every time I give a talk to a group of librarians the issues of archiving and curation of new forms of scholarship are raised. These are valid concerns and I've been trying to work with several groups to deal with the problem in as automatic a way as possible.

We had initially considered a spidering service that would automatically follow every file linked to the ONS wikis and download the documents on a daily basis. This has turned out to be problematic because many of the links don't terminate directly on files, but rather user interfaces. For example, a typical link to a Google Spreadsheet does not lead to a simple HTML page that can be copied but rather to an interface to add data and set up calculations.

It turns out we can take a semi-automated solution that gets us to where we want to be but requires a bit more manual work. Google Spreadsheets can be exported as Excel spreadsheets, which store the results of web service calculations as simple values and include the link to the web service as a cell comment. All calculations within the Spreadsheet are also retained in this way. The trick is to "publish" the spreadsheet using the advanced option of exporting as an Excel file. This then becomes a simple URL.


Now, the only manual step left in the process is to copy these URLs to another BackUp Google Spreadsheet. Andy has created a little executable that steps through a list of these URLs and creates a backup on any Windows computer under a C:\ONS directory. It is simply then a question of setting up a Windows Scheduler service to run once a day and call the executable. All the files are named with the date as a first part of the name for easy sorting.



Besides Google Spreadsheets backed-up as Excel files, spectral JCAMP-DX files and Flickr images can be processed in the same way. In both these cases the user must specify the JDX or DX or JPG file directly. In Flickr you have to go through a few clicks to the download page for a given image but once you have that it works fine.

Andy has versioned this as V0.1 for good reason. It does do exactly what we want but there are a few caveats:

1) Any errors in specifying a file will abort the rest of the back-up. In future versions there would be tolerance for errors, with appropriate reporting of problems, perhaps by email.

2) Files don't necessarily have the correct extensions. For example, backed up Wikispaces pages have to be renamed with an HTML extension to be viewed in a browser. Note that Wikispaces has its own sophisticated back-up system that will put the entire wiki with all files directly uploaded onto the wiki into a single ZIP file - in either HTML or WIKITEXT format. Of course this will not include files residing outside - like Google Spreadsheets. Still I think there is no harm in including the wiki pages in the the list of files to be backed-up by Andy's system.

Going forward there are two types of collaborations that could help a lot:

1) Librarians who would be willing to archive UsefulChem and ONSChallenge files. Right now these are just a few Megs a day but this will increase as we continue to add to the list. To be reasonable about space I could see a protocol of keeping only one back-up per week or month for dates more than 30 days in the past. This is about what the Internet Archive does I think. It would certainly be unambigous to know for certain what was known at what time with multiple libraries maintaining archives.

2) Someone who knows how Google creates URLs for downloadable XLS exports would be mightly helpful. Similar for Flickr and JPG exports. Even just writing a script to spider all HTML pages linked to the wikis and blogs would save a lot of manual labor. The nice thing is that the results of the spidering code would just have to be dumped into the Back-Up Google Spreadsheet - which already backs itself up conveniently.

Tidak ada komentar:

Posting Komentar