Open Citations – Indexing PubMed Central OA data

As part of our work on the Open Citations extensions project, I have recently been doing one of my favourite things – namely indexing large quantities of data then exploring it.

On this project we are interested in the PubMed Central Open Access subset, and more specifically, we are interested in what we can do with the citation data contained within the records that are in that subset – because, as they are open access, that citation data is public and freely available.

We are building a pipeline that will enable us to easily import data from the PMC OA and from other sources such as arXiv, so that we can do great things with it like explore it in a facetview, manage and edit it in a bibserver, visualise it, and stick it in the rather cool related-work prototype software. We are building on the earlier work of both the original Open Citations project, and of the Open Bibliography projects.

Work done so far

We have spent a few weeks getting to understand the original project software and clarifying some of the goals the project should achieve; we have put together a design for a processing pipeline to get the data from source right through to where we need it, in the shape that we need it. In the case of facetview / bibserver work, this means getting it into a wonderful elasticsearch index.

While Martyn continues work on the bits and pieces for managing the pipeline as a whole and pulling data from arXiv, I have built an automated and threadable toolchain for unpacking data out of the compressed file format it arrives in from the US National Institutes of Health, parsing the XML file format and converting it into BibJSON, and then bulk loading it into an elasticsearch index. This has gone quite well.

To fully browse what we have so far, check out http://occ.cottagelabs.com.

For the code: https://github.com/opencitations/OpenCitationsCorpus/tree/master/pipeline.

The indexing process

Whilst the toolchain is capable of running threaded, the server we are using only has 2 cores and I was not sure to what extent they would be utilised, so I ran the process singular. It took five hours and ten minutes to build an index of the PMC OA subset, and we now have over 500,000 records. We can full-text search them and facet browse them.

Some things of particular interest that I learnt – I have an article in the PMC OA! And also PMIDs are not always 8 digits long – they appear in fact to be incremental from 1.

What next

At the moment there is no effort made to create record objects for the citations we find within these records, however plugging that into the toolchain is relatively straightforward now.

The full pipeline is of course still in progress, and so this work will need a wee bit of wiring into it.

Improve parsing. There are probably improvements to the parsing that we can make too, and so one of the next tasks will be to look at a few choice records and decide how better to parse them. The best way to get a look at the records for now is to use a browser like Firefox or Chrome and install the JSONview plugin, then go to occ.cottagelabs.com and have a bit of a search, then click the small blue arrows at the start of a record you are interested in to see it in full JSON straight from the index. Some further analysis on a few of these records would be a great next step, and should allow for improvements to both the data we can parse and to our representation of it.

Finish visualisations. Now that we have a good test dataset to work with, the various bits and pieces of visualisation work will be pulled together and put up on display somewhere soon. These, in addition to the search functionality already available, will enable us to answer the questions set as representative of project goals earlier in January (thanks David for those).