When I joined Microsoft in 2005 to create an ‘eScience’ research program with universities, Turing Award winner Jim Gray became a colleague as well as a friend. I had first met Jim in 2001 and spent the next four years having great debates about eScience. Roughly speaking, eScience is about using advanced computing technologies to assist scientists in dealing with an ever increasing deluge of scientific data. Although Jim was a pioneer of relational databases and transaction processing for the IT industry, he had recently started working with scientists to demonstrate the value of database technologies on their large datasets and to use them to ‘stress test’ Microsoft’s SQL Server product. With astronomer Alex Szalay from Johns Hopkins University, Jim and some of Alex’s students built one of the first Web Services for scientific data. The data was from the Sloan Digital Sky Survey (SDSS) – which is something like the astronomical equivalent of the human genome project. Although the tens of Terabytes of the SDSS now seems a quite modest amount of data, the Sloan survey was the first high resolution survey of more than a quarter of the night sky. After the first phase of operation, the final SDSS dataset included 230 million celestial objects detected in 8,400 square degrees of imaging and spectra of 930,000 galaxies, 120,000 quasars, and 225,000 stars. Since there are only around 10,000 or so professional astronomers, publishing the data on the Skyserver web site http://cas.sdss.org/dr7/en/ constituted a new model of scholarly communication – one in which the data is published before it has all been analyzed. The public availability of such a large amount of astronomy led to one of the first really successful ‘citizen science’ projects. GalaxyZoo, http://www.galaxyzoo.org/, asked the general public for help in classifying a million galaxy images from the SDSS. More than 50 million classifications were received by the project during its first year, and more than 150,000 people participated. Jim’s SkyServer and the Sloan Digital Sky Survey pioneered not only open data and a new paradigm for publication but also a crowd-sourcing framework for genuine citizen science.
Jim also worked with David Lipman and colleagues at the National Center for Biotechnology Information, NCBI, a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). The NIH had established a policy on open access that required
‘all investigators funded by the NIH submit … to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication, to be made publicly available no later than 12 months after the official date of publication.’
The NIH’s PubMed Central deposit policy was initially voluntary, but was signed into law by George W. Bush in late 2007. The percentage compliance rate then improved dramatically and now the NIH have taken a further step of announcing that, sometime in 2013, they ‘will hold processing of non-competing continuation awards if publications arising from grant awards are not in compliance with the Public Access Policy.’
PubMed Central is a freely accessible database of full-text research papers in the biomedical and life sciences. The clear benefits of such an open access archive of peer-reviewed papers are summarized on the NIH website http://publicaccess.nih.gov/FAQ.htm#753
‘Once posted to PubMed Central, results of NIH-funded research become more prominent, integrated and accessible, making it easier for all scientists to pursue NIH’s research priority areas competitively. PubMed Central materials are integrated with large NIH research data bases such as Genbank and PubChem, which helps accelerate scientific discovery. Clinicians, patients, educators, and students can better reap the benefits of papers arising from NIH funding by accessing them on PubMed Central at no charge. Finally, the Policy allows NIH to monitor, mine, and develop its portfolio of taxpayer funded research more effectively, and archive its results in perpetuity.’
Jim’s work with NCBI was to help them develop a ‘portable’ version of the repository software, pPMC, that could be deployed at sites in other countries. In the UK, the Wellcome Trust, a major funder of biomedical research, had adopted a similar open access policy to the NIH. With assistance from NCBI, Wellcome collaborated with the British Library and JISC to deploy the portable version of PubMed Central archive software. The UKPubMed Central repository was established in 2007. Just last year, this was enlarged and re-branded as EuropePubMed Central http://europepmc.org/ since this service is now also supported by funding agencies in Italy and Austria and by the European Research Council. PMC Canada was launched in 2009.
NCBI were also responsible for developing two, XML-based, Document Type Definitions or DTDs:
‘The Publishing DTD defines a common format for the creation of journal content in XML. The Archiving DTD also defines journal articles, but it has a more open structure; it is less strict about required elements and their order. The Archiving DTD defines a target content model for the conversion of any sensibly structured journal article and provides a common format in which publishers, aggregators, and archives can exchange journal content.’
These DTDs have now been adopted by NISO, the National Information Standards Organization, and form the basis for NISO’s Journal Article Tag Suite or JATS http://jats.nlm.nih.gov/index.html
As is now well-known, Jim Gray was lost at sea at the end of January 2007. A few weeks before this tragic event, Jim had given a talk to the National Research Council’s Computer Science and Telecommunications Board. With Gordon Bell’s encouragement, I and two colleagues edited a collection of articles about Jim’s vision of a ‘Fourth Paradigm’ of data-intensive scientific research http://research.microsoft.com/en-us/collaboration/fourthparadigm/default.aspx The collection also included a write-up of Jim’s last talk in which he talked about not one, but two revolutions in research. The first revolution was the Fourth Paradigm; the second was about what he called ‘The Coming Revolution in Scholarly Communication’. In this section, Jim talked about the pioneering efforts towards open access for NIH funded life sciences research with NCBI’s full-text repository PubMed Central. But he believed that the Internet could do much more than just make available the full text of research papers:
‘In principle, it can unify all the scientific data with all the literature to create a world in which the data and the literature interoperate with each other (Figure 3). You can be reading a paper by someone and then go off and look at their original data. You can even redo their analysis. Or you can be looking at some data and then go off and find out all the literature about this data. Such a capability will increase the “information velocity” of the sciences and will improve the scientific productivity of researchers. And I believe that this would be a very good development!’
I include his Figure 3 below:
After talking about open access and overlay journals, peer review, publishing data, Jim goes on to discuss the role that ontologies and semantics will play on the road from data to information to knowledge. As a specific example, he talks about Entrez, a wonderful cross-database search tool supported by the NCBI:
‘The best example of all of this is Entrez, the Life Sciences Search Engine, created by the National Center for Biotechnology Information for the NLM. Entrez allows searches across PubMed Central, which is the literature, but they also have phylogeny data, they have nucleotide sequences, they have protein sequences and their 3-D structures, and then they have GenBank. It is really a very impressive system. They have also built the PubChem database and a lot of other things. This is all an example of the data and the literature interoperating. You can be looking at an article, go to the gene data, follow the gene to the disease, go back to the literature, and so on. It is really quite stunning!’
This was Jim’s vision for the future of scientific research – an open access world of full text publications and data, a global digital library that can truly accelerate the progress of science. Of course, the databases at NCBI are all carefully curated and marked up using the NLM DTDs. Outside NCBI’s walled garden, in the wild world, we have a plethora of different archives, repositories and databases – and replicating the success of a federated search tool like Entrez will be difficult. Yet this is the vision that inspires me. And it is this vision that leads me to support the open access movement for more than just the blunt economic facts that the university library system can no longer afford what publishers are offering.
To be continued …
Originally posted in 2013