A Journey to Open Access – Part 4

Part 4: Open Access in the UK: The Finch Report and RCUK’s Open Access Policy

In the UK, the JISC organization has long pioneered the exploration of different models of open access and, in particular, the role of institutional repositories.  Although JISC’s future is now somewhat uncertain because of the recent change in its funding status to that of a charity, JISC is seen internationally as a major innovator in the use of advanced ICT in higher education. In Europe, only the Dutch SURF organization can match the breadth and originality of JISC programs. Such an innovative ‘applied research’ funding agency is lacking in the US—although the role of JISC is partially met by organizations such as the Mellon Foundation.

Until 2006, I was Chair of the JISC Committee in Support of Research. Our Committee was able to fund many innovative projects and initiatives, including the pilot study that led to the adoption of the Internet2 Shibboleth authentication by UK universities, the establishment of the Digital Curation Center (DCC) in Edinburgh, a test-bed ‘lambda network’ for high-data rate transfers and an experimental text mining service offered by the National Centre for Text Mining (NaCTeM) in Manchester. In April 2005, my committee produced a leaflet explaining the basics of ‘Open Access’. I particularly remember having to insist that the author of the report, one Alma Swan, now well-known to the Open Access community, should put the section on ‘Green Open Access’ via repositories before the section on ‘Gold Open Access’ Journals.

Other committees of JISC also funded a large number of projects exploring different aspects of open access repositories. From 2002–2005 the JISC FAIR Program—Focus on Access to Institutional Repositories—funded projects like the SHERPA project at Nottingham and the TARDis project at Southampton. From 2006–2007, the JISC Digital Repositories Program funded another 20 projects including the OpenDOAR project—a Directory of academic Open Access Repositories—and the EThOS project—to build a national e-thesis service. JISC also funded a Repository and Preservation Program which included the PRESERV project at Southampton that looked at preservation issues for eprints. All of this preamble is intended to show that the UK has had a vibrant and active ‘research repository community’ for over a decade. The ROAR site currently lists 250 UK university repositories. It is unfortunate that the ‘Working Group on Expanding Access to Published Research Findings’—better known as the Finch Committee—seem to have chosen to ignore much of this seminal work.

The UK government has adopted an explicit commitment to openness and transparency.  In the context of research, this has been interpreted as making the results of ‘publicly funded research’ open, accessible and exploitable. The government’s belief is that open access to research results will drive innovation and growth as well as increasing the public’s trust in research. With such a laudable intent, the government set up the Finch Committee to explore how best the UK could ‘expand access to published research findings’. Unfortunately for the outcome, conventional scholarly publishers were the best represented stakeholder group on the committee which consisted of five publishers, four researchers or university administrators, three funders and two librarians. The majority of the ‘Finch Report’ recommendations were accepted by Minister David Willets and a version of them promulgated by the combined Research Councils organization, RCUK—roughly equivalent to the NSF—in July 2012. The RCUK policy can be summarized as follows (quoting Peter Suber’s SPARC Open Access Newsletter, issue #165):

  • RCUK-funded authors ‘must’ publish in RCUK-compliant journals. A journal is RCUK-compliant if it offers a suitable gold option or a suitable green option. It need not offer both.
  • To offer a suitable gold option, a journal must provide immediate (un-embargoed) OA to the version of record from its own website, under a CC-BY license, and must allow immediate deposit of the version of record in an OA repository, also under a CC-BY license. It may but need not levy an Author Processing Charge (APC).
  • To offer a suitable green option, a journal must allow deposit of the peer-reviewed manuscript (with or without subsequent copy-editing or formatting) in an OA  repository not operated by the publisher.

To compensate the publishers—or, in the view of the Finch Committee, give them time to move their business models to accommodate the new open access world—the Finch Report advocates increasing funding to publishers ‘during a transition period’ by establishing ‘publication funds within individual universities to meet the costs of APCs’. In addition, the report also explicitly deprecates the use of institutional repositories by effectively relegating them to only providing ‘effective routes to access for research publications including reports, working papers and other grey literature, as well as theses and dissertations’.

Peter Suber, a very balanced advocate for open access, has given a detailed critique of these recommendations—as well as enumerating several erroneous assumptions made by the group about open access journals and repositories (see issue #165 of the SPARC Open Access Newsletter). Let me highlight some key points that he makes—with which I am in entire agreement.

First and foremost, we should all applaud the group for its robust statement in favor of open access:

the principle that the results of research that has been publicly funded should be freely accessible in the public domain is a compelling one, and fundamentally unanswerable.

Similarly, the Finch Committee are equally forthright about their intent to induce change in the scholarly publishing industry:

Our recommendations and the establishment of systematic and flexible arrangements for the payment of APCs will stimulate publishers to provide an open access option in more journals.

Minister David Willets endorsed this goal and told the Publishers Association that:

To try to preserve the old model is the wrong battle to fight.

Let me be clear, these statements represent huge progress for the Open Access movement in the UK. The Government is to be commended on its stance on openness: unfortunately, I feel that the Finch Committee missed an opportunity by not supporting mandated green open access repositories in addition to gold OA.

A major problem with the Finch and RCUK endorsements of gold OA as the preferred route to open access—and their explicit deprecation of green OA—is that the proposed interim settlement is unreasonably generous to the publishers at the expense of the UK Research Councils and HEFC-funded UK universities. By giving publishers the choice of being paid for gold OA or offering an unpaid green OA option, it is clear that publishers will cancel their green option and opt to pick up more money by introducing a gold option. Their shareholders would demand no less. Even the majority of OA publishers who currently charge no APC fee—contrary to the assumptions of the Finch Group—will be motivated to pick up the money on the table. Similarly, publishers who now only offer Toll Access via subscriptions will be quite happy to pick up more money by offering a gold OA option in addition to their subscription charges.

As I made clear in Part 2 of this series of articles on open access, the serials crisis means that universities are already unable to afford the subscriptions to Toll-Access (TA) journals that the publishers are offering. To offer them more money to effect some change that they should have initiated over a decade ago seems to me to make no sense. Instead of making generous accommodations for the interests of publishers, the Finch Group should have looked at the problem purely from the point of view of what was in the public interest. Now that publishers receive articles in electronic form, and research papers can be disseminated via the web at effectively zero cost, what have publishers done in the last fifteen years or more to adapt their business models to these new realities? The answer is that they have raised journal prices by far more than the rise in the cost of living. It is this rise in subscription costs that has resulted in subscription cancellations – not competition caused by the availability of articles in green open access repositories.

Despite green OA approaching the 100% level in physics, both the American Physical Society and the Institute of Physics have said publicly that they have seen no cancellations they can attribute to arXiv and green OA. Similarly, the Nature Publishing Group has said openly that ‘author self-archiving [is] compatible with subscription business models’. The American Association for the Advancement of Science (AAAS)—who publish ‘Science’—also ‘endorse the green-mandating NIH policy’. There is much concern in the Finch Report for Scholarly Society publishers. In fact a survey in December 2011 showed that 530 scholarly societies currently publish over 600 OA journals. While it is true that some societies use subscription prices to subsidize other member activities, this need not be the case. Now that we have the web, the monopoly endowed by ownership of a printing press is gone forever. Just ask the music industry or the news media.

Let me give three anecdotal examples of the serials crisis:

  • In 2007, the University of Michigan’s libraries cancelled about 2,500 journal subscriptions because of budget cuts and the increasing costs of the subscriptions.
  • In 2008, Professor Stuart Sheiber of Harvard explained ‘that cumulative price increases had forced the Harvard library to undertake “serious cancellation efforts” for budgetary reasons’.
  • In 2009–2011, the UC San Diego Libraries continued to cancel journal subscriptions because of budget cuts and increasing costs of subscriptions. Around 500 titles ($180,000 worth) were canceled in FY 2009/10, and about the same number were projected to be cancelled in FY 2010/11. It also cancelled many of its satellite libraries.

In fact, any research university library around the world will have a similar story to tell. When even such a relatively wealthy university as Harvard has problems with journal subscription increases, surely it is time to take note!

The transitional period envisaged by Finch and RCUK is projected to cost the UK Research Councils and universities a minimum of £37M over the next two years. This is money that will have to come out of hard-pressed Research Council budgets and already reduced university HEFC funding. Instead of continuing to listen to the special pleading of publishers, what is needed now is some leadership from RCUK. They need to put in place a policy with some sensible provisions that do not unduly ‘feather-bed’ the publishers and that is also affordable by UK universities. Instead of being overly concerned with the risks of open access to commercial publishers, RCUK should remember its role as a champion of the public interest.

What should RCUK do now? In my opinion, RCUK could make a very small but significant change in its open access policy and adopt a rights-retention green OA mandate that requires ‘RCUK-funded authors to retain certain non-exclusive rights and use them to authorize green OA’. In the words of Peter Suber, this would ‘create a standing green option regardless of what publishers decide to offer on their own.’ In addition, RCUK should recommend that universities follow the Open Access Policy Guidelines of Harvard, set out by their Office of Scholarly Communication. Under this policy, Harvard authors are required to deposit a full text version of their paper in DASH, the Harvard Open Access Repository even in the case where the publisher does not permit open access and the author has been unable to obtain a waiver from the publisher.

The scholarly publishers have had plenty of time to read the writing on the wall. They have shown their intransigence to adjust to the new reality for more than fifteen years. It seems manifestly unreasonable to give them a very significant amount of more money and more time to do what they should have been exploring fifteen years ago. By insisting on a green option, RCUK will help generate the required and inevitable changes to the scholarly publishing business and get a fairer deal for both academia and the tax-paying public.

In this short overview I have omitted many subtleties and details—such as embargo times, ‘libre green’, CC-BY licenses and other flavors of green OA. Peter Suber’s SPARC Open Access Newsletter #165 and his book on Open Access (MIT Press Essential Knowledge Series, 2012) gives a much more complete discussion with detailed references.

Also, in the interests of full disclosure, I should stress that I am not ‘anti-publisher’ and have been an editor for the Wiley journal, ‘Concurrency and Computation: Practice and Experience’ (CCP&E), for many years. In fact, it is ironic that my university, Southampton, could not afford to subscribe to CCP&E even though it was essential reading for my research group of over 30 researchers. From this experience, and from my time as Dean of Engineering, I came to believe that the unsustainable, escalating costs of journal subscriptions together with the advent of web have irrevocably changed what we require from the scholarly publishing industry. And, after working with many different research disciplines during my time as the UK’s e-Science Director, and now at Microsoft Research, I have seen at first hand the inefficiencies of the present system and the large amount of unnecessary ‘re-inventing the wheel’ that goes on in the name of original research. Because of this, I passionately believe that open access to full text research papers and to the research data can dramatically improve the efficiency of scientific research. And the world surely needs to solve some major health and environmental challenges!

To be continued…

Posted in Uncategorized | 2 Comments

A Journey to Open Access – Part 3

Part 3: Jim Gray and the Coming Revolution in Scholarly Communication

When I joined Microsoft in 2005 to create an ‘eScience’ research program with universities, Turing Award winner Jim Gray became a colleague as well as a friend. I had first met Jim in 2001 and spent the next four years having great debates about eScience. Roughly speaking, eScience is about using advanced computing technologies to assist scientists in dealing with an ever increasing deluge of scientific data. Although Jim was a pioneer of relational databases and transaction processing for the IT industry, he had recently started working with scientists to demonstrate the value of database technologies on their large datasets and to use them to ‘stress test’ Microsoft’s SQL Server product. With astronomer Alex Szalay from Johns Hopkins University, Jim and some of Alex’s students built one of the first Web Services for scientific data. The data was from the Sloan Digital Sky Survey (SDSS) – which is something like the astronomical equivalent of the human genome project. Although the tens of Terabytes of the SDSS now seems a quite modest amount of data, the Sloan survey was the first high resolution survey of more than a quarter of the night sky. After the first phase of operation, the final SDSS dataset included 230 million celestial objects detected in 8,400 square degrees of imaging and spectra of 930,000 galaxies, 120,000 quasars, and 225,000 stars. Since there are only around 10,000 or so professional astronomers, publishing the data on the Skyserver website constituted a new model of scholarly communication – one in which the data is published before it has all been analyzed. The public availability of such a large amount of astronomy led to one of the first really successful ‘citizen science’ projects. GalaxyZoo,  asked the general public for help in classifying a million galaxy images from the SDSS. More than 50 million classifications were received by the project during its first year, and more than 150,000 people participated. Jim’s SkyServer and the Sloan Digital Sky Survey pioneered not only open data and a new paradigm for publication but also a crowd-sourcing framework for genuine citizen science.

Jim also worked with David Lipman and colleagues at the National Center for Biotechnology Information, NCBI, a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). The NIH had established a policy on open access that required

‘all investigators funded by the NIH submit… to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication, to be made publicly available no later than 12 months after the official date of publication.’

The NIH’s PubMed Central deposit policy was initially voluntary, but was signed into law by George W. Bush in late 2007. The percentage compliance rate then improved dramatically and now the NIH have taken a further step of announcing that, sometime in 2013, they ‘will hold processing of non-competing continuation awards if publications arising from grant awards are not in compliance with the Public Access Policy.’

PubMed Central is a freely accessible database of full-text research papers in the biomedical and life sciences. The clear benefits of such an open access archive of peer-reviewed papers are summarized on the NIH website.

‘Once posted to PubMed Central, results of NIH-funded research become more prominent, integrated and accessible, making it easier for all scientists to pursue NIH’s research priority areas competitively. PubMed Central materials are integrated with large NIH research data bases such as Genbank and PubChem, which helps accelerate scientific discovery. Clinicians, patients, educators, and students can better reap the benefits of papers arising from NIH funding by accessing them on PubMed Central at no charge. Finally, the Policy allows NIH to monitor, mine, and develop its portfolio of taxpayer funded research more effectively, and archive its results in perpetuity.’

Jim’s work with NCBI was to help them develop a ‘portable’ version of the repository software, pPMC, that could be deployed at sites in other countries. In the UK, the Wellcome Trust, a major funder of biomedical research, had adopted a similar open access policy to the NIH. With assistance from NCBI, Wellcome collaborated with the British Library and JISC to deploy the portable version of PubMed Central archive software. The UKPubMed Central repository was established in 2007. Just last year, this was enlarged and re-branded as EuropePubMed Central since this service is now also supported by funding agencies in Italy and Austria and by the European Research Council. PMC Canada was launched in 2009.

NCBI were also responsible for developing two, XML-based, Document Type Definitions or DTDs:

‘The Publishing DTD defines a common format for the creation of journal content in XML. The Archiving DTD also defines journal articles, but it has a more open structure; it is less strict about required elements and their order. The Archiving DTD defines a target content model for the conversion of any sensibly structured journal article and provides a common format in which publishers, aggregators, and archives can exchange journal content.’

These DTDs have now been adopted by NISO, the National Information Standards Organization, and form the basis for NISO’s Journal Article Tag Suite or JATS.

As is now well-known, Jim Gray was lost at sea at the end of January 2007. A few weeks before this tragic event, Jim had given a talk to the National Research Council’s Computer Science and Telecommunications Board. With Gordon Bell’s encouragement, I and two colleagues edited a collection of articles about Jim’s vision of a ‘Fourth Paradigm’ of data-intensive scientific research. The collection also included a write-up of Jim’s last talk in which he talked about not one, but two revolutions in research. The first revolution was the Fourth Paradigm; the second was about what he called ‘The Coming Revolution in Scholarly Communication’. In this section, Jim talked about the pioneering efforts towards open access for NIH funded life sciences research with NCBI’s full-text repository PubMed Central. But he believed that the Internet could do much more than just make available the full text of research papers:

‘In principle, it can unify all the scientific data with all the literature to create a world in which the data and the literature interoperate with each other (Figure 3). You can be reading a paper by someone and then go off and look at their original data. You can even redo their analysis. Or you can be looking at some data and then go off and find out all the literature about this data. Such a capability will increase the “information velocity” of the sciences and will improve the scientific productivity of researchers. And I believe that this would be a very good development!’

I include his Figure 3 below:

tonyhey-dataonlineAfter talking about open access and overlay journals, peer review, publishing data, Jim goes on to discuss the role that ontologies and semantics will play on the road from data to information to knowledge. As a specific example, he talks about Entrez, a wonderful cross-database search tool supported by the NCBI:

‘The best example of all of this is Entrez, the Life Sciences Search Engine, created by the National Center for Biotechnology Information for the NLM. Entrez allows searches across PubMed Central, which is the literature, but they also have phylogeny data, they have nucleotide sequences, they have protein sequences and their 3-D structures, and then they have GenBank. It is really a very impressive system. They have also built the PubChem database and a lot of other things. This is all an example of the data and the literature interoperating. You can be looking at an article, go to the gene data, follow the gene to the disease, go back to the literature, and so on. It is really quite stunning!’

This was Jim’s vision for the future of scientific research – an open access world of full text publications and data, a global digital library that can truly accelerate the progress of science. Of course, the databases at NCBI are all carefully curated and marked up using the NLM DTDs. Outside NCBI’s walled garden, in the wild world, we have a plethora of different archives, repositories and databases – and replicating the success of a federated search tool like Entrez will be difficult. Yet this is the vision that inspires me. And it is this vision that leads me to support the open access movement for more than just the blunt economic facts that the university library system can no longer afford what publishers
are offering.

To be continued …

Posted in Uncategorized | Leave a comment

A Journey to Open Access – Part 2

Part 2: University Research Management and Institutional Repositories
University Deans are required to do many things for their university, including taking some responsibility for the research output of their Faculty. Each year, capturing all forms of research deliverables – journal papers, technical reports, conference and workshop proceedings, presentations and Doctorate and Masters theses – is a necessary and important chore. This is especially important in the UK – where the research funds allocated to each department by the Government are explicitly linked to the quality of its research over a four or five year period.

First as Chair of the Electronics and Computer Science Department, and then as Dean of Engineering at the University of Southampton, I was responsible for two of these ‘Research Assessment’ cycles in the UK. It was during the preparation of these research returns that I encountered an interesting problem: the University library could no longer afford to subscribe to all the journals in which our 200 engineering faculty members – plus a similar number of postdocs and graduate students – chose to publish. This meant that just assembling the published copies of all the publications of all research staff and students became a much less straightforward exercise. The reason for this problem is well-known to librarians – it is the so-called ‘serials crisis’. This crisis is dramatically illustrated below in a graph that shows the relative growth of serial expenditures at ARL Libraries versus the consumer price index over the past twenty-five years.

Serial Expenditures and CPI Trends in ARL Libraries 1986-2011

Serial Expenditures and CPI Trends in ARL Libraries 1986-2011

These are typical expenditure curves for all university libraries – and the University of Southampton was no exception. It was for this reason that the University Library sends out a questionnaire each year asking staff which journals they would least mind cancelling! Yet the serials crisis is a curious sort of crisis in that most research staff are simply unaware of any problem. They feel free to publish in whatever journal is most appropriate for their research and see no reason to restrict their choice to the journals that the University can afford to subscribe to.

The Research Assessment exercise in the UK is intended to measure ‘research impact’ and this is judged in a number of ways. One form of research impact that can easily be measured is the number of citations by other researchers to each paper. In order to garner citations, a research paper needs to be accessible and read by other researchers. Not all researchers – and certainly not the general public whose taxes have usually helped fund the research – have access to all research journals. Physicists have solved this accessibility problem by setting up arXiv – a repository for un-refereed, pre-publication ePrints. The US National Library of Medicine has solved the accessibility problem in a different fashion. The full text of all research papers produced from research funded by the National Institutes of Health are required to be deposited in the PubMedCentral (PMC) repository after publication in a journal, usually after some ‘reasonable’ embargo period from 6 to 12 months. Similar open access policies have now been adopted by other funders of biomedical research such as the Wellcome Trust and the Bill and Melinda Gates Foundation.

The repositories PMC and arXiv are examples of subject-specific, centralized research repositories. However, it is my firm belief that each research university needs to establish and maintain its own open access ‘institutional repository’ covering all the fields of research pursued by the university. At Southampton, in the Electronics and Computer Science Department, with colleagues Les Carr, Wendy Hall and Stevan Harnad, we established a Departmental Repository to capture the full text versions of all the research output of the Department to assist us in monitoring and assessing our research impact. A graduate student in the Department, Rob Tansley, worked with Les Carr and Stevan Harnad to develop, in 2000, the EPrints open source repository software. Robert went on to work for Hewlett-Packard Laboratories in the US and wrote the DSpace Repository software in collaboration with MIT.  The EPrints and DSpace repository software are now used by many hundreds of universities around the world. For a list of repositories and software see: http://roar.eprints.org/

As Dean of Engineering, I tried to use the example of the EPrints repository in Electronics and Computer Science as a model for the entire Engineering Faculty. By the time I left Southampton, this had only partially been implemented, but I was enormously pleased to see that by 2006 the University had mandated that all research papers from all departments must be deposited in the ‘ePrints Soton’ repository. In 2008, this was extended to include PhD and MPhil theses. For more details of Southampton’s research repository, well managed by the University Library, see: http://www.southampton.ac.uk/library/research/eprints/

There is much more that can be said about this ‘Green’ route to Open Access via deposit of full text of research papers in Institutional Repositories. For a balanced account, I recommend Peter Suber’s recent book on ‘Open Access’ published by MIT Press, to be available under Open Access 12 months after publication. Peter describes the different varieties of Open Access – such as green/gold, gratis/libre – and also issues of assigning ‘permission to publish’ to publishers versus assigning copyright (https://mitpress.mit.edu/books/open-access). In addition, the Open Archive Initiative supports two community-supported repository standards: OAI-PMH for metadata and OAI-ORE for aggregating resources from different sites into compound digital objects (http://www.openarchives.org/). Also relevant is the Confederation of Open Access Repositories or COAR whose website states:

COAR, the Confederation of Open Access Repositories, is a young, fast growing association of repository initiatives launched in October 2009, uniting and representing 90 institutions worldwide (Europe, Latin America, Asia, and North America). Its mission is to enhance greater visibility and application of research outputs through global networks of Open Access digital repositories.

Why is all this important? It is important because the present scholarly communication model is no longer viable. While many journal publishers perform a valuable service in arranging peer review and in publishing high quality paper and online journals, the unfortunate truth is that universities can no longer afford the costs of the publishers’ present offerings. For example, it was not possible for me as Dean to establish a new research area in the Faculty and have the library purchase the relevant new journals. In such an unsustainable situation, it is obvious that we need to arrive at a more affordable scholarly publishing model. However, instead of just waiting for such a model to magically emerge, university librarians need to be proactive and take up their key role as the guardians of the intellectual output of their university researchers. It is the university library that has both the resources and the expertise to maintain the university’s institutional research repository. And this is not just an academic exercise. Managing the university’s research repository will surely become a major part of the university’s ‘reputation management’ strategy. Studies of arXiv have shown there to be a significant citation advantage for papers first posted in arXiv, and subsequently published in journals, compared to papers just published in journals (arXiv:0906.5418). Similarly, it is likely that versions of research papers that are made freely available through an institutional repository will also acquire a citation advantage – although this conclusion is currently controversial. Nevertheless, like it or not, universities will increasingly be evaluated and ranked on the published information they make available on the Web. For example, the Webometrics Ranking of World Universities takes account of the ‘visibility and impact’ of web publications and includes both an ‘openness’ and an ‘excellence’ measure for research repositories and citations (http://www.webometrics.info/). I am pleased to see that Southampton features in 32nd place in Europe and 119th in their World rankings .

To be continued…

Posted in Uncategorized | Leave a comment

A Journey to Open Access

Part 1: Green open access for over 20 years

My education into open access began over 40 years ago, when I was a practicing theoretical high energy physicist. This was in the 1970’s – in the days of typewriters – and in those days we typed up our research papers, made 100 xerox copies and submitted the original to Physical Review, Nuclear Physics or whatever journal we wanted. The copies were sent round to our ‘peer’ high energy physics research groups around the world and were known as ‘preprints’. While the paper copy to the journal was undergoing refereeing, these preprints allowed researchers to immediately build upon and refer to work done by other researchers prior to publication. This was the preprint tradition in the fast moving field of high energy physics. When papers were accepted for publication, the references to preprints that had since been published were usually updated in the published version. It has always baffled me – now that I work in the field of computer science that if anything is even faster moving than high energy physics – that there is no similar tradition. In computer science, it can take several years for a paper to get published in a journal – by which time they really only serve an archival purpose and as evidence for tenure committees. In contrast to the physics preprint system, the computer science community uses refereed workshop publications to provide a rapid – or at least more rapid – publication vehicle.

With the widespread availability of the Internet, and with the advent of the World Wide Web, theoretical physicist Paul Ginsparg set up a web site to save high energy physicists both the postage and the trouble of circulating preprints. The electronic version of the preprint – inevitably called an e-Print – is typically submitted to a journal and simultaneously posted to the arXiv website (http://arxiv.org/). This is now the standard method of scholarly communication of a very large fraction of the physics, astronomy and mathematics communities.

‘arXiv is the primary daily information source for hundreds of thousands of researchers in physics and related fields. Its users include 53 physics Nobel laureates, 31 Fields medalists and 55 MacArthur fellows, as well as people in countries with limited access to scientific materials. The famously reclusive Russian mathematician Grigori Perelman posted the proof for the 100-year-old Poincaré Conjecture solely in arXiv.’

Reference: http://phys.org/news142785151.html#jCp

The arXiv repository is now over 20 years old and has a submission rate of over 7,000 e-Prints per month and full text versions of over half a million research papers are available free both to researchers and to the general public. More than 200,000 articles are downloaded from arXiv each week by about 400,000 users. Most, but not all, of the e-Prints are eventually published in a journal and this amounts to a sort of post-publication ‘quality stamp’. The apparent drawback of multiple, slightly different versions of a paper turns out not to be a serious drawback in practice. Citation counts for high energy physicists usually count either the e-Print version or the published version. A detailed study of the arXiv system by Anne Gentil-Beccot, Salvatore Mele and Travis C. Brooks is published as ‘Citing and Reading Behaviours in High-Energy Physics. How a Community Stopped Worrying about Journals and Learned to Love Repositories’. The paper is, of course, available as arXiv:0906.5418.

In the terminology of today, arXiv represents a spectacularly successful example of ‘Green Open Access’. This is the situation in which researchers continue to publish in refereed, subscription-based journals but also self-archive versions of their papers either in subject-based repositories – as for arXiv and the high energy physics community – or in institutionally-based repositories. In certain fields – such as the bio-medical area with the US PubMedCentral repository – these full-text versions may only be available to the public after embargo period of 6 or 12 months. The alternative open access model – so-called ‘Gold Open Access’ – is one in which researchers or their funders pay the journal publishers to make the full text version of the paper freely available.

Why should you care? The research described in the papers was typically funded by a grant from a Government funding agency – think NSF or NIH in the USA or RCUK in the UK. The research papers are reviewed by researchers whose salary generally also comes from the public purse. The publishers organize the review process and publish the journals – and then restrict access to these papers to those who can afford to pay a subscription to their journals. Since the research was both funded and reviewed by researchers supported by public money raised by taxes it seems not unreasonable to demand that the general public should be allowed access to this research without having to pay an additional access fee. Now that we have the Web and the technology to make perfect digital copies of documents at zero cost, it is clear that the old rules in which publishers controlled dissemination through the printing press needs to change – just like it has for music and journalism. No one begrudges publishers some reward for the efforts at quality control and supporting a prestigious ‘branded journal’ like Nature. But, as will be seen in the next post, the central issue for universities is now one of affordability of their present journal offerings. Subscription fees to journals have risen much faster than inflation over the last 15 years or more and now constitute an unreasonable ‘tax’ on scarce research funds that is now going to shareholders of the publishing companies.

To be continued …

Posted in Uncategorized | 3 Comments

ICT and Education: MOOCs and all that

Master, Wardens, Mr. Alderman, Freemen and Guests, it is a great honour to be here in London with the Company of Educators. The Franklin lecture tonight highlighted some significant challenges for the UK Higher Education sector: I am afraid that I would like to say a few words about yet another challenge that may be facing the sector. I was at the University of Southampton for 25 years and then led a research initiative for the UK Research Councils for 5 years. During my career at the university, I heard many times about how Information Communication Technology or ICT was going to transform education. I think that this time it may be for real.

First a word about technology. The two major ICT trends in the past 30 years relevant to the education sector are Moore’s Law and the Internet. Gordon Moore, one of the founders of Intel, observed that the number of transistors on a chip was doubling every 18 months. This has led to an amazing industry in which computers have got faster and cheaper for over 30 years. Now it is literally possible for people to store their whole life – photos, videos, emails, documents – on silicon. The second major trend is the growth of the Internet from a small academic research network to a network of networks that literally circles the planet. There are now over 2 billion Internet users and, importantly, the ‘non-PC’ Internet – connectivity of smart phones and many other devices – is the fastest growing segment. The future ‘Internet of Things’ when you will be able to interrogate your fridge from 30,000 feet in a plane is not far away.

Now a word about timing. It is important to introduce technology ‘at the right time’. For example, Microsoft produced tablet computers many years before Apple without achieving significant widespread take-up. It was the continued miniaturization of silicon technology coupled with the increasingly widespread availability of WiFi and broadband connectivity that has led to the remarkable success of the iPad. The connectivity needed to be there so that users could easily use a tablet to browse the Web, to watch videos, to look at shared photos or to scan their email. I think a similar convergence of technologies might now have arrived in education. It is certainly true that the issue of MOOCs – an acronym for Massive Open Online Courses – is now obsessing the US HE sector. Let me explain in more detail.

The term ‘MOOC’ was first coined by two academics called David Cormier and Bryan Alexander in respect to a course given in 2008 at the University of Manitoba in Canada. The course had 25 local, tuition-paying students but it also had 2,300 registered students from around the world who could participate in the course for free. The course content was made available in various ways – on the Web, via RSS feeds, blogs and wikis, for example. However, the online course that has really set universities in America buzzing was a course given at Stanford University in the autumn of 2011 by Peter Norvig and Sebastian Thrun. Let me quote Sebastian Thrun’s reaction to giving this course:

One of the most amazing things I’ve ever done in my life is to teach a class to 160,000 students. In the Fall of 2011, Peter Norvig and I decided to offer our class “Introduction to Artificial Intelligence” to the world online, free of charge.

We spent endless nights recording ourselves on video, and interacting with tens of thousands of students. Volunteer students translated some of our classes into over 40 languages; and in the end we graduated over 23,000 students from 190 countries. In fact, Peter and I taught more students AI, than all AI professors in the world combined.

This one class had more educational impact than my entire career. Now that I saw the true power of education, there is no turning back. It’s like a drug. I won’t be able to teach 200 students again, in a conventional classroom setting.’

There are now other examples. At MIT in the spring of 2012, an online version of their traditional ‘Circuits and Electronics’ course had 155,000 registered students, of which some 7,000 completed the course. Dean Agerwal commented that this was ‘as many as would take the course in 40 years at MIT’. Another on-line course at Stanford, ‘Introduction to Databases’ had 60,000 registered students with over 6,000 finishing. Jennifer Widom, in an essay about this experience – ‘From 100 students to 100,000’ – talks about the ‘flipped classroom’. This is when traditional lectures are replaced by self-paced study of short videos covering individual topics and the scheduled class time is devoted to much more interactive activities with the lecturer’s involvement.

In the USA, the stunning success of these massive, free, online courses has led to something like a land rush in the online education space. Academics Sebastian Thrun and David Evans have set up a company called Udacity with the support of $5M in venture capital funding. Two other academics at Stanford, Daphne Koller and Andrew Ng, have set up Coursera with $15M VC funding. In an effort to provide a non-profit alternative, MIT and Harvard have joined forces to set up EdX with $60M in funding promised from the two universities. Berkeley has now joined the EdX consortium. Will these developments present a serious challenge to second-tier universities in the struggle to extract money from students for their courses? Or will this all prove to be a ‘Dot-Edu’ analog of the Dot-Com bubble?

There are many questions still to be answered. To raise just a few:

  • What is the business model for the for-profit start-ups Udacity and Coursera?
  • What is the business model for the non-profit EdX?
  • What sort of qualification do students who finish the courses obtain?
  • How do universities make their brand visible to the world – or are the individual lecturers the important ‘brand name’?
  • How do you maintain quality yet automate the assessment and grading of 100,000 online student assignments?

In spite all of these unanswered or partially answered questions, MOOCs are arousing much enthusiasm – and great trepidation – in American universities.

Online education has been with us for many years and in the UK, the Open University has been one of the pioneers. So why could MOOCs be a game-changer at this time? I think there are several possible factors that indicate that this time it could be different. Some of them are:

  • The success of the Kahn Academy in demonstrating the effectiveness of short 10 minute segments as a more effective way of teaching than traditional 60 minute university lectures.
  • The Cloud now allows both the volume of the course content and the number of students to scale in a cost-effective and friction-free way.
  • The Internet and Web 2.0 technologies now allow students to spontaneously form online, self-help groups.
  • Automated testing and grading technologies are now beginning to make possible a genuine personalized learning experience.

At this moment of time, the long-term impact of MOOCs is certainly unclear. However, there is certainly the possibility that with increases in university tuition fees in many countries we could see a dramatic change in the HE sector. I have not seen much interest in MOOCs yet in the UK and it may be that a wise policy is ‘to wait and see’. An alternative view could be that UK universities are missing the MOOC boat. One of the few business management books that I liked when I was Dean of Engineering at Southampton was a book by Andrew Grove, former CEO of Intel: ‘Only the Paranoid Survive’ …

I now would like to conclude by asking you all to raise your glasses to the Company of Educators and its Master Martin Cross.

Tony Hey

Posted in Uncategorized | Leave a comment

Exploring the Rebalancing World at PopTech 2011

I recently attended the annual PopTech Conference, where, along with more than 700 other attendees, I experienced a wide variety of new technologies, social innovations, and all-around creative ideas. PopTech takes place each Fall in the small town of Camden, Maine, and is one of the more unusual “science” conferences that I have attended. In fact, it’s not a science conference per se—rather, it’s a venue for cross-disciplinary innovation, bringing together an eclectic mix of scientists, technologists, corporate and civic leaders, as well as representatives of the arts and humanities, all coalescing around the goal of creating a global network of innovators.

This year’s conference was organized around the theme, “The World Rebalancing”—the idea that a new global era is arising out of the “connected and converging revolutions in technology, economics, ecology, energy, geopolitics, and culture,” unleashing new opportunities and a new geography of innovation. Can anyone doubt this is true? Not if you think about the advances being made in China and India—or the novel collaborative possibilities brought about by the digital revolution. What’s great about PopTech is the sense that all these changes are—or at least can be—positives. Where reactionary folks want to freeze the world in its 20th-century patterns, the optimistic folks behind PopTech see endless potential for innovation and advancement in the rebalancing world. This year’s PopTech videos have now gone online and they are a wonderful collection of high-quality talks—see  http://poptech.org/world_rebalancing_videos.

Nowhere is the optimism of the PopTech team more obvious than among the new PopTech Fellows. As you might expect, given its determined multidisciplinary stance, PopTech supports two fellowship programs:  Social Innovation Fellows and  Science and Public Leadership Fellows. Microsoft Research  is one of the sponsors of the latter program, along with National Geographic, the National Science Foundation, the Doris Duke Foundation, and the Rita Allen Foundation. Fellows from the Class of 2011 will profit from year-long training and skills development in communications, public engagement, and leadership. The program helps Fellows develop world-class communication skills and provides them with significant opportunities to raise public awareness of their work through a variety of media. For me, the most exciting talks at this year’s PopTech were the short talks given by the new Science Fellows. Without exception, they all gave wonderfully stimulating presentations on their specific science topics and fully justified their selection. So, welcome, new Fellows. Brimming with optimism from Pop Tech, I’m anxious to see where you and this rebalancing act lead us.

—Tony Hey

Posted in Uncategorized | Leave a comment

An Understated Problem: Distance from the Data

When we discuss the fourth paradigm, we talk a lot about the challenges of data size. Big data size: deluge, flood, tsunami, and landslide. However, as my colleague Catharine van Ingen recently reminded me, there is something else going on. Something that is more
subtle, or at least more hidden—treated, sometimes, almost as dirty laundry. What follows is a recap of our discussion.

So, what is this seldom-discussed issue? It’s this: in many fields of research, the distance between the scientist doing the analysis and the actual data acquisition and observation is growing. That distance accumulates because of instrumentation, computation, and—perhaps most intriguing—data reuse. The first two factors are relatively straightforward and not uncommon in the experience of many scientists. The third factor has become
more prevalent with the growth of digital data and large-scale and/or cross-discipline science.

Instrumentation adds distance because instruments don’t always measure the science variable of interest. A simple example is when the investigator must convert, say, a voltage reading to a temperature measurement. Instruments also aggregate data, if only because of the response time of the instrument. A detailed understanding of the instrument is also often necessary to make the conversion, due to calibration, drift, and spikes.

Computation adds distance when it involves specialized algorithms for deriving scientific variables from the raw data. An example is the stellar object cross match in astronomy. That’s the classification algorithm that matches an extracted star candidate with observed location and properties, such as magnitude and color, to a known star, known galaxy, or to
the unknown. The extraction algorithm is tightly tied to the telescope; the classification algorithm is relatively loosely coupled. The classification algorithm also requires specialized knowledge that is not necessary for many of the subsequent analyses on the cross-matched data.

Computation also adds distance when the science variables of interest arise from a high-level statistical analysis, as is the case for DSTs (data summary tapes) in particle physics. This statistical analysis operates not at the level of an individual event or observation, but rather applies to specific filtered groups of events. (Note that the events here are, in turn,
created by reconstruction algorithms, subjecting them to the first computational factor described above.)

The data reuse factor is harder to capture. Essentially, distance due to reuse can come into play when datasets are assembled, such as the example of the Fluxnet dataset. Reuse also enters the picture when the data being analyzed were originally compiled for a different scientific purpose, as when carbon sequestration analysis uses soil measurements that were gathered for agricultural crop management.

Moreover, reuse problems can definitely arise when the data analyst is an “armchair user.” Because of the specialized knowledge involved, we can’t expect that all data users will understand the full details of the instrument and computational data distances. But some knowledge and care in handling the data—such as attention to quality flags that identify suspect data or sanity checks on data-range bounds—are often necessary. To someone who is involved in producing long-distance data, armchair users can seem almost willfully ignorant, using the data without regard to how that data was created. It’s doubtful that
such cavalier users were possible before the fourth paradigm era, since cutting and pasting data from a publication at least implied that the user had scanned the publication.

The further we are from the actual instrumentation, the more we all play the children’s party game of telephone, where the message becomes more and more garbled as it’s passed from person to person. In these days of data-intensive science, we all need to be alert to this problem of the distance between the data and the science analysis.

We’ll get better at the game by continuing to play. Meanwhile, we’re a long way from that apple falling on Newton’s head.

—Tony Hey

Posted in Uncategorized | 1 Comment