Exploring the Rebalancing World at PopTech 2011

I recently attended the annual PopTech Conference, where, along with more than 700 other attendees, I experienced a wide variety of new technologies, social innovations, and all-around creative ideas. PopTech takes place each Fall in the small town of Camden, Maine, and is one of the more unusual “science” conferences that I have attended. In fact, it’s not a science conference per se—rather, it’s a venue for cross-disciplinary innovation, bringing together an eclectic mix of scientists, technologists, corporate and civic leaders, as well as representatives of the arts and humanities, all coalescing around the goal of creating a global network of innovators.

This year’s conference was organized around the theme, “The World Rebalancing”—the idea that a new global era is arising out of the “connected and converging revolutions in technology, economics, ecology, energy, geopolitics, and culture,” unleashing new opportunities and a new geography of innovation. Can anyone doubt this is true? Not if you think about the advances being made in China and India—or the novel collaborative possibilities brought about by the digital revolution. What’s great about PopTech is the sense that all these changes are—or at least can be—positives. Where reactionary folks want to freeze the world in its 20th-century patterns, the optimistic folks behind PopTech see endless potential for innovation and advancement in the rebalancing world. This year’s PopTech videos have now gone online and they are a wonderful collection of high-quality talks—see  http://poptech.org/world_rebalancing_videos.

Nowhere is the optimism of the PopTech team more obvious than among the new PopTech Fellows. As you might expect, given its determined multidisciplinary stance, PopTech supports two fellowship programs:  Social Innovation Fellows and  Science and Public Leadership Fellows. Microsoft Research  is one of the sponsors of the latter program, along with National Geographic, the National Science Foundation, the Doris Duke Foundation, and the Rita Allen Foundation. Fellows from the Class of 2011 will profit from year-long training and skills development in communications, public engagement, and leadership. The program helps Fellows develop world-class communication skills and provides them with significant opportunities to raise public awareness of their work through a variety of media. For me, the most exciting talks at this year’s PopTech were the short talks given by the new Science Fellows. Without exception, they all gave wonderfully stimulating presentations on their specific science topics and fully justified their selection. So, welcome, new Fellows. Brimming with optimism from Pop Tech, I’m anxious to see where you and this rebalancing act lead us.

—Tony Hey

Posted in Uncategorized | Leave a comment

An Understated Problem: Distance from the Data

When we discuss the fourth paradigm, we talk a lot about the challenges of data size. Big data size: deluge, flood, tsunami, and landslide. However, as my colleague Catharine van Ingen recently reminded me, there is something else going on. Something that is more
subtle, or at least more hidden—treated, sometimes, almost as dirty laundry. What follows is a recap of our discussion.

So, what is this seldom-discussed issue? It’s this: in many fields of research, the distance between the scientist doing the analysis and the actual data acquisition and observation is growing. That distance accumulates because of instrumentation, computation, and—perhaps most intriguing—data reuse. The first two factors are relatively straightforward and not uncommon in the experience of many scientists. The third factor has become
more prevalent with the growth of digital data and large-scale and/or cross-discipline science.

Instrumentation adds distance because instruments don’t always measure the science variable of interest. A simple example is when the investigator must convert, say, a voltage reading to a temperature measurement. Instruments also aggregate data, if only because of the response time of the instrument. A detailed understanding of the instrument is also often necessary to make the conversion, due to calibration, drift, and spikes.

Computation adds distance when it involves specialized algorithms for deriving scientific variables from the raw data. An example is the stellar object cross match in astronomy. That’s the classification algorithm that matches an extracted star candidate with observed location and properties, such as magnitude and color, to a known star, known galaxy, or to
the unknown. The extraction algorithm is tightly tied to the telescope; the classification algorithm is relatively loosely coupled. The classification algorithm also requires specialized knowledge that is not necessary for many of the subsequent analyses on the cross-matched data.

Computation also adds distance when the science variables of interest arise from a high-level statistical analysis, as is the case for DSTs (data summary tapes) in particle physics. This statistical analysis operates not at the level of an individual event or observation, but rather applies to specific filtered groups of events. (Note that the events here are, in turn,
created by reconstruction algorithms, subjecting them to the first computational factor described above.)

The data reuse factor is harder to capture. Essentially, distance due to reuse can come into play when datasets are assembled, such as the example of the Fluxnet dataset. Reuse also enters the picture when the data being analyzed were originally compiled for a different scientific purpose, as when carbon sequestration analysis uses soil measurements that were gathered for agricultural crop management.

Moreover, reuse problems can definitely arise when the data analyst is an “armchair user.” Because of the specialized knowledge involved, we can’t expect that all data users will understand the full details of the instrument and computational data distances. But some knowledge and care in handling the data—such as attention to quality flags that identify suspect data or sanity checks on data-range bounds—are often necessary. To someone who is involved in producing long-distance data, armchair users can seem almost willfully ignorant, using the data without regard to how that data was created. It’s doubtful that
such cavalier users were possible before the fourth paradigm era, since cutting and pasting data from a publication at least implied that the user had scanned the publication.

The further we are from the actual instrumentation, the more we all play the children’s party game of telephone, where the message becomes more and more garbled as it’s passed from person to person. In these days of data-intensive science, we all need to be alert to this problem of the distance between the data and the science analysis.

We’ll get better at the game by continuing to play. Meanwhile, we’re a long way from that apple falling on Newton’s head.

—Tony Hey

Posted in Uncategorized | 1 Comment

Autism and Community Science

Autism. The diagnosis didn’t even exist until the 1940s, though undoubtedly the disorder—well, actually a continuum of related disorders—did. Today, the Centers for Disease Control estimates that as many as 1 of every 110 children in the United States will be diagnosed with some form of what’s come to be called autism spectrum disorders,
or ASDs. That’s a considerable jump from the level of diagnoses as recently as the 1980s.

This seeming epidemic of ASD has raised a host of questions, not the least of which is the cause of the apparent explosion in autism rates. Does it stem from a true increase in affected individuals, or is it the product of changing diagnostic standards? If the former, what could be behind the increased incidence? For that matter, what is the underlying cause of autism, regardless of whether the incidence is growing or not? Is it exclusively genetic, or are there environmental co-factors? Also, how should ASD be treated, and what is the prognosis for those with ASD?

As always, data lies at the heart of answering these questions. Which brings me to the subject of this blog (long way around, I know!): the National Database for Autism Research (NDAR), a community-wide resource that was established by the National Institutes of Health. Under the direction of Dr. Michael F. Huerta, NDAR has assembled a
massive collection of autism information. As described on the program’s website, “NDAR is an extensible, scalable informatics platform for ASD relevant data at all levels of biological and behavioral organization (molecules, genes, neural tissue, behavioral, social and environmental interactions) and for all data types (text, numeric, image, time series, etc.).”

NDAR is designed not just to curate these data but also to facilitate the sharing of information, tools, and methodologies and to foster collaboration across the entire ASD community. It builds on the broad, common use of informatics platforms in the autism research community and promotes common data definitions and standards. It engages investigators, funding sources, and platform operators through workshops, meetings, talks, webinars, and tutorials. NDAR represents a new model of 21st-century biomedical research, combining three overlapping methodologies for doing science—high-volume data collection, computation and informatics, and collaborating laboratories. Michael Huerta calls this community science.

This community science endeavor has built a growing federation of partners who have agreed to share their data and adopted NDAR standards, including global unique identifiers, data definitions, validation tools, and an authentication scheme. Dr. Huerta estimates that the federation is now on track to provide most of the autism community access to these tools and data, and its efforts have resulted in harmonized technical and policy considerations and have generated great enthusiasm among researchers, funders,
and advocacy groups.

NDAR is truly pointing the way toward the future of collaborative, data-intensive biomedical research—community science at its best.

Tony Hey

Related Links

Posted in Uncategorized | Leave a comment

VOI, Anyone?

Everyone who’s ever drawn up a budget is familiar with the acronym ROI (return on investment). But have you heard of VOI? No? Well, neither had I, until I was involved in the National Science Foundation-Office of Cyberinfrastructure’s Task Force on Data and Visualization.

VOI stands for value of information, and like ROI, it’s a very useful tool for economic analysis. Sure, we all know that information has some value, but how can we quantify it? What, for instance, is the economic value of knowing the mean April temperature in
Minneapolis over the past 50 years?

The U.S. Geological Survey (USGS) has taken the lead in developing a framework for measuring VOI. For the USGS, the question was this: What is VOI of the Land Use/Land Cover maps that have been generated by Landsat’s moderate resolution land imagery (MRLI)? This is an enormous store of data, dating back to the 1970s, but how valuable is it? Do its economic benefits justify its cost?

To answer that question, the USGS is conducting a test project that uses archival MRLI data to observe historical crop rotation patterns in 35 counties in Iowa. The USGS is then correlating this information on agricultural land use with data from wells, to estimate how changes in planting patterns affect the chemical composition of the groundwater. This study will enable the researchers to develop models to forecast the impact of planting
decisions on water quality.

These forecasts will help determine when it is cost-effective for policymakers to get involved in planting decisions that affect groundwater. For example, it will add a new economic perspective on the recent shift to more corn production in Iowa, which has been driven in large part by biofuel initiatives. Corn requires heavy use of nitrogen fertilizer,
which eventually shows up as nitrates in the groundwater. A better understanding of the impact of increased corn production on the quality of the groundwater will go a long way to providing a more complete picture of the cost/benefit ratio of the biofuel-driven shifts in land use.

The VOI of the MRLI maps will be determined by the economic impact of land-use decisions that balance pollution hazards against agricultural needs. The USGS is aiming for a VOI that will maximize agricultural production while lowering the costs of treating contaminated groundwater.

While VOI should not be the only consideration in undertaking a data-intensive project, it is certainly a valid one in today’s world of limited budgets. After all, we all want to be sure that we’re getting the greatest socioeconomic benefit from our research dollars.

Posted in Uncategorized | Leave a comment

Data Esperanto

And the Lord said, Behold, the people is one, and they have all one language; and this they begin to do; and now nothing will be restrained from them, which they have imagined to do. Go to, let us go down, and there confound their language, that they may not understand one another’s speech. (Genesis 11:6–7, [King James Version])

As you probably recognize, these verses are from the biblical account of the Tower of Babel, which provided a divine explanation for the profusion of languages and cultures in the ancient world. It also recounts an archetypal engineering failure: the abandonment of a great tower due to the inability to collaborate and share information.

Today we still struggle with overcoming such differences, even in the world of astrophysics, where it is assumed that all practitioners speak the lingua franca of mathematics. Case in point: the Nearby Supernova Factory—better known as the SNfactory—an experiment that involves intense collaboration between American and French institutions.

The SNfactory is designed to collect reams of data on Type Ia supernovae, the subcategory of extraordinarily bright, remarkably uniform objects whose consistent peak luminosity makes them useful as “standard candles” for measuring the rate at which the universe is expanding—measurements that provide insight into the mysterious Dark Energy that accelerates this expansion.

Happily, the SNfactory anticipated the challenges of diverse languages and cultural proclivities and created Sunfall (SUperNova Factory AssembLy Line), a system of well-planned data curation and management, to overcome them. Sunfall brought together an interdisciplinary team of astrophysicists, computer scientists, and software engineers to design a collaborative scientific data management and visual analytics system that
integrates software tools and provides distributed, remote access to the supernova catalog database. It features an interactive, visual interface and a real-time chat system that promote collaboration and efficient decision-making.

As described in the recent report of the National Science Foundation Office of Cyberinfrastructure (NSF-OCI) Task Force on Data and Visualization, Sunfall has demonstrated how cyberinfrastructure can yield a significant return on investment, “both in terms of financial resources and scientific productivity.” In fact, the NSF-OCI report notes that Sunfall “reduced false supernovae identification by 40%; it improved scanning times by 70%; and it reduced labor for search and scanning from 6–8 people working four hours per day to one person working one hour per day.” The report further observes that Sunfall paid for itself within 18 months and it enabled new scientific discoveries—a
substantial return on investment on all counts.

So the ancients might have abandoned their engineering efforts in the face of linguistic and cultural diversity, but thanks to well-designed cyberinfrastructure, we can effectively and frugally collaborate across such anthropological divides.

Posted in Uncategorized | Leave a comment

Data and Dementia

As any fan of the late Berton Roueché knows, data is the key to solving medical mysteries. In his 50-year career at The New Yorker, medical writer Roueché chronicled the detective work of epidemiologists as they hunted down the data crucial for understanding and treating all manner of maladies. Even now, some of the medical mysteries he covered are said to inspire the writers of the popular TV series, House.

As true today as when Roueché wrote his first medical stories in the 1940s, data is the currency of diagnosis. Symptoms, histories, lab tests—all are data, and out of this collection emerges a diagnosis. Usually, but not always. Some diseases defy diagnosis, and of these, Alzheimer’s disease ranks among the worst. This incurable disease robs its victims of their cognitive functions, eventually stripping them of everything that made them who they were. Anyone who’s ever watched this merciless process in a friend or loved one knows the utter sadness of Alzheimer’s. Compounding the tragedy is the inability to diagnose the condition while the patient is alive. Only post-mortem biopsy of brain tissues can reveal the tangled plaques that are the definitive marker of the disease.

Today, however, rapid data gathering and sharing is giving new hope to Alzheimer’s researchers. As reported in the National Science Foundation Office of Cyberinfrastructure (NSF-OCI) Task Force on Data and Visualization, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) is compiling and disseminating data from a variety of diagnostic approaches, working to identify biomarkers that could lead to early diagnosis—and with it, a hope for effective treatments.

A partnership of the private sector and the National Institutes of Health, ADNI has taken a multi-pronged approach, combining data from numerous groups of volunteer subjects and array of diagnostic methods: spinal fluid analysis, MRI images, and PET scans among them. The data, which comes from 800 volunteers spread across 14 different medical centers, is combined, compared, and then rapidly shared with the neuroscience community. In fact, the data is publicly available within a week in most cases.

The richness of the data and the speed of its availability have energized neuroscientists worldwide, as evidenced by the tens of thousands of downloads from the ADNI website and dozens of papers based on the ADNI data. Around the globe, researchers are using these data in a race to find biomarkers that will identify potential Alzheimer’s victims—a breakthrough that could lead to diagnosis a decade or more in advance of symptoms, and with it, the chance for treatments that would halt the disease process before its insidious effects take hold.

This data-intensive research shows that not only is it important to collect and analyze information, but also to share it as widely and quickly as possible. And back to Roueché: If you haven’t read The Medical Detectives, do yourself a favor and pick up a copy.

Posted in Uncategorized | Leave a comment

Rich Data and Serendipitous Uses

Scientific research frequently yields unexpected benefits. Silly Putty, for example, was the byproduct of World War II research for potential rubber substitutes and this bouncy substance has delighted children for generations. Of perhaps more scientific gravitas, the cosmic microwave background radiation was first detected during experiments at Bell Labs on building antennae to pick up radio waves bounced off satellites—a discovery that helped advance the Big Bang theory and proved the final nail in the coffin for Fred Hoyle’s “continuous creation” steady-state alternative.

So it should come as no surprise that data-intensive science is producing its share of serendipitous discoveries. The report of the NSF-OCI (National Science Foundation-Office of Cyberinfrastructure) Task Force on Data and Visualization describes a few examples drawn from data-intensive research in oceanography.

Long the domain of ship-based observations, oceanography today now encompasses observatory-based research and a worldwide network of scientists from myriad disciplines. These efforts measure regular oceanic processes and aim to understand our planet’s climate, geodynamics, and marine ecosystems. For example, scientists at Rutgers University’s Coastal Ocean Observation Lab are collecting high-frequency radar data on ocean surface waves and currents, with an eye to answering specific questions, such as the impact on the marine food chain of the Hudson River’s flows into the Atlantic Ocean.
However, the accumulated data are also being used by the U.S. Coast Guard to facilitate life-saving ocean rescues. Taking advantage of Rutgers’ highly accurate, real-time data on ocean circulation patterns, the Coast Guard can more precisely define the search area for survivors of boating and aircraft accidents. Similarly, the New Jersey Board of Public Utilities is utilizing the Rutgers data to plan offshore wind farms, and the Department of Homeland Security is exploring the data’s potential for detecting ships that suspiciously have not reported their location.

These examples show how large data sets can yield unexpected benefits, and, as the Task Force reports states, they provide “an argument for funding and building robust systems to manage and store the data.” Unfortunately, much of the current Rutgers data has to be discarded due to the lack of capacity for storage, curation, and management. The report rather depressingly observes that the existing research culture often fails to encourage best practices in data management and sharing—thereby impeding the discovery of new uses for these data.

To remedy this situation, the Task Force offered these key recommendations:

  • Introduce new funding models that have specific data-sharing expectations and support researchers in meeting data-management and data-sharing requirements imposed by research sponsors.
  • Create new citation models in which data and software tool providers are credited with their data contributions and establish metrics that recognize open-access policies and sharing.

These recommendations are so commonsensical that it’s hard for me to imagine anyone objecting to them. After all, improved sea rescues and heightened security from terrorists seem like rather nice byproducts. And who knows when rich data sets might even give rise to the next Silly Putty?

Posted in Uncategorized | Leave a comment