An Understated Problem: Distance from the Data

When we discuss the fourth paradigm, we talk a lot about the challenges of data size. Big data size: deluge, flood, tsunami, and landslide. However, as my colleague Catharine van Ingen recently reminded me, there is something else going on. Something that is more
subtle, or at least more hidden—treated, sometimes, almost as dirty laundry. What follows is a recap of our discussion.

So, what is this seldom-discussed issue? It’s this: in many fields of research, the distance between the scientist doing the analysis and the actual data acquisition and observation is growing. That distance accumulates because of instrumentation, computation, and—perhaps most intriguing—data reuse. The first two factors are relatively straightforward and not uncommon in the experience of many scientists. The third factor has become
more prevalent with the growth of digital data and large-scale and/or cross-discipline science.

Instrumentation adds distance because instruments don’t always measure the science variable of interest. A simple example is when the investigator must convert, say, a voltage reading to a temperature measurement. Instruments also aggregate data, if only because of the response time of the instrument. A detailed understanding of the instrument is also often necessary to make the conversion, due to calibration, drift, and spikes.

Computation adds distance when it involves specialized algorithms for deriving scientific variables from the raw data. An example is the stellar object cross match in astronomy. That’s the classification algorithm that matches an extracted star candidate with observed location and properties, such as magnitude and color, to a known star, known galaxy, or to
the unknown. The extraction algorithm is tightly tied to the telescope; the classification algorithm is relatively loosely coupled. The classification algorithm also requires specialized knowledge that is not necessary for many of the subsequent analyses on the cross-matched data.

Computation also adds distance when the science variables of interest arise from a high-level statistical analysis, as is the case for DSTs (data summary tapes) in particle physics. This statistical analysis operates not at the level of an individual event or observation, but rather applies to specific filtered groups of events. (Note that the events here are, in turn,
created by reconstruction algorithms, subjecting them to the first computational factor described above.)

The data reuse factor is harder to capture. Essentially, distance due to reuse can come into play when datasets are assembled, such as the example of the Fluxnet dataset. Reuse also enters the picture when the data being analyzed were originally compiled for a different scientific purpose, as when carbon sequestration analysis uses soil measurements that were gathered for agricultural crop management.

Moreover, reuse problems can definitely arise when the data analyst is an “armchair user.” Because of the specialized knowledge involved, we can’t expect that all data users will understand the full details of the instrument and computational data distances. But some knowledge and care in handling the data—such as attention to quality flags that identify suspect data or sanity checks on data-range bounds—are often necessary. To someone who is involved in producing long-distance data, armchair users can seem almost willfully ignorant, using the data without regard to how that data was created. It’s doubtful that
such cavalier users were possible before the fourth paradigm era, since cutting and pasting data from a publication at least implied that the user had scanned the publication.

The further we are from the actual instrumentation, the more we all play the children’s party game of telephone, where the message becomes more and more garbled as it’s passed from person to person. In these days of data-intensive science, we all need to be alert to this problem of the distance between the data and the science analysis.

We’ll get better at the game by continuing to play. Meanwhile, we’re a long way from that apple falling on Newton’s head.

—Tony Hey

About these ads
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to An Understated Problem: Distance from the Data

  1. All of which suggests we should add a distance dimension to digital preservation of data; currently digital preservation tends to be more concerned with the temporal dimension.

    Another possible effect of distance is on the quality and impact of research. An article last year in PLoS ONE found that “Despite the positive impact of emerging communication technologies on scientific research, our results provide striking evidence for the role of physical proximity as a predictor of the impact of collaborations.” (http://www.plosone.org/article/info:doi/10.1371/journal.pone.0014279) Although this research has been criticised (http://www.nature.com/nature/journal/v470/n7332/full/470039c.html), your identification of the distance factor in science suggests there may be more to this result.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s