Data increasingly drives innovation in virtually every area of inquiry. Whether the data helps to reveal the existence of the “God particle,” the discovery of a new planet, the behavior of crowds, or the spread of disease, it is key to discovery and innovation. Data is also a national priority around the world. In the United States, White House initiatives are focusing on public access to research data, big data, and government open data.
Yet as data—big, small, or in-between—becomes more valuable to science and society, its longevity may increasingly be at risk. This is because there has been inadequate planning and support for the hardware, software, organizational, and human infrastructure that enables its access and use. In particular, we need infrastructure to support the stewardship of research data (for access and use now), and its preservation (for access and use in the future). Without sufficient fiscal, community, and organizational support, research data may disappear, potentially never to be restored.
So why is it so hard to support the infrastructure needed for valuable research data? Why is research data infrastructure such a hard sell?
To try to understand the problem, we can contrast the environments of data-driven research versus the environment in which we build research data infrastructure. Consider the following questions:
What is Newsworthy?
In the research arena, new discoveries and results are newsworthy, such as the discovery of the Higgs Boson and more accurate models for predicting earthquakes. These provide new insights, expand our horizons, and may ultimately create opportunities for innovative products and services.
In contrast, infrastructure is primarily newsworthy when it fails. For example, when valuable data such as social security numbers are lost or damaged, or when data systems break down. In particular, what is newsworthy is the “bad news.” Well-running, reliable infrastructure for data stewardship and preservation is expected, unremarkable, and not on the radar for recognition.
What’s the Value Proposition?
Research drives the innovation needed to advance science and society; it is critical for success. This is an important value proposition that enables investment in research, even when the results are not immediately applicable. In contrast, infrastructure enables research and innovation. This is a much weaker value proposition, making it harder to prioritize data infrastructure investments over higher-profile or more urgent efforts.
What’s the Funding Model?
Almost all of our research funding is time-limited: Government agencies and foundations typically provide funding for a set period of time (often 3 to 5 years with or without limited renewal), and long-term research programs are typically funded by a variety of sources. In particular, research funding is generally not “continuous” from a single grant, agency, or source and almost all researchers’ funding pipelines experience ebbs and flows.
In contrast, infrastructure funding must be continuous and long-term with no gaps. Systems must keep running and administrators must migrate valuable data smoothly over time from old technologies to new ones. Gaps in funding may put valuable data at risk. If we don’t pay the light bill, the lights go out. If we don’t pay the data bill, we are at risk of losing our data.
Who is Responsible?
All across the world, research agencies provide critical support for research communities. In the United States, researchers are funded by the National Science Foundation, National Institutes of Health, Department of Energy, National Endowment for the Humanities, and other agencies. Researchers are also supported by non-profit organizations as well as research funding programs from the private sector.
Who is responsible for funding research data infrastructure? Typically this responsibility is shared among research agencies, institutions, non-profits and others but often as a lesser priority than the mainstream mission of these organizations. There is no “National Research Infrastructure Foundation” with a mission to build and maintain infrastructure that enables research, particularly data-driven research. Moreover, in many research-funding organizations, support for infrastructure must compete with funding for research without expansion of funding levels. (A free copy of this report is available on my website.) This is particularly challenging for digital research data, which may require near-constant care to be accessible for current and future use.
Data Infrastructure Matters
With primarily bad news, a weaker value proposition, a more challenging business model, and no group whose mainstream mission it is to plan and coordinate the data infrastructure needed for the research community, it is no surprise that we do not have a comprehensive plan for research data stewardship and preservation.
Nonetheless, good planning and strategic investment are critical to avoid falling behind. With focus and commitment, a national research data stewardship and preservation plan can be developed and used to drive new discovery and innovation. Without such a plan, we may be left behind wondering what we could have done with data that has disappeared.
Francine Berman is an IEEE Fellow, Hamilton Distinguished Professor of Computer Science at Rensselaer Polytechnic Institute, and Chair of Research Data Alliance/ United States.
Many thanks to Myron Gutmann for thoughtful comments on an earlier draft of this piece.