Read an Excerpt
CHAPTER 1
THE DATA PROBLEM
On July 20, 1969, Neil Armstrong climbed out of his spacecraft and placed his feet on the moon. The landing was broadcast live all over the world and was a significant event in both scientific and human history. Today, we can still watch the grainy video of the moon landing but what we cannot do is watch the original, higher quality footage or examine some of the data from this mission. This is because much of the data from early space exploration is lost forever.
Among the lost data are the original Apollo 11 tapes containing high-quality video footage of the moon landing. Their loss first came to light in 2006 (Macey 2006) and NASA personnel spent the next three years searching for the tapes across multiple continents before concluding that they were likely wiped and reused for data storage sometime in the 1970s (NASA 2009; O'Neal 2009; Pearlman 2009). Other data from this era fared better but at the cost of significant time and money. The Lunar Orbiter Image Recovery Project (LOIRP 2014), for example, spent years and well over a half a million dollars recovering images taken of the moon by the five Lunar Orbiter spacecraft missions preparing for the moon landing in 1969 (Wood 2009; Turi 2014). The project required finding specialized and obsolete hardware to read the original magnetic tapes, reconstructing how to process the raw data into high-quality images, decoding the labeling scheme on each of the tapes, and doing all of this with little to no documentation. Only the cultural importance of the data on these tapes, such as the first image of the earth as seen from the moon, made such efforts worthwhile.
The story of this momentous occasion in scientific history ends with an all-toocommon example of failing to plan for data management. Almost 50 years later, researchers are still inadvertently destroying data or having trouble finding data that still exists. A recent study of biology data, for example, found that data disappears at a rate of 17% per year after publishing the results (Vines et al. 2014). Another estimate says that 31% of all PC users have suffered complete data loss due to events outside of their control; this correlates with 6% of PCs losing data in any given year (Anon 2014a). Unfortunately, very few of us have significant resources – as with the lunar data projects – to recover our own data when something happens to it. Lost, misplaced, and even difficult to understand data represents a real cost in terms of time and money. Fortunately, there are practices you can use to make it easier to find and use your data when you need it; those practices are collectively called "data management".
At its most essential, data management is about taking care of your data better so that you don't experience small frustrations when actively working with your data, like having trouble finding documentation for a particular dataset, or bigger problems after a project ends, like lost data. Having well-managed data means that you can find a particular dataset, will have all of the notes you need, can prevent a security breach, can easily use a co-worker's data, and can manage the chaos of an ever-growing number of digital files. Basically, many of the little headaches that researchers often encounter around data during the research process can be prevented through good data management. Just as you need to periodically clean your home, so too should you do regular upkeep on your data.
The good news is that dealing with your digital research data does not have to be difficult, though it is different than managing analog content. This book will show you many practices you can use to take care of your research data better. The ultimate goal is for you to be able to easily find and use your data when needed, whether it is historic 50-year-old data or the critical dissertation data you collected last week.
1.1 WHY IS EVERYONE TALKING ABOUT DATA MANAGEMENT?
"Data management" is a relatively new term within research, arising in the mid-2000s with funder requirements for both data management and data sharing. Such mandates gained momentum in the UK with the 2011 Common Principles on Data Policy from Research Councils UK (Research Councils UK 2011) and in the United States with the National Science Foundation's data management plan requirement in 2011 (NSF 2013). Data management and sharing policies are now becoming commonplace in science, with recent adoption by journals such as Science (Science/AAAS 2014), Nature (Nature Publishing Group 2006), and PLOS (Bloom 2013). The overall trend is for increased data management but let's examine why this trend exists in the first place.
We cannot discuss the rise in data management requirements without examining its partner, data sharing. The two concepts often pair together in addressing similar problems in the scientific process, such as limited resources, reproducibility issues, and advancing science at a faster rate (Borgman 2012). The pairing also occurs because well-managed data requires less preparation for sharing. Taken as a whole, most of the reasons why you are now required to manage and share your data are external, though there are many personal benefits to having well managed research data, which we will examine throughout the book.
One of the main reasons behind the implementation of data management and sharing requirements relates to money. The rise of data management requirements roughly coincided with the global economic recession of the late 2000s when many research funding groups faced smaller budgets. With limited resources, funders want to be sure that researchers are making the best use of those resources, for example, by preventing the common occurrence of losing data at the end of a project (Vines et al. 2014). Public funders face additional pressure to make research products like articles and data available to the public who support the research; the current default is that these resources are locked behind paywalls, or are not even made available in the first place. By requiring data management and sharing, funders can not only stem the loss of important data but also provide accountability to those ultimately paying for the research. As an added benefit, any data reuse – either by the original researcher or other researchers – means that the same amount of money will result in more research because data usually costs more to collect than to reuse. Therefore, many research funders see data management and sharing requirements as advantageous.
Another key reason for data management and sharing policies is the prevalence of digital data in scientific research. Research data is digital on a scale never seen before which opens up a whole new set of possibilities in scientific research. First, digital data is shareable in a way not easily done with physical samples and paperbased measurements. It's simple to copy and paste digital values, attach a file to an email, or upload a dataset to the web, meaning it's easy to share research data. We are discussing data sharing so much more because it's actually possible to share data on a global scale. We also generate more data than ever before. The world created an estimated 1.8 zettabytes (1.8 x 1021 bytes) of digital content in 2011, a number which is expected to be 50 times bigger in 2020 (EMC 2011; Mearian 2011). Researchers are seeing a similar increase in not only their own data but an added availability of external data. This changes the types of analysis scientists can do. You can now perform meta-analysis or correlate your data with third-party data you would otherwise not be able to collect. Researchers lacking funding or from less developed countries are now able to take part in cutting-edge research because of shared data. Basically, by sharing research data, we open up scientific research to many new types of analysis and can increase scientific research at a faster rate.
In spite of all the benefits of digital data, digital data is fragile. It you do not care for your data, many things can go wrong. Storage devices become corrupt, files are lost, and software becomes out-of-date and media obsolete. Most people have digital files from ten years ago that they cannot use. However, this does not have to be the fate of your research data. Data is a valuable research product that should be treated with care and data management requirements are one way to make that happen.
Finally, data management and sharing policies arose in response to recent reproducibility crises in several scientific disciplines. For example, prominent psychology researcher Deiderick Stapel prompted a reproducibility crisis in his field when it came to light that he committed widespread data fabrication (Bhattacharjee 2013); Stapel amassed over 50 retractions as a result (Oransky 2013a). In economics, a graduate student, Thomas Herndon, proved that the seminal paper supporting economic austerity policies was fundamentally flawed after examining the raw dataset behind the paper (Alexander 2013). In medical research, a study of cancer researchers found that half of the survey respondents had trouble reproducing published results at some point in time (Mobley et al. 2013). Clearly, there is a reproducibility crisis in scientific research as these stories represent just a few highlights of reproducibility issues in recent years. Adding to this is the fact that it can be difficult to tell from an article alone whether a study is reproducible because "a scientific publication is not the scholarship itself, it is merely advertising of the scholarship" (Buckheit and Donoho 1995). The creation of data management and sharing policies is one response to this reproducibility crisis, as these policies help ensure that data is available for review should questions arise about the research. Misconduct investigations are also starting to look at data management. For example, investigations leading to the high-profile retractions of two STAP (stimulus-triggered acquisition of pluri potency) stem cell papers in 2014 "found inadequacies in data management, record-keeping and oversight" (Anon 2014b). With a growing number of retractions in recent years (Fanelli 2013; Steen et al. 2013), good data management increasingly needs to be part of a good defense against charges of fabrication or even more political attacks on high-profile research.
All of these issues – limited funding, the ease of sharing digital data, the availability of new types of analysis, the fragility of digital data, and reproducibility issues within scientific research – coincided to provide an optimal environment for the creation of data management and sharing policies. Along with new policy requirements, they will continue to fuel the drive toward better data management in scientific research.
1.2 WHAT IS DATA MANAGEMENT?
While many researchers were introduced to the concept of data management through a funder's requirement to write a data management plan, there's actually a lot more to data management than planning. Moreover, it's the data management you do after writing the plan that really helps in your research. This section covers what "data management" actually entails, but first we need to define what is meant by the "data" portion of "data management".
1.2.1 Defining data
Defining research data is challenging because data by its very nature is heterogeneous. Research fields are diverse and even specific subfields use a huge variety of data types. So instead of limiting ourselves to one definition of data – which likely doesn't cover everything – let's explore several definitions.
In the United States, research data created under federal funding falls under the definition of data in OMB Circular A-81:
Research data means the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples). Research data also do not include:
(i) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and
(ii) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study. (White House Office of Management and Budget 2013)
This definition is very broad, covering anything necessary to validate research funding, but is helpful in that it outlines what definitely is not research data. These exclusions are particularly useful in complying with data sharing requirements to know what you are not required to share.
More globally, the Organisation for Economic Co-operation and Development (OECD), consisting of 34 member nations, provides a similar definition in their "Principles and Guidelines for Access to Research Data from Public Funding":
"Research data" are defined as factual records (numerical scores, textual records, images, and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings. A research data set constitutes a systematic, partial representation of the subject being investigated.
This term does not cover the following: laboratory notebooks, preliminary analyses, and drafts of scientific papers, plans for future research, peer reviews, or personal communication with colleagues or physical objects (e.g. laboratory samples, strains of bacteria and test animals such as mice). (Organisation for Economic Co-operation and Development 2007)
This report focuses on the sharing of digital datasets so this definition of data skews toward digital content. In actuality, physical samples can be research data and, in some cases, fall under data sharing requirements (see Chapter 10).
You may also see data defined by type. Just as social science data often falls into one of two categories – quantitative or qualitative – so too does scientific data fall into specific groups. For scientific research data, those categories are:
Observational data
Experimental data
Simulation data
Compiled data
Observational data results from monitoring events, often at a specific time and place, and yields data such as species counts and weather measurements. Scientists produce experimental data in highly controlled environments so that similar conditions will always result in similar data; examples of experimental data are spectra of chemical reaction products and the measurements coming from the Large Hadron Collider. The third category of scientific data is simulation data, which results from computer models of scientific systems. Global warming simulations and optimized protein folding pathways represent two types of simulation data. The final category, compiled data, applies when you amass data from other sources for secondary use, such as performing meta-analysis or building a database containing a variety of data on one topic. While not perfect, most of the content we consider to be scientific research data fits into one of these four categories.
For this book, we'll use a broad definition of research data: data is anything you perform analysis upon. This means that data can be spreadsheets of numbers, images and video, text, or another type of content necessary for your research. Data can also be physical samples or paper-based measurements, though analog content usually has fairly established management practices. In the end, it's simply too much to try to define every possible type of data and a broad definition of data allows you, the researcher, to be generous in identifying content that you need to manage better.
1.2.2 Defining data management
If you've ever gotten halfway through a project and thought "why didn't I write down that information?" or "where did I put that file?" or "why didn't I back up my data?" then you could benefit from data management. Data management is the compilation of many small practices that make your data easier to find, easier to understand, less likely to be lost, and more likely to be usable during a project or ten years later. Data management is fundamentally about taking care of one of the most important things you create during the research process: your data.
Data management involves many practices. We will examine these more in Chapter 2. Briefly, data management includes data management planning, documenting your data, organizing your data, improving analysis procedures, securing sensitive data properly, having adequate storage and backups during a project, taking care of your data after a project, sharing data effectively, and finding data for reuse in a new project. Such a wide range of practices means that data management is something you do before the start of a research project, during the project, and after the project's completion.
(Continues…)
Excerpted from "Data Management for Researchers"
by .
Copyright © 2015 Kristin Briney.
Excerpted by permission of Pelagic Publishing.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.