Library Guides: Research Data Management : Data description and metadata

Overview

Properly managing data does not necessarily equate to sharing or publishing those data. But it is a good idea to complete the data lifecycle through sharing and publishing.

Types of Data

RESEARCH DATA DEFINED

One definition of research data is: “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” (OMB Circular 110). Across all agencies including the NSF the definition of research data does not mean summary statistics or tables; rather, it means the data on which summary statistics and tables are based.

REsearch Data Examples

Documents (text, Word), spreadsheets, print outs
Laboratory notebooks, field notebooks, diaries
Questionnaires, transcripts, codebooks
Audio, video
Photographs, films, x-rays, negatives,
Protein or genetic sequences
Spectra, spectroscope data
Test responses
Slides, artifacts, specimens, samples
Collection of digital objects acquired and generated during the process of research
Database contents (video, audio, text, images)
Models, algorithms, scripts, code, software
Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
Methodologies and workflows
Standard operating procedures and protocols
Computers and computer data storage devices
Synthetic compounds
Organisms, cell lines, viruses, cell products
Cloned coordinates, plants animals

CC BY 4.0 Springer Nature

EXCLUSIONS

Some kinds of data might not be sharable due to the nature of the items themselves, or to ethical and privacy concerns. As defined by the OMB, this refers to:

Preliminary analyses
Drafts of scientific papers
Plans for future research
Peer reviews
Communications with colleagues
Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published or similar information which is protected under law
Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study

Additionally, research data managment is not records managmeent for projects or university business data. And therefore does not manage such items as:

Correspondence (electronic mail and paper-based correspondence)
Project files
Grant applications
Ethics applications
Technical reports
Research reports
Signed consent forms
Results of compliance reviews: (export Controls and human subjects)
Software to read proprietary vendor data formats

File Naming and Versioning

OVERVIEW

Plan the directory structure and file naming conventions before creating your data. Plan for version tracking of datasets and documents. Use project-specific conventions or disciplinary standards or best practices. The following are general best practices.

Organizational tips

Decide upon a convention and stick to it. Always include the same information.
Consider organizing directories or folders by date, date/time, place, instrument, project, type of data, variable name or a combination of these using a hierarchical directory structure
Same applies to filenames. If you were able to organize directories by place and date/time then filenames might be organized by type of data and variable name
Test your organizational structure on team members before implementing. Does it make sense to them too or is there confusion?
Consider organizational structures that will help you later decide which data are the most important to deposit and make publicly accessible to others
Consider what structure will make it easier to programmatically walk through your data

DIRECTORY AND FILE NAMING CONVENTIONS

When using date information, use the YYYY-MM-DD format over other formats
Keep file and folder names less than 32 characters.
Include relevant information like unique identifiers, project name, grant numbers or research data names
Try to name runs of an experiment sequentially
Use software application-specific 3-letter file extensions and lowercase them: mov, tif, wrl
When using sequential numbering, make sure to use leading zeros to allow for multi-digit versions. For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100.
No special characters: & , * % # ; * ( ) ! @$ ^ ~ ‘ { } [ ] ? < > –
Use only one period and before the file extension (e.g. name_paper.doc and NOT name.paper.doc OR name_paper..doc)

Metadata Standards

OVERVIEW

Metadata is structured and descriptive information about an item or object. It is a standardized way to explain the who, what, where, when and how of data creation and methods. Metadata and other documentation enable the researcher to understand their data in detail and enables other researchers to discover, use and properly cite the item or object. Metadata standards have been created to facilitate the description of research data using a defined set of elements. And some disciplines have preferred metadata standards.

Data repositories may have specific metadata standard requirements that must be met in order to deposit data. If you intend to deposit your data in a subject- or discipline-specific repository, check their deposit and metadata requirements before including the repository in your data management plan.

COMMON METADATA STANDARDS

Dublin Core (used my the Mines institutional repository): a general standard, can be adapted for specific disciplines
FGDC (Federal Geographic Data Committee): used by many Federal agencies for geospatial data; some tools are available
MODS (Metadata Object Description Schema): richer than Dublin Core and can be used for a variety of purposes
PREMIS (preservation metadata)
METS (both descriptive and technical rights and some preservation fields included)
DIF (Directory Interchange Format): for earth science data

Metadata Creation

OVERVIEW

As the custodian of the primary data, the researcher should ensure project data are properly documented in order to facilitate current use and enable future discovery and sharing. As early as you can, document your data and your data organization protocol, even before data collection begins; doing so will make data documentation easier and reduce the likelihood that you will forget aspects of your data later in the research project.

The following is a list of elements and aspects of your research project and data that should be documented, regardless of discipline. At minimum, this information should be stored in a readme.txt file or the equivalent, together with the data. The Mines Research Support Services group uses this documentation to create the required metadata for the Mines institutional repository.

GENERAL INFORMATION

– Elements marked with an * are required by the Mines institutional repository.
– See the Deposit with Mines page to understand the submittal process

*Title: name of the dataset or research project that produced it
*Creator: names and addresses of the organization or people who created the data
Identifier: number used to identify the data, even if it is just an internal project reference number
*Researcher identifier: a unique and persistent digital identifier that distinguishes you from every other researcher or author; requires registration with ResearchID or ORCID
*Abstract: a concise description or summary of the dataset
*Subject: keywords or phrases describing the subject or content of the data; these are additional search terms that are not listed in the abstract
*Funders: name of the organizations or agencies who funded the research
*Award: the grant number(s) if the data was generated from work on a grant
*Rights: any known intellectual property rights held for the data (copyright)
Publication citations: any citations that describe or use the data

DATA CHARACTERISTICS

*Access information: if you deposited the data in a repository external to Mines, describe where and how the data can be accessed by other researchers
*Access restrictions: if there are restrictions on making the data openly accessible indicated the nature of the restriction and how long they need to be in place
*Language: language(s) of the intellectual content
*Dates: key dates (and times) associated with the data, including: funding period; project start and end date; release date; time period covered by the data (coverage); and other dates associated with the data lifespan, e.g., maintenance cycle, update schedule, date of last update
*Date of publication: the date the data was made available, created or compiled as an entity for use by others
*Location: spatial coverage of the data or sampling site information
Methodology: how the data was generated, including equipment or software used, experimental protocol, other things one might include in a lab notebook
*Data processing: during your research, record information on how the data has been altered or processed
Sources: citations to material for data derived from other sources, including details of where the source data is held and how it was accessed
Unit of analysis: the major entity that is being analyzed in the study
*Type: the dominant kinds of data; choose from Collection, Event, Image, Moving Image, Physical Object, Software, Sound, Text

FILE CHARACTERISTICS

Count: total number of files
*Size: how much space the dataset requires on a computer server
*File names: list of all data files associated with the project, with their names and file extensions (e.g. ‘NWPalaceTR.WRL’, ‘stone.mov’)
*File formats: format(s) of the data, e.g. FITS, SPSS, HTML, JPEG, and any software required to read the data
*File structure: organization of the data file(s) and the layout of the variables, when applicable
*Variable list: list of variables in the data files, when applicable
*Code lists: explanation of codes or abbreviations used in either filenames or the variables in the data files (e.g. ‘999 indicates a missing value in the data’)
*Versions: date/time stamp for each file, and use a separate ID for each version
Checksums: to test if the files have changed over time

Metadata Examples

OVERVIEW

To deposit in the Mines Repository, metadata (a description of the item/object) is required. The following example is for a dataset and is typical of the information that needs to be gathered in order to make a deposit.

Author: Lauenroth, William K.

Title: SGS-LTER Bouteloua gracilis Removal Experiment Vegetation Data (ARS #155) on the Central Plains Experimental Range, Nunn, Colorado, USA 1997-2008

Keywords: populations ; blue grama ; population dynamics ; density ; plants ; disturbance

Abstract: Six sites approximately 6 km apart were selected at the Central Plains Experimental Range in 1997. Within each site, there was a pair of adjacent ungrazed and moderately summer grazed (40-60% removal of annual above ground production by cattle) locations. Grazed locations had been grazed from 1939 to present and ungrazed locations had been protected from 1991 to present by the establishment of exclosures. Within grazed and ungrazed locations, all tillers and root crowns of B. gracilis were removed from two treatment plots (3 m x 3 m) with all other vegetation undisturbed. Two control plots were established adjacent to the treatment plots. Plant density was measured annually by species in a fixed 1m x 1m quadrant in the center of treatment and control plots. For clonal species, an individual plant was defined as a group of tillers connected by a crown (Coffin & Lauenroth 1988, Fair et al. 1999). Seedlings were counted as separate individuals. In the same quadrant, basal cover by species, bare soil, and litter were estimated annually using a point frame. A total of 40 points were read from four locations halfway between the center point and corners of the 1m x 1m quadrant. Density was measured from 1998 to 2005 and cover from 1997 to 2006. All measurements were taken in late June/early July.

Award: NSF Grant Number DEB-0823405.

Publisher: Colorado State University. Libraries

Date: 1997 – 2008

Type: Dataset ; Text ; Still image ; Metadata

Language: English

Spatial: The Short Grass Steppe site encompasses a large portion of the Colorado Piedmont Section of the western Great Plains. The extent is defined as the boundaries of the Central Plains Experimental Range (CPER). The CPER has a single ownership and land use (livestock grazing). The PNG is characterized by a mosaic of ownership and land use. Ownership includes federal, state or private and land use consists of livestock grazing or row-crops. There are NGO conservation groups that exert influence over the area, particularly on federal lands.

Referenced by: Munson, Seth M. (2009), Plant community and ecosystem change on conservation reserve program lands in northeastern Colorado. (Unpublished doctoral dissertation). Colorado State University. http://hdl.handle.net/10217/76822

Referenced by: Munson, S. M. and Lauenroth, W. K. (2009), Plant population and community responses to removal of dominant species in the shortgrass steppe. Journal of Vegetation Science, 20: 224–232. http://dx.doi.org/10.1111/j.1654-1103.2009.05556.x

Contributor: University of Wyoming. Dept. of Botany.

Rights: Data sets are open. Please include tag line in report or manuscript: Data sets were provided by the Shortgrass Steppe Long Term Ecological Research group, a partnership between Colorado State University, United States Department of Agriculture, Agricultural Research Service, and the U.S. Forest Service Pawnee National Grassland. Significant funding for these data was provided by the National Science Foundation Long Term Ecological Research program.