Table of Contents
Data Management Implementation Plan
Data Management Units
Data Collection
Data Documentation
Quality Control
File Organization
Formats
Storage
Backup
Workflow Internal Data Sharing
Data Use
Protection for Sensitive and Confidential Data
Management of Physical Samples
Data Publication
Data Archival
Roles and Responsibilities
This document includes additional guidance to write a Data Management Plan Implementation Document. The Implementation template can be found here, and an example of a fictitious document using the template can be found here
About Data Sharing Agreements: they can also be called Data Use Agreements. These are written documents that clarify the ownership, rights and responsibilities regarding the data created during a research project where there are several institutions/companies involved. Talk with the Research Office if you think you need to create one with your collaborators, or if you need to improve your understanding of what an already existing Data Sharing Agreement means.
About funder policies: DMPTool.org maintains a compilation of paths to funder policies that can help you get started to find information on the requirements, if any, that your funder has regarding data management. https://dmptool.org/public_templates
The goal to define these data management units is to be able to refer to them easily throughout the document when the deed arises.
About the amount of data: the reason that it is interesting to have an idea about the amount of data expected is that it is harder to manage a lot of data than just a few MB of data. This should be an approximate amount, or a range.
No extra guidance.
The rule of thumb is that a person familiar with the project’s field of research but not familiar with the research project should be able to look at the project’s files and understand the data, understand what has been done to the data and why, without having to ask anybody. This can be achieved with documentation, and with file organization.
To learn more about data documentation visit the OSU metadata and data documentation Lib Guide.
A few tips:
It may be useful to design different levels of quality control. For example:
Level zero (L0): Data as it is downloaded directly from an instrument or model. This data is often in binary format, impossible to understand or look at by a human unless it is processed by a program. These programs tend to be proprietary and they may or may not perform operations on the data. This data level may not exist. For example: binary files coming from a temperature sensor permanently installed in a stream.
Level one (L1): Raw data in a format that is understandable by a human. There have been no corrections on these data. For example, a csv file obtained after running the programs supplied by the company that manufactured the instrument.
Level two (L2): Verified data that have undergone quality control, including but not limited to:
Level two data are the best data that a researcher could use. Level two data should not include data that have undergone quality control procedures that are subjective to the researcher. When quality control is not necessary, L1 and L2 data may be the same.
Level three (L3): L2 data that have been analyzed to answer specific research questions. Typically, this is the data that will be used to create figures in a publication. For example, if a principal component analysis was used to analyze three years of temperature data and published in a figure as part of a peer-reviewed article.
The rule of thumb is that a person familiar with the project’s field of research but not familiar with the research project should be able to look at the project’s files and understand the data, understand what has been done to the data and why, without having to ask anybody. This can be achieved with documentation, and with file organization.
Best practices about file naming:
Data standards
Formats that will be better at long term preservation are formats that are platform independent (can be accessed from Linux, Mac and Windows), in an open format (no proprietary formats), and character based (not in binary format). There can be exceptions to all of these for the right reasons. For example, some data standards that are widely used in some disciplines, like netCDFs, save data in binary format.
See eCommons: Cornell’s Digital Repository. Recommended file formats for a table with existing formats for different types of content, and their probability for full term preservation.
Rationale: setting expectations about how and when datasets will be shared internally will minimize conflict during the project.
Datasets will be shared internally [specify when researchers are expected to share their datasets. Some examples: as soon as possible after the data is collected/at the end of the sampling season/6 months after it is collected/on January of each year/when a researcher of the Project requests it ].
Datasets will be shared internally with [who? Some examples: all the members of the team/members of the team approved by the IRB/the data manager of the project/the researcher who requested the dataset].
Datasets will be shared internally in format [is there an expected format? For example excel, or csv, or spss, or…].
Datasets will be shared internally [at which quality level? For example: after a quality control level has been assigned to each point following the schema in X / after following the protocol X for quality control/at any quality control level, as long as the documentation clarifies the quality control procedures that have been followed /only if all the data points have been subject to the whole quality control process outlined in X].
Datasets will be shared internally accompanied of [which documentation? For example: a readme file outlining at least the methods followed for data collection, the quality control procedures that have been followed, and a data dictionary/documentation using the template X/documentation using the metadata template X].
Datasets will be shared internally by [how are the datasets going to be delivered? For example: by e-mail/by depositing them in Box/Google drive/external hard drive/shared drive/website].
When a member of the Project uses a dataset shared by another member of the team [how will the use be notified? For example: a courtesy e-mail will be sent to the contact person/no notification will be needed at this stage/the member of the Project using the shared data will write his/her name in a log].
When a new version of a dataset is generated, it will be notified to the other members of the Project that may want to use the dataset by [example: sending a general e-mail to the whole group/documenting in the documentation file the new version and sending individual e-mails to the members of the team that are known to be using the dataset].
[Include other workflow details that will be useful if necessary. For example, there may be details in the data management plan that can be outlined or detailed here. For example, when will the datasets be made publicly available? Who will decide when to make the dataset available if there are several researchers working with them?]
Rationale: Most of the data management responsibilities outlined in the final section require a lot of time and effort. Often, datasets are shared within members of the same project and the use of these datasets improves or makes possible scholarly outcomes such as publications of articles, book chapters, presentations in conferences, proceedings, etc. It is necessary to have a common understanding on how to acknowledge the role of data managers, data creators, data analysts in the research process. These roles may not be appropriate as manuscript authors, but there are many other options. Acknowledging these roles is not a legal matter (no law requires it), but it is an ethical one. Responsible conduct of research involves acknowledging other people’s roles in managing data. Acknowledging the roles may also have an impact of the careers of researchers involved.
[Decide what are the procedures that you will follow to acknowledge data management roles, and if there are any preferred methods. This template lists the options in order: options that follow best practices are noted at the beginning, while practices that we discourage are noted at the end. We use here “data management” as a general term, but consider changing it for more specific roles. For example, you may want to consider offering co-authorship to the researchers involved in data collection and data quality control as authors in data publications, and adding the researchers involved in instrumentation maintenance in the acknowledgements]
All members of the Project involved in roles related to data management will be acknowledged in some way. Specifically:
Members of the Project that were involved in data management [change to a more specific role] will be offered co-authorship to papers that make use of their data. Co-authorship will require participation in the interpretation of the data, writing, or critical review of the manuscript, approval of the final manuscript. [if the group defines authorship using a specific set of criteria, include a link to these criteria here. A few examples of current definitions of authorship can be found in https://publicationethics.org/resources/discussion-documents/what-constitutes-authorship-june-2014]. The offer for co-authorship may be accepted or declined.
Datasets will be published separately from the research in a repository or as an article in a data journal [change if there are more discipline specific options]. Members of the Project with a significant data management contribution will be listed as co-authors in the data publication. Every member of the Project that makes use of the published datasets will cite the dataset and list it in the reference list in their publications.
When possible, publications will be made in journals that use the CRediT authorship taxonomy (http://docs.casrai.org/CRediT) or similar. The roles of each of the members of the Project involved in data management will be documented using the appropriate roles.
Rationale: Data management takes time and effort. In order to not oversee any important data management action, it should be clear to all the members of the team who is responsible for each of them.
Principal Investigator (PI): leads the Project. It is usually designated by the funder. If there is no funder or the funder does not designate the principal investigator, it will be person providing leadership to the Project.
Faculty Investigator: they actively perform research on all or a part of the research Project. They may provide active mentorship to students.
Team member: they contribute to the scientific development or execution of a study in a substantive, measurable way (research/postdoctoral fellows, technicians, associates and consultants).
Student: member of the Project pursuing a degree. Undergraduate, master, PhD or others
DMP Implementation: responsible for ensuring Data Management Plan and the Internal Data Sharing Plan move from planning into implementation; ensure that any practices, responsibilities, policies outlined in the plan are followed; ensure that new members of the Project will receive data management training; responsible for maintaining the Data Management Plan and the Internal Data Sharing Plan up to date, and making sure that all members of the Project understand and are prepared to apply the changes.
Responsibility of: [complete with one of the roles defined above]
Access control: responsible for regulating access to data based on the roles of the authorized user, whether from the project or not. Access is the ability to perform a specific task, such as view, create, or modify a file. Responsible for granting access to data by members outside of the project when requested during the duration of the project.
Responsibility of: [complete with one of the roles defined above]
Protection of sensitive and protected data: responsible for complying with applicable laws and regulations, institutional policies, and ethical principles governing the conduct of human subjects research, sensitive and protected data.
Responsibility of: [complete with one of the roles defined above]
Software creation and maintenance: responsible for the creation, design, and installation of a software products (e.g. code writing) and maintenance of the system (software update, error correction, enhancement of existing features).
Responsibility of: [complete with one of the roles defined above]
Instrumentation maintenance: responsible for conducting tasks related to instruments such as installation, calibration, testing, and performing maintenance of instrumentation equipment.
Responsibility of: [complete with one of the roles defined above]
Data collection/ data generation: responsible for data collection and creation (research, locate, identify, and measure), data entry, information processing (transcribing and manipulation), data generation (prototyping, models, and database).
Responsibility of: [complete with one of the roles defined above]
Data organization: responsible for maintaining the data in an organized data structure so that it is easy to find (i.e. folder structure, file naming conventions). Responsible for saving the data in the appropriate formats.
Responsibility of: [complete with one of the roles defined above]
Metadata generation: responsible for generating metadata (data description), documentation, using the metadata standards or templates specified in the Data Management Plan.
Responsibility of: [complete with one of the roles defined above]
Quality control: responsible for performing quality assurance and quality control. It involves testing, reviewing, cleansing of data, calibration, correcting errors, data remediation, and documentation of quality control on the data points.
Responsibility of: [complete with one of the roles defined above]
Data analysis: responsible of various activities related to data analysis such as examining, analyzing, sorting, aggregating, transforming, modeling, visualizing, validating, presenting, to answer research questions.
Responsibility of: [complete with one of the roles defined above]
Archiving and preservation: responsible for assuring archiving and storage, preservation and access to data (and associated metadata) long term (e.g. in a repository, or managed internally).
Responsibility of: [complete with one of the roles defined above]