CIAT like most research institutions generates a lot of research data and would like to publish it. Data publishing involves digital dissemination of research data and supporting information such as metadata, documentation, models, code and unpublished reports in such a way that it is persistently and uniquely identifiable.
In the past research data was not viewed as a central component of scholarly communication or counted as a core research output such as peer reviewed journal articles for example. However in the last number of years there is a growing emphasis of publication of research data as a principle research output. The Data Citation Principles1 state that “Data should be considered legitimate, citable products of research”.
Why Publish Research Data
Here at CIAT we believe that all our research outputs including research data should be shared openly and with the widest possible diffusion. We believe this contributes to our mission of reducing hunger and poverty, and improving human nutrition in the tropics through our research. For research data this means publishing it in open access for use and reuse in ways that we may not have anticipated in the initial research. However there are many other benefits to publishing data as illustrated in this infographic.
Before You Publish Data:
Researchers or data authors are responsible for ensuring that data is ready for publishing. Before publishing data authors should ensure that:
- The dataset has been cleaned, verified for correctness and is suitable for the intended use.
- The dataset is well structured
- The dataset is well documented. This includes rich metadata, documentation on the methodology, a codebook or variable descriptions and data collection tools like questionnaires where applicable.
- They have considered privacy, confidentiality and security related issues, to determine whether the dataset can be published or not and whether it needs to undergo an anonymization or de-identification process.
- That the dataset uses reusable file formats. This includes open standard file formats such as CSV files or proprietary formats that have become a de-facto standard such as the ESRI shapefile.
Where to Publish Data:
Researchers have several options on where to publish research data. The choice is usually between a reputable peer reviewed data journal or an approved data repository.
The key consideration when selecting where to publish data is to ensure that data will at the end adhere to the FAIR data principles. That is, that the data will be Findable, Accessible, Interoperable and Reusable. This means the data will need to have persistent identifiers, be citable through a data citation, be as much as possible machine and human readable and have a license that guarantees openness and reusability.
The options CIAT recommends for data publishing are:
- Peer reviewed Data Journals
Research data may be published as a “Data Paper” which describes the dataset; the purpose of collection, the methodology used during collection and the data content. A data paper does not focus on reporting on the hypothesis, analysis and conclusions drawn from the data.
Examples of data journals include Scientific Data, Geoscience Data Journal and Data in brief. The CIAT library is compiling a list of relevant data journals. The University of Edinburgh maintain a very good list of data journals.
- Subject specific repositories
Researchers can also publish data in repositories that accept data for particular disciplines. Examples of this include GBIF for biodiversity data and Pangea for geoscientific and environmental data. Please contact the CIAT data and information management team if you need assistance in selecting a data repository. Re3data.org maintains a list of research data repositories by subject.
- Institutional and CRP repositories
CIAT, other CGIAR centers and the CGIAR research programs (CRP’s) all maintain repositories that can be used to publish datasets and provide all the requirements to publish FAIR data. For CIAT researchers datasets not published in the first two categories above or datasets that underpin journal articles are published in our Institutional repository on Dataverse.
- General purpose repositories
When dealing with intellectual property issues concerning published data, it is advisable to ensure that no ambiguity exists on the rights of others to distribute and re-use the data. Therefore researchers are always advised to apply a data license that will guarantee the openness and reusability of the data. We recommend that one of the following two licenses to be applied on all datasets as these licenses also ensure attribution back to the author and the center.
Depending on the circumstances non-commercial versions of these licenses may be applied as well.
Types of Data to Publish
CIAT is encouraging researchers to publish two kinds of datasets.
- Primary data used in the production of a publication.
- Unpublished datasets that span an entire research project and that are described by:
- Materials and methods
- Proper documentation including a clear description of the variables, data acquisition tools, software code if the data was transformed from its raw format.
When to Publish Data
According to CIATs open access and open data policy. Data and datasets should be published within 12 months of an appropriate project milestone such as, the end of data collection or the end of the project. For datasets used in publications these should be published within 6 months of article publication.
Examples of Data Publishing Workflows for Publication Data
One question that we constantly receive from researchers is whether publishing of data and receiving a citation and permanent identifier such as DOI can be considered “prior publication” by Journals that they may submit a research paper that uses the data. Many journals allow work based on prior published datasets such as journals from Nature, Science, Elsevier, PLOS and SAGE. However we advise researchers to always verify with the target journal in their publication plan before publishing the datasets. If the target journal does consider published data as prior work, then we help researchers publish the replication data after the publication of the article.
Restrictions to Publishing Data
Not all data is suitable for publishing and exceptions to adhering to the open data policy exists. Usually data will not be suitable for publishing because of one of the following issues.
- Privacy – Information that identifies and individual.
- Confidentiality – Information that should not be shared.
- Security – Release of data will cause threats to someone or something.
Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 [https://www.force11.org/group/joint-declaration-data-citation-principles-final