LibGuides: Open Research Handbook: Open research data

Open research data

According to The Open Definition, open data 'can be freely used, modified, and shared by anyone for any purpose'. But permission alone is not enough if there is no means to find, access and use the data. Open data also have to be:

explicitly identified and formally entered on the online public record, so that they can be accurately cited and discovered;
accessible, so that they can be opened, read and processed;
presented and documented in such a way that they can be understood and used.

These usability conditions are expressed in the FAIR Data Principles, according to which data must be Findable, Accessible, Interoperable and Re-usable. The FAIR Principles were first set out in 2016 by a group of stakeholders from academia, industry, funding agencies, and scholarly publishers. The Principles put specific emphasis on the ability of machines to automatically find and use data and/or related metadata, in addition to supporting re-use by individuals.

Since they were first published, the FAIR Principles have achieved widespread acceptance, and have been adopted as standards for management of data, development of infrastructure and delivery of services.

Open data, to be open to fullest extent, must also be FAIR. (But note that FAIR data do not have to be open: restricted-access data may be FAIR, providing the metadata describing them are openly accessible.)

Data should be made available in accordance with the FAIR Data Principles.

FAIR Data Principles

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data

To be Re-usable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards

What does FAIR mean in practice? Let us consider the FAIR Principles in more detail.

Findable and Accessible

It is quite common for articles reporting findings based on collection and analysis of primary data to say something like this: 'Data supporting these findings can be supplied on request'. Permission to access the data is given, but are they findable and accessible?

The data are not on public record - there is no explicit description of the data or formal citation, so the dataset cannot be precisely identified. The existence of the data is not independently verified, and there is no guarantee they and information about them will continue to exist and be available. Access to the data depends on an applicant being able to locate the author (who may have moved on, retired or died), on the author being willing to supply the data in a timely fashion, on the author being able to match the data supplied to the data previously described and now requested, and on the data being retrievable by the author and in an uncorrupted state. Given this chain of dependencies, the probability of the data becoming undiscoverable and inaccessible steadily increases as a function of time elapsed. A study published in 2014 found that the odds of being able to access data associated with published studies fell by 17% per year, with broken email addresses and obsolete storage devices being the principal causes of access failure.

For data to be Findable and Accessible, sufficient information needs to be published that they can be explicitly identified, located and accessed; this information, and the datasets themselves, need to be persistent over time; and the means of providing access needs to be organisationally managed and procedurally defined, so they are not at the mercy of a single point of failure.

Interoperable

This means that information about the data has to be published in machine-readable formats, i.e. as an online structured metadata record using standard vocabularies or ontologies to record metadata elements. Machine-readability should extend to the data object itself as far as is possible. This may include storing data in open and editable file formats with semantic encoding, and not proprietary formats with non-transparent or non-semantic encoding. For example: graphical and tabular data are often made available as supplementary information alongside journal articles in PDF format. This is a near useless format for storage of structured quantitative data, because it does not enable the data to be easily extracted, edited and analysed. Another example: although Microsoft Excel files can be opened in any number of software applications, and can be exported to XML format, embedded formatting and formulas may be lost in translation. It is preferable to preserve data in open formats, such as CSV for tabular data, which are universally accessible and do not contain embedded components that may fail to function in some applications.

Re-usable

Even supposing a data file is made retrievable and technically accessible, it may still be unusable.

Imagine a table of values, with rows for participants in a study and columns for variables. Can the user unambiguously define the variable from the column header? Are the units of measurement specified? How are missing values recorded? Have values been rounded up, or averaged from several measurements? Are these all the data, or have anomalies and outliers been removed, and if so, by what criteria? What were the protocols followed to collect the data? What instruments were used? What additional contextual variables might be relevant? (e.g. where were the data collected? at what time of year? time of day? weather? was the subject fed or fasting?) What research question were the data collected to answer, and how have the data been analysed?

We can see that surrounding the raw data are various levels of information which enable you and other users to make sense of the data in different ways. To you much of this information will be tacit and may not need to be written down, but for another user, with no experience of the research context or methods, the information has to be made explicit.

Data must therefore be provided with sufficient information and supporting documentation to enable them to be understood and used. Further guidance on documentation and metadata is provided on the research data management web pages.

Permission to use the data for specific purposes must also be given by means of a licence. See the section on licensing for more information about this.

To be made open and FAIR, data should be deposited in a data repository. This is a service that exists to preserve and provide access to research data. A data repository is a future-proofed vehicle for ensuring that data remain accessible and usable over the long-term. It is preferable to sharing data as supplementary files alongside a published article, or via cloud-based file storage services, or maintaining data in private storage and sharing on request only.

A data repository should not be confused with cloud-based services that provide file storage and sharing, such as GoogleDrive or the Open Science Framework. A data repository performs a number of specific functions:

It actively preserves data, e.g. replicating and validating data files, migrating to preservation formats;
It publishes metadata to enable online discovery;
It assigns persistent unique identifiers (e.g. DOIs) to datasets and makes them citable;
It quality-controls datasets and enhances metadata, e.g. by applying standard vocabularies (not all repositories do this);
It manages access to data so that they can be used by other people;
It applies licence notices, to make terms of use and attribution requirements clear.

Examples of data repositories include: disciplinary data centres and their component databases, such as NERC data centres and the databases of the European Bioinformatics Institute; institutional data repositories, such as the University of Reading Research Data Archive; and general-purpose data sharing services, such as Zenodo and figshare.

As a general rule, you should use data repositories specific to your research domain or the data type, where these are available. These are community places of resort and provide subject-specialist curation, e.g. quality-controlling submissions and enhancing metadata. They include services funded and supported by Research Councils, which you may be required or encouraged to use, and repositories recommended by funders and publishers.

Funders

If your funder supports or recommends a particular data repository, you should use this. Some Research Councils directly fund data centres and expect researchers to offer their data to these. The main data centres are:

Archaeology Data Service (AHRC and NERC);
UK Data Service (ESRC);
NERC data centres (CEDA Archive, Environmental Information Data Centre, and others)

Only NERC requires its researchers to offer their data to the relevant NERC data centre. ESRC requires data to be deposited in either its ReShare repository (part of the UK Data Service), or an appropriate responsible digital repository, such as an institutional repository.

BBSRC contributes funding to a number of international bioscience data sharing resources, including the molecular biology databases of the European Bioinformatics Institute. The Wellcome Trust also maintains a list of approved data repositories.

Publishers

Although most journals accept the submission of supplementary data alongside a journal article, it is generally better to use a dedicated data repository and reference the data from an article. Many publishers now prefer this and recommend discipline-specific repositories, for example Springer Nature and PLOS.

The University Research Data Archive

In the absence of a suitable external service you can use the University's Research Data Archive. Research data in non-digital formats and digital data that cannot be made accessible or require controlled access should also be registered in the University Archive. The Archive can provide a mechanism to regulate access to controlled data under data sharing agreement where this is necessary.

General-purpose data sharing services

You can also use general-purpose data sharing services, such as Zenodo (funded by the EC), Dataverse (managed by Harvard University), figshare (a commercial service, but free to individual users), and Dryad Digital Repository (a non-profit service, it charges a small fee for deposits, which may be waived for authors published with some journals). These will not provide the quality control that a specialist or institutional data repository offer, but they are mostly free, quick and easy to use.

It is often assumed that if data have been collected from human subjects, or contain confidential or sensitive information, they cannot be made openly available or even shared outside of a project. In fact, data collected from research participants can often be made openly available following appropriate redaction; and data that cannot be shared openly may be shared with authorised users under a controlled access procedure, which some data repositories operate.

Most data collected from human subjects can be anonymised for sharing. This applies to both quantitative and qualitative data. A valid reason for restricting access to such data would obtain only if it is not possible to anonymise the data (biometric data, for example) or if the risk of causing harm or distress by disclosure is significant. The UK Data Service provides very good guidance on anonymisation of both quantitative and qualitative data.

Even data containing personal or confidential information may be shared under certain conditions, with appropriate consent. Some data repositories, e.g. the UK Data Service ReShare repository and the European Genome-phenome Archive, can manage controlled access to sensitive/confidential data. The University Research Data Archive can also offer a restricted access option. Contact us if you wish to discuss this.

If data collected from human subjects have been fully anonymised, you do not need consent to share them, but it is good practice to inform your research participants how the data you collect from them will be used. Your information sheet should address this and your consent form should specifically allow the participant to indicate they have understood your intentions and agree to data sharing, by checking a statement such as this:

‘I understand that the data collected from me in this study will be preserved and made available in anonymised form, so that they can be consulted and re-used by others.'

Do not in your ethical approval application or in the information you provide to participants state that the data you collect will not be disclosed outside of the project, or will be destroyed by a given date (such as 'no later than 3 years after the end of the project'). This is a hostage to fortune. It will prevent you from sharing your data at a later date. It is the case that personal data should be destroyed when no longer required, and it is acceptable to tell your participants this. But you will still want to retain an anonymised dataset that can be archived and shared indefinitely. It is in any case wise to avoid making a specific commitment to destroy personal data by a given time: personal data can be legally retained as long as a valid reason for their retention exists. For example, you may wish retain details of participants on an internal database to enable you to undertake follow-up studies.

Be aware that in order for publicly-disclosed data to be fully anonymised, any means of linking them to participant records stored internally should be destroyed. If you have used pseudonymous identifiers in your dataset which are linked to separately-held internal participant records, the linked identifiers in the public dataset should be replaced by random identifiers.

Researchers who will be collecting data from research participants should consult the University guidance on Data Protection and Research (which includes a Data Protection Checklist for Researchers, and sample information sheet and consent form) and Research Ethics.

Where data are collected from commercial organisations, or where research is conducted in partnership with companies, it may be assumed that data cannot be shared. This is not necessarily the case.

Not all information provided by commercial organisations is commercially confidential, and companies may be willing for data provided by them to be made openly available - with redaction if appropriate.

Open publication of data is not necessarily an enemy of commercial objectives, and in fact may promote them. Corporations that are open with their data, and shown to be associated with prestigious research organisations, derive reputational benefits. Being open can be a valuable strategy for building trust and a basis for long-term collaboration. Many successful commercial businesses are based on Open Source software business models. In some areas - for example, the pharmaceutical industry - the transformative potential of Open Research is already being actively discussed and explored.

Research agreements between the University and commercial partners are made on the basis that the University conducts publicly-funded research for public benefit and is committed to making the results of its research publicly available. Contracts include provisions for making the results of research known through publication. While these contracts may also include restrictions on disclosure of privileged information, publication of data and other material arising in the research is usually possible, providing notice is given to all parties and approval is given.

It is acceptable to restrict access to data if they are commercially confidential or there is a commercial pathway for the research, for example involving an identified industrial partner. If IP protection may be sought, it should be possible to release data once protection has been confirmed.

If you need to understand what is permitted under your contract with a commercial partner, you can contact your Contracts Manager. For queries relating to commercial exploitation of IP and any related restrictions on data sharing, contact the IP Manager.

Some datasets can be substantive research outputs in their own right. This may be the case, for example, with environmental observations or survey data, which are by their very nature unique and irreplaceable.

If you have produced a valuable open dataset in the course of your work, you can gain wider exposure for the dataset and receive academic credit as its producer by publishing a data paper. This is a peer-reviewed article, published in an academic journal, which describes a dataset that has been created in a research context.

A data paper can be an effective means of advertising a valuable dataset and encouraging others to make use of it and cite it. A data paper is also a citable output in its own right, and is a means to ensure that proper recognition is given to those who were involved in creating the dataset. A data paper can also provide prospective users of the data with valuable information about how and why the dataset was created, how it has been used, and how it might be used or further developed.

Bear in mind that the primary purpose of the data paper is to promote re-use, and many journals will require the dataset described to be available under an open licence. Some standard licence restrictions, such as non-commercial terms of use, may be unacceptable.

An example of a software paper published by University members is provided below.

There are various journals that will publish data papers, including dedicated data journals and 'mixed' journals, which will publish both data papers and conventional research articles.

Examples of dedicated data journals are: Data in Brief, Earth System Science Data, Journal of Open Archaeology Data, Nature Scientific Data, Open Health Data, and Polar Data Journal.

These are examples of journals/platforms that will also publish data papers: F1000Research, GigaScience, PLOS ONE and Wellcome Open Research.

A new, long-term daily satellite-based rainfall dataset for operational monitoring in Africa
An example of a data paper describing a dataset created by University staff, published in the journal Nature Scientific Data.
TAMSAT Daily Rainfall Estimates (Version 2.0)
The TAMSAT dataset described and cited in the above paper. The dataset has been deposited in the University of Reading Research Data Archive.

Training resources

Research Data Management and Sharing
This MOOC will provide learners with an introduction to research data management and sharing.
Research Data MANTRA
A free online course in data management from the University of Edinburgh.

How not to make data open

Useful links

Research data management website
The University's research data management web pages provide guidance on all aspects of research data management, including use of data repositories for data preservation and sharing.
re3data.org (Registry of Research Data Repositories)
A global registry of research data repositories, searchable by subject, content type and country.
FAIRsharing.org
A registry of data and metadata standards, databases that implement or recommend the standards, and data policies. A key FAIR data resource.
Digital Curation Centre
The Digital Curation Centre (DCC) is an internationally-recognised centre of expertise in digital curation with a focus on building capability and skills for research data management. The DCC provides expert advice and practical help to research organisations wanting to store, manage, protect and share digital research data.

Open Research Handbook: Open research data

About open research data