LibGuides: Research data management: Data preservation and sharing

Preservation and sharing

Data that support research findings should be preserved, and made accessible wherever possible, by deposit in a suitable data repository no later than publication of related research findings, or, in the case of research students, availability of the awarded thesis in CentAUR. Supporting datasets should be referenced from related publications.

Data should be as open as possible, as closed as necessary. Almost all data can be made accessible to others outside the original project on some basis. While open data should be the presumed default in the absence of any reason to restrict access, there may be valid commercial, legal and ethical reasons why access to some data needs to be restricted. There are controlled-access repositories that can manage such datasets.

The data you preserve and make accessible to others are part of the legacy of the research, and in many cases will be necessary to validate the findings you place on the public record. It is important, therefore, that the data are of good quality, preserved according to appropriate standards, and are made accessible and re-usable.

The basis of effective open data sharing is described by the FAIR Data Principles, according to which Data should be Findable, Accessible, Interoperable, and Re-usable. In most cases these principles can be complied with by depositing data in a data repository. Various data repository options may be available to you, including disciplinary data centres and data type-specific databases, the University's Research Data Archive, and general-purpose data sharing services.

The data deposit process should begin towards the end of the project, as results are being finalised and publications prepared. PhD students should deposit data before they submit their thesis for examination, so that the dataset can be cited from the thesis. Deposit of data in a data repository requires preparation, and time for this whould be allowed.

It is important to reference and link to datasets from related publications, by means of what is called a data access statement or data availability statement. This is required by funders, and also now by many publishers.

Open data

'Open means anyone can freely access, use, modify, and share for any purpose' (The Open Definition)

Data should be as 'open as possible, as closed as necessary'. This is the expectation of this University's Research Data Management Policy, and it reflects widely-accepted standards of research transparency. Funders and publishers express similar expectations in their data sharing policies.

Making available is not the same as making open. Open content is content that has been licensed by or on behalf of the rights-holder as free to use, modify and share for any purpose. Resarch data should always be licensed when they are made available, whether the liceence used is an open licence or one defining more restricted permission. See the Licences and Licensing data tabs for more information.

A large proportion of research data is suitable for being made openly available. this includes much data collected from participants, providing they have been anonymised.

Restricted data

Some data may not be suitable for making openly available. This may be the case for example where data have been collected from participants and there is a higher risk of identification and/or harm, for one or more of the following reasons:

the data are inherently identifying, e.g. some biometric data or observational video data in which individuals are clearly shown;
they were collected from a small and distinctive sample (e.g. sufferers from a rare medical condition in a defined geographic area);
the nature of the information is very personal, even though the data are anonymised (as may be the case with detailed biographical interviews);
the risk of harm in the event of identification is great, even though the likelihood of identification is small (e.g. where the subject matter is senstitive, for example relating to medical history or political activity in a country where such activity may be dangerous)

This does not mean data cannot be made available. Data can be made available under restricted licences and using controlled-access repositories. For example, the UK Data Service ReShare repository has a 'safeguarded' option for higher-risk anonymised data, and the University's Research Data Archive offers a restricted dataset option for very high-risk data and data containing identifiable/confidential information. Data managed on such terms would only be made available in confidence to authorised researchers under a data access agreement. See the tab on Controlled-access repositories for more information.

Data sharing should be thought of as a formal process akin to publication. This entails a number of requirements:

Data must be explicitly identified and formally entered on the online public record, so that they can be accurately cited and discovered;
Data must be accessible, so that they can be opened, read and processed;
Data must be presented and documented in such a way that they can be understood and used;
Permission to use the data must be formally granted by means of a licence.

These usability conditions are expressed in the FAIR Data Principles, according to which data must be Findable, Accessible, Interoperable and Re-usable. These Principles have been widely adopted as standards for management of data, development of infrastructure and delivery of services. They put special emphasis on the ability of machines to automatically find and use data and/or related metadata, in addition to supporting re-use by individuals.

Open data, to be open to fullest extent, must also be FAIR. But FAIR data do not have to be open: restricted-access data may be FAIR, providing the metadata describing them are openly accessible.

It is important to think about making data FAIR from the outset of your research, as this may affect how you collect and document your data, the formats you store the data in, how you preserve and share the data, and how they are licensed for re-use.

To be made FAIR, data should be deposited in a data repository. This is a service that exists to preserve and provide access to research data. It is a future-proofed vehicle for ensuring that data remain accessible and usable over the long-term.

Using a data repository is preferable to sharing data as supplementary files alongside a published article, or via cloud-based file storage and sharing services (such as the Open Science Framework), or maintaining data in private storage and sharing on request only. None of these ways of sharing data is fully FAIR.

A data repository performs a number of specific FAIR functions:

It actively preserves data for long-term viability, e.g. replicating and validating data files, migrating to preservation formats;
It publishes machine-readable metadata to enable online discovery;
It assigns persistent unique identifiers (e.g. DOIs) to datasets and makes them citable;
It quality-controls datasets and enhances metadata, e.g. by applying standard vocabularies (not all repositories do this);
It manages online access to data so that they can be used by other people;
It applies licence notices, to make terms of use and attribution requirements clear.

You are unlikely to need to presrve and share all the data you collect or create in the course of your research. You will therefore need to select data of value, and dispose of the remainder. Bear in mind the following considerations when selecting data for preservation and sharing.

Validating published findings

What data will be required to validate the research findings that are placed on the public record, i.e. through publication in a research article or inclusion in a PhD thesis? Test data, results of failed experiments, and data from faulty instruments need not be included. Data at intermediate stages of processing will often be unnecessary, providing the raw data are preserved and you have documented any processing used to generate the final results. It may also be useful to preserve your data in a final processed format, especially if the effort required to reproduce them would be considerable. Bear in mind that code files used to generate, process and analyse data may form part of the material required to validate results.

The data you share should be your raw data, or as near as possible, at the individual record level (with appropirate anonymisation if required) and in an appropriate format for use and analysis. It is not enough to share only summary or aggregate values without the raw source data, as the results of processing alone are not sufficient to enable others to validate or reproduce your results.

Volume

There are practical limits to the preservability and shareability of some data. Some research may generate large volumes of data, at the scale of 100s gigabytes (GB) or several terabytes (TB). Examples of such research might include large-scale high-resolution imaging and video recording, and computer simulations of complex systems, where raw output can run to TB. Many data repositories will not have the capacity to handle very large datasets. Storage, preservation and transfer of data at these scales present both technical and financial challenges, to the extent that the cost of meaningful preservation and sharing of such data outputs may be in excess of any possible benefit. In the case of computer simulations in particular, it may be less important to preserve individual outputs than the model code and input parameters, by means of which a set of results can be reproduced.

Even where it is not desirable or possible to deposit high-volume data outputs in a data repository, you may still wish to retain them, for your own ongoing use, and/or in order to be able to share them with others on request. The University can offer options for such use case. See the tab on Where to deposit data.

Data that cannot be shared

Are there any legal/ethical/contractual restrictions on what data can be shared? In many cases, this is unlikely to mean that data cannot be shared at all. Data may need to be redacted, e.g. to remove confidential or commercially-privileged information, or access to them may need to be restricted in some way.

As a general rule, you would be expected to preserve anonymised data only. For example, you may preserve anonymised transcripts, but dispose of original interview audio recordings; you may preserve anonymised quantitative data from an observation study, but would not record data by means of which individual participants might be identified.

Where confidential information or personal data cannot be removed from data (as may be the case with biometric data, for example), or where the risk of causing harm or distress by disclosure is significant, it may be possible to preserve data on a controlled-access basis. Some data repositories, e.g. the UK Data Service ReShare repository the European Genome-phenome Archive, can manage controlled access to sensitive/confidential data. The University's Research Data Archive can also offer a restricted access option. See the Resarch Data Archive section for more information.

Computer code written to generate, process, analyse and validate research data is part of the data produced by the research, and falls within the scope of the University's Research Data Management Policy. Principles of data management should be applied to code, and code written in support of research findings should be preserved and shared wherever possible. Our Publishing research software guide (PDF) provides guidance on best practice in software code sharing.

An online code repository is often used to manage and publish code. A code repository provides various management feaures, including version control, code review, bug tracking, documentation, and user support, and allows the user to publish code releases. The University provides a GitLab code repository service; other popular platforms are GitHub and Bitbucket. A code repository is a good solution for managing code that is under ongoing development or for building a community of developers and users. But code repository platforms do not guarantee long-term preservation of the code or issue DOIs, and links to code repositories are not version-specific.

Any versions of code that supports published results (e.g. model code used to generate output data, or code written for purposes of statistical analysis) should be archived to a public data repository, so that it is preserved as the version relevant to the reported results, and can be cited by DOI from the related publication. Small scripts specific to a dataset can be archived in a data repository alongside the data. Code that may exist as an output in its own right, e.g. model code, may be better archived as a standalone item. GitHub provides an easy-to-use function for archiving code files to the Zenodo digital repository. Code files can also be deposited in the University's Research Data Archive, or any other general-purpose repository.

A licence is an official authorisation to make use of specified material. As well as telling users what they are and are not allowed to do with the material, a licence also provides protection to the creators and owners of intellectual property. An accompanying rights statement asserts legal ownership of the licensed item and the right of its creator(s) to be recognised as such. The attribution condition that is common to many open licences is the legal basis of your right to be credited as the creator of the licensed material. Many licences also include formal disclaimers of liability for any harm or damage that may arise from someone else's use of the material.

Open licences

An open licence makes an item free to access, use, modify and share by anyone for any purpose. Examples of open licences include:

Creative Commons licences for creative works (including research publications and datasets);
Open Source licences for software source code;
Licences for specific types of work, such as the Open Data Commons licences for databases;
Government open data licences, such as the UK Open Government Licence for public sector materials;
Public Domain Dedications, such as the Creative Commons CC0 Public Domain Dedication: strictly speaking, this is a rights waiver, not a licence, but it is generally considered as a type of open licence.

The Creative Commons licence suite includes versions with Non-Commercial and No-Derivatives terms. These and any licences with similar terms are less open licences, because of the restrictions they place on re-use. But if material cannot be made available under a more open licence, it is still wise to publish under a standard licence. The Creative Commons Attribution-NonCommercial (CC BY-NC) licence still grants broad permission for use in research and teaching and other non-commercial activities.

The Open Definition provides a list of conformant open licences for creative works (including publications and datasets). The Open Source Initiative lists Open Source licences for software.

The University does not prescribe use of any particular open licences for data or software, as the most appropriate licence will depend on the nature of the material and related requirements.

Creative Commons Attribution (CC BY) is widely used for the licensing of datasets (as well as Open Access publications and other materials), and is a good choice that will suit most requirements. It is the default licence recommended by the University's Research Data Archive. Other licences may be used or preferred by some repositories. For example, by default NERC data centres release primary data from NERC-funded research under the Open Government Licence; the Dryad Digital Repository releases data only under the Creative Commons Zero Public Domain Dedication.

Licences for restricted data

More customised and restrictive licences may be used where data have been deposited in a controlled-access repository. Examples include:

The UK Data Service licence end-user licence for safeguarded data requires data to be used in confidence for non-commercial research and learning purposes only. Under the licence users are not permitted to distribute the data other parties or to seek to identify individuals from the data.
The University's Research Data Archive data access agreement for restricted datasets imposes similar restrictions on parties to the agreement.

When making data available to others outside the research team, you should observe two rules:

Always make the data available under licence, so that it is clear to any person wishing to access and use them who owns the data, and on what terms they can be used;
Make the data available under the most open licence possible, which allows the widest possible scope for re-use and redistribution.

Data should be made available under an open licence, unless there is good reason to licence them on a more restrictive basis, for example, to prohibit commercial re-use of data in which a commercial partner has an interest.

How to license IP

A licence to make use of intellectual property is issued by or on behalf of the intellectual property rights-holders. The first thing therefore is to establish who owns the intellectual property, and your right or authorisation to issue the licence. The IPR in primary data tab of the Managing data section provides guidance on identifying the rights-holders in data or software. Rights-holders are typically the University (for IP created by University employees), students (in the absence of any contract or assignment agreement indicating otherwise), or third parties involved in research, such as commercial partners, collaborator organisations, or studentship sponsors.

If the material has been created by multiple authors, or multiple parties have interests in it, you should ensure that any proposed release under a specific licence is agreed by all concerned beforehand, as once it has been applied to material a licence cannot be revoked. Where ownership of research data resides with the University, researchers are authorised under the Research Data Management Policy to make data and source code available under an open licence, providing no commercial, legal or ethical restrictions apply.

To license material, you should clearly mark it with both a rights statement and a licence statement. These combined statements make clear to any prospective user who is the owner of intellectual property rights in the licensed material, and the terms on which the material can be used.

The rights and licence statements should be included in the public information recorded about the material (such as a metadata record in a data repository, or the landing page of a software code repository), as well as in the material itself and/or its primary documentation (such as a readme file or user manual). You do not necessarily have to mark all individual files with these statements, providing item-level statements are clearly visible. Licence statements should include the URL to the full legal code of the licence used (the URL can be embedded in text or a licence logo image).

Most data repositories will include include rights and licence statements in the metadata record for an item. A repository will usually enable you to specify rights and licence information when you deposit the dataset. The University's Research Data Archive provides a licence picker tool for uploaded files, with various standard licences and the option to upload your own licence. The licence information displays both in the file metadata and on the item record.

It is important that the rights statement identify all owners of intellectual property in the material. For example, the rights statement for a dataset created by a member of University staff jointly with student John Smith must identify the University and John Smith as rights-holders (assuming the student has not assigned his IP to any other party under contract).

Examples of combined rights and open licence statements are:

Licensing software

In most cases short scripts and segments of code written to perform standard operations, e.g. for purposes of data processing, statistical analysis or data visualisation, can be archived alongside data, under the same licence as the dataset (for example, a Creative Commons Attribution licence). This is best suited for situations where the code is likely to have little independent use value, and any re-use is likely to be solely for the purpose of validating results, e.g. by re-running analyses described in a paper.

Where re-use of source code in new contexts or further development is anticipated, for example if substantial original software has been developed, or source code has been written in the context of an ongoing project or established community, it will be appropriate to release the code under an Open Source licence, witht he caveat that where existing code has been modified any licence for the modified code must be in accordance with the licence terms for the original code.

There are a number of popular Open Source licences for software, which are listed by the Open Source Initiative, and there is a useful licence picker tool at choosealicense.com. Another useful resource, tl:drLegal provides plain English summaries of many Open Source licences. For detailed guidance on software licensing, consult our guide to Publishing research software (PDF).

A data repository is a service that exists to preserve and provide access to research data. It is a future-proofed vehicle for ensuring that data remain accessible and usable over the long-term. It should always be used in preference to sharing data as supplementary files alongside a published article, or via cloud-based file storage and sharing services (such as the Open Science Framework), or maintaining data in private storage and sharing on request only. None of these ways of sharing data is fully FAIR.

A data repository performs a number of specific functions to make research data Findable, Accessible, Interoperable and Re-usable:

It actively preserves data for long-term viability, e.g. replicating and validating data files, migrating to preservation formats;
It publishes machine-readable metadata to enable online discovery;
It assigns persistent unique identifiers (e.g. DOIs) to datasets and makes them citable;
It quality-controls datasets and enhances metadata, e.g. by applying standard vocabularies (not all repositories do this);
It manages online access to data so that they can be used by other people;
It applies licence notices, to make terms of use and attribution requirements clear.

The University does not prescribe the use of specific repositories, and there may be a variety of options open to you. As a general rule we recommend your first choice should be a relevant domain repository where there is one available; alternatively, you can in most cases use the University's Research Data Archive; as a third choice, general-purpose data sharing services may be used.

Most repositories are free to use. Where there is an archiving charge for a data repository, this can usually be recovered from grant funding.

Domain repositories (specific to discipline or data type)

Data should be deposited in a data repository specific to your research dsicipline or the data type, where one is available. These are community services and provide subject-specialist curation. They include repositories recommended by various funders and publishers. Some have the capacity to accept large volumes of data.

These are some examples of recommended repositories. They are free to use except where otherwise specified.

Biological data: the European Bioinfomatics Institute hosts repositories for different types of genetic data, imaging data, and general biological study data. There are no size limits.
Social science/human subject data: the UK Data Service ReShare repository is the research data repository of the UK's national social science data service, funded by ESRC. It has a broad social science/human subject data scope, including biomedical data, and provides in addition to open data archiving a safeguarded data option, suitable for higher-risk anonymised data. You do not have to be funded by ESRC to use the repository.
Environmental data: the NERC Environmental Data Service includes the CEDA Archive for weather and climate data, the Environmental Information Data Centre, and others. NERC-funded researchers are expected to use these. Researchers not funded by NERC may deposit data in scope of a data centre's collection policy, but may be charged to do so.
Archaeological data: the Archaeology Data Service is a national resource for archaeological data. Deposits are chargeable.
Neuroimaging data: OpenNeuro supports the archiving of a range of neuromaging data, including MRI, PET, MEG and EEG.

Many publishers recommend discipline-specific repositories, esepcially in the sciences, for example Springer Nature and PLOS. The Wellcome Trust also maintains a list of approved data repositories.

You can search for data repositories by discipline in re3data.org and FAIRsharing.

University of Reading Research Data Archive

In the absence of a suitable external service staff and research students can use the University's Research Data Archive. This is free to University members and provides both open data archiving and a restricted dataset option for data containing confidential information which can be shared only on a strictly controlled basis.

The Archive has a 20 GB limit for deposits, but other services with more capacity may be options where needed. If no alternative data repository is available for a high-volume dataset, it is an option for a modest charge to archive it offline with DTS and create a linked metadata record for the dataset in the Research Data Archive, so that it can be cited and access to the data can be requested.

General-purpose data sharing services

You can also use general-purpose data sharing services, such as Zenodo (funded by the EC), and Figshare (a commercial service that is free to individual users). These will not provide the quality control that a specialist or institutional data repository offers, but they are free, quick and easy to use.

Figshare+ can be used to share datasets up to several TB in scale for a one-off charge. (The standard Figshare service is free to use for deposits up to 20 GB.) Zenodo accepts deposits of up to 50 GB for free, and up to 200 GB on a one-off basis.

Some data may not be suitable for public access, for a number of reasons:

they contain confidential information that cannot be easily removed (such as biometric data or video/image data);
the data have been de-identified but still present a higher risk of re-identification and harm to the data subjects;
the data contain information that is confidential for other reasons, e.g. commercially-confidential information;
there is a legitimate interest in retaining the data in their identifiable form, because removing the identifiable elements would significantly diminish their value.

A number of repositories exist that can manage sensitive data falling into one or more of these categories under controlled access procedures. Such a procedure may require a prospective data user to make an application to consult a specific dataset, which can be approved or rejected by the data owner or a nominated data steward. Access would be granted under a special licence or data accesss agreement. Access to personal data will also be subject to consent from the data subject, so this would need to be considered at the planning and recruitment stage of the research. See the University's guide to Data Protection and Research for more information.

Repositories that provide controlled-access procedures include:

the UK Data Service ReShare repository, which has a 'safeguarded data' option for higher-risk anonymised data
the European Genome-phenome Archive
The University's Research Data Archive, which provides a restricted dataset option. Restricted datasets can be securely preserved and made accessible to authorised researchers affiliated to a research organisation, subject to approval by a Data Access Committee (including the PI of the original study or a nominated representative), and under the terms of a Data Access Agreement between the University and the recipient organisation.

Some research can generate large volumes of data, at the 100s gigabytes (GB) or terabytes (TB) scale, such as computational modeling and various kinds of experimental imaging. If you need to archive these data, there may be practical and cost limitations that may constrain your options, as some data repositories have size limits. But repositories designed to handle large-scale datasets do exist, notably:

NERC's CEDA Archive routinely manages climate and weather datasets at the TB scale. It is primarily for use by NERC-funded resarchers, but it may accept non-NERC-funded data that are within scope. It may charge where NERC is not the funder.
The European Bioinformatics Institute provides repositories for genetic, imaging and general biological study data, which can accept large volumes of data at no charge.
The free-to-use data sharing service Zenodo accepts deposits of up to 50 GB with a maximum of 100 files, and will accept a one-off deposit of up to 200 GB.
Figshare Plus can be used to share datasets up to several TB in scale for a one-off charge, which could be costed into a grant. (The standard Figshare service is free to use for deposits up to 20 GB.)
Some research facilities that support the generation of high-volume data, such as the ISIS Neutron and Muon Source, provide an archive facility for raw data collected on their instruments. In this case you would not need to archive the data yourself as this will be done as part of facility operational procedures.

Bear in mind that you may not need to archive or maintain all of the raw data collected or generated in project. See What data should you share?

Archiving high-volume data outside a repository

If there is no suitable data repository for a high-volume dataset, we recommend you consider the following solutions, in the order presented. If combined with creation of a metadata record in the Research Data Archive describing the dataset and the means by which it can be accessed, this can enable compliance with the University's data sharing requirements.

The DTS Offline Data Archive provides a cost-effective, long-term storage solution for the archiving of digital data in a secure environment. This service is designed to archive research data that need to be preserved for extended periods but do not require immediate, active access, It is suitable for NFS (Linux) or SMB (Windows) datastores.
University cloud storage offers free high-volume storage. OneDrive accounts provide staff users with 5 TB of storage as standard; Teams sites provide up to 25 TB storage. These services are not designed as long-term storage solutions, and are not optimal for storage and use of high volumes of data. Data stored in OneDrive would be accessible only as long as the account-holder is a member of the University, so should be backed up to another location where continued access by others is required.
External hard drives provide inexpensive storage solutions, but you should consider backing up the data in at least one separate location. The hard drive should be stored securely on site and accessible by at least two people. Data would need to be migrated to new media periodically, e.g. every five years.

If data are stored by these means, you are advised to observe the following principles:

Ensure the data are accessible to/retrievable by at least two people, and that there is a handover policy, so that if someone leaves the University, responsibility is transferred and the data continue to be retrievable. It is advisable to have a designated steward for archived data within a research group or department, who maintains a register of archived datasets, their locations, and responsible owners.
Basic measures should be taken to ensure the integrity and usability of the data. Data files should be write-protected, so that once archived they cannot be further modified. If possible, checksums should be generated for all data files. There should be some documentation of the data, including a file listing or manifest, so that they can be navigated and understood.
If the data support published results, a metadata record should be created in the University's Research Data Archive describing the data and the means by which they can be accessed. This will enable the data to be cited by DOI from related publications, and provide a means by which others can request access to them. If a request to access the data is received, this can be granted by inviting the requester to view the data on site (if this is feasible), or by arranging (at their expense) to send a copy of the relevant data.
When data are deleted, any local register and metadata record in the Research Data Archive must be updated accordingly.

The Research Data Service can advise on and support you in archiving data using the principles outlined above.

Non-digital data should be digitised for long-term preservation wherever possible. If for any reason this is not possible or desirable, they should be archived following the principles for high-volume data. There should be clear documented ownership and local management of the data. If the data are necessary to support published research findings, a record should be published in the University's Research Data Archive describing the data and the means by which they can be accessed, so that they can be cited from the related publication.

Research outputs that rely on supporting data, code and other materials should provide information about where and how these materials can be accessed. This is a requirement specified by UKRI in its Common principles on research data and Open Access Policy, as well as by other funders of research. Many publishers ask for articles to be accompanied by a data availability statement.

This will usually appear either at the head of the article or in the end matter, often in the Acknowledgements section. Your journal's guidance for authors should indicate how to provide your data access statement.

We also recommend that you include a full citation to the dataset in your main references list. An example citation is provided further down this page.

You must bear this requirement in mind when preparing your research outputs. In order to be able to cite your data from the output, you will have to first deposit the dataset in your chosen data repository.

These general principles apply when providing a data availability statement. Examples are provided on a separate tab:

If data are held in a data repository, the name of the data repository they are stored in should be provided, as well as any unique persistent identifier (e.g. the DOI) or accession number for the dataset.
If there are legal, ethical or commercial reasons why some or all data cannot be made openly available, any restrictions should be specified in the data access statement.
If data have been provided in full in the article or as supplementary information, this should be stated;
A direction to contact the author for access to data would not normally be considered an acceptable data access statement.

Below there are examples of data access statements covering a variety of different scenarios. In these examples a dummy DOI is used; this will not resolve.

Open data

Data supporting the results reported in this paper are openly available from the University of Reading Research Data Archive at https://doi.org/10.17864/1947.000999.

All data supporting this study are provided as supplementary information accompanying this paper.

All data are provided in full in the results section of this paper.

Secondary analysis of existing data

This study was a re-analysis of data that are publicly available from the British Atmospheric Data Centre at [DOI]. Data derived through the re-analysis undertaken in this study are available from the University of Reading Research Data Archive at https://doi.org/10.17864/1947.000999.

Ethical restrictions

Interview transcripts are held under safeguards by the UK Data Service and may be accesserd by authorised researchers, subject to registration, at [DOI].

Because of the sensitive nature of the research, interviewees did not consent to the retention or sharing of their data.

Commercial restrictions

Supporting data are subkect to IP protection and will be available from the University of Reading Research Data Archive at https://doi.org/10.17864/1947.000999 after a temporary embargo period.

Research data are commercially confidential, but can be made available to bona fide researchers subject to a data access agreement. Details of the data and how to request access are available at the University of Reading Research Data Archive: https://doi.org/10.17864/1947.000999.

No new data created

No new data were created in this study.

Standard data citation (include in reference list)

The standard citation format for a dataset is:

Creator(s) (PublicationYear): Title. Publisher. Resource Type. Identifier

For example:

Smith, John and Jones, David (2015): Electricity pylons of the UK, 1928-2005. University of Reading. Dataset. https://doi.org/10.17864/1947.000999.

Preparing data for sharing

You should put as much care into preparing a dataset as you would any other research output. A deposit in a data repository can be delayed and in some cases rejected if, for example, you have not preapred and dcoumented your data to an appropriate standard, or correctly identified intellectual property rights in a dataset and obtained relevant permissions, or established an ethical basis for sharing of data collected from participants, or anonymised a dataset where this is required.

This guide takes you through the main things to consider and address before you deposit a dataset in a data repository. It will help you to address critical requirements and produce a good quality, appropriately documented dataset.

For a more detailed version of this guide, download Preparing for data sharing (PDF).

It is important to define your dataset and identify its contents, as this will also determine what preparation is necessary. Refer to What data should you share? for more guidance on defining your dataset.

Check your preferred repository's guidance on depositing data and note any requirements it may have. Repositories may have content and metadata requirements for certain types of data, require submission of data in specific formats, and place limitations on the volume of data that can be deposited. Some repositories may also charge for deposit of data (although most do not). If you have not identified the repository you will deposit data, refer to our guidance on choosing a data repository.

If data have been collected from research participants, check that you have documented consent for data sharing. It is acceptable to disclose data obtained from human subjects without consent if the data have been fully anonymised, but it is good practice to inform participants of your intention to do this. It is not acceptable to disclose even anonymised data if in your consent procedure you stated that the data would not be disclosed, or would be destroyed at a given time. Identifiable data can be disclosed under a controlled access procedure, providing that participants have consented to participate in the study on the understanding that data would be shared in this way. The University provides a sample consent form including statements suitable for open data sharing and sharing of data subject to safeguards.

If you are depositing data collected from participants in the Research Data Archive, you will be required to submit your participant information sheet(s) and unsigned copies of any consent form(s) used alongside your data files, so that we can confirm you have a basis for data sharing. These documents will be stored alongside the dataset for administrative purposes. Access to them will be restricted, meaning they will not form part of the dataset available for users download.

It is important to understand who is a creator of your dataset – as well as who is not – because intellectual property rights and permission to distribute the data will be associated with its creators. Creators of datasets also have the moral right to be identified as such. Datasets may be the work of many hands, and it is not always easy to clearly distinguish its creators from other people who contributed to the work of the project.

According to the Copyright, Designs and Patents Act 1988 it is ‘the selection or arrangement of the contents of the database’ that constitutes the creative act which attracts copyright. Creators are those who have had a direct creative role in the selection and arrangement of data in the dataset. This is not the same as being involved in the design of the research or in the original data collection. In most cases, a project PI or student supervisor will not be a creator of the dataset, unless they had a direct authorial hand in its creation. Technicians, contractors and others involved in the collection of data are not usually creators of a dataset, unless they had creative input into the selection and arrangement of the data points.

Anyone who does not meet the definition of a Creator but has contributed to the production of the dataset can still be acknowledged for their contribution in the dataset documentation. The Research Data Archive includes a Contributors field in its metadata schema.

You must clearly identify rights-holders, because your authorisation to deposit the dataset depends on their permission. By depositing data you are also distributing them, and doing this without the authorisation of the rights-holder will be a breach of copyright.

Owners of intellectual property rights (IPR) in the data will be associated with the creators of the dataset.

In general, an employer will own IPR created by its employees: the University is ordinarily the rights-holder in IP created by members of staff. Research contracts generally allow ownership of ‘arising IP’ (i.e. created under the contract) to reside with the originating institution.

Students registered with the University own the IP they create by default, but this may not be the case if they are funded under a third-party sponsorship agreement (excluding public funders such as Research Councils, which do not assign student IP to other parties), or if they have assigned their IP to the University. A sponsorship agreement will include Intellectual Property clauses stating which party has ownership of arising IP. Ownership of IP created by a student at another institution will be subject to that institution's IP policy and any relevant agreements.

If a dataset has multiple creators, it may also have multiple rights-holders, which may include the University, students in their own right, and collaborating and partner organisations. There is more guidance on IPR in primary data/software in the Managing data section.

You may need to investigate any applicable research contracts or studentship agreements to establish what parties hold rights in a dataset. Students and/or their supervisors should have copies of any contracts relating to their research programmes. If you need to locate a copy of a contract, contact your Contracts Manager.

Where datasets incorporate secondary data, the owners of these data will also have the rights to determine how and on what terms their data are distributed by you.

IP should always be published under a licence, so that ownership of the IP and terms of use are clear to others. In accordance with the University's Research Data Management Policy you are expected to share data under an open licence wherever possible. The most widely used open data licences are the Creative Commons Attribution (CC BY) licence, which permits re-use of the data provided proper attribution is made, and the Creative Commons Zero Public Domain Dedication (CC0), a waiver of all rights in the work.

In order to license the data you must be the data owner or authorised to assign a licence on behalf of the data owner, so the choice of licence may be subject to the permission of other parties. For example: a third-party co-creator with commercial interests may request the application of a non-commercial licence; if the dataset incorporates third-party materials these may be made available with the third party's permission on an ‘All Rights Reserved’ basis.

Data held under a controlled access policy (such as UK Data Service safeguarded data and restricted datasets in the Research Data Archive) will be made available under special licence terms. The Data Access Agreement for restricted datasets deposited in the Research Data Archive allows data to be used, subject to authorisation, in confidence for non-commercial research and learning purposes only. The Agreement will be made between the University and the organisation to which the authorised user is affiliated.

As a general rule we recommend you use the Creative Commons Attribution licence for open data, and this is the default applied to uploaded files in the Research Data Archive. More restrictive licences should only be used if there is a justification for doing so, for example, to protect commercial or other confidential interests.

We provide guidance on licences and licensing. Guidance on licence options for software can be found in our Guide to publishing research software.

You must ensure that you have permission to archive and distribute the dataset from: the creators; the rights-holders; parties with contractual rights regarding publication of research outputs; secondary data owners.

Creators

Creators of datasets have the moral right in copyright law to be identified as such. Individuals also have the moral right not to have a work falsely attributed to them as an author. You must therefore ensure that dataset is archived with the knowledge and permission of its creators.

Rights-holders

Where the employer is a University or publicly-funded research organisation, permission to publish the data can be inferred from their policy position on research data, which is, certainly in the case of universities, to promote the public sharing of data supporting research outputs wherever possible. Other parties, including students, industrial studentship sponsors and commercial research partners, will need to give written consent to publication of the dataset.

Parties to contracts

Research and studentship contracts have Publication clauses, which generally grant other parties the right to be notified of and have the opportunity to approve or delay any intended publication. This right exists irrespective of who owns the IP created under the contract. The standard notice period is 30 days. Persons to whom notices should be sent will be identified in the agreement (usually towards the end).

Secondary data owners

If your dataset incorporates IP from existing sources, you may need to seek permission to distribute the dataset. If data have been obtained from a public resource such as a website or a data repository, you should check the source for any terms of use or licence information. If you have incorporated government or research data, these may well have been made available under open licences that permit redistribution, providing acknowledgement of the source is given. If you cannot find any information in the published source, or the data have been obtained from a non-public source, you may need to contact the data owner directly. We provide guidance on using secondary data.

Seeking permission

Permission should be requested in writing. Email is acceptable. Research contracts and sponsorship agreements will nominate a contact for each party, to whom any notices under the contract can be directed. In the case of studentship agreements, notices would usually be sent to the student's supervisors at the University and the sponsor organisation.

When contacting other parties for permission to archive and distribute data, it is important to identify the data unambiguously, and to be clear how the data will be made available, and on what terms they will be licensed for use. While you should always seek to licence the dataset on the most open terms, other parties may legitimately require more restrictive licensing. For example, a commercial partner may not be willing to distribute a dataset under terms that permit re-use for commercial purposes.

Depositing data in a repository is not simply a matter of transferring the files from your active storage location into the repository. Your data will need to be tidied up, put into order, and documented. When forming the dataset, consider the following:

Identify all the files that will compose it. These might include: raw data files (in the initial collection format); processed data files (e.g. cleaned data; raw data saved to another format; statistical analyses and visualisations); programming code (e.g. analysis scripts); documentation
Ensure the data are stored in suitable formats for preservation, for example by saving tabular data in an open format such as CSV. You may need to check file format requirements specified by your chosen repository. Guidance is provided on suitable file formats for preservation in the Research Data Archive.
Make sure your data files are well-formed and readable. Poorly-presented data are harder to read, more likely to contain errors, and will inspire less trust. Check the data for errors. Apply consistent style and formatting, and spellcheck your text. Ensure relevant information is clearly presented in data files, e.g. variable names and units of measurement, missing value codes, etc. Present actual values; avoid encoded content, such as formulae in spreadsheets and colour formatting.
Redact data as necessary. Data collected from research participants may need to be anonymised. There is guidance on anonymisation provided by the UK Data Service. Other kinds of information may also need to be removed or obscured, such as commercially-confidential information. Link-coded data, where data records are identified by a unique code which is linked to identifiable participant information held in a separate table, are in data protection law still personal data. For a dataset to be anonymised, and suitable for sharing as open data, you will need to remove any means of linking data records to identifiable participants, e.g. by destroying all documented records of the link, or by replacing linked IDs in the dataset with unlinked IDs.
If the dataset is composed of multiple files, make sure they are organised in a logical fashion. You can upload zip files to some repositories, including the Research Data Archive, which would allow you to organise files within a folder structure.
Use appropriate and consistent file names, which are descriptive of the file contents, formatted without spaces or special characters, and not longer than 32 characters. We provide guidance on file naming.
Check the size of the dataset and make sure it does not exceed any size limitations specified by your chosen data repository. The Research Data Archive allows the deposit of datasets up to 20 GB free of charge and recommends that individual files be no larger than 4 GB. If you have a large dataset and/or a large number of files, it may be easier for both you and prospective users of the data to use an archive format to package/compress the files. Zip and tar.gz are good choices, as they provide lossless compression.
You could ask a colleague to review your dataset. A pair of eyes unfamiliar with the data may spot mistakes and things you have overlooked. Remember that the people reading your data will have not have your experience of the research context.

Every dataset should have at least a basic manual or user guide. This should include the following:

citation metadata for the dataset (creators, title, publication year);
identification of the rights-holder(s) with licence statements;
a brief description of the dataset. This might include summary information about what and how much data were collected, the research context in which they were collected, the purpose for which they were collected, and the instruments and methods used;
information about the project in which data were collected, with any external funding details;
a description of the contents of the dataset, e.g. as a file listing;
key interpretative information, e.g. a full definition of variables and units used, such as a codebook or data dictionary;
details of the methods and instruments used to collect, process and analyse the data, and relevant supporting information, such as analysis scripts;
references to any secondary data sources used;
references to related publications. If a publication in process, as much information as possible should be provided to enable identification of the published item, e.g. authors, provisional title, journal (if known), year and status (in preparation/under review, in press).

For deposits in the University's Research Data Archive, a README template (txt) is provided, which can be used to record basic documentation. Documentation can be saved in PDF, Word or another text format as preferred.

Research data management: Data preservation and sharing

Preservation and sharing

Open data

Restricted data

Validating published findings

Volume

Data that cannot be shared

Open licences

Licences for restricted data

How to license IP

Licensing software

Domain repositories (specific to discipline or data type)

University of Reading Research Data Archive

General-purpose data sharing services

Archiving high-volume data outside a repository

Open data

Secondary analysis of existing data

Ethical restrictions

Commercial restrictions

No new data created

Standard data citation (include in reference list)

Preparing data for sharing

Creators

Rights-holders

Parties to contracts

Secondary data owners

Seeking permission

Resources