Prepare data for deposit & sharing
What data to keep?
Data can be deposited and archived locally or shared in a public data repository. Note that archiving can be costly and there may not be enough space to archive everything. Researchers should carefully identify which data to preserve. Consider the following:
- Does the data support published research?
- Are the data likely to be reused?
- Are the data unique or historically significant?
- Are there funder or institutional requirements?
- Are the data difficult to reproduce?
- Are there any ethical issues to consider?
- Are the data in support of a patent application?
Further reading: Examples of data that should be kept by discipline (Stanford University).
Best practices when preparing data for deposit and sharing
File formats
Choose file formats suitable for long-term storage, preferably non-proprietary formats, to overcome access issues caused by software obsolescence. See: Best practices for file formats and Recommended formats for sharing, reuse, and preservation.
Documentation
Add it alongside your data to make it understandable and reusable. See: Data documentation and description.
Ownership and privacy
If sharing data, make sure that:
- You or your organization own the data. See: Data licences.
- All ethical requirements are followed. See: Collect data ethically and Research with Indigenous communities.
Data integrity
If keeping a local copy, avoid bit rot through refreshment (copy data on a new drive every 2 to 5 years) and replication (maintain 3 copies of the data, on 2 forms of storage with 1 in an external location).
Preparing sensitive data for sharing
Consent forms are key to data sharing
Some data cannot be shared for legal or ethical reasons. However, if sharing the dataset is required, ensure that this has been stated in consent forms and cleared with the Research Ethics Unit. Find out more about collecting data ethically.
De-identification allows sharing of sensitive data
De-identification is the process used to remove identifying data. Identifiers can be direct, which point directly to an individual, or indirect, which point to an individual when combined with other data.
Examples of direct and indirect identifiers
De-identification guidance
- De-identification guide (Portage Network)
- De-identification guidelines for structured data (Information Privacy Commissioner of Ontario)
Methods of data de-identification
- Anonymization (removing identifiers altogether)
- Pseudonymization (replacing identifiers with pseudonyms or other identifiers)
Protecting sensitive species data
- Current best practices for generalizing sensitive species occurrence data (Global Diversity Information Facility)
See also:
- Can I share my data? Decision tree (Portage Network)
- Data Deposit & Access section of the Human Participant Research Data Risk Matrix (p. 8) (Sensitive Data Expert Group of the Portage Network)
- Anonymisation: managing data protection risk code of practice (UK Information Commissioner's Office).
- Anonymisation (UK Data Service)
- McGill Data Anonymization Workshop Series 2023: recordings and slides from a workshop series providing theoretical and practical knowledge about data anonymization and de-identification of sensitive data to promote and facilitate data deposit and data sharing.
Help and resources
Research data management consultations are available for Concordia faculty, students, and staff. Find out more about how librarians on the Library's RDM team can provide guidance. This service is part of Concordia's Institutional Research Data Management Strategy.