Preparing your datasets for deposit

What is a dataset?

A dataset is a structured collection of data organised and stored together. Data within a dataset is typically related in some way. It can include different types or formats of data and may be comprised of one or many files.

To decide if you have one dataset or many, think about how you will describe the data and what the supporting documentation will consist of. If the metadata is similar or identical for each dataset, it indicates that you probably have one dataset. For example, if you have repeated the same experiment/monitoring event at several locations, this is likely to be be one dataset. Similarly, if you have carried out the same experiment/monitoring event over several years, this too could be one dataset, rather than several.

As part of the deposit process we will agree with you the format and structure of your data and a handover date. Ensuring that your data are correct and well-formatted will help to speed up the process.

Format

Data provided to the EIDC should normally be in a non-proprietary format (e.g. .csv(s) rather than an Excel workbook) 

We maintain a list of acceptable formats, however the list is not exhaustive and we will consider other formats on a case-by-case basis.

Filenames

  • Try to keep filenames short
  • Do not use spaces. Replace them with underscores or hyphens (e.g. windermere_chemistry_data, windermere-chem-data) or use camel case (e.g. windermereChemistryData)
  • Other than hypens and underscores, avoid non-alphanumeric characters ($*@%”\/?*<>#^¾ etc.)
  • If possible, filenames should be meaningful and reflect the content
  • If you have multiple, related files be consistent and use a naming convention

Examples

1486Xiuytr.csv
  This doesn't tell us anything about the data

Site location data from the UK Butterfly Monitoring Scheme collected during 2011.csv
  This is very long and contains spaces

ukbmsLocationData2011.csv
  This is descriptive, short and contains no spaces or special characters

Variables

  • Variable names should be unique, short and (preferably) meaningful.
  • Avoid spaces and special characters (e.g. -$*@/, ) in variable names. Best practice is to use only alphanumeric characters and underscores (_).
  • Remove any variables which are are not important for re-using the data (e.g. created for admin or internal purposes).

Examples

  Sample ID
contains space
  Sample_ID

  Count of individual perch
contains spaces and is unnecessarily long
  Perch_count

  Binomial/Latin_name
contains non-standard character (/)
  Binomial_name

  Soil temperature ° C
contains spaces and non-standard character (°)
  Soil_temp

Codelists and abbreviations

Using codes and abbreviations in the data is often very useful. However, if you do use them you must ensure:

  • they are unique (within the dataset) and used consistently
  • they are all described in the accompanying metadata
  • Any explanations you provide in the metadata are applicable.  For example, the metadata states "T = trace", but the code T doesn't actually occur in the data.

Missing data/nulls

  • It is preferable to identify nulls or missing data as blanks.  However, depending on the format of the data this isn't always possible.  Alternative ways of identifying nulls are to use codes such as NaN or N/A.
  • Numerical values such as -999999 may also be acceptable but should be avoided if possible.
  • Zeros (0) should NEVER be used to identify nulls as zero is a meaningful data value.
  • Whatever method you use to identify nulls, it should be applied consistently throughout the dataset and must be documented in the accompanying metadata.

Tabular data

Structure

  • We normally expect tabular data to be formatted with variables (mass, temperature, concentration etc) arranged in columns and observations in rows.

Headings

  • Variable names should be in the first row (and only the first row).   Data should follow in row 2.
  • Remove superfluous information in heading rows.

 

illustration of a bad csv

 

illustration of a good csv

Multiple tables

  • Never include more than one table within a single spreadsheet. This makes it far more difficult for a machine to read the data.
  • Each table should be separated into its own file.

 

illustration of a bad csv

Anonymity and data security

  • Ensure that data are anonymised where needed and cannot be linked to any identifiable person
  • Consider anonymising site location data where this is necessary for the safety of the site, equipment or future research
  • Where data are derived from existing data, check if permission needs to be obtained from the data owner

Quality

  • When converting data for deposit, ensure that all data and metadata are correct after conversion
  • Confirm that data detail is consistent with the access and licensing agreements as stated
  • Complete all internal consistency checks BEFORE offering your data for deposit
  • Resolve any data issues and ensure data are complete BEFORE deposit, to minimise the risk of further deposit(s) being necessary

If you have any queries or are unsure about the suitability of your dataset(s) for deposit, we'll be happy to discuss it with you. Please contact us.