Open data formats
When depositing your data, the format used should ideally be an open, non-proprietary format in common usage by the research community. This is to ensure that the data is available to the widest possible audience without the need for access to restrictive software.
Data that is stored in a proprietary format should be converted to an open format. The format chosen for the conversion will depend on the original file format (see table below). For example, the preferred format for depositing tabular data is comma-separated values (.csv ). This format has proven to be robust and future-proof, allowing its use in a wide variety of common software tools and applications.
Proprietary formats may be considered if there is no alternative or if conversion would result in data loss. These will be dealt with on a case by case basis so if in doubt, just ask us.
Preferred formats
The following are our preferred formats for deposit
Tabular data | Comma-separated (CSV), or tab-delimited (TAB) |
---|---|
Database tables | CSV, TAB |
Spatial raster data | GeoTIFF |
Spatial vector data | Geopackage, SpatiaLite |
Images | PNG, JPEG |
Movies | MPEG, MP4, MOV, AVI |
Sound | MP3, WAV |
Recommended open format conversions
The table below outlines common data formats and the conversion we recommend when depositing data into the EIDC. Please note:
- This list is periodically updated and is not exhaustive
- If a format is not listed, it does not mean we will not accept it
- If conversion of data would result in data loss we may accept proprietary formats
If in doubt, please contact us for advice.
Original format | Format for deposit | Notes |
---|---|---|
comma separated values (csv) | No change required | To be acceptable, files must open in commonly available software that reads CSV (e.g. OpenOffice Calc, MS Excel, Numbers) |
tab delimited values (e.g. .txt, .tab, .dat) | No change required | Must open in commonly available software e.g. Notepad++ |
text (.txt) | No change required | |
Spreadsheets (e.g. Excel, Pages) | Convert to csv file(s) | |
Database tables (e.g. Access, Oracle, SQL) | Convert to csv file(s) | |
Geopackage | No change required | Must open in commonly used GIS e.g. QGIS, ArcGIS |
SpatiaLite | No change required | Must open in commonly used GIS e.g. QGIS, ArcGIS |
Shapefile | No change is required if shapefile is the original format. However, if the original data is not a shapefile we do not recommend converting data into this format for deposit (see notes below). | Must open in commonly used GIS e.g. QGIS, ArcGIS |
kml | No change required | Must open in commonly used GIS e.g. QGIS, ArcGIS |
GeoJSON | No change required | Must open in commonly used GIS e.g. QGIS, ArcGIS |
File geodatabase (.gdb folder) | Convert to Geopackage or SpatiaLite | |
Personal geodatabase (.mdb) | Convert to Geopackage or SpatiaLite | |
GeoTiff data | No change required | Must open in common GIS e.g. QGIS, ArcGIS Some raster datasets may be accompanied by files that containing ancillary information that some software applications can use. Examples include *.hdr and *.vtr |
Esri Grid (ARC/INFO grid) | Convert to GeoTiff | |
ASCII grid | Conversion to GeoTiff is preferred. However, we will accept original data files | Must open in common GIS e.g. QGIS, ArcGIS |
.R | No change required | We will also accept other R file formats accompanying R code (e.g., .rdm, .rds) |
.rds | No change required but consider converting to .csv if it does not come with accompanying R code and contains only tabular data | |
NetCDF | No change required | Must open in two netCDF-capable applications without additional transformation |
SAS | Convert to csv file(s) | |
miniSEED | No change required | |
Minitab | Convert to csv file(s) | |
NASA Ames | Convert to netCDF or csv file(s) | |
MATLAB binary file | Convert to csv file(s) | |
STL | No change required | Must open in commonly available mesh rendering software e.g. Meshlab |
FASTA | No change required | We recommend that sequence data be deposited in a specialist repository. During the deposit process, we will advise you on the best place to store this data. |
WEAP | No change required for model files | Must open in freely available WEAP software. Documentation must be provided, specifying how to access WEAP software and make users aware of the licensing terms under which it is available. If at any time, the EIDC becomes aware that WEAP software is not freely available to run existing model files, WEAP resources will be deprecated. |
PLINK | No change required | Ensure any supporting documentation links to the PLINK source and cites it appropriately (see https://www.cog-genomics.org/plink/1.9/general_usage#cite) |
Digital terrain elevation data/DTED1 (.dt1) | No change required BUT see notes | Must be deposited with a script to convert it to a non-proprietary format e.g. ASCII or NetCDF. Can be opened using NetCDF or ArcGIS |
LAS (.las) LiDAR point cloud data or .LAZ |
No change required | Must open in commonly used GIS e.g. QGIS, ArcGIS. .LAZ (LASzip)is a compressed version of .LAS |
Apache Parquet | No change required if data volumes are large | Contact us to discuss your requirements |
Nexus (.nex or .nxs) | No change required | The extensible NEXUS file format is widely used in bioinformatics. It stores information about taxa, morphological and molecular characters, distances, genetic codes, assumptions, sets, trees, etc. It is an open format |
Variant Call Format (.vcf) | No change required | .vcf is an open standard text file format that opens in Excel and basic text editors |
Notes on shapefiles
Shapefiles have a number of limitations:
- They do not support NULL values. Nulls may be represented as zeros which is very problematic for quantitative data
- The maximum length of attribute names is 10 characters so longer names will be truncated
- The maximum number of attributes is 255
- Floating-point numbers are stored as text and may contain rounding errors
- The file size cannot exceed 2GB
Although they are not idea;, if your original data is generated and stored as shapefile(s) we will accept them. However, we do not recommend converting data into shapefiles. If you need to convert spatial vector data - consider the SpatiaLite format.
Supporting documentation
It is EIDC policy that supporting documentation will be made available with the data as a separate, linked document(s).
One of the main reasons for separating supporting documentation from data is that the EIDC is committed to a programme of review and improvement of metadata in order to make resources easier to find and easier to re-use. The data, conversely, must remain unchanged. Providing supporting documentation separately from data also permits users to make an informed decision about whether the data resource meets their requirements prior to actually downloading a copy of the data itself.
Original format | Preferred format for deposit | Notes |
---|---|---|
Rich-text documentation (e.g. Microsoft Word (doc, docx), Apple Pages (.pages), OpenOffice (odt)) |
.docx, .odt | |
Portable Document Format (pdf) | .docx, .odt | Metadata in pdf format cannot easily be maintained, therefore it is not our preferred choice. However, if there are no options to convert the pdf, we will accept documents in that format. |
xls, xlsx, csv, etc | csv | csvs provide high maintainability, longevity and ease of access. |
Plain text | txt |
Text files' limitations (i.e. lack of formatting) mean that they are rarely the best option for providing good quality, readable documentation. However, their advantages (small file size, longevity and ease of access) mean that in some instances they are a highly appropriate format. |
If your supporting documentation is in a format other than those listed above, please contact us for advice.