Suitable formats for data and supporting documentation

Open data formats

When depositing your data, the format used should ideally be an open, non-proprietary format in common usage by the research community. This is to ensure that the data is available to the widest possible audience without the need for access to restrictive software.

Data that is stored in a proprietary format should be converted to an open format. The format chosen for the conversion will depend on the original file format (see table below). For example, the preferred format for depositing tabular data is comma-separated values (.csv ). This format has proven to be robust and future-proof, allowing its use in a wide variety of common software tools and applications.

Proprietary formats may be considered if there is no alternative or if conversion would result in data loss. These will be dealt with on a case by case basis so if in doubt, just ask us.

Preferred formats

The following are our preferred formats for deposit

Tabular data Comma-separated (CSV), or tab-delimited (TAB)
Database tables CSV, TAB
Spatial raster data GeoTIFF
Spatial vector data Geopackage, SpatiaLite
Images PNG, JPEG
Movies MPEG, MP4, MOV, AVI
Sound MP3, WAV

Recommended open format conversions

The table below outlines common data formats and the conversion we recommend when depositing data into the EIDC. Please note:

  • This list is periodically updated and is not exhaustive
  • If a format is not listed, it does not mean we will not accept it
  • If conversion of data would result in data loss we may accept proprietary formats

If in doubt, please contact us for advice.

Original format Format for deposit Notes
comma separated values (csv) No change required To be acceptable, files must open in commonly available software that reads CSV (e.g. OpenOffice Calc, MS Excel, Numbers)
tab delimited values (e.g. .txt, .tab, .dat) No change required Must open in commonly available software e.g. Notepad++
text (.txt) No change required  
Spreadsheets (e.g. Excel, Pages) Convert to csv file(s)  
Database tables (e.g. Access, Oracle, SQL) Convert to csv file(s)  
Geopackage No change required Must open in commonly used GIS e.g. QGIS, ArcGIS
SpatiaLite No change required Must open in commonly used GIS e.g. QGIS, ArcGIS
Shapefile No change is required if shapefile is the original format. However, if the original data is not a shapefile we do not recommend converting data into this format for deposit (see notes below). Must open in commonly used GIS e.g. QGIS, ArcGIS
kml No change required Must open in commonly used GIS e.g. QGIS, ArcGIS
GeoJSON No change required Must open in commonly used GIS e.g. QGIS, ArcGIS
File geodatabase (.gdb folder) Convert to Geopackage or SpatiaLite  
Personal geodatabase (.mdb) Convert to Geopackage or SpatiaLite  
GeoTiff data No change required Must open in common GIS e.g. QGIS, ArcGIS 
 
Some raster datasets may be accompanied by files that containing ancillary information that some software applications can use. Examples include *.hdr and *.vtr
Esri Grid (ARC/INFO grid) Convert to GeoTiff  
ASCII grid Conversion to GeoTiff is preferred. However, we will accept original data files Must open in common GIS e.g. QGIS, ArcGIS
.R No change required We will also accept other R file formats accompanying R code (e.g., .rdm, .rds)
.rds No change required but consider converting to .csv if it does not come with accompanying R code and contains only tabular data  
NetCDF No change required Must open in two netCDF-capable applications without additional transformation
SAS Convert to csv file(s)  
miniSEED No change required  
Minitab Convert to csv file(s)  
NASA Ames Convert to netCDF or csv file(s)  
MATLAB binary file Convert to csv file(s)  
STL No change required Must open in commonly available mesh rendering software e.g. Meshlab
FASTA No change required We recommend that sequence data be deposited in a specialist repository. During the deposit process, we will advise you on the best place to store this data.
WEAP No change required for model files Must open in freely available WEAP software. Documentation must be provided, specifying how to access WEAP software and make users aware of the licensing terms under which it is available. If at any time, the EIDC becomes aware that WEAP software is not freely available to run existing model files, WEAP resources will be deprecated.
PLINK No change required Ensure any supporting documentation links to the PLINK source and cites it appropriately (see https://www.cog-genomics.org/plink/1.9/general_usage#cite)
Digital terrain elevation data/DTED1 (.dt1) No change required BUT see notes Must be deposited with a script to convert it to a non-proprietary format e.g. ASCII or NetCDF. Can be opened using NetCDF or ArcGIS
LAS (.las)
LiDAR point cloud data or .LAZ
No change required Must open in commonly used GIS e.g. QGIS, ArcGIS. .LAZ (LASzip)is a compressed version of .LAS 
Apache Parquet No change required if data volumes are large Contact us to discuss your requirements
Nexus (.nex or .nxs) No change required The extensible NEXUS file format is widely used in bioinformatics. It stores information about taxa, morphological and molecular characters, distances, genetic codes, assumptions, sets, trees, etc. It is an open format
Variant Call Format (.vcf) No change required .vcf is an open standard text file format that opens in Excel and basic text editors

Notes on shapefiles

Shapefiles have a number of limitations:

  • They do not support NULL values. Nulls may be represented as zeros which is very problematic for quantitative data
  • The maximum length of attribute names is 10 characters so longer names will be truncated
  • The maximum number of attributes is 255
  • Floating-point numbers are stored as text and may contain rounding errors
  • The file size cannot exceed 2GB

Although they are not idea;, if your original data is generated and stored as shapefile(s) we will accept them.  However, we do not recommend converting data into shapefiles. If you need to convert spatial vector data - consider the SpatiaLite format.

Supporting documentation

It is EIDC policy that supporting documentation will be made available with the data as a separate, linked document(s).

One of the main reasons for separating supporting documentation from data is that the EIDC is committed to a programme of review and improvement of metadata in order to make resources easier to find and easier to re-use. The data, conversely, must remain unchanged. Providing supporting documentation separately from data also permits users to make an informed decision about whether the data resource meets their requirements prior to actually downloading a copy of the data itself.

Original format Preferred format for deposit Notes
Rich-text documentation   
(e.g. Microsoft Word (doc, docx), Apple Pages (.pages), OpenOffice (odt))
.docx, .odt  
Portable Document Format (pdf) .docx, .odt Metadata in pdf format cannot easily be maintained, therefore it is not our preferred choice.  However, if there are no options to convert the pdf, we will accept documents in that format.
xls, xlsx, csv, etc csv csvs provide high maintainability, longevity and ease of access.
Plain text txt

Text files' limitations (i.e. lack of formatting) mean that they are rarely the best option for providing good quality, readable documentation.  

However, their advantages (small file size, longevity and ease of access) mean that in some instances they are a highly appropriate format.

If your supporting documentation is in a format other than those listed above, please contact us for advice.