Digital preservation policy

Introduction

This policy documents the digital preservation policy of the Environmental Information Data Centre (EIDC), detailing its scope, objectives and drivers. The EIDC is a Natural Environment Research Council (NERC) data centre hosted by the UK Centre for Ecology & Hydrology (UKCEH). The aim is to ensure the longevity of the digital information assets held by the Data Centre in a sustainable way by addressing the factors which risk making them unusable and/or inaccessible.

Scope

The scope of this policy is limited to the Data Centre's data collections. It covers all research data and associated project management information in all formats including:

  • NERC owned data
  • Data generated by NERC funded grants
  • Third party data (data we hold on an in-licensed basis)
  • Data generated by non-NERC funded grants that is deemed to be of significant long-term value to the terrestrial and freshwater scientific community

Objectives

The primary aim of the EIDC is to act a NERC Data Centre for the terrestrial and freshwater sciences community. The EIDC takes in data resources and assumes responsibility for the long-term preservation and accessibility of data in digital form. This policy outlines the key actions and rationale behind the actions necessary to ensure that the data held by the EIDC are permanently accessible in a form that is fit for purpose for all the end users of the service provided.

The specific aims of the preservation policy are:

  • provide reliable instances of data resources to the designated user community;
  • maintain the integrity and quality of the data resources;
  • ensure all data resources are protected;
  • ensure the relevant level of information security is applied to each data resource;
  • instil good practice in active preservation management;
  • improve the speed and efficiency with which information is reserved and retrieved;
  • develop and maintain systems of low-cost storage, with appropriate location and with regular review.

Requirements

As the main function of the EIDC is to acquire, maintain and manage related digital resources of value to the terrestrial and freshwater sciences community, and to promote and disseminate these resources as widely and effectively as possible, the EIDC has developed a series of requirements which it strives to ensure are followed as closely as possible:

  • the data resources it accessions are accompanied by sufficient documentation to enable their re-use for analytical and research purposes;
  • the data resources are checked and validated according to strict data and documentation ingestion procedures;
  • the data resources are catalogued according to appropriate metadata standards;
  • the data resources, documentation and metadata are kept in conditions suitable for long-term archival storage;
  • the authenticity, integrity and reliability of data resources preserved for future use are retained;
  • the basic preservation actions undertaken by the EIDC are of a uniformly high standard regardless of the perceived value of any data resource.

As a core activity in the EIDC, preservation does not exist in isolation. It needs to take account of:

  • the aims and objectives of the EIDC;
  • NERC's strategic and operational plans;
  • The NERC Data Policy;
  • The NERC Information Security policy;
  • the needs of the users of the EIDC;
  • archival theory and practice;
  • the place of the EIDC within local, national and international frameworks.

Legal and regulatory framework and other policies

There are numerous legal and regulatory policies that impact on the management of data held by the EIDC. This policy helps the EIDC meet its legislative and accountability requirements and the expectations of its user community. The EIDC must have legal rights to preserve any digital content kept in its archives and will not ingest materials that have unclear ownership or unresolved rights issues.

In preserving its data collection the EIDC follows:

Other policies, covering the business information and with an impact on digital preservation include:

The organisation will follow the broad guidance given in standards and best practice guidance to support the level of preservation required.

These include:

Roles and responsibilities

Staff at the EIDC are employed by the UK Centre for Ecology and Hydrology (UKCEH) which is spread over four locations. Accountability pertaining to preservation and re-use falls to:

UKCEH Head of Environmental Data Science
Owner of the EIDC Digital Preservation function.
EIDC manager
Strategic alignment of digital preservation, optimisation of available resources.
Environmental Data Science staff
Special expertise in digital preservation, covering the development and implementation of digital preservation policy, strategy and associated workflows.
Data Centre Management group
A group of core data centre staff + EIDC manager.
EIDC Data Centre Operatives
Ingest, store, help deliver and preserve the data, and provide guidance to users
Data creators and depositors (internally and externally)
Appropriate actions to safeguard the data they create and deposit, including the use of appropriate file formats and provision of descriptive and technical metadata.
Systems and Network Support
Ensure infrastructure used for storing digital records and materials is fit for purpose. Apply general data security.
All staff
All EIDC staff are accountable to their line managers for compliance with this policy and with related policies, standards and guidelines

Model

The EIDC broadly follows the guidance provided in the Open Archival Information System (OAIS) reference model in its approach to long term preservation. The stages involved in curation of digital assets are: (equivalent OAIS functional entities in parentheses)

Identification (Administration)
Preparation (Administration)
Data transfer (Ingest, Archival storage, Data management, Administration)
Access (Access, Administration)
Administration/Housekeeping (Data management, Administration)
Data centre management (Preservation planning)

The EIDC has established a set of defined processes and guidance to manage all stages. An issue tracking system (JIRA), content management system and a bespoke discovery metadata catalogue are used to track and document all work.

Identification

Data identification is used to quickly determine whether or not the data resource is suitable for ingestion by ascertaining:

  • If the data is within the scope of the EIDC's remit
  • If the data is of sufficient quality
  • If the data is documented sufficiently to aid reusability

Depositors are provided with guidance as to what is expected from them including:

The identification stage helps to improve data quality, comprehensibility and accessibility by enforcing minimum standards of quality at the point of deposit. They ultimately help reduce the time required for the subsequent ingestion stages as data are submitted at a standard which requires less processing.

Preparation

The preparation stage involves negotiating service agreements (submission agreements) with the depositor.

Data transfer

Data transfer involves the receipt of data and associated metadata from a depositor. This has a close correspondence to the OAIS Submission Information Package (SIP).

At this stage data is examined to ensure it is uncorrupted, complete and what it purports to be. Data and metadata is stored in an appropriate secure place on the Storage Area Network and checksums are generated. This is equivalent to the OAIS Archival Information Package (AIP).

Discovery metadata is finalised and published in the EIDC data catalogue which facilitates discovery and access to the data. A Digital Object Identifier (DOI) is then assigned to a data resource. The published metadata is also harvested by additional catalogues, such as the NERC data service (https://data-search.nerc.ac.uk) and Find Open Data (https://data.gov.uk ), further promoting the deposited data resources.

The process provides an unbroken audit trail of actions to ensure the authenticity and integrity of data resources.

Access

The access process, makes the data resource available to download. This is equivalent to the OAIS Dissemination Information Package (DIP).

The EIDC does not alter data that has been deposited and so in the vast majority of cases the data DIP will be identical to the AIP. However, in some cases only a proportion of datasets are requested – in these cases, DIPs are created by subsetting the AIP. This is done 'on the fly' without altering the AIP in any way.

Usually, the access process follows immediately from the data transfer process. However, occasionally data may be embargoed (subject to the terms agreed with funders) in which case there can be a delay of several months/years between data transfer and access.

Administration and housekeeping

Day to day administration of the data centre ensures that metadata and data are continually monitored and maintained.

Data integrity - The integrity of all data held by the EIDC is ensured by continually monitoring the checksums generated at the data transfer stage. Any change to the data triggers an alert which can then be investigated and rectified by data centre staff.

Version control/change procedures - Data ingested into the EIDC will normally be given a DOI which precludes any alterations/updates being made to the data. If a dataset is later found to be in error, a new version must be deposited with the EIDC. Online access to the erroneous version will be rescinded but it remains available on request.

Withdrawal -The EIDC, in line with NERC, has a minimum retention period of ten years, after which data is periodically reviewed and potentially discarded. However, data which has been given a DOI (the majority of data held by the EIDC) will be kept in perpetuity.

Data centre management

The data centre is managed using a series of defined, documented procedures supported by a task management system (JIRA).

The Data Centre Management group meets monthly to prioritise deposit requests, deal with problems arising in day-to-day running of data centre, plan future development priorities, deal with non-conformance issues and any other issues related to data centre management.

The EIDC manager and the UKCEH head of Environmental Data Science are responsible for ensuring that the data centre is resourced appropriately.

Preservation planning

Strategy

The preservation strategy of the EIDC aims to maintain a flexible preservation system that can evolve to meet the demands of changing technology and developing user expectations. The EIDC has chosen to implement a preservation strategy based upon open and available file formats. The same ingestion procedure is used for all data resources and no judgement is made on the scholarly value of the datasets once they have been identified as suitable for deposit with the EIDC. All datasets accepted for deposit must be accompanied by supporting documentation of sufficient quality to enable re-use over the long-term.

To reduce the risk of obsolescence, files are only accepted in a non-proprietary format. Each dataset within the preservation system follows a consistent directory structure for storage and this is enforced by automated checks.

Online storage for the data centre repository is provided by an IBM Storage Area Networks (SAN) administered by a dedicated in-house IT infrastructure team. The environmental parameters which control the storage media are tightly controlled to reduce vulnerability. Data is backed up to Enterprise Tape Libraries continuously. The backups are managed using IBM Tivoli Storage Manager (TSM). A copy of the archive is kept securely off site and forms a key component of the EIDC's disaster recovery and business continuity procedures, providing for rapid recovery of data and infrastructure under commonly anticipated threats (e.g. technical failure, human error). The system also ensures the safety of the data in the event of a more serious incident if, for example, the buildings housing the data centre and/or major IT infrastructure were to be rendered inoperable

Monitoring and review

The preservation policy of the EIDC is monitored and reviewed in light of changing technologies on an annual basis to ensure timely updates. The EIDC Manager initiates the review process in association with the Environmental Data Science staff. Implementation of the preservation policy is monitored via the EIDC's non-conformance process. This policy is available to all staff and members of the public via the EIDC website. Queries concerning the preservation policy should be directed to eidc@ceh.ac.uk

IT Architecture

The preservation of the EIDC's data relies on an IT infrastructure that is fit for purpose and is continually monitored and periodically reviewed to ensure timely upgrades in both hardware and software.

In order to ensure resilience, the preservation system consists of on-site and off-site storage. Adequate storage capacity for all holdings is maintained. All servers in the Archive are protected by power surge protection systems. Disaster recovery procedures are in place.

Network security

The EIDC is committed to taking all necessary precautions to ensure the physical safety and security of all data resources it preserves. The repository rooms are equipped with a security-protected card access system.

All data are available as Microsoft Windows file shares in an Active Directory managed by the UKCEH. Access to the network data storage area is restricted to the minimum viable number of staff and ability to access/view/move data resources held by the EIDC is controlled via permissions. Access to the SAN and file shares is controlled with Check Point Firewalls and external access is only available via a Secured VPN with multi-factor authentication using digital tokens. In addition, Carbon Black Protection is used to monitor the SAN and all endpoints; it blocks any malicious activity that could potentially corrupt or damage data.

Cooperation

The EIDC has established productive working relationships with the other four NERC data centres (British Oceanographic Data Centre, Centre for Environmental Data Analysis, National Geoscience Data Centre and Polar Data Centre), including liaison in determining the most appropriate data centre for resource curation.

Funding and resource planning

The EIDC is and always has been dependant on funding from NERC to carry out its activities. The EIDC is currently funded via the NERC Data Centre Commissioning project which began in April 2018 and runs for five years.

Resource management for preservation of digital resources includes:

  • technical infrastructure, including equipment purchases, maintenance and upgrades, software/hardware obsolescence monitoring, network connectivity etc.
  • financial plan, including strategy and financing the EIDC and commitment to long-term funding
  • staffing infrastructure, including recruitment, induction and ongoing staff training

The EIDC makes every effort to remain up-to-date with any relevant technological advance to ensure continued access to its collection. The EIDC also implements a programme of continual improvement in how users interact with the data centre, for example, improved deposit and request functions for users.

Data Centre Management group