Ngā hōputu kōnae mō ngā whakawhiti matihiko
File formats for digital transfers
As you know we are now accepting digital transfers. However, as your organisation looks further into the process some of you may be wondering if you need to be changing your file formats before you transfer them to us.
The short answer is: no - we take them all! Our Digital Preservation Analyst, Jan Hutař, provides some more information and explains how this is possible below.
We are now open for born-digital transfers. This means any public office willing to transfer born-digital records can get in touch with us, plan the transfer, go through the Digital Transfer Initiation document, and if everything is looking good, make the transfer happen.
Why all formats?
We currently do not prescribe or require certain file formats. Once your records team have signed off the records to be transferred using your Disposal Authority we can accept any and all file formats which an agency has. Born digital transfers are ingested and stored in the Government Digital Archive. The Government Digital Archive uses Rosetta, a long-term preservation system by ExLibris, for managing and preserving digital records (both born-digital and digitised).
Rosetta is able to manage and preserve any file format; new, old, obsolete, or bespoke. Rosetta is based on Open Archival Information System reference model (ISO 14721:2012). During the ingest process there are a number of steps including: file format identification, file format validation and metadata extraction (and many more) which allow Rosetta to accept most file formats.
File format identification
File format identification is a crucial step in this process - we do not ingest files into the Government Digital Archive if the file format is unknown except for a few rare cases where the format is investigated once ingested. This is the information known as metadata which all the following tasks are based on, most importantly the preservation actions. There is no way you can preserve a file without knowing what it is. Knowledge about file formats enables us to work with the file and assign and assess risks associated with the file format. These risks can include obsolescence, dependence on bespoke applications, and dependence on certain hardware or software. We are also able to conduct searches based on file format, create sets and perform preservation actions on those sets.
For format identification, Rosetta uses the tool DROID, which is maintained by The National Archives of UK (TNA). DROID uses the PRONOM database of file formats (also managed by TNA). Every single file in Rosetta goes through the DROID identification process and when the file format is identified, a format identifier is assigned to it and kept in its metadata. At the moment PRONOM database contains almost 2000 file format records. If there is a file format unknown to DROID/PRONOM the result of format identification is then “format not identified”. Rosetta then stops the ingest process and one of our technical analysts decides on the next steps. The usual scenario is to research the file formats. If it is confirmed as a new file format not listed in PRONOM a “signature” is developed for it and submitted as a proposal to TNA to be included in PRONOM.
Format validation
All the results of file format identification, format validation against the official format specification (when possible) and all minute technical characteristics extracted from the file are kept in metadata and are able to be used for management and preservation, so from this point of view, the more Rosetta knows about the files being ingested, the better the chance of accurate file management and more importantly preservation in the future.
From what is said above it should be clear that we really are able to accept and ingest any file format to the Government Digital Archive. At the moment, we don’t want agencies to do format migration just for the sake of transfer to our archives. We prefer the original version (format) of the records, i.e. the form (format) in which they are kept in EDRMS for example. If the file formats used by the agencies are standard, widely used formats, it certainly makes the transfer, ingest into and then management and preservation in Government Digital Archive easier.
Keeping the original files
The original file gives us more information than a file migrated into a different format during its lifecycle at the agency, for example, the last date of modification, creation date or application used for creation. If the record arrives in an obsolete file format, we ingest it into the Government Digital Archive. Preservation actions are needed to get the content of the records onto a current file format. But even after creating the new file format, the original files are kept in perpetuity. For every preservation action such as format migration, we make sure the intellectual content of the record is not changed, and all actions to the file are documented in its metadata. After all, we are not preserving the files and their format, we are preserving their content.
Originally published on the Records Toolkit blog 27 February 2018