Te wāhi nui ki ngā tapekemoka
The importance of checksums
As we increasingly work in the digital world there are new areas of work for information and records managers. We need to understand our organisation’s capability to participate in these. We are working towards being able to accept born-digital records and fundamental to this is the ability to produce checksums.
In the light of the recent interest in checksums, this is a revised version of the blog originally published in 2016.
A checksum is a string of numbers and letters that act as a fingerprint for a file against which later comparisons can be made to detect errors in the data. They are important because we use them to check files for integrity.
Our digital preservation policy uses the UNESCO definition of integrity.
“Digital content is information encapsulated in one or more digital objects. Within this context, integrity of a digital object is the quality of its content remaining ‘uncorrupted and free of unauthorized and undocumented changes'"
National Library of Australia/UNESCO. (2003). Guidelines for the Preservation of Digital Heritage.
Checksums are useful when moving files from one environment to another for example validation after migration; for regularly checking the integrity of files managed in a system - where you expect the file content to remain unchanged over time - and also when working with files to uniquely identify what we are working with.
Checksums will bridge the gap, quite literally, between the organisation and permanent preservation in our archive during transfer or deposit. A file must remain unchanged from the duplicate in your Content Management System when you extract it. We will attempt to prove that unchanged state when we store it in the digital repository. An exception procedure triggers if anything unexpected has happened. Use of checksums is also relevant for local authorities managing digital protected records.
The actual procedure which yields the checksum is called checksum generation. A generation uses one of a collection of checksum functions or algorithms. These algorithms usually output a significantly different value even for the tiniest of changes to the data. So, checksums ensure a corrupt-free transmission. They also indicate when the file has been tampered with; an important byproduct of integrity is security.
We need to monitor checksums throughout the transfer or deposit lifecycle. There are two important points where we must guarantee integrity. Firstly, when we receive the files including checksums from your organisation and compare them to a new checksum output that we create. Secondly, when we deposit the files into the permanent repository and check them against the original transfer sent to us by your organisation. Once in our repository, we will continue to monitor the checksums to ensure the files remain unchanged in perpetuity.
Open source tools
Checksums can be generated and validated with many tools. Below is a list of some open source tools for your convenience: TOOL: Free Commander
Operating System: Win
Generate: Yes
Validate:Yes
TOOL: Double Commander
Operating System: Win, Linux, MacOS
Generate: Yes
Validate: Yes
TOOL: DROID
Operating system: Win, Linux, MacOS
Generate: Yes
Validate: No
TOOL: AVPreserve Fixity
Operating System: Win, MacOS
Generate: Yes
Validate: Yes
TOOL: Checksum-comparator
Operating System: Win, Linux
Generate: No
Validate: Yes
TOOL: Spreadsheet (LibreOffice)
Operating System: Win, Linux, MacOS
Generate: No
Validate: Yes
TOOL: SHA1SUM, MD5SUM commands
Operating System: Linux
Generate: Yes
Validate: Yes.
Use in a command line
TOOL: Online MD5 generator
Operating System: Win, Linux, MacOS
Generate: Yes
Validate: No
Further reading
Digital Preservation Coalition – Fixity and Checksums. Contains further reading and links to other tools.
Digital Preservation Coalition - Which checksum algorithm should I use? (PDF 468 KB). Further reading on what checksum algorithm you should use, depending on a number of factors.
Capability assessment
To assess your own capability, here are some questions for you and/or your organisation:
Does your organisation use checksums and if so what type?
Has your organisation used checksums in any other scenario e.g. for de-duplication?
Would your organisation be able to create a checksum comparison list like the one described?
We are very interested to hear any questions about or practices of working with checksums and will use these to produce further relevant information.
Originally published on the Records Toolkit blog on 22 June 2017