Te whakapūranga ā-tukutuku
Web archiving
Learn more about web archiving including when you should do it, and the benefits and risks associated with different approaches.
Web archiving and the Public Records Act
Websites created by public offices and local authorities (public sector organisations) and the information they contain are public or local authority records. As such, they need to be managed in accordance with the Public Records Act 2005 [The Act] from creation to authorised disposal.
Managing active websites and archiving websites when they are no longer in current use are different activities. This guidance focuses on web archiving. To find out more about managing website content and website activity, read our guide on managing websites as records.
When to archive websites
You should archive websites during times of change—for example, when your organisation intends to:
develop a new website
make major updates or redevelopments to an existing website, or
decommission a whole website or part of a website.
Some archiving processes and systems can capture web records that other systems or processes cannot. For example, if you use a content management system (CMS) to manage web records that does not provide rollback functionality, you may need to take a snapshot of the website at risk determined intervals. You might also need to keep a copy of the entire website so you can refer to its associated records for recordkeeping and business purposes.
Websites may also be archived to preserve them long term for cultural or historical purposes. Websites captured for this reason may not always be managed by the creating organisation.
Harvesting and snapshots
Harvesting is the process of capturing a whole website or specified parts of a website—usually with commercial tools such as external site crawlers. Crawlers are software packages or pieces of code that index or copy websites in a methodical manner, then save the selected elements as static pages to a disc. The resulting data is a snapshot of the site at a known point in time.
There are benefits and risks to using this approach.
Although the context of the web information is preserved, creation, rollback and metadata may not be available.
The ‘look and feel’ of the website may be preserved but it produces a static version of information that may have been originally presented in a dynamic or personalised manner.
It may only capture public facing pages but not intranet pages or secured content.
Some content such as multimedia formats may not be captured if located on a different server to the HTML pages.
National Library web harvesting
The National Library primarily collects New Zealand websites and individual documents published on websites under legal deposit legislation (National Library of New Zealand Act 2003, Part 4). This includes annual harvests of public sector organisation websites.
Selective harvesting priorities are outlined in the New Zealand and Pacific Published Collections collecting plan. An annual harvest of the New Zealand internet is also undertaken. This includes websites that are not necessarily part of the selective web harvesting programme, such as schools and tertiary education institutions. For more information about the National Library’s web harvesting programme, see their website.
It’s important to note that for public sector organisations, the National Library’s web harvesting programme does not constitute authorised transfer or disposal under the Act. Also, as there are difficulties in harvesting large, complex websites, there is no guarantee that this will meet other legislative recordkeeping requirements, such as the Official Information Act 1982, the Local Government Official Information and Meetings Act 1987, and the Privacy Act 2020. Your organisation should assess the risk if unable to access vital web records and consider the use of other systems such as a CMS for maintaining a usable copy of a discontinued website until it can legally be disposed of.
Transaction logs
Transactional logging is the recording of actions that occur to a web page, information or artefact. Almost all CMS products enable the recording of actions known as a transaction. Actions may be the creation of a new page, the publishing of a new content item, or the submission of a form. Collated lists of transactions are called transaction logs. They are often saved to a database table or text file within the application that generated the transaction.
Benefits and risks to using this approach include:
it can be easily set up with most database driven applications, but has limited accessibility
it captures raw information, but context is often lost.
Digital preservation approaches
Normalisation and format migration are digital preservation strategies that can be used for web archiving. Normalisation usually involves converting or ‘normalising’ web records to one file format (either a static document format such as PDF or an HTML document) or using an emulation tool to access or read obsolete or uncommon formats. Format migration involves updating web records in old file formats to newer versions when they are at risk of becoming obsolete.
Both strategies can be automated but quality control is always important.
Normalisation creates homogenous, easier to manage collections and means that users only need to know how to use a few file types.
Format migration means that files can be accessed in current IT environments, but careful consideration must be given to migration pathways to avoid loss of data and functionality.
For further advice about digital preservation approaches to web archiving, contact us.