Archives in the cloud — Exploring machine learning to transform our digital services for agencies
From February to July 2022, Te Rua Mahara o te kāwantanga Archives New Zealand led work on a proof of concept which showed that machine learning tools have the potential to auto-classify digital public records and surface information of interest to Māori.
Why we worked on the proof of concept
Public offices like government agencies and Crown entities create huge amounts of digital information and data in their day to day work. Some of this information has long term significance to Aotearoa New Zealand and will eventually be transferred to our custody and form part of the lasting memory of the government. The roles of Archives and the Chief Archivist are set out in the Public Records Act 2005.
We designed our current systems for sorting, maintaining, and ensuring the accessibility of this information with paper records in mind. Public offices now hold huge stores of digital information and data, from databases with millions of emails to legacy systems and shared drives full of content. Agencies need to appraise this information in line with policy documents called disposal authorities. Disposal authorities determine how long information will be kept and what will happen to it — usually either destruction or transfer to us. It is no longer possible for people to sort through all this information manually. Without looking for new approaches to appraisal, disposal, and searching for information within our archives it is inevitable that there will be gaps in the memory of government.
We wanted to see if machine learning tools and hyperscale cloud capabilities can help to sort this information and solve other information and archival challenges that have arisen in the digital era. We received funding from the Digital Government Partnership Innovation Fund to carry out a proof of concept (PoC) from February to July 2022. The PoC aimed to test if it was possible to use these tools to:
streamline the appraisal process, specifically whether auto-classification could determine the appropriate disposal authority to apply to information and records.
identify material of importance to communities, specifically whether available tools could identify and surface information of interest to Māori.
Agencies we worked with
We worked the Ministry of Justice and Ministry for Primary Industries, technology partners Microsoft and AWS and information management experts for the PoC. The aim was to test if machine learning and cloud computing tools could classify data in line with disposal authorities — the rules for keeping or disposing of information. We also tested whether these tools could surface information of interest to Māori.
We worked together in a nimble and iterative way to develop this PoC and through issues like:
where the data would live
how to keep it safe
what key outcomes we wanted to test in this small-scale experiment.
Within the limited time available, both Microsoft and AWS successfully developed solutions using their suites of tools that could auto-classify records and find Māori subject headings within records. With further training, the models would become more accurate, and further refinement and consultation could help ensure the relevance and accuracy of the Māori records identified.
Read the AWS report on Document Understanding Proof of Concept — Archives New Zealand (PDF 1.2MB)
The potential of these technologies is huge, and we want to continue developing processes and approaches to help to address the challenges we have and to grasp opportunities. For future work in this area, we’ll need to get the right resources in place, work alongside Māori, and ensure that the wider processes are fit for purpose and in line with the Algorithm Charter. We also need to think about the wider information context across government. For example, it is likely for any large-scale project to be successful we’ll need to rethink how we develop disposal authorities and ensure we build an all-of-government ontology and make it available.
Next steps for the proof of concept
Our proposed next step to build on this PoC is to continue work on approaches to auto-classification of digital records and information under general disposal authorities. We expect that auto-classification approaches will make a significant positive impact to information managers and agencies more broadly, as it’ll allow them to carry out their work more efficiently.