Filling the Digital Preservation Gap

Jump to navigation Jump to search

See all projects | About this work | Jisc Research Data Spring



From the project website:

The project aims “ explore the potential of the digital preservation solution Archivematica to help manage research data…”

From Spotlight Data: Jisc RDS Software Projects

"The project has reviewed and scoped whether Archivematica can preserve research data in the long-term.
Key project outputs:
  1. Project reports for phases 1 and 2 (available via Figshare)
  2. A number of sponsored developments (both enhancements and new functionality) within Archivematica, specifically designed to make Archivematica more suitable for use as a component for preserving research data (available within a future release of Archivematica for all users)
  3. A number of presentations, posters, blog posts and a podcast about the project to raise awareness about our work (available via SlideShare)
  4. New research data file format signatures under development at The National Archives (soon to be available to all users of the PRONOM database)"

From a synthesis of the projects in the context of the OAIS model, by Jen Mitcham of the Filling the Digital Preservation Gap project

"We are not looking at digital preservation software or tools that a researcher will interact with, but with the help of Archivematica are looking at among other things the OAIS Ingest entity (how we process the data as it arrives in the digital archive) and the Preservation Planning entity (how we monitor preservation risks and react to them)."

Further information

Hyperlinks to further information on the project

Potential to enhance

Is there potential to leverage non-preservation focused developments to enhance preservation capabilities?

The development tasks outlined by the project focused on the Archivematica software are striking in their wider applicability and have clearly been designed to facilitate wider interoperability rather than quick solutions relating to the technology in use at York and Hull. Testing and refinement of these developments would therefore appear to continue to be a very worthwhile target for the subsequent phase of work.


Is there potential for collaboration and/or exploiting existing/parallel work beyond the project consortiums?

Page 6 of the intermediate report outlines an impressive list of engagement and collaboration. The obvious potential to feed into the Jisc RDM Shared Service should also be noted here.

Considerations going forward

What are the key considerations (with regard to preservation) for taking forward the work beyond the current phase?

File format identification is one of the key areas of development within Filling the DP Gap and has been identified as a further target looking ahead to phase 3. The project identified some key questions in a blog post, and while noting the point that there are not necessarily correct answers and that solving the issues may well go beyond this project’s remit, it seems worthwhile providing some input on them here.

What should happen if you ingest data that can't be identified? Should you get notification of this? Should you be offered the option to try other file id methods/tools for those non-identified files?

A lack of any file format identification information indicates uncertainty. Any obvious gap in our understanding of the data in a repository represents a preservation risk of some kind. It therefore seems sensible to highlight data that has been ingested that lacks identification, as is suggested. Whether this is something that is flagged up during ingest, perhaps even pushing non-identified items into a holding area, or is simply an informational for investigation at a later date, will depend on local policy. York’s intention (as described in the project report) seems to favour a largely automated workflow which might better suit the latter approach.

It should be remembered however that file format identification is not necessarily a precise art, and that the concept of a file format is in practice usually more fluid than how it is sometimes perceived (for example see The Network is the Format: PDF and the Long-term Use of Digital Content). There are also formats for which the DROID/PRONOM approach of identification via matching of magic numbers in file headers is largely ineffective. For example, text based formats such as source code. Other tools and approaches will therefore be necessary to include as they become available, and existing approaches and workflows need a degree of flexibility and pragmatism.

Should we allow the curator/digital archivist to over-ride file identifications - eg - "I know this isn't really xxxx format so I'm going to record this fact" (and record this manual intervention in the metadata) Can you envisage ever wanting to do this?
Where a file is not identified at all, should you have the option to add a manual identification? If there is no Pronom id for a file (because it isn't yet in Pronom) how would you record the identification? Would it simply be a case of writing "MATLAB file" for example? How sustainable is this?

GIven the points made above, and the often conflicting or unclear results of automatic file format identification, as well as the cases where curators/depositors are already aware of the file types, these manual interventions seem likely scenarios. Preservation software seems to have been slow to reflect this reality, perhaps partly due to the focus on automated ingest. The ability to go back and make manual adjustments (with event metadata, as noted) at a later date may well be the most useful approach here. How these adjustments are recorded has not been widely discussed within the community. Where possible, identification of the creating software may offer one possibility.

Where a tool gives more than one possible identification should you be allowed to select which identification you trust or should the metadata just keep a record of all the possible identifications?

This depends a little on how the metadata will in fact be used, and perhaps raises a counterpoint to the previous two questions. Developing a risk assessment and preservation planning process which file format information will inform, will help firm up the answer to these questions. It’s important to note that identification will change over time (hopefully for the better) as the tools and their coverage improve. So decisions of this nature also need to be made in the context of what could be annual re-runs of file format identification over entire repository collections. How much manual effort is necessary and/or useful and how will this be treated when automated format identification is repeated at a later date?

How should you share info around file formats/file identifications with the wider digital preservation community? What is the best way to contribute to file format registries such as Pronom

This raises an ongoing challenge within the wider digital preservation community, relating to several challenges: the shape of the current problem, multiple format identification tools, several sources of file format magic and the dependence of the community on the outstanding contributions of the UK National Archives on DROID and PRONOM. Greater community input, and perhaps more importantly ownership (as the TNA cannot be expected to own this problem and deliver the solution for everyone else) are challenges that the DPC is keen to take forward. The nature of the challenge certainly requires that real in depth experience to be able to grow the format signature base without breaking existing signatures in the process - something that TNA has taken on and excelled at. But there may be a way of getting more input from the community in a way that will take some pressure off TNA and help getting more signatures into PRONOM. It might, for example, be possible to collate example files (without associated signatures) on the OPF Format Registry, where draft signatures could also be created when effort is available, before these are then made ready for inclusion in PRONOM by TNA. The idea of a submission button for unidentified files could be a really useful way of encouraging community contributions, and it would be good to see this explored in phase 3.

Uptake and sustainability

What steps should be taken to ensure effective uptake and sustainability of the work within the digital preservation community?

The community engagement on the project has been strong, but there will be certainly be more opportunities given the wide applicability of the results. For example, the detailed FAQ for exploring the suitability of Archivematica for RDM (noting further useful detail in the phase 1 report).

Project website sustainability checklist

A brief checklist ensuring the project work can be understood and reused by others in the future.

Task Score
Clear project summary on one page, hyperlink heavy 2
Project start/end dates 2
Clear licensing details for reuse 2
Clear contact details 1
Source code online and referenced from website N/A

2=present, 1=partial, 0=missing

Key recommendations

  • Substantial potential value in testing and refining Archivematica developments, most of which have wide applicability to other organisations applying Archivematica for RDM (and beyond)
  • Suggest advancing discussion on community engagement and cohesion of file format identification support, which DPC can help to facilitate.
  • Further dissemination of valuable results would be worthwhile
  • Suggest adding contact details to project website


See all projects | About this work | Jisc Research Data Spring