SCAPE Platform Business Activity
Digital collections at this organisation have grown to a sufficient size that assessment of the data for preservation risks has becoming virtually impossible. Execution time of a simple file format identification process is taking several months to complete. The proposed business activity will expand the web archiving department's Hadoop cluster, connect the organisation's digital repository to the cluster and apply the SCAPE Platform to enable all repository data to be characterised and assessed on a more frequent and practical basis.
The main activities are:
- Purchase new hardware and expand existing cluster
- Implement SCAPE Platform connector to enable movement of content from the digital repository to the cluster
- Implement SCAPE Platform tools, specifically Nanite, to enable file format identification, metadata extraction and characterisation of collection data
- Perform frequent re-assessment of data using the latest tools and file format signatures available
The web archiving department have agreed to allow digital preservation to utilise their existing Hadoop cluster in return for expanding the cluster with some additional hardware.
The organisation is struggling to assess it's digital collections due to the length of time it takes to assess the growing numbers of files. Without any file format analysis, the potential for unknown preservation risks is significant. The business case therefore suggests leveraging (and expanding) an existing Hadoop cluster for running preservation processes. Web archiving departments have been early adopters of Hadoop technology in libraries and archives, and so this scenario has been used by some SCAPE Project partners.