SCAPE Platform business case template

From wiki.dpconline.org
Jump to: navigation, search

Use this to help develop a convincing business case for the use of the SCAPE Platform in your organisation.

About this business case template

This template presents generic business benefits, digital preservation risks and costs for applying the SCAPE Platform in a production environment, followed by an example of how these benefits might be tailored for, and presented in, a specific business case. Elements from this case study are mirrored in relevant parts of the Toolkit (for example the Benefits from this case study are mirrored on the DPBCT benefits template page). This template was developed by the SCAPE Project.

How to use this template

The sections on benefits, risks and costs can be reused by organisations who would like to create a business case for the use of the SCAPE Platform, but they must be tailored to that organisation's particular needs, aims and contextual situation. The SCAPE Platform Business Case Example shows how the generic business benefits and risks can be adapted to meet the specific needs of a (theoretical) organisation. Developing benefits and risks requires careful analysis, adaptation, use of language and prioritisation as described elsewhere within the Digital Preservation Business Case Toolkit. The Step by step guide to building a business case is a good place to start.

About the SCAPE Platform

The "SCAPE Platform" is an umbrella term for the suite of SCAPE and non-SCAPE tools that enable preservation processes to be run in parallel on a high performance cluster. The term also represents the ethos of leveraging high performance computing to make preservation actions achievable at scale. The SCAPE Platform is centred on the Apache Hadoop Framework and includes new tools and approaches such as ToMaR, Nanite and repository connectors.

This technology enables preservation processes, such as identifying the format of a file, to be run on a high performance computing cluster. This is important when the digital collections to be preserved are large, and execution time on a single computer could take months or even years.

Business cases relating to the SCAPE Platform might focus on:

  • Building a Hadoop cluster which runs the SCAPE Platform on which preservation processes can be performed
  • Connecting a digital repository to an existing cluster within the organisation on which the SCAPE Platform can be applied to perform preservation processes

SCAPE Platform benefits

Use this as a starting point for the benefits of using the SCAPE Platform at your organisation. Note that there may be benefits more specific to your organisation or circumstances that are not covered here so do your own brainstorm as well.

SCAPE Platform benefit summary

This section provides a summary of generic business benefits for using the SCAPE Platform.

Direct benefits:

  • Scales preservation technology to meet the demands of preserving very large digital objects (gigabyte+) and/or very large collections (terabyte+)
  • Dramatically speeds up preservation processes
    • Reduces execution time
    • Reduces execution costs
    • Enables previously long and impractical preservation processes to be executed (such as deeper characterisation) [1]
    • Makes the refinement of preservation processes more practical [2]
  • Makes the analysis of the results of preservation processes more manageable and responsive
  • Enables advances in tools to be exploited by making repeated runs of preservation processes viable

Indirect benefits:

  • Improves working efficiencies by providing facilities to manage preservation workflows and their results [3]
  • Leverages existing investments by better utilising existing infrastructure for preservation purposes [4]
  • Parallel computing can fulfil needs in a variety of areas beyond preservation, bringing wider value to an organisation

Notes:

  • [1] Deeper characterisation may otherwise require such lengthy execution time that it would not be viable
  • [2] If the execution time is reasonable, re-runs to address bugs or enable refinement of the process to address edge cases becomes more realistic. See The SCAPE Platform in use
  • [3] In particular, use of Taverna
  • [4] The SCAPE Platform provides the capability for a digital preservation department to exploit an existing organisational hadoop cluster

SCAPE Platform benefits by SCAPE dimensions of scalability

This section describes generic business benefits for using the SCAPE Platform in the context of the four SCAPE Project dimensions of scalability. It provides a different perspective on the SCAPE Platform business benefits described above.

Number of objects

Preserving digital collections consisting of large numbers of digital objects, possibly totalling many terabytes of data, poses significant challenges. Simple processes, such as identifying file formats or calculating checksums, necessitates accessing large amounts of data and performing a vast amount of intensive computation. The SCAPE Platform tackles both these issues, resulting in dramatically faster execution time of digital preservation processes.

Size of objects

Preserving large digital objects (such as video or audio) can be difficult due to the time taken to execute often simple processes (such as calculating checksums) and the inability of software applications to handle very large files. The SCAPE Platform reduces execution time by orders of magnitude and enables key preservation processes to be executed even on digital objects totalling multiple gigabytes.

Complexity of objects and Heterogeneity of collections

Preservation tools are constantly being enhanced as the community's understanding of digital preservation improves and knowledge of new formats (and the finer details of existing formats) grows. Being able to re-run preservation processes on large collections therefore becomes a useful capability. This is very dependent on the length of time it takes to execute a process on a large collection. For example, re-running a format identification process when a new set of file format signatures becomes available is clearly desirable, particularly for heterogeneous collections such as web archives. But it's not practical if the process takes weeks or possibly even months to complete.

Greater computing power enables deeper characterisation to be applied to collections, as long as execution times remain viable. This is essential for complex objects that may have embedded formats, that may be dependent on external content, or may contain a host of other preservation risks that can only be detected via thorough (and therefore typically computationally expensive) characterisation.

For more on developing and articulating your business benefits see the DPBCT sections on Benefits, Stakeholder analysis, Who is going to be affected? and How do I make the case for what I want to do?.

Justification for the benefits

The SCAPE Platform is designed to enable the performance of preservation processes to be scaled as per the needs of the situation. Adding additional nodes to a SCAPE platform will scale the performance. Evaluation work conducted by SCAPE provides some illustration of the potential savings that can be made in the execution of typical preservation processes. For example, characterising web content with benchmarking showing considerably higher performance running (the SCAPE Platform) on a small test cluster.

SCAPE Platform cost elements

Use this to identify key cost elements that should be considered in your business case.

  • Capital cost elements and setup activities
    • Hardware, installation and testing of a cluster, where one is not already available
    • Access costs or contributions in kind for access to a cluster, where one is available for use at your organisation
    • Installation and testing of SCAPE Platform components
  • Operational activities
    • Execution and maintenance of preservation processes running on the SCAPE Platform
    • Analysis, planning and exploitation of executed preservation processes
    • Hadoop cluster IT support
  • Staffing and specific technology skills needed:
    • Expertise in parallel computing, Hadoop and the other SCAPE Platform components

For more on identifying and understanding the costs of your business activity see the DPBCT sections on Costs, Institutional readiness and What resources are we focussing on?.

SCAPE Platform digital preservation risks

Use this to understand the key digital preservation risks of relevance to a business case focusing on the SCAPE Platform.

The SCAPE Platform primarily provides a mechanism to enable preservation processes to be run on a high performance computing cluster. As such it does not predicate the mitigation of specific preservation risks as this will depend on the nature of the data to be preserved. At a high level, relevant preservation risks might include:

  • Rare, problematic, unknown or obsolete file formats
    • Necessitating file format identification, characterisation and/or file format validation
  • Quality issues, badly constructed or invalid files
    • Necessitating characterisation, assessment and file format validation
  • Bit preservation issues
    • Necessitating generation and validation of checksums
  • Mitigation of risks and/or curation activities
    • Necessitating file format migration, generation of access copies, enhancement or extraction of metadata

For more on identifying and understanding the digital preservation risks that your business activity is targeting see the DPBCT sections on Digital preservation risks, Understand your collection and Why are we writing a business case?.

The SCAPE Platform in use

The UK Web Archive was an early adopter of the SCAPE Platform. Andrew Jackson, Technical Lead of Web Archiving, describes how the UK Web Archive uses the SCAPE Platform and the benefits they get from using it.

We are currently indexing 30TB of compressed web content, 2.5 billion resources, and are also doing format identification and extraction of PDF/A violations using Apache Preflight. This is fairly slow going, in that it will takes about a week to perform the scan. We have a 78 node cluster, and that means that it would at the very least take something like a year and a half to do the same thing on a single machine.

One of the critical advantages in shorter turn-around times is that being able to repeat means you get better at it. You can keep up with the rate that the preservation tools improve, and re-run the analysis periodically. Also, when dealing with very complex datasets, you often find that some subset of them hits an edge case in your code, but because the collection is so large, you have to code for more edge cases or you’d lose a big chunk of content. A faster turnaround time makes this kind of debugging vastly easier.

During the indexing, you can do more sophisticated analysis, both because you have a lot more ram and the framework. The mere fact that it’s running on multiple machines means that you have more RAM available per document as well as more CPU time. This makes more intensive preservation analysis more plausible than if you were to parallelize the I/O only. For example, we can imagine actually running an embedded browser and rendering some significant fraction of the collection in order to look for access problems (although we don’t do that at present).

Also see this blog post from Per Møldrup-Dalum on using Nanite at the Danish State and University Library

For more on the relevance and value of understanding the context to your organisation's business case, see the DPBCT section on External context.

SCAPE Platform business case example

This example business case applies the SCAPE Platform benefits and risks (see above) to a particular (theoretical) organisational situation. It shows how they could be tailored to the needs of an organisation and the likely concerns and interests of stakeholders. It comprises key sections from the DPBCT Template for building a business case followed by explanatory discussion notes.

ArchiveData Process.png

SCAPE Platform executive summary

An example Executive Summary.

Digital collections at this organisation have grown at such a rapid pace that our existing infrastructure is struggling to cope. Long term digital preservation of our collections and ensuring access for our users remain key strategic priorities. But as the capacity of our repository goes beyond 50TB we are unable to maintain the standards of digital preservation assessment defined in our organisational policy. This business activity will leverage and expand existing infrastructure in order to provide an efficient and cutting edge digital preservation capability.

Discussion

Discussion notes explaining the approach in developing the Executive Summary example, above.

This summary focusses on the changing circumstances that leave the organisation unable to meet it's policy requirements and at risk of not living up to key strategic priorities. The efficiency of utilising an existing Hadoop cluster (rather than building something completely new) is hinted at. The detail of the solution, including any mention of the complex technologies of which it is composed, is not mentioned here.

For more on summarising your business case and delivering your key messages succinctly see the DPBCT sections on Executive summary and How do I make the case for what I want to do?.

SCAPE Platform business activity

An example Business Activity description.

Digital collections at this organisation have grown to a sufficient size that assessment of the data for preservation risks has becoming virtually impossible. Execution time of a simple file format identification process is taking several months to complete. The proposed business activity will expand the web archiving department's Hadoop cluster, connect the organisation's digital repository to the cluster and apply the SCAPE Platform to enable all repository data to be characterised and assessed on a more frequent and practical basis.

The main activities are:

  • Purchase new hardware and expand existing cluster
  • Implement SCAPE Platform connector to enable movement of content from the digital repository to the cluster
  • Implement SCAPE Platform tools, specifically Nanite, to enable file format identification, metadata extraction and characterisation of collection data
  • Perform frequent re-assessment of data using the latest tools and file format signatures available

The web archiving department have agreed to allow digital preservation to utilise their existing Hadoop cluster in return for expanding the cluster with some additional hardware.

Discussion

Discussion notes explaining the approach in developing the Business Activity example, above.

The organisation is struggling to assess it's digital collections due to the length of time it takes to assess the growing numbers of files. Without any file format analysis, the potential for unknown preservation risks is significant. The business case therefore suggests leveraging (and expanding) an existing Hadoop cluster for running preservation processes. Web archiving departments have been early adopters of Hadoop technology in libraries and archives, and so this scenario has been used by some SCAPE Project partners.

For more on developing and articulating your business activity see the DPBCT sections on Business activity, How do I make the case for what I want to do? and Why are we writing a business case?.

SCAPE Platform benefits

An example of business benefits.

Using the SCAPE Platform will reduce execution time of preservation processes from months to days. This will enable us to meet the requirements of our organisational preservation policy to assess all content in our repository at least twice a year. This is not currently the case, putting our digital collections at significant risk, and endangering our strategic commitments to collection longevity and access. By applying the latest technology offered by the SCAPE Platform, we will have sufficient capability to not only execute our existing preservation processes efficiently, but to increase their coverage and depth. In doing so we will follow current best practice guidance ensure the survival and use of our digital collections now and in the future.

Discussion

Discussion notes explaining the approach in developing the Business Benefits example, above.

The broader list of detailed SCAPE Platform benefits is distilled to the issues critical to this organisation: meeting the policy requirements and ensuring strategic commitments are met.

For more on developing and articulating your business benefits see the DPBCT sections on Benefits, Stakeholder analysis and Who is going to be affected?.

SCAPE Platform implementation risks

An example of implementation risks alongside mitigation for those risks.

  • Adopting cutting edge technology is risky
    • Web archiving department already has confidence and experience with Hadoop technology
    • Some SCAPE components less well tested in production environments. Early visits to SCAPE Project partners will be used to identify and avoid typical pitfalls
  • Insufficient Hadoop cluster IT support skills
    • IT department already familar with maintenance of cluster.
  • Insufficient Hadoop skills in the digital preservation department
    • Exploit skills in web archiving department
    • Utilise training courses early in project to bring key team members up to speed
  • Shared use of cluser may lead to clashes with web archiving department needs
    • Coordinate planning of cluster use in advance with web archiving department

Acknowledgements

This business case template was created by the SCAPE Project with the support of the European Union under FP7 ICT-2009.4.1

Scape logo.png