Talk:Pre-ingest: Difference between revisions

From wiki.dpconline.org
Jump to navigation Jump to search
No edit summary
Line 33: Line 33:


What do you think?
What do you think?
== In the pre-ingest phase things needs to work out well before the SIP can enter the next phase of the OAIS ingest. ==
I’ve been working for several years at the Audit Organization of the Dutch Ministry of Finance. At the Dutch Tax office I was part of auditing some grand conversion projects. One was about an ingest of an old digital archive (with records burned on Plasmon optical discs and stored in an old Dutch Coalmine) into the mainframe digital archive system. Another project was about the ingest of 400 databases into a new grand Invoice system. But because the auditor was not involved in the pre-ingest phase of these projects, it was to late the say something about the integrity of the data that was transformed from one system to the other.
Datasets needs to be carefully examined and this takes a lot of time. Before a Submission Information Package (SIP) can be ingested, al lot has to be done. There are lots of risks involved and a lot of steps need special attention. The OAIS Ingest phase needs a data conversion plan to provide an organization tool (for management, producer, client, a host system, project, quality assurance, auditor, etc.) that can be used in the management solution of the ingest projects. For the successful completion of a SIP, the handling and management needs to be well organized. The separation between pre-ingest and ingest phase is in OIAS not sharp enough. Before a SIP can be taken in, precaution measures in the pre-ingest phase need to be taken and every step needs to be well documented.
The most relevant topics to be included in the pre-ingest so that risks can be minimized, should be:
Policy, integrity, management, back up, availability, tooling software, testing, quality control, file conversion.
POLICY, PLANNING AND DOCUMENTATION 
If there are no measures taken (and recorded) for the purchase of a SIP the organization runs the risk that the ingest process is out of control or that ultimately the desired result is not achieved and the project can not be completed on time and within budget.
There are (organizational) schemes needed to ensure that, throughout conversion process, management and project management in a proper manner can be switched on or adjusted. In its conduct of the examination (procedural and technical) measures must be taken throughout the conversion process commitments. These commitments are to be accurate, complete and timely manner.
INTEGRITY OF THE DATA
The risk is that the SIP needs mutations. Selecting the right conversion tools and proper controls are necessary. The timely availability of the data can lead to erroneous information concerning the data. There must be assurance that all data after the conversion is accurate, complete and timely included in the definitive database of the Archive Institution when conversion is needed. According to me all datasets and other SIP material needs to be changed in one way or another way (or at least the metadata) to make ingest possible.
MANAGEMENT AND ACCESS TO ENVIRONMENTS
The organization should comply with legal requirements related to privacy sensitive data. If unauthorized persons gain access to the SIP data, it may mean that the privacy-sensitive data is public. The organization meets or does not meet the legal requirements and runs the risk of claims.
There are (organizational) measures ensuring that, during the conversion process only authorized persons have access to the SIP data and the conversion tools and functionalities.
BACK-UP EN FALLBACK
If no decisive action has been taken to create backups of the database at all stages of the SIP process and there is no fallback scenario, the planned ingest process and the continuity of the giving process can be in jeopardy. This may mean that the operational and informative - application cannot or does not timely occur. There must be (procedural and technical) measures to be taken to ensure that data during and after the pre- and ingest phase constant is guaranteed.
Log information of both phases needs to be archived and given as (metadata) to the designated community so that they have assurance that the information is correctly handled during both phases so it can be TRUSTED!
AVAILABILITY NEW ENVIRONMENT
How about the situation that the datasets in the SIP have to wait before they are taken in the Archival system?
When the required functions are not available during the data transfer and after the data ingest conversion and cannot be tested in the correct manner, the database may not be sufficiently usable. As a result, the continuity of the data processing is in danger. There are (procedural and technical) measures taken to ensure that the new information packages (for the benefit of the entire ingest range) may be available for the ultimate ingest.
TOOLING SOFTWARE
A SIP conversion is such a specific nature that no use can be made of standard conversion software. Sure standard software or a standard function of Preservica can help but every SIP project is special and needs customization and is therefore time consuming. If the proper operation of ingest like conversion, adding metadata and transfer software fails, the organization runs the risk that due to the software, the integrity of the data is no longer sure and / or that the ingest process is unnecessarily time consuming.
There are (organizational and technical) measures to be taken to ensure that during the entire ingest process the software that is used must functioning properly.
TESTING
Incomplete or incorrect test can lead to the implementation of a converted database in the production environment, which does not comply with the set requirements. This enables that the integrity, confidentiality, security and manageability are at stake. Within the pre-ingest process one should works with a predetermined test plan for the test. This is particularly who incorporated will conduct the testing, whoever is responsible, as well as when, where and how will be tested.
QUALITY CONTROL
If there is insufficient attention to compliance with the planning, execution and control cycle, the risk is that the ingest process is unmanageable and uncontrollable. A designated quality control officer, who oversees compliance with the planning, execution and control cycle, should monitor the pre-ingest process. Then there are less problems to be expected in the ingest of a SIP.
FILE CONVERSION
If the quality of the data conversion and the checks carried out in the pre-ingest phase have not proven the reliability of the SIP then the data should not enter the ingest phase. The results of the pre-ingest process and the checks carried out, must be recorded in a file conversion document.
Summary:
In the pre-ingest phase things needs to work out well before the SIP can enter the next phase of  the OAIS ingest.
Things to concern:
OAIS knows a Negotiate Submission Agreement function. This is a detailed description of what constitutes a Submission Agreement that can be found in the OAIS model related standard: Producer -Archive Interface Methodology Abstract Standard (PAIMAS). PAIMAS is also a tool to better handle the complex OAIS framework. It helps to manage the risks in the Pre-ingest proces.
Ronald van der Steen, Erfgoedinspectie

Revision as of 02:16, 14 November 2015

Reply from Hervé L'Hours, UKDA

The original OAIS release included recommendations for future development including a more detailed approach to the handling of appraisal and custody transfer to the repository (now PAIMAS).

OAIS uses the term ‘Pre-Ingest’ once, in the discussion under “4.1.5 The repository shall have an ingest process which verifies each SIP for completeness and correctness.” : If an inventory of files was provided by a producer as part of pre-ingest negotiations, one would expect checks to be carried out against that inventory.“

PAIMAS mentions the Pre-Ingest only once as an alternate term for the first of its four phases: “The Preliminary Phase, also known as a pre-ingest or pre-accessioning phase, includes the initial contacts between the Producer and the Archive and any resulting feasibility studies, preliminary definition of the scope of the project, a draft of the SIP definition and finally a draft Submission Agreement.”

But the more recent (2012) Interface Specification (651x1r1-ProducerArchiveInterfaceSpec-PAIS-RedBook-Feb2012.pdf) doesn’t mention it at all.

The use of the term more colloquially has grown among curation professionals to refer to the repository processes from first contact with a potential ‘depositor’ to the arrival of an approved ‘deposit’ in the repository (carefully avoiding the term SIP here). But it is also used/misused at times to denote all pre-repository stages of the digital object lifecycle.

Does this use/misuse indicate that:

1. The term deserves formalisation as an important phase of the repository process? 2. The OAIS would benefit from more clearly placing itself within the full digital object lifecycle?

Topics important to pre-ingest (“processes from first contact with a potential ‘depositor’ to the arrival of an approved ‘deposit’”) include:

  • Contact management
  • Formal identification of depositor contacts
  • Confirmation that the offered data collection:
    • meets the repository’s collections development criteria (appropriate subject matter, IPR and rights status etc)
    • is of suitable technical quality for use and preservation (low risk file formats, sufficient supporting metadata etc)
    • has been appropriately validated in terms of risk (virus detection to appropriate anonymisation of human subjects)
    • is suitably structured to support the needs of the designated community (from structural metadata, to file naming to including ‘pre-repository’
  • Standard procedures for custody transfer

Barbara, I think the “want to distinguish the ‘raw material’ received from the producer from the material that is being processed to become a SIP” is tricky. I think the OAIS authors would suggest that this is the SIP though of course we know that we’re collecting relevant metadata well before a deposit is received. I think the OAIS specifically identifies the three states SIP/DIP/AIP as covering everything, with the SIP exactly as deposited and then recording of all actions to create the AIP and DIP. But we also know that some organisations may receive sample data, or multiple submissions before the ‘deposit process’ is closed. Here at the UKDA we ‘declare’ the existence of a SIP once all relevant QA is complete, until that point we just have (potentially) multiple ‘deposits’ in the acquisition process.

So I agree about Pre-Ingest, and I understand the idea of a PSP, but I’m not convinced that OAIS would ever consider an additional ‘object’ like that. I think that they’d suggest that any metadata you collect during the Pre-Ingest process becomes part of the AIP i.e. that you begin creating the AIP even before the SIP/deposit arrives.

I agree about the ‘raw’ data stuff though, I’d particularly like to be able to request deposit of ‘versions’ of the data that have been used to support ‘pre-repository’ publications. We know that people might use our DOI’s for publications which are actually based on Pre-Repository (so pre-QA) versions of the data during the production process. We want to support this but also to distinguish these versions from our own ‘better’ (or at least different) version. One solution to this might be to offer pre-repository DOI minting to researchers and then to design the SIP to make sure we can receive multiple data versions.

What do you think?