Talk:Pre-ingest: Difference between revisions

From wiki.dpconline.org
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
(9 intermediate revisions by 4 users not shown)
Line 1: Line 1:
''Reply from Hervé L'Hours, UKDA''


The original OAIS release included recommendations for future development including a more detailed approach to the handling of appraisal and custody transfer to the repository (now PAIMAS).  
The original OAIS release included recommendations for future development including a more detailed approach to the handling of appraisal and custody transfer to the repository (now PAIMAS).  
Line 33: Line 32:


What do you think?
What do you think?
[[User:Hlhours|Hlhours]] ([[User talk:Hlhours|talk]]) 10:31, 18 November 2015 (UTC)
== In the pre-ingest phase things needs to work out well before the SIP can enter the next phase of the OAIS ingest.  ==
Recently, I was working at the Audit Organization of the Dutch Government. At the Dutch Tax office I was participating in the audit of some grand conversion projects. One was about an ingest of an old digital archive (with records burned on Plasmon optical discs that were physically stored in an old Dutch coalmine) into the mainframe digital archive application. Another project was about the ingesting of 400 databases into a new grand Invoice system. But because the auditor was not involved in the pre-ingest phase of these projects, it was to late to reflect on the integrity of the data that were transformed from one system to the other.
Datasets needs to be carefully examined and this takes a lot of time. Before a Submission Information Package (SIP) can be ingested, al lot has to be done. There are lots of risks involved and a lot of steps need special attention. The OAIS Ingest phase needs a data conversion plan to provide an organization tool (for management, producer, client, a host system, project, quality assurance, auditor, etc.) that can be used in the management solution of the ingest projects. For the successful completion of a SIP, the handling and management needs to be well organized. The separation between pre-ingest and ingest phase is in OIAS not sharp enough. Before a SIP can be taken in, precaution measures in the pre-ingest phase need to be taken and every step needs to be well documented.
The most relevant topics to be included in the pre-ingest so that risks can be minimized, should be:
Policy, integrity, management, back up, availability, tooling software, testing, quality control, file conversion and metadata conversion.
POLICY, PLANNING AND DOCUMENTATION 
If there are no measures taken (and recorded) for the purchase of a SIP the organization runs the risk that the ingest process is out of control or that ultimately the desired result is not achieved and the project can not be completed on time and within budget.
There are (organizational) schemes needed to ensure that, throughout conversion process, management and project management in a proper manner can be switched on or adjustments. In its conduct of the examination (procedural and technical) measures taken to ensure that occur throughout the conversion process commitments. These commitments are to be accurate, complete and timely manner.
INTEGRITY OF THE DATA
The risk is that the SIP needs mutations. Selecting the right conversion tools and proper controls are necessary. The timely availability of the data can lead to erroneous information concerning the data. There must be assurance that all data after the conversion is accurate, complete and timely included in the definitive database of the Archive Institution when conversion is needed. According to me all datasets and other SIP material needs to be changed in one way or another way (or at least the metadata) to make ingest possible.
MANAGEMENT AND ACCESS TO ENVIRONMENTS
The organization should comply with legal requirements related to privacy sensitive data. If unauthorized persons gain access to the SIP data, it may mean that the privacy-sensitive data is public. The organization meets or does not meet the legal requirements and runs the risk of claims.
There are (organizational) measures ensuring that, during the conversion process only authorized persons have access to the SIP data and the conversion tools and functionalities.
BACK-UP EN FALLBACK
If no decisive action has been taken to create backups of the database at all stages of the SIP process and there is no fallback scenario, the planned ingest process and the continuity of the giving process can be in jeopardy. This may mean that the operational and informative - application cannot or does not timely occur. There must be (procedural and technical) measures to be taken to ensure that data during and after the pre- and ingest phase constant is guaranteed.
Log information of both phases needs to be archived and given as (metadata) to the designated community so that they have assurance that the information is correctly handled during both phases so it can be TRUSTED!
AVAILABILITY NEW ENVIRONMENT
How about the situation that the datasets in the SIP have to wait before they are taken in the Archival system?
When the required functions are not available during the data transfer and after the data ingest conversion and cannot be tested in the correct manner, the database may not be sufficiently usable. As a result, the continuity of the data processing is in danger. There are (procedural and technical) measures taken to ensure that the new information packages (for the benefit of the entire ingest range) may be available for the ultimate ingest.
TOOLING SOFTWARE
A SIP conversion is such a specific process that no use can be made of standard conversion software. Standard software or standard functionality of preservation systems (like Preservica) can help but every SIP project is special and needs customization and is therefore time consuming. If the proper operation of ingest like conversion, adding metadata and transfer software fails, the organization runs the risk that due to the software, the integrity of the data is no longer sure and / or that the ingest process is unnecessarily time consuming.
There are (organizational and technical) measures to be taken to ensure that during the entire ingest process the software that is used must functioning properly.
TESTING
Incomplete or incorrect test can lead to the implementation of a converted database in the production environment, which does not comply with the set requirements. This enables that the integrity, confidentiality, security and manageability are at stake. Within the pre-ingest process one should works with a predetermined test plan for the test. This is particularly who incorporated will conduct the testing, whoever is responsible, as well as when, where and how will be tested.
QUALITY CONTROL
If there is insufficient attention to compliance with the planning, execution and control cycle, the risk is that the conversion process is unmanageable and uncontrollable. A designated quality control officer, who oversees compliance with the planning, execution and control cycle, should monitor the pre-ingest process. Then there are less problems to be expected when ingesting SIP’s.
FILE CONVERSION
If the quality of the data conversion and the checks carried out in the pre-ingest phase have not proven the reliability of the SIP then the data should not enter the ingest phase. The results of the pre-ingest process and the checks carried out, must be recorded in a file conversion document.
Summary
In the pre-ingest phase things needs to work out well before the SIP can enter the next phase of the OAIS ingest.
Things to concern:
OAIS describes a Negotiate Submission Agreement function. This is a detailed description of what constitutes a Submission Agreement can be found in the OAIS model related standard: Producer -Archive Interface Methodology Abstract Standard (PAIMAS). PAIMAS is also a tool to better handle the complex OAIS framework.
Concern the SIP the same as the most valuable asset of a bank, it must be kept forever!
December 2015, Ronald van der Steen
==Flexible ingest?==
The counter point to this discussion so far is the concept of Minimal Effort Ingest: "In [[Minimal Effort Ingest]], we postpone the QA of data and metadata until after the data has been ingested and even further, if resources are not available. This approach makes it possible to secure the incoming data quickly." Although the approach has been characterised by delaying certain processes to a date after ingest, the essence of the concept is that ingest can be flexible to the needs of a particular dataset. I'm not sure if this muddies the water further, but it perhaps raises the need for clarification of what we mean by the terms for various workflow stages.
[[User:PRWheatley|PRWheatley]] ([[User talk:PRWheatley|talk]]) 12:04, 9 December 2015 (UTC)
== Flexible (pre-)ingest (and preservation) continued ==
Adding to the comment from Paul Wheatley I would add that in general QA is something which is hard to fit in a definite location within the OAIS processes.
Namely, having worked together with many preservation institutions I've seen situations where:
* focus on pre-ingest: an institution requires full QA to be done by the submitting institution, accepts the (clearly defined) SIP and ingests it more or less in the same form as an AIP (i.e. the SIPtoAIP conversion + QA during ingest is minimal);
* focus on repository: an institution receives data in whatever form the submitter is able to submit it, the data is loosely packaged as an SIP, ingested as an AIP and only then, while creating new revisions of the AIP, QA is done and the package updated.
* and of course - anything in between is possible as well.
As an example, we ourselves let the data providers do some of the QA during pre-ingest but large chunks of it still take place during ingest: so we are somewhere in the middle of the extremes.
As such it seems to me that we need to talk more about the different tasks repositories need to do before data is fit for preservation but not necessarily concentrate on whether these tasks are done during pre-ingest, ingest or preservation (or all of them). That in turn would require that the new revision of OAIS would allow to describe such "loose" functional (sub-)components which you can plug into any of the core OAIS processes depending on the needs of your institution and the capabilities of the data provider.
[[User:KuldarAas|KuldarAas]] ([[User talk:KuldarAas|talk]]) 13:58, 10 August 2016 (UTC)

Latest revision as of 13:59, 10 August 2016

The original OAIS release included recommendations for future development including a more detailed approach to the handling of appraisal and custody transfer to the repository (now PAIMAS).

OAIS uses the term ‘Pre-Ingest’ once, in the discussion under “4.1.5 The repository shall have an ingest process which verifies each SIP for completeness and correctness.” : If an inventory of files was provided by a producer as part of pre-ingest negotiations, one would expect checks to be carried out against that inventory.“

PAIMAS mentions the Pre-Ingest only once as an alternate term for the first of its four phases: “The Preliminary Phase, also known as a pre-ingest or pre-accessioning phase, includes the initial contacts between the Producer and the Archive and any resulting feasibility studies, preliminary definition of the scope of the project, a draft of the SIP definition and finally a draft Submission Agreement.”

But the more recent (2012) Interface Specification (651x1r1-ProducerArchiveInterfaceSpec-PAIS-RedBook-Feb2012.pdf) doesn’t mention it at all.

The use of the term more colloquially has grown among curation professionals to refer to the repository processes from first contact with a potential ‘depositor’ to the arrival of an approved ‘deposit’ in the repository (carefully avoiding the term SIP here). But it is also used/misused at times to denote all pre-repository stages of the digital object lifecycle.

Does this use/misuse indicate that:

1. The term deserves formalisation as an important phase of the repository process? 2. The OAIS would benefit from more clearly placing itself within the full digital object lifecycle?

Topics important to pre-ingest (“processes from first contact with a potential ‘depositor’ to the arrival of an approved ‘deposit’”) include:

  • Contact management
  • Formal identification of depositor contacts
  • Confirmation that the offered data collection:
    • meets the repository’s collections development criteria (appropriate subject matter, IPR and rights status etc)
    • is of suitable technical quality for use and preservation (low risk file formats, sufficient supporting metadata etc)
    • has been appropriately validated in terms of risk (virus detection to appropriate anonymisation of human subjects)
    • is suitably structured to support the needs of the designated community (from structural metadata, to file naming to including ‘pre-repository’
  • Standard procedures for custody transfer

Barbara, I think the “want to distinguish the ‘raw material’ received from the producer from the material that is being processed to become a SIP” is tricky. I think the OAIS authors would suggest that this is the SIP though of course we know that we’re collecting relevant metadata well before a deposit is received. I think the OAIS specifically identifies the three states SIP/DIP/AIP as covering everything, with the SIP exactly as deposited and then recording of all actions to create the AIP and DIP. But we also know that some organisations may receive sample data, or multiple submissions before the ‘deposit process’ is closed. Here at the UKDA we ‘declare’ the existence of a SIP once all relevant QA is complete, until that point we just have (potentially) multiple ‘deposits’ in the acquisition process.

So I agree about Pre-Ingest, and I understand the idea of a PSP, but I’m not convinced that OAIS would ever consider an additional ‘object’ like that. I think that they’d suggest that any metadata you collect during the Pre-Ingest process becomes part of the AIP i.e. that you begin creating the AIP even before the SIP/deposit arrives.

I agree about the ‘raw’ data stuff though, I’d particularly like to be able to request deposit of ‘versions’ of the data that have been used to support ‘pre-repository’ publications. We know that people might use our DOI’s for publications which are actually based on Pre-Repository (so pre-QA) versions of the data during the production process. We want to support this but also to distinguish these versions from our own ‘better’ (or at least different) version. One solution to this might be to offer pre-repository DOI minting to researchers and then to design the SIP to make sure we can receive multiple data versions.

What do you think?

Hlhours (talk) 10:31, 18 November 2015 (UTC)

In the pre-ingest phase things needs to work out well before the SIP can enter the next phase of the OAIS ingest.

Recently, I was working at the Audit Organization of the Dutch Government. At the Dutch Tax office I was participating in the audit of some grand conversion projects. One was about an ingest of an old digital archive (with records burned on Plasmon optical discs that were physically stored in an old Dutch coalmine) into the mainframe digital archive application. Another project was about the ingesting of 400 databases into a new grand Invoice system. But because the auditor was not involved in the pre-ingest phase of these projects, it was to late to reflect on the integrity of the data that were transformed from one system to the other.

Datasets needs to be carefully examined and this takes a lot of time. Before a Submission Information Package (SIP) can be ingested, al lot has to be done. There are lots of risks involved and a lot of steps need special attention. The OAIS Ingest phase needs a data conversion plan to provide an organization tool (for management, producer, client, a host system, project, quality assurance, auditor, etc.) that can be used in the management solution of the ingest projects. For the successful completion of a SIP, the handling and management needs to be well organized. The separation between pre-ingest and ingest phase is in OIAS not sharp enough. Before a SIP can be taken in, precaution measures in the pre-ingest phase need to be taken and every step needs to be well documented.

The most relevant topics to be included in the pre-ingest so that risks can be minimized, should be: Policy, integrity, management, back up, availability, tooling software, testing, quality control, file conversion and metadata conversion.

POLICY, PLANNING AND DOCUMENTATION If there are no measures taken (and recorded) for the purchase of a SIP the organization runs the risk that the ingest process is out of control or that ultimately the desired result is not achieved and the project can not be completed on time and within budget. There are (organizational) schemes needed to ensure that, throughout conversion process, management and project management in a proper manner can be switched on or adjustments. In its conduct of the examination (procedural and technical) measures taken to ensure that occur throughout the conversion process commitments. These commitments are to be accurate, complete and timely manner.

INTEGRITY OF THE DATA The risk is that the SIP needs mutations. Selecting the right conversion tools and proper controls are necessary. The timely availability of the data can lead to erroneous information concerning the data. There must be assurance that all data after the conversion is accurate, complete and timely included in the definitive database of the Archive Institution when conversion is needed. According to me all datasets and other SIP material needs to be changed in one way or another way (or at least the metadata) to make ingest possible.

MANAGEMENT AND ACCESS TO ENVIRONMENTS The organization should comply with legal requirements related to privacy sensitive data. If unauthorized persons gain access to the SIP data, it may mean that the privacy-sensitive data is public. The organization meets or does not meet the legal requirements and runs the risk of claims. There are (organizational) measures ensuring that, during the conversion process only authorized persons have access to the SIP data and the conversion tools and functionalities.

BACK-UP EN FALLBACK If no decisive action has been taken to create backups of the database at all stages of the SIP process and there is no fallback scenario, the planned ingest process and the continuity of the giving process can be in jeopardy. This may mean that the operational and informative - application cannot or does not timely occur. There must be (procedural and technical) measures to be taken to ensure that data during and after the pre- and ingest phase constant is guaranteed. Log information of both phases needs to be archived and given as (metadata) to the designated community so that they have assurance that the information is correctly handled during both phases so it can be TRUSTED!

AVAILABILITY NEW ENVIRONMENT How about the situation that the datasets in the SIP have to wait before they are taken in the Archival system? When the required functions are not available during the data transfer and after the data ingest conversion and cannot be tested in the correct manner, the database may not be sufficiently usable. As a result, the continuity of the data processing is in danger. There are (procedural and technical) measures taken to ensure that the new information packages (for the benefit of the entire ingest range) may be available for the ultimate ingest.

TOOLING SOFTWARE A SIP conversion is such a specific process that no use can be made of standard conversion software. Standard software or standard functionality of preservation systems (like Preservica) can help but every SIP project is special and needs customization and is therefore time consuming. If the proper operation of ingest like conversion, adding metadata and transfer software fails, the organization runs the risk that due to the software, the integrity of the data is no longer sure and / or that the ingest process is unnecessarily time consuming. There are (organizational and technical) measures to be taken to ensure that during the entire ingest process the software that is used must functioning properly.

TESTING Incomplete or incorrect test can lead to the implementation of a converted database in the production environment, which does not comply with the set requirements. This enables that the integrity, confidentiality, security and manageability are at stake. Within the pre-ingest process one should works with a predetermined test plan for the test. This is particularly who incorporated will conduct the testing, whoever is responsible, as well as when, where and how will be tested.

QUALITY CONTROL If there is insufficient attention to compliance with the planning, execution and control cycle, the risk is that the conversion process is unmanageable and uncontrollable. A designated quality control officer, who oversees compliance with the planning, execution and control cycle, should monitor the pre-ingest process. Then there are less problems to be expected when ingesting SIP’s.

FILE CONVERSION If the quality of the data conversion and the checks carried out in the pre-ingest phase have not proven the reliability of the SIP then the data should not enter the ingest phase. The results of the pre-ingest process and the checks carried out, must be recorded in a file conversion document.

Summary In the pre-ingest phase things needs to work out well before the SIP can enter the next phase of the OAIS ingest.

Things to concern: OAIS describes a Negotiate Submission Agreement function. This is a detailed description of what constitutes a Submission Agreement can be found in the OAIS model related standard: Producer -Archive Interface Methodology Abstract Standard (PAIMAS). PAIMAS is also a tool to better handle the complex OAIS framework.

Concern the SIP the same as the most valuable asset of a bank, it must be kept forever!

December 2015, Ronald van der Steen

Flexible ingest?

The counter point to this discussion so far is the concept of Minimal Effort Ingest: "In Minimal Effort Ingest, we postpone the QA of data and metadata until after the data has been ingested and even further, if resources are not available. This approach makes it possible to secure the incoming data quickly." Although the approach has been characterised by delaying certain processes to a date after ingest, the essence of the concept is that ingest can be flexible to the needs of a particular dataset. I'm not sure if this muddies the water further, but it perhaps raises the need for clarification of what we mean by the terms for various workflow stages.

PRWheatley (talk) 12:04, 9 December 2015 (UTC)

Flexible (pre-)ingest (and preservation) continued

Adding to the comment from Paul Wheatley I would add that in general QA is something which is hard to fit in a definite location within the OAIS processes.

Namely, having worked together with many preservation institutions I've seen situations where:

  • focus on pre-ingest: an institution requires full QA to be done by the submitting institution, accepts the (clearly defined) SIP and ingests it more or less in the same form as an AIP (i.e. the SIPtoAIP conversion + QA during ingest is minimal);
  • focus on repository: an institution receives data in whatever form the submitter is able to submit it, the data is loosely packaged as an SIP, ingested as an AIP and only then, while creating new revisions of the AIP, QA is done and the package updated.
  • and of course - anything in between is possible as well.

As an example, we ourselves let the data providers do some of the QA during pre-ingest but large chunks of it still take place during ingest: so we are somewhere in the middle of the extremes.

As such it seems to me that we need to talk more about the different tasks repositories need to do before data is fit for preservation but not necessarily concentrate on whether these tasks are done during pre-ingest, ingest or preservation (or all of them). That in turn would require that the new revision of OAIS would allow to describe such "loose" functional (sub-)components which you can plug into any of the core OAIS processes depending on the needs of your institution and the capabilities of the data provider.

KuldarAas (talk) 13:58, 10 August 2016 (UTC)