SIPs

SIPs

Introduction

We introduce an extension to the existing SIP model from Complex Objects 1.0. The file structure remains unchanged, namely, a ZIP file with files organised in folders and containing a top-level METS file describing the files in the ZIP. The structure of the METS has been changed to be correctly in line with the METS standard and have the expressivity for describing Record Trees.

MediaHaven already offers to export a complete record tree together with an METS XML.

Retrieving the created records

When the top-level intellectual object is created, it will get a metadata field Structural.Relations.ContainedBy that contains the fragment ID of the Sip. On the SIP, the inverse relation Structural.Relations.Containscontains the fragment ID of the top-level intellectual object.

Record Trees

Complex Objects 2.0 now can describe any of the Record Trees one can model in MediaHaven.

Example 1

  • Newspaper

    • Newspaper page 1

      • TIFF image

        • Original representation

      • JP2 image

        • Original representation

    • Newspaper page 2

      • TIFF image

        • Original representation

      • JP2 image

        • Original representation

    • ALTO XML for the entire newspaper

Example 2

  • Dossier

    • Document 1 (e-mail)

      • Original representation (e-mail)

      • Email attachment 1

        • Original representation

      • Email attachment 2

        • Original representation

    • Document 2

      • Original representation

METS

The SIP contains a metadata sidecar file in theMETSformat describing the entire SIP. Because of this, the METS file can be quite large and will be stored as an additional representation under the original representation of the SIP. When the original representation is removed as part of the lifecycle of the SIP, this additional representation is removed as well.

SIP

The structure of the SIP is unchanged from the old Complex Objects 1.0.

Requirements

Unchanged from Complex Objects Reference

The following requirements are imposed by the Complex Ingest workflow that go beyond the well-formatted XML and the validation by the provided XSDs.

  1. Every file in the archive must be referenced by the METS. If files are not referenced or if a referenced file is missing, the entire archive is rejected.

  2. The MD5 checksums provided in the XML are compared against the calculated MD5 checksums. The entire archive is rejected if one check fails.

  3. The file paths used in the METS are paths relative to the root the accompanying archive.

Rules

Lifecycle

See also MediaHaven Record Status for an overview of statuses.

Status

Meaning

Storage

What happens to the original representation of the SIP?

Status

Meaning

Storage

What happens to the original representation of the SIP?

New

The SIP has been created but not yet picked up.

 

Processing

The SIP workflow has started and the SIP is being processed.

 

Published <25.2

Accepted 25.2

The SIP and all its content has been successfully ingested. In older versions the status was Published, but this has been replaced with Accepted since 25.2

  • < 24.3 In older versions the SIP was logically deleted at this point.

  • 24.3 The SIP is no longer logically deleted, its original representation is permanently deleted instead.

Rejected

The SIP itself contains invalid (meta)data or one of the files from the SIP was rejected during ingest

Same behaviour as Accepted

Rejections

SIPs will be rejected for the following possible reasons about the ZIP file itself or the embedded METS XML.

Message

Meaning

Message

Meaning

1

Root directory in SIP does not contain any XML file

The METS XML is not found inside the SIP

2

Root directory in SIP contains multiple XML files

The system cannot unambiguously determine the METS XML because multiple XML files are present in the root directory

3

Corrupt ZIP

The file is not a valid ZIP file

4

Could not list files in ZIP

The ZIP file has become corrupted or truncated

5

Could not extract sidecar file from SIP

The sidecar file is corrupted

6

Could not open sidecar file

The sidecar file is corrupted

7

Invalid METS sidecar XML

The METS XML is not a well-formed XML file, nor does it comply with the METS XSD.

8

Sip does not contain the following N files described in METS

One or more files described by the METS XML are not physically present in the ZIP file

9

Sip has the following extra N files not described in METS

One or more files not described by the METS XML are present in the ZIP file

10

File A contains an unrecognised use X only Y are supported

The attribute USE contains an invalid value. See METS.

11

Sip has the following file X whose MD5 Y is already present in organisation Z

When duplicate files are not allowed (the default), then raise a validation even before extracting the file

12

Sip has the object %s/%s whose ExternalId %s is already present (ArchiveStatus: processing or finished)

The metadata field Administrative.ExternalId must be unique (at most 1 non-deleted object with the same value).

13

ExternalId MUST BE unique within the METS, duplicated ExternalId(s) are

The METS XML contains multiple instances of the same value for Administrative.ExternalId, which would violate the unique constraint.

Beyond this, the SIP can be rejected because individual objects are rejected during ingest

Check

Meaning

Check

Meaning

1

Duplicate MD5

When the MD5 is not provided in the METS XML, the dynamic calculation during the ingest can still raise a validation error

2

Virus scanning

A virus was detected

3

Format

  • See Formats

  • The format is blacklisted

  • The format could not be determined, and the system forbids this

4

Validation

A Validation Module determined that the object is not valid

5

Transformation

If configured as strict, failed transformations lead to rejection

Limits 25.2

For advanced installations, these limitations can have different (higher) values

The following limits apply

Limit

Value

Limit

Value

Total size

250 GB

Number of files in SIP

10000

The SIP will be rejected if these limits are exceeded.