Complex Objects Overview

Complex objects are an advanced form of ingest workflow in MediaHaven which is not available to all customers. 

Introduction

The complex objects workflow was conceived as means of submitting a collection of related files in a single submission (SIP). For example, a submission of a particular edition of a newspaper with a series of files for each digitized page and per page, a high resolution TIFF, an ALTO OCR file and JPEG2000 access copy. This collection of files is submitted as a single ZIP archive. In addition there is metadata file describing this entire collection of files structured in the METS format. The METS metadata file describes the files contained within the archive and provides for each file its relative path in the archive, an md5 checksum and optional descriptive metadata and events. Depending on variant of the workflow, the METS metadata file is provided as a sidecar file along the archive or is contained within the archive itself.

Step-by-step

The steps described are highly similar to the steps of the standard workflow in MediaHaven, with the difference that several steps are repeated for each unpacked file while skipping the step that physically would store the unpacked file on the long term storage.

  • 1) Parse & validate (using XSD) the sidecar METS file

  • 2) Unpack the archive into a temporary folder providing access to the files within

Perform steps 3-4-5-6-7 for each unpacked file within the archive and the archive itself

  • 3) Calculate the MD5

  • 4) Compare the calculated MD5 with the MD5 in the sidecar XML file

  • 5) Extract the technical and descriptive metadata (includes the file type)

  • 6) Merge the delivered metadata (only the part linked with this file) with the extracted metadata

  • 7) Create the file in MediaHaven and save the metadata

  • 8) Copy the archive to the long term storage

Validation

On confluence you find a sample METS file and the XSD to validate the METS you create: Complex Objects Reference.

Detailed description of the METS format

The items below describe in detail how the information (Md5, metadata) is structured in the METS file. See Complex Objects Reference for detailed information.

Files in the SIP

The section in the METS tagged fileSec is called the file section and provides a list of all files in the archive and the archive itself. For each file, following information is given:

  1. Relative path of the file in the SIP

  2. MD5 checksum of the file

  3. Link to descriptive metadata in the METS

  4. Intended use of the file (i.e. storage strategy)

Metadata

For the SIP and each file in the SIP, the attribute ADMID of the tag fileGrp or fileSec optionally provides a whitespace separated list of IDs referring to the sourceMD tags with matching ID. These sourceMD tags contain wrapped metadata in the standard metadata format shown above. In the example below, one attribute ADMID refers via the ID METADATA-SIP to an above metadata section, providing a title and description for the archive. Similarly one attribute ADMID refers via the ID METADATA-ALTO to metadata about an ALTO XML.

Storage strategy

Through the attribute USE, the section fileSec specifies how to treat and store the file. If the strategy does not apply for you, you can omit the attribute USE.

Possible values are:

  • PRESERVATION: Extract metadata, mime type and type, and preserve the file on the long term storage

  • VIRTUAL: Extract metadata, mime type and media type but do not store the file on the long term storage (the file is packed in the archive which is preserved on the long term storage).

  • FIXITY: Only check the MD5 for this file but do not create a container in MediaHaven nor store it on the long term storage (the file is packed in the archive which is preserved on the long term storage).

By using PRESERVATION for the archive and VIRTUAL for each file in the archive, MediaHaven will store the archive on the long term storage while the files in the SIP are available as metadata only entities. If you put PRESERVATION for an unpacked file, you can overrule the above and yet save the individual file on the long term storage (in addition to the archive).

Depending on your plugin for the MediaHaven workflow Ingest more USE(s) might be available.

Fixity

In this format, the fixity information is extra secure because an MD5 checksum is provided for:

  1. The archive itself

  2. Every single file in the archive

If one of the calculated MD5 checksums does not match, the archive is rejected.

Requirements

Some extra requirements are added to improve the reliability.

  1. Every file in the archive must be referenced by the METS. If files are not referenced or if a referenced file is missing, the archive is rejected. This ensures that the METS unambiguously describes the archive.

  2. Namespace Mets must be explicitly used instead of using a default namespace declaration.

  3. The order in which the files are ingested into the MediaHaven is the order of files described in the METS, not the order in the archive.