Complex Objects Overview
Complex objects are an advanced form of ingest workflow in MediaHaven which is not available to all customers.Â
Introduction
The complex objects workflow was conceived as means of submitting a collection of related files in a single submission (SIP). For example, a submission of a particular edition of a newspaper with a series of files for each digitized page and per page, a high resolution TIFF, an ALTO OCR file and JPEG2000 access copy. This collection of files is submitted as a single ZIP archive. In addition there is metadata file describing this entire collection of files structured in the METS format. The METS metadata file describes the files contained within the archive and provides for each file its relative path in the archive, an md5 checksum and optional descriptive metadata and events. Depending on variant of the workflow, the METS metadata file is provided as a sidecar file along the archive or is contained within the archive itself.
Step-by-step
The steps described are highly similar to the steps of the standard workflow in MediaHaven, with the difference that several steps are repeated for each unpacked file while skipping the step that physically would store the unpacked file on the long term storage.
1) Parse & validate (using XSD) the sidecar METS file
2) Unpack the archive into a temporary folder providing access to the files within
Perform steps 3-4-5-6-7 for each unpacked file within the archive and the archive itself
3) Calculate the MD5
4) Compare the calculated MD5 with the MD5 in the sidecar XML file
5) Extract the technical and descriptive metadata (includes the file type)
6) Merge the delivered metadata (only the part linked with this file) with the extracted metadata
7) Create the file in MediaHaven and save the metadata
8) Copy the archive to the long term storage
Validation
On confluence you find a sample METS file and the XSD to validate the METS you create:Â Complex Objects Reference.
Detailed description of the METS format
The items below describe in detail how the information (Md5, metadata) is structured in the METS file. See Complex Objects Reference for detailed information.
Files in the SIP
The section in the METS tagged fileSec is called the file section and provides a list of all files in the archive and the archive itself. For each file, following information is given:
Relative path of the file in the SIP
MD5 checksum of the file
Link to descriptive metadata in the METS
Intended use of the file (i.e. storage strategy)
Metadata
For the SIP and each file in the SIP, the attribute ADMID of the tag fileGrp or fileSec optionally provides a whitespace separated list of IDs referring to the sourceMD tags with matching ID. These sourceMD tags contain wrapped metadata in the standard metadata format shown above. In the example below, one attribute ADMID refers via the IDÂ METADATA-SIP to an above metadata section, providing a title and description for the archive. Similarly one attribute ADMID refers via the IDÂ METADATA-ALTO to metadata about an ALTO XML.
Storage strategy
Through the attribute USE, the section fileSec specifies how to treat and store the file. If the strategy does not apply for you, you can omit the attribute USE.
Possible values are:
PRESERVATION: Extract metadata, mime type and type, and preserve the file on the long term storage
VIRTUAL: Extract metadata, mime type and media type but do not store the file on the long term storage (the file is packed in the archive which is preserved on the long term storage).
FIXITY: Only check the MD5 for this file but do not create a container in MediaHaven nor store it on the long term storage (the file is packed in the archive which is preserved on the long term storage).
By using PRESERVATION for the archive and VIRTUAL for each file in the archive, MediaHaven will store the archive on the long term storage while the files in the SIP are available as metadata only entities. If you put PRESERVATION for an unpacked file, you can overrule the above and yet save the individual file on the long term storage (in addition to the archive).
Depending on your plugin for the MediaHaven workflow Ingest more USE(s) might be available.
Fixity
In this format, the fixity information is extra secure because an MD5 checksum is provided for:
The archive itself
Every single file in the archive
If one of the calculated MD5 checksums does not match, the archive is rejected.
Requirements
Some extra requirements are added to improve the reliability.
Every file in the archive must be referenced by the METS. If files are not referenced or if a referenced file is missing, the archive is rejected. This ensures that the METS unambiguously describes the archive.
Namespace Mets must be explicitly used instead of using a default namespace declaration.
The order in which the files are ingested into the MediaHaven is the order of files described in the METS, not the order in the archive.