Release Candidate
Profile: Newspaper
The newspaper profile supports the ingest of digitised newspaper content. It shows how to deal with multiple media files (in formats such as TIFF, ALTO XML and PDF) and the relationships between them and their metadata. It also allows to use the MODS metadata schema for descriptive metadata, which is the default for describing newspaper content.
Permalink: https://data.hetarchief.be/id/sip/1.1/newspaper
Example Directory structure
root_directory
│──manifest-md5.txt
│──bagit.txt
│
└──data
│──mets.xml
│──metadata
| |──descriptive (at least one of both files must be present)
| | └──dc.xml
| | └──mods.xml
| └──preservation
| └──premis.xml
│
└──representations
│──representation_1
│ │──mets.xml
│ │──data
│ | |──file_1.tiff
│ │ └──...
│ │
│ └──metadata
│ └──preservation
│ └──premis.xml
│
│──representation_2
│ │──mets.xml
│ │──data
│ | |──file_1.xml
│ │ └──...
│ │
│ └──metadata
│ └──preservation
│ └──premis.xml
│
└──representation_3
│──mets.xml
│──data
| └── file_1.pdf
│
└──metadata
└──preservation
└── premis.xml
Requirements
General
- A newspaper SIP MUST contain exactly one newspaper edition.
- The newspaper edition MUST be digitised per page, i.e. each TIFF and/or ALTO XML file contained in their respective representation directories MUST represent exactly one page.
- There MUST be exactly one IE present in the SIP, i.e. the newspaper edition.
- There MUST be preservation metadata at the package level in the
preservation/premis.xml
file. - There MUST be preservation metadata at the representation level in the respective
preservation/premis.xml
files. - Preservation metadata in the SIP MUST be limited to the PREMIS metadata schema.
- There MAY be descriptive metadata at the representation level (e.g. information about the representations, such as a title or a description).
Package METS
- The
csip:CONTENTINFORMATIONTYPE
attribute MUST be set tohttps://data.hetarchief.be/id/sip/1.1/newspaper
. - The
mets/dmdSec/mdRef/@MDTYPE
attribute MUST be set toDC
orMODS
.
Package Descriptive Metadata
- Either a
descriptive/mods.xml
or adescriptive/dc.xml
descriptive metadata file MUST be present at the package level. In case they both occur, thedescriptive/dc.xml
file is ignored. - The
descriptive/dc.xml
file MUST follow the DCTERMS metadata schema in accordance with the basic profile requirements. - The
descriptive/mods.xml
file MUST follow the MODS metadata schema (v3.7.). - The
descriptive/mods.xml
file MUST contain a shared identifier with thepreservation/premis.xml
to indicate which PREMIS object is being described in thedescriptive/mods.xml
file. - The MODS metadata in
descriptive/mods.xml
MUST be limited to the elements and attributes outlined below.
Element | mods:mods |
---|---|
Name | MODS root element |
Description | This root element MUST contain the XML schema namespace of MODS. It MUST NOT contain any other XML schema namespaces besides MODS. |
Cardinality | 1..1 |
Obligation | MUST |
Attribute | mods:mods/@version |
---|---|
Name | MODS version attribute |
Description | This attribute indicates which version of MODS is being used. It MUST be set to 3.7 to indicate conformance with MODS v3.7. |
Datatype | String |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:titleInfo[not(@*)]/mods:title |
---|---|
Name | MODS title element |
Description | This element contains the title of the newspaper. Its parent element ( <mods:titleInfo/> ) MUST NOT contain any attributes in order to differentiate from other optional <mods:titleInfo/> elements which, if present, MUST contain @type attributes to indicate e.g. alternative titles for the newspaper. |
Datatype | String |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:identifier[not(@*)] |
---|---|
Name | MODS identifier element |
Description | A unique identifier for the newspaper edition. This identifier MUST be shared with the relevant PREMIS object in the preservation/premis.xml file.This metadata element MUST NOT contain any attributes. |
Datatype | ID |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:typeOfResource |
---|---|
Name | MODS type of resource element |
Description | This element indicates which type of resource is being described. Its value MUST be set to newspaper edition . |
Datatype | String |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:abstract |
---|---|
Name | A summary of the content of the newspaper. |
Description | This element contains a summary of the content of the newspaper. |
Datatype | String |
Cardinality | 0..1 |
Obligation | SHOULD |
Element | mods:mods/mods:subject/mods:topic |
---|---|
Name | A term or phrase representing the primary topic(s) on which the newspaper is focused. |
Description | This element contains a summary of the content of the newspaper. |
Datatype | String |
Cardinality | 0..1 |
Obligation | MAY |
Element | mods:mods/mods:name[@type="personal"] |
---|---|
Name | Name of a person |
Description | The name of a person associated with the newspaper. |
Cardinality | 0..1 |
Obligation | SHOULD |
Element | mods:mods/mods:name[@type="personal"]/namePart[@type="family"] |
---|---|
Name | Family name of a person |
Description | The family name of a person associated with the newspaper. |
Datatype | String |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:name[@type="personal"]/namePart[@type="given"] |
---|---|
Name | Given name of a person |
Description | The given name of a person associated with the newspaper. |
Datatype | String |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:name[@type="personal"]/role/roleTerm[@type="text"] |
---|---|
Name | Role of a person |
Description | Designates the relationship (role) of the person to the newspaper. |
Datatype | String |
Cardinality | 0..1 |
Obligation | MAY |
Element | mods:mods/mods:originInfo/mods:dateCreated[@encoding="edtf"] |
---|---|
Name | MODS creation date element |
Description | This element contains the date the newspaper edition was created. Its value MUST be EDTF-compliant, as indicated by the @encoding attribute which MUST be set to edtf . |
Datatype | EDTF |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:originInfo/mods:dateIssued[@encoding="edtf"] |
---|---|
Name | MODS issuance date element |
Description | This element contains the date the newspaper edition was issued. Its value MUST be EDTF-compliant, as indicated by the @encoding attribute which MUST be set to edtf . |
Datatype | EDTF |
Cardinality | 1..1 |
Obligation | MUST |
Element | mods:mods/mods:relatedItem[@type="series"]/mods:identifier[@type="abraham_id"] |
---|---|
Name | Abraham ID |
Description | This element contains the Abraham identifier taken from the Abraham Belgian Newspaper Catalog. Note that an Abraham identifier refers to newspaper titles rather than newspaper editions; multiple editions can therefore share the same Abraham identifier. This element MUST contain the @type attribute, with its value set to abraham_id . The @type attribute of its parent element (i.e. <mets:relatedItem/> ) MUST be set to series . |
Datatype | ID |
Cardinality | 0..1 |
Obligation | SHOULD |
Element | mods:mods/mods:relatedItem[@type="series"]/mods:identifier[@type="abraham_uri"] |
---|---|
Name | Abraham URI |
Description | This element contains the Abraham URI taken from the Abraham Belgian Newspaper Catalog. Note that an Abraham URI refers to newspaper titles rather than newspaper editions; multiple editions can therefore share the same Abraham URI. This element MUST contain the @type attribute, with its value set to abraham_uri . The @type attribute of its parent element (i.e. <mets:relatedItem/> ) MUST be set to series . Note that the Abraham URI contains the Abraham identifier. |
Datatype | URI |
Cardinality | 0..1 |
Obligation | SHOULD |
Element | mods:mods/mods:note[@type="license"] |
---|---|
Name | License element |
Description | This element MAY be used to add any licensing info needed. It MUST contain the @type attribute, with its value set to license . |
Datatype | String |
Cardinality | 0..* |
Obligation | MAY |
Package Preservation Metadata
- A preservation metadata file
preservation/premis.xml
MUST be present at the package level. - The
preservation/premis.xml
file MUST follow the PREMIS metadata schema (v3.0.). - If the SIP contains ALTO XML files, the
preservation/premis.xml
file MUST contain a PREMIS event of typetranscription
to link the TIFF and ALTO XML files. With this event, the representation containing the TIFF files MUST receive the PREMIS linking object rolesource
and the representation containing the ALTO XML files MUST receive the PREMIS linking object roleoutcome
. See the section about PREMIS events and example 1 below for more information about the structure of PREMIS events. - If the SIP contains a PDF file (which SHOULD contain all pages of the newspaper edition, cf. supra, the
preservation/premis.xml
file MUST contain a PREMIS event of typecreation
to link the TIFF and ALTO XML files to the PDF file. With this event, the two representations containing the TIFF and the ALTO XML files MUST receive the PREMIS linking object rolesource
and the representation containing the PDF file MUST receive the PREMIS linking object roleoutcome
. See the section about PREMIS events and example 1 below for more information about the structure of PREMIS events.
Example 1: a PREMIS transcription event (linking the TIFF and ALTO XML files)
<premis:premis version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:premis="http://www.loc.gov/premis/v3" xsi:schemaLocation="http://www.loc.gov/premis/v3 https://www.loc.gov/standards/premis/premis.xsd">
[...]
<premis:event>
<premis:eventIdentifier>
<premis:eventIdentifierType>UUID</premis:eventIdentifierType>
<premis:eventIdentifierValue>uuid-34ae79f8-a8e7-4768-a269-4d6d895662d6</premis:eventIdentifierValue>
</premis:eventIdentifier>
<premis:eventType>transcription</premis:eventType>
<premis:eventDateTime>2022-02-16T10:01:15.014+02:00</premis:eventDateTime>
<premis:eventDetailInformation>
<premis:eventDetail>Generate ALTO XML from TIFF via OCR</premis:eventDetail>
</premis:eventDetailInformation>
<premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifierType>UUID</premis:linkingObjectIdentifierType>
<premis:linkingObjectIdentifierValue>uuid-d8fd6dde-53a5-4614-823c-32f64588efe6</premis:linkingObjectIdentifierValue>
<premis:linkingObjectRole>source</premis:linkingObjectRole>
</premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifierType>UUID</premis:linkingObjectIdentifierType>
<premis:linkingObjectIdentifierValue>uuid-1fca6190-a4bd-4773-8529-272b9e7d536a</premis:linkingObjectIdentifierValue>
<premis:linkingObjectRole>outcome</premis:linkingObjectRole>
</premis:linkingObjectIdentifier>
</premis:event>
[...]
</premis:premis>
Example 2: a PREMIS creation event (linking the TIFF, ALTO XML and PDF files)
<premis:premis version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:premis="http://www.loc.gov/premis/v3" xsi:schemaLocation="http://www.loc.gov/premis/v3 https://www.loc.gov/standards/premis/premis.xsd">
[...]
<premis:event>
<premis:eventIdentifier>
<premis:eventIdentifierType>UUID</premis:eventIdentifierType>
<premis:eventIdentifierValue>uuid-16a5c827-e513-4ec5-ad75-f75c7b9bde1f</premis:eventIdentifierValue>
</premis:eventIdentifier>
<premis:eventType>creation</premis:eventType>
<premis:eventDateTime>2022-02-16T10:01:15.014+02:00</premis:eventDateTime>
<premis:eventDetailInformation>
<premis:eventDetail>Generate PDF from ALTO XML and TIFF</premis:eventDetail>
</premis:eventDetailInformation>
<premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifierType>UUID</premis:linkingObjectIdentifierType>
<premis:linkingObjectIdentifierValue>uuid-d8fd6dde-53a5-4614-823c-32f64588efe6</premis:linkingObjectIdentifierValue>
<premis:linkingObjectRole>source</premis:linkingObjectRole>
</premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifierType>UUID</premis:linkingObjectIdentifierType>
<premis:linkingObjectIdentifierValue>uuid-1fca6190-a4bd-4773-8529-272b9e7d536a</premis:linkingObjectIdentifierValue>
<premis:linkingObjectRole>source</premis:linkingObjectRole>
</premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifier>
<premis:linkingObjectIdentifierType>UUID</premis:linkingObjectIdentifierType>
<premis:linkingObjectIdentifierValue>uuid-3d371b39-90af-4655-91e9-d93c55f25da1</premis:linkingObjectIdentifierValue>
<premis:linkingObjectRole>outcome</premis:linkingObjectRole>
</premis:linkingObjectIdentifier>
</premis:event>
[...]
<premis:premis>
Representation METS
- If the files in a representation each correspond with a single page (e.g. the TIFF and ALTO XML files, since each of these files MUST correspond to a single page), the corresponding
<div/>
elements in the structural map MUST contain an@ORDER
attribute that indicates the sequence of the pages of the newspaper edition. Additionally, each<div/>
element that corresponds to a file representing a page MUST have a@TYPE
attribute that is set topage
. See example 3 below for more information.
Example 3: the structural map of a representation METS, with @TYPE
and @ORDER
attributes
<mets xmlns="http://www.loc.gov/METS/" xmlns:csip="https://DILCIS.eu/XML/METS/CSIPExtensionMETS" xmlns:sip="https://DILCIS.eu/XML/METS/SIPExtensionMETS" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" OBJID="representation_1" TYPE="Textual works - Print" PROFILE="https://earksip.dilcis.eu/profile/E-ARK-SIP.xml">
[...]
<structMap ID="uuid-04647bb4-f524-435b-b4bf-5fe7a926b9d4" TYPE="PHYSICAL" LABEL="CSIP">
<div ID="uuid-74e4335c-1d24-42bc-bbd0-864bd216d99c" LABEL="representation_1">
<div ID="uuid-60d4a0db-769c-42a9-8ef8-c395bb555803" LABEL="Metadata">
<div ID="uuid-e96b8688-e811-4dd8-83dc-81ae263b9c2a" LABEL="preservation">
<fptr FILEID="uuid-4482555d-aed7-4066-a211-44429a60a49a" />
</div>
</div>
<!-- order attributes for page order -->
<div ID="uuid-41bacec1-1d6c-467a-8020-7114115562a8" LABEL="Representations">
<div ID="uuid-47e52361-8508-4ae1-ad8c-0e1f5382065e" TYPE="page" ORDER="1">
<fptr FILEID="uuid-9850cb03-b1fd-4661-a4fb-e3dfcf25e9e5" />
</div>
<div ID="uuid-47e52361-8508-4ae1-ad8c-0e1f5382065e" TYPE="page" ORDER="2">
<fptr FILEID="uuid-3309e853-bf0f-4d19-ae6a-5e14911e3662" />
</div>
<div ID="uuid-eebd6f2a-f06e-4c5f-9c52-fd58e784eaff" TYPE="page" ORDER="3">
<fptr FILEID="uuid-4ef96979-4abf-4af0-8156-d04fdd2ff7c3" />
</div>
</div>
</div>
</structMap>
[...]
</mets>
Representation Preservation Metadata
- If ALTO XML files are present in the SIP, the
preservation/premis.xml
files of the representation containing the TIFF files and of the representation containing the ALTO XML files MUST contain a PREMIS relationship to establish a link between the two.- In the case of the representation with the TIFF files, this PREMIS relationship MUST be of type
derivation
and of subtypeis source of
. The@valueURI
attribute of the<premis:relationshipType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipType/der
. The@valueURI
attribute of the<premis:relationshipSubType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipSubType/iso
. Finally, a<premis:relatedEventIdentifier/>
element MUST be present that refers to the relevant event (in this case a transcription event) defined in thepreservation/premis.xml
file of the package level. This is shown in example 4 below. - In the case of the representation with the ALTO XML files, this PREMIS relationship MUST be of type
derivation
and of subtypehas source
. The@valueURI
attribute of the<premis:relationshipType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipType/der
. The@valueURI
attribute of the<premis:relationshipSubType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipSubType/hss
. Finally, a<premis:relatedEventIdentifier/>
element MUST be present that refers to the relevant event (in this case a transcription event) defined in thepreservation/premis.xml
file of the package level. This is shown in example 5 below.
- In the case of the representation with the TIFF files, this PREMIS relationship MUST be of type
- If a PDF file is present in the SIP, the
preservation/premis.xml
files of all three representations (i.e. of the TIFF files, of the ALTO XML file and of the PDF file) MUST contain a PREMIS relationship to establish a link between the three.- In the case of the representations with the TIFF and ALTO XML files, this PREMIS relationship MUST be of type
derivation
and of subtypeis source of
. The@valueURI
attribute of the<premis:relationshipType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipType/der
. The@valueURI
attribute of the<premis:relationshipSubType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipSubType/iso
. Finally, a<premis:relatedEventIdentifier/>
element MUST be present that refers to the relevant event (in this case a transcription event) defined in thepreservation/premis.xml
file of the package level. This is similar to example 4 shown below. - In the case of the representation with the PDF file, this PREMIS relationship MUST be of type
derivation
and of subtypehas source
. The@valueURI
attribute of the<premis:relationshipType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipType/der
. The@valueURI
attribute of the<premis:relationshipSubType>
element MUST be set tohttp://id.loc.gov/vocabulary/preservation/relationshipSubType/hss
. Finally, a<premis:relatedEventIdentifier/>
element MUST be present that refers to the relevant event (in this case a transcription event) defined in thepreservation/premis.xml
file of the package level. This is similar to example 5 below, the difference being that the relationship will mostly entail multiple<premis:relatedObjectIdentifier/>
elements since the PDF is derived from all TIFF and ALTO XML files together.
- In the case of the representations with the TIFF and ALTO XML files, this PREMIS relationship MUST be of type
Example 4: a PREMIS is source of
relationship
<premis:premis version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:premis="http://www.loc.gov/premis/v3" xsi:schemaLocation="http://www.loc.gov/premis/v3 https://www.loc.gov/standards/premis/premis.xsd">
[...]
<premis:relationship>
<premis:relationshipType authority="relationshipType" authorityURI="http://id.loc.gov/vocabulary/preservation/relationshipType" valueURI="http://id.loc.gov/vocabulary/preservation/relationshipType/der">derivation</premis:relationshipType>
<premis:relationshipSubType authority="relationshipSubType" authorityURI="http://id.loc.gov/vocabulary/preservation/relationshipSubType" valueURI="http://id.loc.gov/vocabulary/preservation/relationshipSubType/iso">is source of</premis:relationshipSubType>
<premis:relatedObjectIdentifier>
<premis:relatedObjectIdentifierType>UUID</premis:relatedObjectIdentifierType>
<premis:relatedObjectIdentifierValue>uuid-3d2dfddb-6348-43b7-865d-ba9a12ef5c79</premis:relatedObjectIdentifierValue>
</premis:relatedObjectIdentifier>
<premis:relatedEventIdentifier>
<premis:relatedEventIdentifierType>UUID</premis:relatedEventIdentifierType>
<premis:relatedEventIdentifierValue>uuid-34ae79f8-a8e7-4768-a269-4d6d895662d6</premis:relatedEventIdentifierValue>
</premis:relatedEventIdentifier>
</premis:relationship>
[...]
</premis:premis>
Example 5: a PREMIS has source
relationship
<premis:premis version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:premis="http://www.loc.gov/premis/v3" xsi:schemaLocation="http://www.loc.gov/premis/v3 https://www.loc.gov/standards/premis/premis.xsd">
[...]
<premis:relationship>
<premis:relationshipType authority="relationshipType" authorityURI="http://id.loc.gov/vocabulary/preservation/relationshipType" valueURI="http://id.loc.gov/vocabulary/preservation/relationshipType/der">derivation</premis:relationshipType>
<premis:relationshipSubType authority="relationshipSubType" authorityURI="http://id.loc.gov/vocabulary/preservation/relationshipSubType" valueURI="http://id.loc.gov/vocabulary/preservation/relationshipSubType/hss">has source</premis:relationshipSubType>
<premis:relatedObjectIdentifier>
<premis:relatedObjectIdentifierType>UUID</premis:relatedObjectIdentifierType>
<premis:relatedObjectIdentifierValue>uuid-1711cd43-19d2-4d89-9259-17443fc7d75f</premis:relatedObjectIdentifierValue>
</premis:relatedObjectIdentifier>
<premis:relatedEventIdentifier>
<premis:relatedEventIdentifierType>UUID</premis:relatedEventIdentifierType>
<premis:relatedEventIdentifierValue>uuid-34ae79f8-a8e7-4768-a269-4d6d895662d6</premis:relatedEventIdentifierValue>
</premis:relatedEventIdentifier>
</premis:relationship>
[...]
</premis:premis>
Validation
The XML files that are required by this profile can be validated using the following XML schema definitions:
File | Format | XML Schema |
mets.xml | METS v1.12.1 | mets.xsd |
premis.xml | PREMIS v3.0 | premis-v3-0.xsd |
mods.xml | MODS v3.8 | mods-3-8.xsd |
dc.xml (if no MODS) | Dublin Core (custom schema) | dc_basic.xsd depends on: edtf.xsd, dcterms.xsd, dcmitype.xsd, dc.xsd |
Use Cases
Some use cases that implement this profile are: