Data Entities
Table of contents
- Referencing files and folders from the Root Data Entity
- Core Metadata for Data Entities
- Web-based Data Entities
The primary purpose for RO-Crate is to gather and describe a set of Data entities in the form of:
- Files
- Directories
- Web resources
The data entities can be further described by referencing contextual entities such as persons, organizations and publications.
Referencing files and folders from the Root Data Entity
Where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property. Directory hierarchies MAY be represented with nested Dataset Data Entities, or the Root Dataset MAY refer to files anywhere in the hierarchy using hasPart.
Data Entities representing files MUST have "File"
as a value for @type
. File
is an RO-Crate alias for http://schema.org/MediaObject. The term File here is liberal, and includes “downloadable” resources where @id
is an absolute URI.
Data Entities representing directories MUST be of "@type": "Dataset"
. The term directory here includes HTTP file listings where @id
is an absolute URI, however “external” directories SHOULD have a programmatic listing of their content (e.g. another RO-Crate). It follows that the RO-Crate Root is itself a data entity.
Data Entities can also be other types, for instance an online database. These SHOULD be a @type
of CreativeWork (or one of its subtypes) and typically have a @id
which is an absolute URI.
In all cases, @type
MAY be an array in order to also specify a more specific type, e.g. "@type": ["File", "ComputationalWorkflow"]
There is no requirement to represent every file and folder in an RO-Crate as Data Entities in the RO-Crate JSON-LD.
Example linking to a file and folders
<RO-Crate root>/
| ro-crate-metadata.json
| cp7glop.ai
| lots_of_little_files/
| | file1
| | file2
| | ...
| | file54
An example RO-Crate JSON-LD for the above would be as follows:
{ "@context": "https://w3id.org/ro/crate/1.2-DRAFT/context",
"@graph": [
{
"@type": "CreativeWork",
"@id": "ro-crate-metadata.json",
"conformsTo": {"@id": "https://w3id.org/ro/crate/1.2-DRAFT"},
"about": {"@id": "./"}
},
{
"@id": "./",
"@type": [
"Dataset"
],
"hasPart": [
{
"@id": "cp7glop.ai"
},
{
"@id": "lots_of_little_files/"
}
]
},
{
"@id": "cp7glop.ai",
"@type": "File",
"name": "Diagram showing trend to increase",
"contentSize": "383766",
"description": "Illustrator file for Glop Pot",
"encodingFormat": "application/pdf"
},
{
"@id": "lots_of_little_files/",
"@type": "Dataset",
"name": "Too many files",
"description": "This directory contains many small files, that we're not going to describe in detail."
}
]
}
Adding detailed descriptions of encodings
The above example provides an media type for the file cp7glop.ai
- which is useful as it may not be apparent that the file is readable as a PDF file from the extension alone. To add more detail, encodings SHOULD be linked using a PRONOM identifier to a Contextual Entity of @type
array containing WebPage and Standard
.
{
"@id": "cp7glop.ai",
"@type": "File",
"name": "Diagram showing trend to increase",
"contentSize": "383766",
"description": "Illustrator file for Glop Pot",
"encodingFormat": ["application/pdf", {"@id": "https://www.nationalarchives.gov.uk/PRONOM/fmt/19"}]
},
{
"@id": "https://www.nationalarchives.gov.uk/PRONOM/fmt/19",
"name": "Acrobat PDF 1.5 - Portable Document Format",
"@type": ["WebPage", "Standard"]
}
If there is no PRONOM identifier (and typically no media type string), then a contextual entity with a different URL as an @id
MAY be used, e.g. documentation page of a software’s file format. The contextual entity SHOULD NOT include Standard
in its @type
if the page do not sufficiently document the format. The @type
SHOULD be WebPage, or MAY be WebPageElement to indicate a section of the page.
For example:
{
"@id": "traj.trr",
"@type": "File",
"name": "Trajectory",
"description": "Trajectory of molecular dynamics simulation using GROMACS",
"contentSize": "45512",
"encodingFormat": {"@id": "https://manual.gromacs.org/documentation/2021/reference-manual/file-formats.html#trr"}]
},
{
"@id": "https://manual.gromacs.org/documentation/2021/reference-manual/file-formats.html#trr",
"@type": "WebPageElement",
"name": "GROMACS trajectory of a simulation (trr)"
}
If there is no web-accessible description for a file format it SHOULD be described locally in the dataset, for example in a Markdown file:
{
"@id": "some-file.some_extension",
"@type": "File",
"name": "Some file",
"description": "A file in a non-standard format",
"contentSize": "120",
"encodingFormat": ["text/plain", {"@id": "some_extension.md"}]
},
{
"@id": "some_extension.md",
"@type": ["File", "CreativeWork"],
"name": "Description of some_extension text-based file format",
"encodingFormat": "text/markdown"
}
File format profiles
Some generic file formats like application/json
may be specialized using a profile document that define expectations for the file’s content as expected by some applications, by using conformsTo to a contextual entity with types CreativeWork and Profile:
{
"@id": "attributes.csv",
"@type": "File",
"encodingFormat": ["text/csv", {"@id": "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/18"}],
"conformsTo": {"@id": "https://docs.ropensci.org/dataspice/#create-spice"}
},
{
"@id": "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/18",
"@type": "WebPage",
"name": "Comma-Separated Values format (CSV)"
},
{
"@id": "https://docs.ropensci.org/dataspice/#create-spice",
"@type": "CreativeWork",
"@id": "https://docs.ropensci.org/dataspice/#create-spice",
"@type": ["CreativeWork", "Profile"],
"name": "dataspice CSV profile"
}
Profiles expressed in formal languages (e.g. XML Schema for validation) can have their own encodingFormat
and conformsTo
to indicate their file format.
The Metadata File Descriptor ro-crate-metadata.json
is not a data entity, but is described with conformsTo
to an implicit contextual entity for the RO-Crate specification, a profile of JSON-LD. RO-Crates themselves can be specialized using Profile Crates, specified with conformsTo
on the root data entity.
Core Metadata for Data Entities
The table below outlines the properties that Data Entities, when present, MUST have to be minimally valid.
Encoding file paths
Note that all @id
identifiers must be valid URI references, care must be taken to express any relative paths using /
separator, correct casing, and escape special characters like space (%20
) and percent (%25
), for instance a File Data Entity from the Windows path Results and Diagrams\almost-50%.png
becomes "@id": "Results%20and%20Diagrams/almost-50%25.png"
in the RO-Crate JSON-LD.
In this document the term URI includes international IRIs; the RO-Crate Metadata File is always UTF-8 and international characters in identifiers SHOULD be written using native UTF-8 characters (IRIs), however traditional URL encoding of Unicode characters with %
MAY appear in @id
strings. Example: "@id": "面试.mp4"
is preferred over the equivalent "@id": "%E9%9D%A2%E8%AF%95.mp4"
File Data Entity
A File Data Entity MUST have the following properties:
@type
: MUST beFile
, or an array whereFile
is one of the values.@id
MUST be either a URI Path relative to the RO Crate root, or an absolute URI.
Additionally, File
entities SHOULD have:
- name giving a human readable name (not necessarily the filename)
- description giving a longer description, e.g. the role of this file within this crate
- encodingFormat indicating the the IANA media type as a string (e.g. `“text/plain”) and/or a reference to file format contextual entity.
- conformsTo to a contextual entity of type Profile, that indicate a profile of the encoding format
- contentSize with the size of the file in bytes
RO-Crate’s File
is an alias for schema.org type MediaObject, any of its properties MAY also be used (adding contextual entities as needed). Files on the web SHOULD also use identifier
, url
, subjectOf
, and/or mainEntityOfPage
.
Directory File Entity
A Dataset (directory) Data Entity MUST have the following properties:
@type
MUST beDataset
or an array whereDataset
is one of the values.@id
MUST be either a URI Path relative to the RO Crate root, or an absolute URI. The id SHOULD end with/
Additionally, Dataset
entities SHOULD have:
- name giving a human readable name (not necessarily the directory name)
- description giving a longer description, e.g. the content of this directory
- hasPart listing directly contained data entities
Any of the properties of schema.org Dataset MAY additionally be used (adding contextual entities as needed). Directories on the web SHOULD also provide distribution
.
Web-based Data Entities
While one use-case of RO-Crates is to describe files contained within the RO-Crate root directory, RO-Crates can also gather resources from the web identified by absolute URIs instead of relative URI paths, i.e. Web-based data entities.
Using Web-based data entities can be important particularly where a file can’t be included in the RO-Crate root because of licensing concerns, large data sizes, privacy, or where it is desirable to link to the latest online version.
Example of an RO-Crate including a File Data Entity external to the RO-Crate root (file entity https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf):
{ "@context": "https://w3id.org/ro/crate/1.2-DRAFT/context",
"@graph": [
{
"@type": "CreativeWork",
"@id": "ro-crate-metadata.json",
"conformsTo": {"@id": "https://w3id.org/ro/crate/1.2-DRAFT"},
"about": {"@id": "./"}
},
{
"@id": "./",
"@type": [
"Dataset"
],
"hasPart": [
{
"@id": "survey-responses-2019.csv"
},
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf"
},
]
},
{
"@id": "survey-responses-2019.csv",
"@type": "File",
"name": "Survey responses",
"contentSize": "26452",
"encodingFormat": "text/csv"
},
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf",
"@type": "File",
"name": "RO-Crate specification",
"contentSize": "310691",
"description": "RO-Crate specification",
"encodingFormat": "application/pdf"
}
]
}
Additional care SHOULD be taken to improve persistence and long-term preservation of web resources included in an RO-Crate as they can be more difficult to archive or move along with the RO-Crate root, and may change intentionally or unintentionally leaving the RO-Crate with incomplete or outdated information.
File Data Entries with an @id
URI outside the RO-Crate Root SHOULD at the time of RO-Crate creation be directly downloadable by a simple retrieval (e.g. HTTP GET), permitting redirections and HTTP/HTTPS authentication. For instance, in the example above, https://zenodo.org/record/3541888 and https://doi.org/10.5281/zenodo.3541888 cannot be used as @id
above as retrieving these URLs give a HTML landing page rather than the desired PDF as indicated by encodingFormat
.
As files on the web may change, the timestamp property sdDatePublished SHOULD be included to indicate when the absolute URL was accessed, and derived metadata like encodingFormat and contentSize were considered to be representative:
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf",
"@type": "File",
"name": "RO-Crate specification",
"contentSize": "310691",
"encodingFormat": "application/pdf",
"sdDatePublished": "2020-04-09T13:09:21+01:00Z"
}
Embedded data entities that are also on the web
File Data Entities may already have a corresponding web presence, for instance a landing page that describes the file, including persistent identifiers (e.g. DOI) resolving to an intermediate HTML page instead of the downloadable file directly.
These can be included for File Data Entities as additional metadata, regardless of whether the File is included in the RO-Crate Root directory or exists on the Web, by using the properties:
- identifier for formal identifier strings such as DOIs
- url with a string URL corresponding to a download link (if not available, a download landing page) for this file
- subjectOf to a CreativeWork (or WebPage) that mentions this file or its content (but also other resources)
- mainEntityOfPage to a CreativeWork (or WebPage) that primarily describes this file (or its content)
{
"@id": "survey-responses-2019.csv",
"@type": "File",
"name": "Survey responses",
"encodingFormat": "text/csv",
"url": "http://example.com/downloads/2019/survey-responses-2019.csv",
"subjectOf": {"@id": "http://example.com/reports/2019/annual-survey.html"}
},
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf",
"@type": "File",
"name": "RO-Crate specification",
"encodingFormat": "application/pdf",
"identifier": "https://doi.org/10.5281/zenodo.3541888",
"url": "https://zenodo.org/record/3541888"
}
Directories on the web; dataset distributions
A Directory File Entry or Dataset identifier expressed as an absolute URL on the web can be harder to download than a File because it consists of multiple resources. It is RECOMMENDED that such directories have a complete listing of their content in hasPart, enabling download traversal.
Alternatively, a common mechanism to provide downloads of a reasonably sized directory is as an archive file in formats such as application/zip
or application/gzip
, described as a DataDownload.
{
"@id": "lots_of_little_files/",
"@type": "Dataset",
"name": "Too many files",
"description": "This directory contains many small files, that we're not going to describe in detail.",
"distribution": {"@id": "http://example.com/downloads/2020/lots_of_little_files.zip"}
},
{
"@id": "http://example.com/downloads/2020/lots_of_little_files.zip",
"@type": "DataDownload",
"encodingFormat": ["application/zip", {"@id": "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/263"}],
"contentSize": "82818928"
}
Similarly, the RO-Crate root entity may also provide a distribution URL, in which case the download SHOULD be an archive that contains the RO-Crate Metadata file.
In all cases, consumers should be aware that a DataDownload
is a snapshot that may not reflect the current state of the Dataset
or RO-Crate.