RO-Index: A survey of Research Object usage

For this study we aim to build RO-Index, a broad and comprehensive corpus of Research Objects found “in the wild”. The proposed methodology follows multiple strands to find the “breeding grounds” of research objects and further describes how Research Objects are selected for inclusion, along with post-processing to build the corpus.

The corpus of Research Objects will primarily be distributed as Open Data, including:

Research Objects that cannot be redistributed (e.g. unknown license) will only be examined for aggregates.

A brief set of qualitative and quantitative analytics will then be performed across the overall corpus, in particular to address research questions like:

Introduction

Protocol

Finding Research Objects

One goal of this work is to determine what kind of artifacts, in practice, can be considered a research object. For the purpose of building a corpus we need to have both inclusion and exclusion criteria.

The foundational article on the RO concept is [1] and its workshop predecessor [2]. The Research Object community has maintained lists of initiatives and Research Object profiles which provide curated, although potentially biased, collections of Research Object approaches and implementations.

Declared Research Object usage

In order to determine potential sources of Research Objects we will start with these community lists, but expand based on a literature review by following any academic citation of the before-mentioned Research Object articles to find potential repositories, tools and communities that may conceptually claim to have or make “research objects”. This is a broad interpretation that does not expand into general datasets or packaging formats. The list may be expanded by literate search for “Research Object”, the RO vocabularies and standard URLs.

Each of the citing articles will then be assessed to see if they have openly accessible research objects that are possible to identify, and ideally retrieve, by building a programmatic crawler. Ideally such access would use an open harvesting protocol like OAI-PMH or ResourceSync, but it is predicted that in the majority of cases custom crawler code will need to be developed per repository, in addition to manual harvesting of identifiers for smaller collections and individual Research Objects.

Keyword searches

7 In addition to this “self-claimed” research object usage we will search in more general repositories by developing a list of keywords like “research object”, “robundle” or the RO vocabulary URLs. We will search in at least:

It is predicted that these searches will yield duplicates, but will be used to find potentially new Research Object sources or free-standing instances.

Archives with manifests

Finally we will consider broadly Open Data repositories of file archives (e.g. ZIP, tar.gz) to inspect for the presence of a manifest-like file (e.g. /manifest.rdf). For practical reasons this search will be restricted to a smaller selection of public repositories and formats, e.g. Zenodo (20k *.zip Datasets), FigShare (“zip” Datasets), Mendeley Data “zip” File Set.

It is predicted that most of the archive files will not contain such a manifest, therefore they can be inspected “on the fly” by the crawler without intermediate storage, to first detect a short-list of archives that contain a manifest-like file. These can then be downloaded in full for further inspection. File-name matching will inspect potential sub-directories, e.g. to detect nested/data/manifest.xml, but will classify these archives differently from direct matches.

Candidate sources

We may contact the provider or maintainer to expand on these questions if unclear from public information, however we are not conducting a formal survey, as our main interest lays in the machine-readable information from the research objects themselves.

We will finally form a shortlist of sources for further harvesting, considering:

Handling personally identifiable information

Research Objects may, by their nature, contain information about people and their research activities. It is therefore important that our data collection, processing and potential re-distribution is in consistent with the General Data Protection Regulation (GDPR). To this end we will evaluate:

Evaluating this may require retrieving research objects in the first place, but particular care will be taken to classify Research Objects and their sources according to the above evaluation in order to filter information that can progress to be part of the Open Data RO-Index corpus. This forms a staged inclusion list:

Note: In the above, “tend to” will be determined manually by inspecting a smaller subset of typically 10 research objects. The selection will aim to approximate a simple random subset, but may need to be expanded to take into account the overall diversity of ROs at the source, e.g. date, authors, subsystem, formats. The identifiers of the ROs of this subset will be recorded, along with a description of how the subset was selected.

The inclusion list may be further restricted based on findings from further processing (e.g. a repository is found to distribute sensitive data).

It is worth noting that compliance with open licenses like Creative Commons Attribution 4.0 (CC-BY) or Apache License 2.0 require attribution to be propagated (if present). Attribution may sometimes take the form of a URL, identifier, project or organization which do not directly identify a person.

Data for any excluded Research Objects will only be kept for the purpose and duration of this study on computer infrastructure managed by The University of Manchester. Data from excluded Research Objects will only be used for non-person-identifiable aggregated results (e.g. number of CSV files) and broad categorization (e.g. vocabularies used in metadata).

The identifiers from category 1, metadata from category 3 and data from category 4 will be shared in the public Open Data repository Zenodo according to Zenodo’s policies. Metadata from category 3 and 4 above may be exposed for programmatic querying (e.g. SPARQL) or converted to other formats. No additional linking with internal and external data sources will be performed, although the collected Research Objects may already contain such links (e.g. https://orcid.org/ identifiers of authors); an exception to this rule is that linking will be permitted to detect duplicate Research Objects across multiple sources, and to access resources clearly aggregated as part of the Research Object.

For GDPR purposes the Data Controller is The University of Manchester, data subjects may contact info@esciencelab.org.uk for any enquiries, such as to request access to data about themselves, or to request update or removal of personally identifiable information.

Pre-identified data sources

Proto-research objects

ORE-based research objects

Software/container-based research objects

2nd generation ROs

Manifest formats

A key characteristic of a Research Object is the presence of a manifest that describes and relates the content. However, multiple potential formats and conventions have emerged for how to serialize such a format. (..)

Proposed data gathering workflow

Prototype workflow

A prototype workflow is being developed using Common Workflow Language [23], figure 1 shows how a a community sub-section of the Zenodo repository is being inspected to list the filenames contained within its downloadable ZIP files.

The workflow and its components have been tested with the reference implementation cwltool [25] which can provide rich provenance captured in CWLProv research objects [26]

Estimating bandwidth and storage requirements

In order to estimate the bandwidth and storage requirements for executing the above prototype workflow across the whole of Zenodo, a shell script approach was used to retrieve and analyse the JSON for each Zenodo record, which include information on downloadable filename, extension and file size.

To retrieve these it was deemed necessary to use the undocumented parts of Zenodo API. From the Zenodo source code it was identified that the REST template https://zenodo.org/api/records/{pid_value} could be used with pid_value as the numeric part from the OAI-PMH identifier, e.g. for oai:zenodo.org:1310621 the Zenodo JSON can be retrieved at https://zenodo.org/api/records/1310621.

The JSON API supports content negotiation, the content-types supported as of 2019-09-20 include:

Using these (currently) undocumented parts of the Zenodo API thus avoids the need for HTML scraping while also giving individual complete records that are suitable to redistribute as records in a filtered dataset.

This preliminary exploration will be adapted into the reproducible CWL workflow. Below is a bash transcript. Execution time was about 3 days from a server at the University of Manchester network on a single 1 GBps network link. The script does:

Downloading all will take at least 4 days assuming 73.9 MBit/s as measured using wget of a 9 GB file - it is predicted that actual download time may be doubled because of the effect of latency on shorter downloads, which will not be able to saturate the link speed before the download is complete. For comparison downloading the JSON files, each about 1 kB, took 3 days, so a realistic estmiate is 7 days to download.

It was found that only a small subset of downloads are over 30 MB. Keeping all the ZIP files <30MB will require about 300 GB.

Therefore the suggestion is to split the download list into two subsets, a) with many small ZIP files which are kept b) large ZIP files which are processed by streami

Conclusions/Discussion

Data (and Software) Availability

Author contributions

Competing interests

Grant Information

References

1. Why linked data is not enough for scientists
Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, … Carole Goble
Future Generation Computer Systems (2013-02) https://doi.org/bgmqrb
DOI: 10.1016/j.future.2011.08.004

2. Why Linked Data is Not Enough for Scientists
Sean Bechhofer, John Ainsworth, Jiten Bhagat, Iain Buchan, Philip Couch, Don Cruickshank, David De Roure, Mark Delderfield, Ian Dunlop, Matthew Gamble, … Shoaib Sufi
2010 IEEE Sixth International Conference on e-Science (2010-12) https://doi.org/cv5tzk
DOI: 10.1109/escience.2010.21

3. COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project
Frank T Bergmann, Richard Adams, Stuart Moodie, Jonathan Cooper, Mihai Glont, Martin Golebiewski, Michael Hucka, Camille Laibe, Andrew K Miller, David P Nickerson, … Nicolas Le Novère
BMC Bioinformatics (2014-12) https://doi.org/gb8wc5
DOI: 10.1186/s12859-014-0369-z · PMID: 25494900 · PMCID: PMC4272562

4. ROHub — A Digital Library of Research Objects Supporting Scientists Towards Reproducible Science
Raúl Palma, Piotr Hołubowicz, Oscar Corcho, José Manuel Gómez-Pérez, Cezary Mazurek
Communications in Computer and Information Science (2014) https://doi.org/gf5m6p
DOI: 10.1007/978-3-319-12024-9_9

5. Research Object Bundle 1.0
Stian Soiland-Reyes, Matthew Gamble, Robert Haines
Zenodo (2014-11-05) https://doi.org/gf5m6k
DOI: 10.5281/zenodo.12586

6. Reproducible big data science: A case study in continuous FAIRness
Ravi Madduri, Kyle Chard, Mike D’Arcy, Segun C. Jung, Alexis Rodriguez, Dinanath Sulakhe, Eric Deutsch, Cory Funk, Ben Heavner, Matthew Richards, … Ian Foster
PLOS ONE (2019-04-11) https://doi.org/gf5m6s
DOI: 10.1371/journal.pone.0213013 · PMID: 30973881 · PMCID: PMC6459504

7. Datacrate Submisssion To The Workshop On Research Objects
Peter Sefton
Zenodo (2018-07-15) https://doi.org/gf5twt
DOI: 10.5281/zenodo.1445817

8. A lightweight approach to research object data packaging
Eoghan Ó Carragáin, Carole Goble, Peter Sefton, Stian Soiland-Reyes
Zenodo (2019-06-20) https://doi.org/gf5twv
DOI: 10.5281/zenodo.3250687

9. Ro-Combine-Archive
Stian Soiland-Reyes, Matthew Gamble
Zenodo (2014-04-28) https://doi.org/gf5m6t
DOI: 10.5281/zenodo.10439

10. Applying linked data approaches to pharmacology: Architectural decisions and implementation
Gray Alasdair J.G., Groth Paul, Loizou Antonis, Askjaer Sune, Brenninkmeijer Christian, Burger Kees, Chichester Christine, Evelo Chris T., Goble Carole, Harland Lee, … Williams Antony J.
Semantic Web (2014) https://doi.org/gf5m6j
DOI: 10.3233/sw-2012-0088

11. Preserving Reproducibility: Provenance and Executable Containers in DataONE Data Packages
Bryce Mecum, Matthew B. Jones, Dave Vieglais, Craig Willis
2018 IEEE 14th International Conference on e-Science (e-Science) (2018-10) https://doi.org/gf5m6q
DOI: 10.1109/escience.2018.00019

12. DataLab
Yang Zhang, Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu, Wei Xu
Proceedings of the 2nd International Workshop on BIG Data Software Engineering - BIGDSE ’16 (2016) https://doi.org/gf7hv6
DOI: 10.1145/2896825.2896830

13. The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet
Tim Robertson, Markus Döring, Robert Guralnick, David Bloom, John Wieczorek, Kyle Braak, Javier Otegui, Laura Russell, Peter Desmet
PLoS ONE (2014-08-06) https://doi.org/f6n8jm
DOI: 10.1371/journal.pone.0102623 · PMID: 25099149 · PMCID: PMC4123864

14. Natural history specimens collected and/or identified and deposited.
Claudia Baider
Zenodo (2019-09-12) https://doi.org/gf75d8
DOI: 10.5281/zenodo.3405730

15. Specification of the Crystallographic Binary File (CBF/imgCIF)
H. J. Bernstein, A. P. Hammersley
International Tables for Crystallography (2006-10-01) https://doi.org/b27mm3
DOI: 10.1107/97809553602060000729

16. CWL Viewer: the common workflow language viewer
Mark Robinson, Stian Soiland-Reyes, Michael R. Crusoe, Carole Goble
F1000Research (2017) https://doi.org/cbq2
DOI: 10.7490/f1000research.1114375.1

17. A workflow PROV-corpus based on taverna and wings
Khalid Belhajjame, Jun Zhao, Daniel Garijo, Aleix Garrido, Stian Soiland-Reyes, Pinar Alper, Oscar Corcho
Proceedings of the Joint EDBT/ICDT 2013 Workshops on - EDBT ’13 (2013) https://doi.org/gf5m6r
DOI: 10.1145/2457317.2457376

18. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv
Farah Zaib Khan, Stian Soiland-Reyes, Richard O. Sinnott, Andrew Lonie, Carole Goble, Michael R. Crusoe
Zenodo (2019-07-15) https://doi.org/gf5tg8
DOI: 10.5281/zenodo.1208477

19. W2Share Case Study: Workflow Research Object (Wro)
Lucas Carvalho, Claudia Bauzer Medeiros
Zenodo (2018-10-18) https://doi.org/gf5m6m
DOI: 10.5281/zenodo.1465897

20. CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)
Stian Soiland-Reyes
Mendeley (2018-12-04) https://doi.org/gf5m6h
DOI: 10.17632/6wtpgr3kbj.1

21. The Scientific Filesystem
Vanessa Sochat
GigaScience (2018-03-13) https://doi.org/gdwq7f
DOI: 10.1093/gigascience/giy023 · PMID: 29718213 · PMCID: PMC5952957

22. ActivePapers: a platform for publishing and archiving computer-aided research
Konrad Hinsen
F1000Research (2015-07-14) https://doi.org/gfrkvv
DOI: 10.12688/f1000research.5773.3 · PMID: 26064469 · PMCID: PMC4448745

23. Common Workflow Language, v1.0
Peter Amstutz, Michael R. Crusoe, Nebojša Tijanić, Brad Chapman, John Chilton, Michael Heuer, Andrey Kartashov, Dan Leehr, Hervé Ménager, Maya Nedeljkovich, … Luka Stojanovic
Figshare (2016) https://doi.org/gf6ppg
DOI: 10.6084/m9.figshare.3115156.v2

24. Web Linking
M. Nottingham
RFC Editor (2017-10) https://doi.org/gf8jcd
DOI: 10.17487/rfc8288

25. common-workflow-language/cwltool: cwltool 1.0.20190815141648
Peter Amstutz, Michael R. Crusoe, Farah Zaib Khan, Stian Soiland-Reyes, Manvendra Singh, Kapil Kumar, Anton Khodak, John Chilton, Thomas Hickman, Boysha, … Ryan Spangler
Zenodo (2019-08-15) https://doi.org/gf7s2s
DOI: 10.5281/zenodo.3369238

26. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv
Farah Zaib Khan, Stian Soiland-Reyes, Richard O. Sinnott, Andrew Lonie, Carole Goble, Michael R. Crusoe
Zenodo (2019-05-23) https://doi.org/gf7s2r
DOI: 10.5281/zenodo.3196309

27. Hypertext Transfer Protocol (HTTP/1.1): Range Requests
R. Fielding, Y. Lafon, J. Reschke (editors)
RFC Editor (2014-06) https://doi.org/gf8jcb
DOI: 10.17487/rfc7233

28. DataCite Metadata Schema Documentation for the Publication and Citation of Research Data v4.0
DataCite Metadata Working Group
DataCite e.V. (2016) https://doi.org/gf8jcf
DOI: 10.5438/0012

29. JavaScript Object Notation (JSON) Text Sequences
N. Williams
RFC Editor (2015-02) https://doi.org/gf8jcc
DOI: 10.17487/rfc7464