Abstract. The volume of research is increasing along with the steadily increasing digitization of research and the advent of open science. This puts a pressure on research information systems, which try to work with various research output types (e.g. publications and datasets) and the related information. This is even more of an acute issue for Social Sciences and Humanities (SSH). When compared to e.g. natural sciences, SSH’s representation in various forms of research outputs is often lacking in research databases. One solution is national Current Research Information Systems (CRIS), which aim to provide a realistic and disciplinarily balanced picture on the research outputs produced by various research organizations in a given country. To achieve this, metadata for research outputs need to be consistent and, above all, interoperable. One component in this is to use persistent identifiers (PIDs). This paper presents a case of interoperability of SSH publications, datasets, and infrastructures. Linking research outputs, funding decisions, actors, and organizations with PIDs is the starting point of the Finnish Research.fi portal. We present and discuss the advancements that PIDs provide for research information management from the SSH point of view.
Corresponding author: email@example.com
The volume of research is increasing along with the steadily increasing digitization of research and the advent of open science. This puts a pressure on research information systems, which try to work with various research output types (e.g. publications and datasets) and the related information. For Social Sciences and Humanities (SSH), this is even more of an acute issue. When compared to e.g. natural sciences, SSH’s representation as various forms of research outputs is often lacking in research databases. This shortcoming is most evident when evaluating the coverage of SSH in commercial databases (e.g. Web of Science and Scopus), which only provide partial coverage of SSH outputs (1). The research suggests that at best, around 70% of SSH publications are available in these databases with the lowest estimates being in the range of 20-30% (2). This may result in skewed analysis or representation of research activities done in the SSH field.
Though the problem is acute, a lot of progress has been made in this area in recent years. One track of solutions is national CRIS systems (such as CRIStin1 in Norway, NARCIS2 in the Netherlands) which aim to provide a realistic and disciplinarily balanced picture on a diverse array of research outputs produced by various research organizations in a given country. To achieve this, metadata for research outputs need to be consistent and, above all, interoperable. So while common metadata schemas such as CERIF (3,4) are necessary, at the core of the solution is using persistent identifiers (PIDs) to enable unambiguous references to research outputs.
For publications, there exist international, and mostly commercial databases of information such as Scopus, national efforts, as well as disciplinary collections. The need to be able to build on the results of other researchers and the reproducibility of research are the base pillars of science and so the methods for discovering the work of others should be well established. In general, as a part of the publishing process, the datasets used in papers should be attached or linked to a publication, so that it is possible to validate or recreate the results from the data used. Any tools, software or methods used for analysis should also be clearly identified so that an identical analysis can be recreated if need be. Many journals now prefer that the data related to publications are deposited in data repositories, rather than be attached to publications. Open science also advocates making research outputs publicly available. While there are discipline agnostic data repositories such as Zenodo3 or Dryad4, disciplinary repositories are most commonplace, as there is a high likelihood that a researcher searching to reuse someone else’s datasets will be interested in finding a very specific dataset, belonging to a precisely defined discipline. Typically this exploration of datasets is based on metadata elements (e.g. attributes of data, data format) which are discipline-specific and thus not available in general repositories. Data repositories may also be further delimited in scope by geographic boundaries, such as the Finnish Social Science Data Archive (FSD). Also other research information types are of relevance for researchers and CRIS systems. Positive funding decisions are both a key merit as well as early descriptions of what researchers plan to work on. Source code and software might be outputs themselves or at least key components in carrying out a given analysis.
This paper presents a case of interoperability of SSH outputs, namely publications, datasets, and infrastructures. Linking these research outputs, funding decisions, actors, and organizations with PIDs is the starting point of the Finnish Research.fi portal. Its first version was launched in June 2020. We present and discuss the advancements that PIDs provide for research information management from the SSH point of view.
The Research.fi portal connects different types of research outputs by using PIDs. The PID Graph concept was introduced by Fenner and Aryani (5) and presents a concept that proposes the use of federated RESTful APIs, which means that the graph metadata is not stored in one location, but distributed between different data providers. A resolving PID can be used to retrieve additional metadata from the linked data providers. Research.fi is a national portal that currently handles data integration via other, more traditional mechanisms. Other efforts on this theme are on-going, such as OpenAIRE Research Graph (6), Research Graph5 and Scholix6.
Figure 1. An excerpt from Research.fi showing different searchable research information types.
Research.fi aims to connect publications, with related funding decisions, research infrastructures and datasets used in a given study. The portal content can be browsed from different viewpoints, and the linked data navigated through interconnected links. From a service design perspective, the central node in this graph is the researcher. One of the key use cases for the portal design was finding people with expertise in a particular research topic. However, several graph connections do not directly involve researchers. A user can view data from different perspectives, such as:
Datasets resulting from a given infrastructure
Datasets and publications generated with funding from a given funding decision
Funding awarded to a given organisation each year
A given researcher’s publications, funding decisions, datasets, etc.
The initial release of the Research.fi includes the research information types shown in Figure 1. Additional types are being incorporated. Like other concurrent efforts, we face the same hurdle of finding information about how these different types of data items relate to each other. Research information systems have for some time provided the possibility to enter information e.g. on the projects relating to datasets, but such voluntary entry of information is frequently not provided by researchers. The interest typically lies in providing information that affects the researchers’ or their host institutions’ funding.
The simplest and most explicit way to establish a connection between two research objects is to link their PIDs. For example, a connection from a publication to an author is made when the author gives her/his ORCID identifier when submitting a paper to a journal. Once the PID-to-PID connections are established, both the national data providers as well as international research data aggregators, such as OpenAIRE, have an interest to utilise the connection-enriched data.
The Fairdata services support the research process and data management via multiple service components each having their own primary function. As the name suggests, the services aim to be as FAIR-compliant as possible, this is that the services take into account the FAIR-principles7 for data management, which have in the recent years gained great support in the data repository field.
All Fairdata service components come together around Metax – a Metadata Warehouse service. Metax acts as a metadata repository for research datasets and additional metadata stored in Fairdata services. Metax also stores harvested metadata from various national and institutional data repositories in order to provide a national metadata database on datasets produced by Finnish research organisations.
Metax does not have a user interface but can be accessed via a REST API and OAI-PMH protocols both by individual users and service providers interacting with the metadata. Metax has its own internal data model, but is interoperable in many ways with common data model standards (e.g. Datacite8, DDI9) used in data repositories. It has mappers and refiners available for metadata exchange between those formats. Metax is discipline agnostic related to dataset metadata, and is limited in its ability to support the various discipline specific metadata schema needs. This is by design. As an aggregating metadata warehouse, the primary requirement for Metax is the ability to reach the highest level of interoperability regarding metadata flows between services.
FSD is a CoreTrustSeal-certified trusted research data repository. It serves the international research community as an expert organisation that curates and preserves digital research data collected primarily to study society, population, and cultural phenomena. It runs an infrastructure service for data deposit, discovery, and sharing. It offers information services and data management support throughout the data lifecycle and facilitates easy and responsible reuse of data. FSD is the national service provider for the European SSH research infrastructure CESSDA ERIC (Consortium of European Social Science Data Archives European Research Infrastructure Consortium). FSD’s services are available for researchers, teachers, students, and everyone else interested in issues regarding research data.
FSD’s data curators produce detailed metadata descriptions of all datasets that are ingested. Full metadata records are available in machine-actionable formats. Primarily, the metadata descriptions are harvestable through the Kuha2 OAI-PMH metadata server10.
OAI-PMH is a metadata collection protocol based on client-server architecture, and is widely used for harvesting XML-based metadata over HTTP. Subsequent harvesting runs can use modification timestamps to reduce the resources used for synchronization, and the OAI-PMH repository can also track deleted records to explicitly notify the harvesters that a record should be removed. Through Kuha2, the metadata is harvestable in DDI Codebook, OAI Dublin Core, and EAD3 formats. Using OAI-PMH's selective harvesting, the FSD metadata records can be grouped according to study series, metadata language and kind of data. Each group can be harvested individually. Access to different metadata levels and language translations are available as well.
We demonstrate what kind of information can be retrieved, annotated and returned to the repository for a single dataset originally deposited in the FSD repository, with an integration pipeline to Fairdata Metax and Research.fi. The flow of records and metadata includes the following steps:
At the ingest phase a dataset and contextualising materials are assessed for suitability for archiving. This includes providing the basic descriptive metadata. The dataset then enters the curation workflow. A curated dataset is published in the FSD Data Catalogue (repository platform) with detailed descriptive, provenance and technical metadata.
An internal study number and a PID (URN) based on it are assigned to the dataset when it enters the curation workflow. First it resolves only to a list of upcoming data releases.
Once the dataset is published, the PID assigned to it resolves to the FSD catalogue landing page as shown in Figure 2. This is when the so called master metadata is formed, and it is always available via this PID pointing to the FSD catalogue.
Information on related publications with their PIDs are included in the metadata record during the initial curation stage and later, when researchers report on publications based on that dataset.
Once a dataset metadata record is made available it is updated to the OAI-PMH endpoint (Kuha2) once a day by querying FSD's internal metadata repository and synchronizing the changes made to published metadata records.
Metax harvests the dataset record’s metadata from Kuha2 in DDI 2.5 format and transfers it to the Metax database.
To accommodate the different data models behind the FSD and Metax services, a mapping is made between DDI 2.5 format and Metax data model, and further, an automated refiner takes care of the metadata conversion between the two services. The original PID (URN) of the dataset is preserved and kept as the “primary identifier” for the dataset in Metax as well. Metax also automatically includes information in the dataset record that it has been harvested from an external source. This harvest supports any other PIDs (e.g. related research publications, used infrastructures, funding decisions) that have been added while curating the dataset at FSD.
Metax exposes the dataset record for Research.fi
The dataset record as a whole is exposed to a RabbitMQ queue which is listened to by Research.fi. All dataset records that have been created, updated or removed are then transferred to Research.fi database. This includes all the PIDs connected to the dataset, which allows to create the connections to e.g. research publication or grants already present in Research.fi and in case they are not yet present, they can potentially be added.
Figure 2. Landing page for a specific dataset11 in FSD data catalogue service
Research.fi transfers the record to it’s database and the dataset metadata is shown in the Research.fi -portal
All metadata and connected research outputs are still available
A researcher can make new connections to other research outputs via Research.fi
Any annotations to this specific dataset that are already available in Research.fi via other service providers are appended automatically
Once again, the metadata has gone through a refiner to ensure all relevant metadata for this dataset can be shown in Research.fi. Some detailed metadata are left out, as those are not currently of relevance to Research.fi. These include e.g. geographical information that might have been present in the original dataset. However, all this is accessible via the PID and landing page assigned in the original repository.
Research.fi with it’s vast metadata on other aspects of research (e.g. publications, infrastructures, funding decisions) can then be utilized by the researcher to make connections to these after the original dataset record has reached Research.fi. This means that if the information on e.g. a funding decision is available in Research.fi and the incoming dataset record includes a PID for this, the linkage is made automatically between the two. This is one of the most pivotal use cases for Research.fi as it allows to form PID graph data automatically even though the original source systems of each record had no previous interaction.
The dataset metadata can be viewed in the Research.fi portal as shown in Figure 3. Also all related outputs are linked using individual PIDs.
Both the user interface and the currently (as of 2021) in-development API of Research.fi make it possible to follow and use these linkages between research outputs and other concepts (e.g. funding decisions, organizations, departments). This makes the PID graph a reality between seemingly disconnected source systems under Research.fi.
Figure 3. Dataset from FSD harvested to Metax and pushed to Research.fi12
Research.fi uses PIDs for publications, other research outputs, researchers, infrastructures, and funding. The original dataset in FSD can be enriched with the newly discovered connection metadata, in the first step at FSD.
These annotations are made available on the Research.fi API in the future.
To support the ground work of annotating and describing datasets and other outputs done by curators and researchers, Research.fi aims to make all PIDs available via a machine-actionable API. The API would provide PIDs and information on how these are connected to other outputs and concepts. When such an API is used by annotation tools, this makes the curation and description of a dataset easier and thus adds value by making the linkages be present on the record level, already from initial creation of master metadata. This also results in holistic cost-efficiency, when these issues are tackled on a higher aggregation level. This also removes the burden from smaller services or annotation tools of building and maintaining a reliable and broad PID catalog of various research outputs, different PID types and other concepts. This has been demonstrated by Research.fi, which makes the PID graph possible by bridging over many otherwise disconnected services.
We have shown the flow of information from the original annotation of a dataset, and incorporation in a geographically bounded and discipline specific repository, FSD. The dataset metadata was consequently harvested into a more generic domain agnostic research data metadata hub Metax by Fairdata -services. The contents of this generic metadata hub were consequently harvested into an aggregating metadata database of Research.fi, which is aimed to gather all different types of research outputs, funding decisions, research infrastructures and other concepts in one place.
Besides the obvious use case, citation of other papers, some publications’ metadata point out to datasets or infrastructures used as a basis for the publication and acknowledge funding decisions that enabled the work. Likewise, data repositories may provide similar possibilities for providing metadata on how research outputs and concepts connect to each other. However, in many cases these references are made only as descriptive texts, which are not machine-actionable. For creating a PID graph to connect all these research information types, these references must be provided with a persistent identifier as the target of connection. It is, however, unrealistic to expect a user or even a curator to know and (correctly) specify such identifiers. Therefore, a service is needed to cater PIDs (and possibly some metadata behind that PID) for various research information types. The Research.fi service described in this paper can potentially provide such a service, e.g. catering PIDs of funding decisions to be connected to publications or datasets for annotation tools. While the Research.fi service aims to portray a complete picture of Finnish research outputs, it is limited to research carried out in Finland. OpenAIRE has a broader geographic scope, but might consequently suffer from reaching a comparable level of completeness or have other issues with quality of metadata related to e.g. reference data, ontologies and classifications.
It is imperative that the information for annotating a given research information type with connections to other research outputs takes place in the context of the original creation or publishing of outputs. The person specifying this information is most likely to be aware of all these connections. Getting people to return to already described research outputs, and retroactively annotate them with linking information, is not a safe bet. To address the issue of referring to a yet unpublished research output, such as an unpublished dataset, FSD has come up with the solution to assign a PID already when a dataset is initially ingested for archiving, even though it wouldn't be published yet or does not provide any machine actionable metadata at the time. This approach makes it possible to provide metadata on linkages to unpublished research outputs for e.g. a publication being submitted to a journal. We find this solution of high relevance to the principle of creating linkages in context of initial metadata entry for a given research information type.
Shortcomings also exist for referencing some research information types, such as software and source code. There are also widely used repositories for source code such as Github or Software Heritage, but practice and PIDs for citing source code or software are not adequately established, a point that should be addressed, as software is currently underrepresented as a research output type that merits researchers.
All datasets and software will not be openly accessible due to e.g. their sensitive nature or intellectual property rights. However, the metadata should usually be possible to share. The repositories addressed in this paper handle both approaches, actual data deposition and providing metadata only for datasets where the actual data is under restricted access.
The research graph approach needs information on how the different research outputs are connected to each other - the key being the PIDs. A lack of PID information is the biggest problem in building comprehensive research information systems based on linked data. In the case presented here, there are many ways forward. The most straightforward way to gain research output connection data is to annotate the dataset with information regarding outputs, at the time when the dataset is first deposited. Then, in an aggregator service, research outputs and concepts may “learn” about previously unconnected datasets coming from seemingly disconnected services.
Many researchers do not deposit datasets until they have exhausted them for their own publication needs. Therefore, there are cases where PID for the dataset is not available at the time when an article is published. A solution was presented to prevent this from occurring, by making it possible to give PIDs to datasets initially being ingested, by a repository, but not yet published. Such outputs may also receive connection information at a later date via proxy e.g. when both the publication and dataset become referenced to with a PID in the reporting of the project that funded both. To ensure linking between two known PIDded research outputs, a research infrastructure such as FSD can access Research.fi publications via the API. Making data and research outputs’ linkages visible is of importance to such repository infrastructures as well. The person depositing a dataset in the repository is likely to be the party who knows which funding decision to acknowledge for funding the work, which research infrastructures were used to collect the data, which subsets the data are composed of, or what source code is related with that dataset. An API is being built for Research.fi to cater varied PIDded research item types to services in the creation of richer and better quality metadata.
1. Sīle L, Guns R, Sivertsen G, Engels T. European Databases and Repositories for Social Sciences and Humanities Research Output. figshare; 2017 Jul. https://doi.org/10.6084/m9.figshare.5172322.v2
2. Petr M, Engels TCE, Kulczycki E, Dušková M, Guns R, Sieberová M, et al. Journal article publishing in the social sciences and humanities: A comparison of Web of Science coverage for five European countries. Bornmann L, editor. PLoS One. 2021 Apr 8;16(4):e0249879. https://doi.org/10.1371/journal.pone.0249879
3. Jeffery K, Houssos N, Jörg B, Asserson A. Research information management: the CERIF approach. Int J Metadata, Semant Ontol. 2014;9(1):5. https://doi.org/10.1504/IJMSO.2014.059142
4. Jörg B. CERIF: The Common European Research Information Format Model. Data Sci J. 2010 Jul 24;9:CRIS24–31. https://doi.org/10.2481/dsj.CRIS4
5. Fenner M, Aryani A. Introducing the PID Graph — FREYA [Internet]. 2019 [cited 2020 Oct 13]. Available from: https://www.project-freya.eu/en/blogs/blogs/the-pid-graph https://doi.org/10.5438/jwvf-8a66
6. Manghi P, Bardi A. The OpenAIRE Research Graph - Opportunities and challenges for science. In: International Open Science Conference (OS). Berlin, Germany; 2019. https://doi.org 10.5281/ZENODO.2600275/