Abstract. In human subject research, various data about the studied individuals are collected. Through re-identification and statistical inferences, this data can be exploited for interests other than the ones the subjects initially consented to. Such exploitation must be avoided to maintain trust with the researched population. We argue that keeping data-access policies up-to-date and building accountability on research data processing can reflect subjects’ consent and mitigate data misuse. With accountability in mind, we are building Lohpi: a decentralized system for research data sharing with up-to-date access policies. We demonstrate our initial prototype with timely delivery of policy changes along with minimal access control overhead.
Corresponding author: email@example.com
ICT-centered research methodologies are being quickly and widely adopted in the fields of social sciences and humanities , fueled by advances in big-data systems, knowledge extraction, and machine-learning methods. Some researchers have raised concerns about this rapid adoption of new and unfamiliar technologies as they bring new challenges in research , in particular with regards to privacy and compliance with laws and regulations. If ICT-centered research methodologies are not implemented correctly, the researcher may not obtain the required ethics approval and fail to establish the trust needed to recruit volunteer participants that, for instance, epidemiology, sports science, psychology, social sciences, and humanities heavily rely on.
Recent regulations such as GDPR require explicit informed consent from participating individuals (hereinafter referred as subjects) for collecting and processing their data . Researchers and Institutions must ensure that sensitive data of subjects is meticulously handled . World Health Organization  states that an ethics committee must protect subjects from any anticipated harm.
Perhaps the most common techniques to process data in compliance with these laws are anonymization and aggregation and are often recommended by ethics committees. However, weaknesses in known methods have led to multiple privacy violations . Advancements in statistical inferences and re-identification attack methodologies have made it relatively easy to identify discussed individuals in a study . Differential privacy  is often hailed as one of the advanced solutions to protect an individual’s privacy in a dataset. However, it is difficult to use differential privacy in every possible scenario . Yang et al.  and Garfinkel et al.  have highlighted issues with differential privacy. Existing differential privacy protocols assume a relatively simple data model with a centralized database. Misunderstandings about randomness and noise, limited access to micro-data, and accuracy are some of the raised concerns .
Kroll et al.  argue the need for having global visibility in data usage to test the next generation of privacy-enhancing technologies. Researchers argue that building accountability around the applicable laws and the dynamic privacy risks landscape, is the way forward . Subjects’ perception of privacy might change over time and depend upon the purpose data is collected for . Although data analysis techniques, such as statistical inferences, can blur the lines between sensitive and non-sensitive  data, the problems of informed consent, individual privacy, harm, and data re-identification are evident in big-data computing . Inspired by Shneiderman , we argue for auditing, independent oversight, and trustworthy certification for research data sharing and processing.
In this paper, we present Lohpi: a system for safe and accountable research data sharing, enabled by a secure network substrate for distributing and applying up-to-date access policies. Lohpi takes a decentralized approach where research institutions can process data on their internal computing infrastructure and maintain control of valuable data assets. The key contribution of Lohpi is our compliant data analytic framework that encapsulates and manages distributed data assets. Data access policies reside as meta-code stored at file-system level , along with the data they govern and updated using gossip-based communication. We present our initial results and discuss future work.
Data-driven research in social sciences and humanities relies heavily on the voluntary participation of subjects . Metrics from the Dataverse project  show that more than (21%) datasets are related to social sciences and (5%) of the datasets are from medicine, health, and life sciences. The subjects of these studies contributed different types of data. PII, such as contact information, can potentially identify an individual and is typically anonymized to safeguard a subject’s privacy. The collected data remain publicly available on repositories such as Dataverse . However, multiple data sources can be linked without a subject’s knowledge or consent, which may result in re-identification of the subject . Protecting the data shared on a global scale is identified as one of the key challenges in the era of Big Data .
Typically, research projects concerning humans require regulatory approval from an ethics committee, a data protection officer (DPO), or an institutional review board (IRB). We collected data from annual reports of the Norwegian ethics committee (REK)  and identified two key metrics: new projects and project changes. Figure 1 shows the growing number of changes to existing projects. We contacted REK to understand what is considered a project change. A project change includes changes to people who have access to the collected data (new researchers), newly discovered risks for subjects (new threats), and even changes in conditions for dispensation from professional secrecy requirements (new laws). These changes require approval by a governing body.
Data collected for a specific research context are often used beyond its initially specified goals . Also, data without any access control can be exploited by third parties. A dataset downloaded from a public repository might not reflect the current state of data sharing policies approved by an ethics committee. To the best of our knowledge, we are not aware of mechanisms that an ethics committee can use to verify compliant data processing by researchers. Anonymized data available in repositories such as Dataverse  can potentially also be re-identified and misused . Consequently, we have built Lohpi as a platform for compliant data usage among researchers, which may identify a rogue researcher .
The FAIR Guiding Principles  are becoming an established standard for managing research data. The principles can be applied to data assets to make them Findable, Accessible, Interoperable, and Reusable. Holub et al.  proposed an extension to the FAIR data management principles by accounting for the privacy risks associated with research data. They map the flow of data from participants to research data repositories and highlight the trust and privacy aspects. A research project is considered compliant if the consent is obtained either from the participants or from an ethics committee . Holub et al. also highlight the following competing interests in human data use: (a) protection of privacy of individuals, (b) reuse of data, and (c) complex ownership and economic interests; and conclude that anonymization cannot always protect individuals’ privacy when data are shared. Instead, they advocate for checking compliance to research data before they are shared. By checking data usage against approved policies at any stage of a project, Lohpi extends this notion of compliance to the entire lifetime of the data. Note that Lohpi is designed to not limit collaboration among researchers. On the contrary, by building on accountability and oversight of research data processing, we conjecture that trust between researchers and the public can be improved . That may lead to improved participation in fields such as social sciences and humanities, which rely heavily on public participation.
In this section, we present our vision for Lohpi, and we refer to Figure 2 that describes a typical research data collection process with this system. A principal investigator (PI) or a team of co-PIs formulates a project protocol outlining research data collection and processing. The protocol provides details on the data that will be collected and how it will be processed and stored. The protocol also provides details on collaborators and measures to protect subjects’ privacy while processing and sharing data. The project protocol is sent to an ethics committee (or some other regulatory unit). The ethics committee reviews the protocol and ensures that the data collection, processing, and sharing within the scope of the research project complies with applicable laws and regulations. The committee also reviews potential threats to the subjects’ privacy and necessary measures put in place to safeguard privacy. The approval ensures that these measures are transformed into a verifiable data-access policy. The approval also means that at any time, a competent authority or a subject can request a compliance report on the project’s data.
Ausloos et al.  argue that defining policies for responsible data-driven research should be an iterative process among the stakeholders. The data-access policy approved by the ethics committee is attached to the data collected in the project. The PIs retain their data assets, which are governed by the approved policies. These data assets individually owned by the PIs connect via gossips. The network as a whole provides a platform for researchers to do analyses. These analyses are compliant with the applicable laws and subjects’ consents. Any changes to these policies, whether revocation of a consent or a newly added collaborator, are updated as policies disseminated through the network via a so-called gossip protocol. The stakeholders have oversight over the data they are responsible for  and can iterate over compliant data-access policies as the threat models change.
Even though the data collected through a project might not change, the policies might. Lohpi facilitates the subjects’ to have their requirements translated into verifiable policies. Researchers might not be familiar with applicable laws that apply and might be breaching them unknowingly. Lohpi enables compliant data processing and sharing which can prevent such breaches by keeping the policies up-to-date. The applicable laws endorsed by the ethics committee and the consents of the subjects of their data form an agreement for data processing. Lohpi keeps this agreement enforced by continuously monitoring and updating data-access policies. A breach of trust can damage the relationship with the subjects. A compliant data-processing environment enabled by Lohpi will mitigate such risks and improve the relationship between researchers and the researched population.
We now present an overview of Lohpi. Lohpi is intended to operate as a permissioned system, one that needs prior approval before being used by research organizations that cooperate on a potential large portfolio of research projects. Institutions host their project data at storage nodes, typically located either on a secure campus infrastructure or on a public cloud. These nodes form a data-storage substrate that runs our secure Fireflies overlay-network protocol . This decentralized network of nodes stores the research data along with their recent data-access policies. Figure 3 shows the overview of Lohpi.
Researchers interested in accessing data are required to authenticate with one or more institutions.Lohpi allows institutions to join the network and integrate their identity management systems based on OpenID . The stakeholders can issue policy changes that are propagated to the data storage network via gossips with the underlying Fireflies network. The compliance engine facilitates researchers to analyze the data and stakeholders to perform audits. As argued earlier, audits can provide a clearer picture to the stakeholders about the data use. A policy change is stored at the policy store. The policy store also propagates these changes into the data storage network as gossip messages.
We now briefly explain components of Lohpi(see Figure 3). Subjects refer to the researched population that contributes data about themselves in a project. An ethics committee, also known as REC, IRB, or DPO, is charged with the task of ensuring that a research project complies with all laws, regulations, and ethical standards. Therefore, throughout this paper, we show the functions of Lohpi in the context of an ethics committee concerning research data sharing and processing. A data storage node stores one or more study data. They are managed by institutions themselves with a Lohpi communication substrate running on them. The nodes can be hosted on an institution’s infrastructure or a public cloud platform such as Microsoft Azure, Amazon Web Service, or Google Cloud. The nodes form a data storage network based on Fireflies  and use TLS-based secure communication between them. The policy store stores and propagates policies for research data stored in the data storage network. It also stores policies’ history in a git-like manner. The policy store can also probe the data storage network for any configuration issues or communication losses. The compliance engine performs audits requested by the stakeholders. A detailed description of Lohpi components is available in . For brevity, the components are discussed briefly in this paper.
Instead of a centralized access control mechanism, each node has embedded data-access control. The policies for such control are updated via gossip messages. Once the policies reach the target node, they are encoded into the file system. Later in Section 5, we show the overhead of such access control. Lohpi is designed to support compliant processing by a benign user (researcher). However, a highly knowledgeable attacker or someone with physical access to the computer network can bypass these mechanisms. In addition to providing compliant access to researchers, Lohpi allows stakeholders to request compliance reports. These reports can be predefined to obtain a holistic view of data usage. For subjects, it may be of interest to see what their data is being used for  and update their policies.
A key property of Lohpi is the reliable dissemination of policy updates. Therefore, we evaluate the propagation of the updates as gossips, issued by an ethics committee and introduced to Lohpi by the policy store. We designed a set of micro-benchmarks to evaluate how much time it takes to propagate a data-access policy change. These experiments focus on the time required to propagate an update under different conditions.
Let be the percent of the data storage network that must receive a gossip message to consider it successful. represents the number of nodes to which the policy store multicasts the update directly. For example, if the policy store multicasts the message to one node, . We begin by simulating the growth of the total number of nodes in Lohpi. We assign a static value to and introduce policy updates by the policy store. To consider a policy update successful, the policy store must receive acknowledgments from different nodes (see Eqn. [acknumbers]). We measure the time elapsed after the policy store multicasts the message to set of nodes and then waits to receive acknowledgments. We arbitrarily chose the message size to 512 KiB. We take measurements at least three times to calculate the uncertainty and plot them using error bars. After recording the first set of readings, we increase the value of , by doubling it and take a further set of readings.
We also evaluate the overhead added by the access controls. First, we measure the baseline by reading multiple files from the file system without any access controls introduced by Lohpi. After enabling the access controls, we perform the same read operations and measure time. We measure the time required to read a large chunk of 1 GiB of data.
In Figure 4 we show the time required for reaching at least two-thirds of the data storage network. We can observe that the time required to reach the acceptance level grows exponentially with the number of nodes () in the network. We also observe that by increasing the value of , we can propagate the message faster through the network. However, the gains are not significant at lower values of . Only with 32, we start to observe significant gains. Also, the variability in the time increases with the size of the network. Failures in the network or nodes can increase the propagation time, however, this can be mitigated.
We also evaluated the access control overhead for file read operations. The results (Figure 5) show that the overhead is significantly large (15%) when the file sizes are smaller than 64 KiB. As the file size grows, the overhead becomes negligible.
Dataverse  is a centralized repository where researchers can deposit their data. Researchers can add custom licenses. Once a dataset is downloaded from Dataverse, there are no mechanisms to restrict sharing through any other means such as over FTP or a USB drive. Wolley et al.  introduced the Automatable Discovery and Access Matrix (ADA-M) that allows stakeholders to confidently track, manage, and interpret applicable legal and ethical requirements. The ADA-M metadata profiles allow an ethics committee to evaluate and approve information models linked to a dataset. ADA-M facilitates responsible sharing outlined in the profile and allows the custodian to check the accesses against regulatory parameters. However, they do not mention any functionality about issuing updates to the profile. Alter et al.  presented Data Tags Suite (DATS), which can be used to describe data access, use conditions, and consent information. DATS provides a metadata exchange format without any compliance checking mechanisms. Havelange et al.  developed a blockchain-based smart contract to attach license requirements to a dataset. The datasets are encrypted and ADA-M profiles are attached with each dataset. A researcher accepts the contract and receives a token to decrypt the dataset. The researcher’s data accesses are checked against the ADA-M profile for compliance. However, they require each researcher, dataset provider, and supervisory authority to have a node on the Ethereum-blockchain network. They do not provide any evaluation in their work.
Our prototype implementation demonstrates that it is possible to propagate updated policies close to real-time. We conjecture that even with a larger distributed storage network, policy changes can be propagated within minutes. We also conjecture that transparency in research data processing can increase trust in research institutions. Adapting protection mechanisms to newly discovered threats to protect individuals involved in research can help sustain public trust . With OpenID  Lohpi can integrate with existing authentication services used at various institutions.
While Lohpi’s approach is not centralized, we conjecture that it can provide abstractions for an ethics committee. Such abstractions can periodically measure compliance on the data storage network and mitigate privacy risks. An expressive policy language like Guardat  can be realized using meta-code . We are also interested in building a tool to express research data usage protocol for streamlining ethics committee approvals and verifying compliance against an approved protocol. We are interested in making existing research data available on Lohpi.
We presented a distributed infrastructure to support compliant data analytics for human subject research. We demonstrated that a distributed gossiping network can ensure the timely delivery of policy changes. The architecture can scale even when a research project spans multiple regulatory bodies. In Lohpi, a data storage node can run on the public cloud or on-campus hardware. Institutions can easily join the network without the need to move their research data.
This work was funded in part by Research Council of Norway project numbers 263248 and 275516. We thank Katja Pauline Czerwinska for her assistance with the graphics.