Research project
Project SENSYN
Making sensitive data reusable through synthetic data generation
- Duration
- 2024
- Contact
- Katharina Krüsselmann
- Funding
- NWO Open Science Fund
About the project
Data-sharing is an important pillar of open science, yet privacy regulations hinder the sharing of highly sensitive data. Project SENSYN addresses this issue, by introducing and stimulating the use of artificial synthetic datasets as a solution through guidelines and a proof-of-concept study, using Dutch homicide data.
Description of the project
Making data findable, accessible, interoperable and reusable (FAIR) is a pillar of Open Science. Yet, privacy regulations hinder the implementation of FAIR principles in research fields working with sensitive data, such as crime, finances, or public health. Sharing individual-level data is not just crucial for collaborations between research institutions, for enabling replication studies and increase transparency and accountability of research processes, but also to inform policymaking and to engage other stakeholders of the general public in research.
To overcome the problem of sharing sensitive data, in recent years, the use of differentially private synthetic datasets has emerged as a successful tool to make detailed highly sensitive data more accessible, in particular amongst data scientists. A synthetic dataset is an artificial dataset that has the same statistical properties as the original dataset, but as each datapoint is simulated, they cannot be traced back to individuals protected under privacy regulations. In the context of Open Science, synthetic datasets have been praised for opening new possibilities for the replication of studies and for researchers to share their data under FAIR principles. However, the use of this novel technique is not widespread yet, particularly in the social sciences. In addition, very little attention has been directed at the possibilities of using synthetic datasets to open up research processes and findings to non-academic audiences.
With project SENSYN, a team of social scientists and data scientists aims to encourage and assist the use of synthetic datasets (a) amongst researchers handling personal and highly sensitive data and (b) for the accessibility of data for the general public. One the one hand, the use of synthetic data is encouraged through a proof-of-concept study synthetic dataset using highly sensitive homicide data from the Netherlands. On the other hand, the project will result in accessible protocols and guidelines for other researchers, as well as online workshops and videos on the use of synthetic data. All outputs of the project, including the synthetic dataset, will be displayed on an interactive, public online information hub.
Project SENSYN consists of three specific tasks:
1. Evaluation of existing ways of creating synthetic data in the light of highly sensitive data.
Several methods exist to generating synthetic datasets. Yet not all generation methods adhere to data privacy regulations or are suitable for various types of data. In this first step, common generation methods are evaluated based on (a) adherence to data privacy regulations, (b) utility of the data for research and other purposes and (c) accessibility to researchers with no technical understanding of synthetic data generation.
2. Generation of interactive, accessible online platform with synthetic dataset
To prove the use of differentially private synthetic datasets, an interactive and accessible online platform will be generated, that showcases the synthetic dataset based on the Dutch Homicide Monitor. In addition to accessing the detailed, disaggregated dataset, users of the platform will be able to search for information, such as the mean age of homicide victims for a specific time period. In addition, users will be informed about the (dis)advantages of this synthetic dataset and instructed on the use of the data through accessible, short videos aimed at non-academic audiences. The platform will be generated using an implementation of the Streamlit framework, and codes will be shared through Github, thereby enabling adaption of this process.
3. Creation of guidelines for (a) other researchers and (b) non-academic audiences, including step-by-step
guidelines, roadmaps, videos.
Throughout the first two steps, the project team will document their findings for researchers working with highly sensitive data, to encourage and enable the adaption of FAIR principles through the generation of synthetic data. This documentation will result in a number of guidelines, summarizing different generation techniques, detailing which data is suitable, which data privacy regulations need to be accounted for and how data needs to be prepared. For non-academic audiences, short videos will be produced to explain how synthetic datasets can be used to engage in and inform public debates. These guidelines will be made accessible as well. Guidelines for academic audiences will be published on Github and through Leiden University’s website. Guidelines for non-academic audiences will be published on the online platform.
The project is currently still running. Check this website again later this year for more information about upcoming online workshops, links to guidelines on synthetic dataset or the online information hub.