Gjalt-Jorn Peters’ website - You (or your university) don’t own your data

This is a Mastodon thread. The original thread is available here:

This is a brief thread/blog post about ownership of data. I created it because of efficiency (basically, I find myself explaining this a lot). Apparently, there are a lot of misunderstandings around data ownership. Here, I hope to correct some misunderstandings.

The TL;DR: the only data you own are your personal data as per the GDPR. If you process others’ personal data, you don’t own them, but just temporarily process them. Data that are not personal data (i.e. that are not about a person) are about the world and cannot be owned. All data falls within one of these two categories, and as such, only personal data is somebody’s property. Your institution (or you) cannot own anonymous data.

Data about the world are facts, and facts cannot be owned: they are defined as existing in the public domain. For example, you cannot claim to own the fact that the earth revolves around the sun. For more information see https://copyrightalliance.org/faqs/whats-not-protected-by-copyright-law/.

If you conduct a qualitative study, audio or video recordings can easily be considered personal data (not always, but it’s generally best — and easiest — to assume they will be). This means you, as a temporary data processor, can only do with the data what you agreed to with the participants whose data you’re processing (except in some edge cases where your grounds for processing the data are not consent).

Once the data are transcribed into a text file, the data will often still enable identifying the participant who was interviewed (with reasonable effort; arguably, anything can be identified when infinite resource are thrown at it). That makes everything in the transcript about that participant personal data as per the GDPR, and so you cannot own those data.

In most cases, it’s possible to anonymize such data. Campbell, Javorka, Engleton, Fishwick, Gregory and Goodman-Williams describe a systematic approach to this end in https://doi.org/k7fm.

For qualitative data, it is possible the interviewees could claim the copyright for the narrative they provided. In that case that would still disallow you from making the data public. Therefore, it’s important to provide for this in your informed consent, which then serves three purposes: the ethical informed consent as per most ethical committees’ requirements; the consent as per the GDPR; and the agreement to deposit the anonymized data in the public domain.

To make it easy to create such informed consent forms, Szilvia Zorgo, James Green and me created an example informed consent insert that covers this. It is freely available at https://behaviorchange.eu/files/open-consent.odt.

Especially in the social sciencs, quantitative data are much easier. This is because of the low quality of our “measurement instruments” (the scare quotes reflecting the question of whether it’s reasonable to consider this measurement).

The large measurement error, and often even larger transient measurement error (https://doi.org/mkqn), are usually problematic, decimating effect sizes and requiring considerable sample sizes — but when it comes to identifiability, they are a boon.

A spreadsheet with answers to, for example, three measurement instruments of psychological constructs contains no personal data (the low reliability of these instruments means that even if people were to complete the same questions again and you’d use those results to try to identify them, you wouldn’t be able to locate their row in the spreadsheet since their answers will differ).

Therefore, for quantitative data, the best approach is to collect anonymous data: data that cannot be linked to the persons providing the data. In this respect, the concept of k-anonymity comes in handy (see https://en.wikipedia.org/wiki/K-anonymity).

K-anonimity means that there are always at least k participants that share any combination of data points.

For example, if you collected age, gender, and region, if you store age as date of birth, your sample would need to be pretty huge to no longer have unique dates of birth. If you store age in years at the time of data collection, you achieve k-anonimity with smaller datasets; and if you store age in decades at the time of date collection (0-10, 11-20, 21-30, etc), dozens of participants suffice.

The same goes for gender: if you measure gender in three categories (non-binary, female, and male), you achieve k-anonymity with smaller sample sizes than if you provide an open text field where participants can specify their gender (tempting to say “how they identify”, but that might be confusing in this context).

Of course, you can always include such an open question and then have an anonymization procedure where you categorize the answers into a smaller number of categories, for example using the :rstats: {gendercoder} R package (unfortunately not on CRAN: see https://docs.ropensci.org/gendercoder).

Whether participants can be identified also depends on your sampling frame. If you sample from “people in the Netherlands who use drugs and are between 20 and 40 years of age”, your sampling frame is much larger than if you sample from “primary school teachers in Valkenburg” (a village in the south of the Netherlands that has four primary school; also see https://nltimes.nl/2023/12/01/valkenburg-among-10-trendiest-destinations-worldwide-bookingcom-survey).

After all, the larger the group you sample from, the harder it is to identify any group member given a set of attributes. There are many more people with red hair among “people in the Netherlands who use drugs and are between 20 and 40 years of age” than among “primary school teachers in Valkenburg”.

For more details on anonymization, see https://behaviorchange.eu/posts/2019-09-a-poor-person-s-guide-to-open-sciencing-gdpr-compliant-data-management.html.

Once anonymized, a dataset contains no personal data, so it only contains facts about the world. Therefore, you or your university cannot claim to own those data.

(And if it had contained personal data, you or your university still wouldn’t have to be able to claim to own those data; personal data are only owned by the relevant person, and at most can be temporarily processed by others.)

There is one exception to this relatively simple state of affairs. First, a person or university (or other organization) can invoke what we’ll call “database right”: this is an instrument created to make it worthwhile for commercial organizations to invest in compiling large datasets (e.g. the yellow pages, https://en.wikipedia.org/wiki/Yellow_pages).

This means you claim to own a specific configuration of data that otherwise consists in the public domain. While this is a perfectly legal instrument, invoking it is incompatible with open science principles, and in the case of the Netherlands, with the Dutch Code of Conduct on Research Integrity, which mandates transparency. As such, this exception remains, um, academic 😬

I hope this helps to clarify why you never need to engage in discussions with your university administrators, or with external organizations, about “who owns the data”. The data cannot be owned by anybody other than the person who provided the data, and then only if the data are personal data as per the GDPR (i.e. identiable data).

As a final note: the fact that data exist in the public domain and cannot be owned does not mandate publication of the data. An organization is under no obligation to make data they collect public, even if they cannot claim to own the data. There are no obstacled to publishing the data, but that doesn’t mean organizations are forced to do so.

This imperative follows from scientific ethical integrity principles, as well as the implicit contract entered into with participants: that the data will be used for scientific research (which includes the ability for close scrutiny by other researchers and replicability of any analyses done on the data).

Researchers and universities have to make all data they collect public (well not all: following the principle “as open as possible, as closed as necessary), not because the legislator forces them, but because they adhere to standards of scientific integrity.