How to anonymise/pseudonymise your open data in R

Anonymising data, especially if it includes personal details like addresses or voting patterns, is very important if you want to share your data or results. Due to GDPR, all data collected in the E.U. has to be either anonymised or pseudonymised when used in ways other than what was initially intended[note]The benefits of doing so are detailed in Lindquist (2018).[/note]. One of the main points of open science is to allow other researchers the opportunity to explore the data you’ve collected and perform tests you might not have thought of or were not in a position to do (Meyer, 2018). But there’s more to making your data anonymous than just replacing variable names. Various R packages give researchers a number of options during data analysis and sharing.

Data analysis

digest: create a hash for data ID. If simply removing the variables from the data frame isn’t an option, adding a new variable with a unique ID is a good alternative. By using the digest package, you can create an anonymous hash to represent your data with an associated key to work out which participant is which. Another method of doing this is explored here.

bcrypt: a more powerful version of hashing participant IDs as it protects against brute force attacks[note]When someone tries to break through encryption by submitting many different passwords with the hope of getting right by luck.[/note]. This process is explained in more detail here.

Data sharing

synthpop: create a synthetic set of results based on the raw scores with minimal distortion. Using the parameters from the original data, simulated data can be created which gives readers a good understanding of what the results were without revealing any personal information.

gganonymise: replace labels and text in graphs generated using ggplot2 with pseudonymous alternatives.

Is anonymised data anonymous?

There has been a fertile debate around whether shared data which has been ‘anonymised’ actually protects the identities of those involved. Tests using large-scale data sets have found it is possible to match data and thus identify participants (Kondor, Hashemain, Montjoye, & Ratti, 2018). However, this study used very large and detailed data sets which included location tracking and purchase history. Research which doesn’t collect as large amounts of data is less likely to be able to be re-identified. But re-identification only requires a small amount of data to be feasible (Kondor, Hashemain, Montjoye, & Ratti, 2018). Therefore, great care must be taken to protect the identities of your participants. For a summary of how to practice open science ethically, I highly recommend Meyer (2018). For how to share data whilst complying with GDPR, Hinzte (2017) explores this issue in greater detail.

References

[zotpress items=”{5421944:J95SRB7Q}” style=”apa”]

[zotpress items=”{5421944:FLWTNJ7D}” style=”apa”]

[zotpress items=”{5421944:QBVY3IZS}” style=”apa”]

[zotpress items=”{5421944:CFDG8RUU}” style=”apa”]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: