Easy-to-use GDPR guide for Data Scientist. Part 2/2

As successful Data Scientist, what can I do and what cannot to be GDPR compliant? Amazon Web Services (AWS) vs on-premise. De-identification vs Anonymization. Anonymization: removing, masking or suppression, generalization, k-anonymization, scrambling, blurring. Pseudonymization: tokenization, hashing, encryption, key deletion or crypto-shredding.

Navigate to Part 1

Table of Contents

Disclaimer: I do not represent my current/previous employers on my personal Medium blog.

Download Jupyter Notebook with source code…

De-identification vs Anonymization

De-identification is also known as data anonymization, when applied to metadata or general data about identification.

Data anonymization is the process of either encrypting or removing personally identifiable information from datasets, so that the people whom the data describe remain anonymous.

Removing

Anonymization in general and PII removing are out of GDPR scope.

Masking or suppression

Example:
6619 5829 2470 5074 credit card number
to:
**** **** **** 5074.

With masking, we identify data without manipulating actual identities.

Anonymization in general and PII masking/suppression are out of GDPR scope.

Note: Be careful with masking of e-mail addresses. *@gmail.com is GDPR compliant, *@korniichuk.com (www.korniichuk.com) is not GDPR compliant.

Generalization

Anonymization in general and generalization are out of GDPR scope.

K-anonymization

Anonymization in general and k-anonymization are out of GDPR scope.

Scrambling

Example:
6619 5829 2470 5074 credit card number
to:
4596 5482 6077 2019.

Note: The process can sometimes be reversible.

Anonymization in general and scrambling are out of GDPR scope.

I cannot recommend scrambling as anonymization technique!

Blurring

Example:
6619 5829 2470 5074 credit card number to:

Note: The process can sometimes be reversible.

Anonymization in general and PII blurring are out of GDPR scope.

I cannot recommend blurring as anonymization technique!

Pseudonymization

By holding the de-identified data separately from the “additional information,” the GDPR permits data handlers to use personal data more liberally without fear of infringing on the rights of data subjects. This is because the data only becomes identifiable when both elements are held together.

You can use Faker and Mimesis Python libs for pseudonymization.

Faker is a Python package that generates fake data for you.

Note: The resulting string is always the same for the same input, so that analytical correlations are still possible.

I cannot recommend pseudonymization with Faker/Mimesis Python libs!

Tokenization

Note: The resulting token is always the same for the same input, so that analytical correlations are still possible.

I can recommend tokenization as pseudonymization technique.

Hashing

However, if the range of input values the hash function are known they can be replayed through the hash function in order to derive the correct value for a particular record.

Hash functions are subject to brute force attacks. Pre-computed tables can also be created to allow for the bulk reversal of a large set of hash values.

The use of a salted-hash function (where a random value, known as the “salt”, is added to the attribute being hashed) can reduce the likelihood of deriving the input value but, calculating the original attribute value hidden behind the result of a salted hash function may still be feasible with reasonable means.

Keyed-hash function with stored key uses a secret key as an additional input (this differs from a salted hash function as the salt is commonly not secret).

Note: The resulting hash is always the same for the same input, so that analytical correlations are still possible.

I can strongly recommend keyed-hash function as pseudonymization technique.

Encryption

Assuming that a state-of-the-art encryption scheme was applied, decryption can only be possible with the knowledge of the key.

Symmetric encryption vs asymmetric. Today’s standard of symmetric encryption scheme is AES. Symmetric encryption has its weakness in the presence of the secret key on the encryption side.

Asymmetric encryption schemes such as RSA, DSA, or ECC use two keys: public and private. Encryption of data uses the public key and even if an attacker attains this public key, he is not able to decrypt protected data.

I cannot recommend encryption as pseudonymization technique!

Key deletion or crypto-shredding

You can upgrade tokenization, keyed-hash function, and encryption techniques.

Considering a state-of-the-art algorithm, it will be computationally hard for an attacker to decrypt or replay the function, as it would imply testing every possible key, given that the key is not available.

I cannot recommend this pseudonymization technique!

Sources

Download Jupyter Notebook with source code…

Python Developer and Artificial Intelligence Engineer