Easy-to-use GDPR guide for Data Scientist. Part 2/2

6 min readApr 16, 2019

As successful Data Scientist, what can I do and what cannot to be GDPR compliant? Amazon Web Services (AWS) vs on-premise. De-identification vs Anonymization. Anonymization: removing, masking or suppression, generalization, k-anonymization, scrambling, blurring. Pseudonymization: tokenization, hashing, encryption, key deletion or crypto-shredding.

Navigate to Part 1

Disclaimer: I do not represent my current/previous employers on my personal Medium blog.

Download Jupyter Notebook with source code…

De-identification vs Anonymization

De-identification is the process used to prevent a person’s identity from being connected with information.

De-identification is also known as data anonymization, when applied to metadata or general data about identification.

Data anonymization is the process of either encrypting or removing personally identifiable information from datasets, so that the people whom the data describe remain anonymous.

Removing

Removing personally identifying information (PII) is a form of anonymization.

Anonymization in general and PII removing are out of GDPR scope.

Masking or suppression

Masking or suppression is anonymization technique allows an important/unique part of the data to be hidden with random characters or other data.

Example:
6619 5829 2470 5074 credit card number
to:
**** **** **** 5074.

With masking, we identify data without manipulating actual identities.

Anonymization in general and PII masking/suppression are out of GDPR scope.

Note: Be careful with masking of e-mail addresses. *@gmail.com is GDPR compliant, *@korniichuk.com (www.korniichuk.com) is not GDPR compliant.

Generalization

Generalization is anonymization technique replaces individual values of fields with a broader category.

Anonymization in general and generalization are out of GDPR scope.

K-anonymization

K-anonymization is anonymization technique combines generalization and masking/suppression.

Anonymization in general and k-anonymization are out of GDPR scope.

Scrambling

Scrambling is anonymization technique involves a mixing or obfuscation of characters.

Example:
6619 5829 2470 5074 credit card number
to:
4596 5482 6077 2019.

Note: The process can sometimes be reversible.

Anonymization in general and scrambling are out of GDPR scope.

I cannot recommend scrambling as anonymization technique!

Blurring

Data blurring is anonymization technique uses an approximation of data values to render their meaning obsolete and/or make it impossible to identify individuals.

Example:
6619 5829 2470 5074 credit card number to:

Note: The process can sometimes be reversible.

Anonymization in general and PII blurring are out of GDPR scope.

I cannot recommend blurring as anonymization technique!

Pseudonymization

GDPR defines pseudonymization as the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.

By holding the de-identified data separately from the “additional information,” the GDPR permits data handlers to use personal data more liberally without fear of infringing on the rights of data subjects. This is because the data only becomes identifiable when both elements are held together.

You can use Faker and Mimesis Python libs for pseudonymization.

Faker is a Python package that generates fake data for you.

Note: The resulting string is always the same for the same input, so that analytical correlations are still possible.

I cannot recommend pseudonymization with Faker/Mimesis Python libs!

Tokenization

Tokenization is pseudonymization technique substitutes a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value.

Note: The resulting token is always the same for the same input, so that analytical correlations are still possible.

I can recommend tokenization as pseudonymization technique.

Hashing

Hashing is pseudonymization technique returns a fixed size output from an input of any size and cannot be reversed.

However, if the range of input values the hash function are known they can be replayed through the hash function in order to derive the correct value for a particular record.

Hash functions are subject to brute force attacks. Pre-computed tables can also be created to allow for the bulk reversal of a large set of hash values.

The use of a salted-hash function (where a random value, known as the “salt”, is added to the attribute being hashed) can reduce the likelihood of deriving the input value but, calculating the original attribute value hidden behind the result of a salted hash function may still be feasible with reasonable means.

Keyed-hash function with stored key uses a secret key as an additional input (this differs from a salted hash function as the salt is commonly not secret).

Note: The resulting hash is always the same for the same input, so that analytical correlations are still possible.

I can strongly recommend keyed-hash function as pseudonymization technique.

Encryption

Encryption with secret key is pseudonymization technique. In this case, the holder of the key can trivially re-identify each data subject through decryption of the dataset because the personal data are still contained in the dataset, albeit in an encrypted form.

Assuming that a state-of-the-art encryption scheme was applied, decryption can only be possible with the knowledge of the key.

Symmetric encryption vs asymmetric. Today’s standard of symmetric encryption scheme is AES. Symmetric encryption has its weakness in the presence of the secret key on the encryption side.

Asymmetric encryption schemes such as RSA, DSA, or ECC use two keys: public and private. Encryption of data uses the public key and even if an attacker attains this public key, he is not able to decrypt protected data.

I cannot recommend encryption as pseudonymization technique!

Key deletion or crypto-shredding

Key deletion or crypto-shredding pseudonymization technique may be equated to selecting a random number as a pseudonym for each attribute in the database and then deleting the correspondence table.

You can upgrade tokenization, keyed-hash function, and encryption techniques.

Considering a state-of-the-art algorithm, it will be computationally hard for an attacker to decrypt or replay the function, as it would imply testing every possible key, given that the key is not available.

I cannot recommend this pseudonymization technique!

Sources