In the Series V, we have learnt about the basics of data anonymisation and pseudonymisation. In this Series VI, we go further deeper into the concept and some measures to achieve this.
Just to reiterate, Pseudonymization is a method to substitute identifiable data with a reversible, consistent value. Anonymization is the destruction of identifiable data.
Some basic guidelines for the process of Anonymisation are as under:
- remove all direct identifiers
- remove indirect identifiers that are not essential for reusing the data
- remove indirect identifiers with a high disclosure risk, such as usual or known characteristics
- reduce the level of detail of the indirect identifier
A combination of indirect identifiers may also lead to the identification of a respondent; for instance, research about deaf & dumb people in a specific village. Consequently, in certain cases, it is advised to choose a higher sample size, such as a particular state instead of the precise village or town. This is primarily done to respect the privacy of the individual.
Another example is the combination of age in days and date of exam, which may lead to the exact age of the respondent. In research concerning school classes, participating children may thus be identified. In this case, either the exam date can be reduced to the year, or the age should be adjusted to month or year.
Make sure you do not share the following direct identifiers with others or archive them in a public archive:
The Indian PDPB2019 does not apply to anonymised data. However, pseudonymised data falls fully within the scope of the Bill and must be treated with the same levels of consideration in terms of collection, security, processing and deletion.
Whatever is not considered as “Anonymized” relates to data that will have an element of identification of an individual. The concept of “Anonymization” relates to personal data but there are other categories of data which is neither personal data, not anonymized data. This category of data may include data that has an identity of a “Company” or business data, which does not include personal identity which is outside the definition of personal data or anonymized data. Such data is also outside the scope of the PDPB2019.
One of the common practise being used for pseudonymization is a process called Tokenization. This provides a logical token for each unique name and requires access to additional information to re-identify the data:
|Name of Author||Token/Pseudo Name||Anonymised|
Here, with the pseudonymized data, we may not know the identity of the data subject, but we can correlate entries with specific subjects (records 1 and 4, 2 and 7, 3 and 8 are a reference of the same person). If we have access to re-identify the data via the token lookup tables, then we can get back to the real identity. With the anonymized data, however, we only know that there are 8 records and there is no method to re-identify the data.
With Anonymization, we must also be concerned about “indirect re-identification”. If we return to our author example above. An analysis of the writing style of our anonymous authors might allow us to indirectly identify them. We might not be able to identify the name, but we might be able to identify that specific books were written by the same person, because of their unique writing style.
For instance, directly identifiable data such as audio or video files are hard to anonymise (without losing their scientific value) and should in general not be published in open access.
“Pseudonymisation is effectively only a security measure. It does not change the status of the data as personal data. The Indian PDPB2019 makes it clear that pseudonymised personal data remains personal data and within the scope of the Bill.”
Additionally, in the situation where clinical trial data has had all identifiers removed, this can only be considered anonymised data if it was impossible to re-identify the trial subjects, even when cross referenced against supporting documentation.
While there may be incentives for some organisations to process data in anonymised form, this technique may devalue the data, so that it is no longer of useful for some purposes. Therefore, before anonymization consideration should be given to the purposes for which the data is to be used.
A variety of methods are available depending on the degree of risk and the intended use of the data.
A directory replacement method involves modifying the name of individuals integrated within the data, while maintaining consistency between values, such as “postcode + city”.
Scrambling techniques involve a mixing or obfuscation of letters. The process can sometimes be reversible. For example: Sameer could become Meesar.
A masking technique allows a part of the data to be hidden with random characters or other data. For example: Pseudonymisation with masking of identities or important identifiers. The advantage of masking is the ability to identify data without manipulating actual identities.
Personalised anonymisation is another popular method. This allows the user to utilise his own anonymisation technique. Custom anonymisation can be carried out using scripts or an application.
Data blurring uses an approximation of data values to render their meaning obsolete and/or render the identification of individuals impossible.
Data masking versus data encryption: A comparison of two pseudonymisation methods
Distinct from data masking, data encryption translates data into another form, or code, so that only people with access to a secret key (formally called a decryption key) or password can read it.
Data masking is a more widely applicable solution as it enables organizations to maintain the usability of their customer data.
By Sameer Mathur, Founder & CEO, SM Consulting
President, Delhi-NCR Chapter of the Foundation of Data Protection Professionals in India
With inputs from Mr Vijayashankar Nagaraj Rao