According to an estimate by the World Economic Forum, by 2025 we will be creating 463 exabytes of data every single day! Apart from the tweets, status updates, and emails that we send out every day, there’s a huge amount of data that our smart devices are capturing every minute. Businesses harvest this data to improve customer services and business decisions.
Although directives, such as the GDPR (General Data Protection Regulation) in the European Union and CPRA (California Privacy Rights Act) in the US, mandate businesses to protect privacy of the customer data they possess, there is a growing concern around data privacy, as companies continue to sell ‘anonymized’ data to third parties and incidents of data breach continue unabated.
To allay their customers’ concerns around privacy, businesses hide behind the shield of data anonymization. They often downplay the risk of individual identities being retraced with the argument that their datasets are anonymized and hence incomplete. This is because once the data is formatted through sampling and anonymization, it no longer falls in the ambit of data protection regulations.
Businesses can sell the anonymized data to third parties who can then use it in any manner they deem fit. As the Avast incident has highlighted, when businesses sell this anonymized data to third parties, it can be pieced together with data sets obtained from data breaches and other sources to accurately re-identify an individual.
Machine learning tools can retrace individuals from anonymized datasets
Multiple researches have time and again proved that despite the various anonymization techniques that businesses use, it is possible to create tools that can reverse engineer the anonymized data to reveal the true identity of an individual. For instance, students of Harvard John A Paulson School of Engineering and Applied Sciences recently created a tool to comb through the treasure chest of consumer data exposed by numerous breaches. They compiled stolen data from multiple sources and searched for confidential information for each person. Their research revealed that while it is difficult to identify a person from an individual data set; but when used collectively, multiple databases can reveal even the most personal details of an individual. Their research underscores the grave problem that anonymized data combined with multiple incidents of data breach can pose to an individual’s privacy.
This incidentally is not the first time that researchers have proved data anonymization does not guarantee data privacy. Earlier in 2019, researchers in the UK were able to accurately identify 99.98% Americans from an anonymized dataset. They only needed a machine learning tool and 15 characteristics to do so. An MIT study took only four vague characteristics to accurately identify 90% of users from an anonymized credit card data set. A German study researched on the anonymized user data on brake pedal use to identify the correct driver with 90% accuracy.
Data anonymization is no silver bullet
Anonymized data is not truly capable of safeguarding customer identity. Machine learning algorithms make it possible to trace back individuals even from anonymized data sets. Since different businesses anonymize different identifiers, stitching together the identifiers from multiple data sources can reveal the true identity of an individual. This is why there is a steady rise in the incidents of data breach, which allows cyber criminals to supplement the databases they already possess.
The ramifications of confidential user information reaching cyber criminals can be devastating. It is, therefore, imperative that the rules around data anonymization be reviewed and made stricter. On the personal front, individuals must change the passwords frequently and refrain from using the same passwords across multiple accounts.
The article has been written by Neetu Katyal, Content and Marketing Consultant
She can be reached on LinkedIn.