Friday, 14 August 2009

Data Anonymisation to prevent Data Leakage

With data leaks constantly in the news, I thought I would write a quick blog post about data anonymisation. The problem seems to be that people think it's perfectly acceptable to walk around with sensitive information on mobile devices and removable media. The solution, according to common thought, is to encrypt those devices. This is a solution that should be adopted, but after the more fundamental problem has been addressed. It should not be possible or necessary to store raw sensitive data on mobile devices or removable media!

Assuming that you need the data for business intelligence purposes and that the IT department can't or won't (for some good reason) allow this to be done online through a secure connection, then you must anonymise the data first and then encrypt it. Why do you need to know the names, addresses and credit card numbers of your customers when on the road TK Maxx? Why do you need the names, addresses, dates of birth, national insurance numbers, salaries and bank details of your employees when away from the office UPS? I'm afraid, that the only reason I can think of to have the non-anonymised data is for fraudulent purposes (please send me a comment if you can think of a legitimate reason).

Drawing from Pierangela Samarati's session at IPICS, I'll give a very brief overview of data anonymisation. There are two basic techniques to anonymise data: generalisation and suppression. With generalisation, we use a more general value in place of the specific value, e.g. birth year rather than birth date, postal district rather than full postcode (KT1 rather than KT1 2EE), credit card issuer rather than full credit card number (1234 56** **** **** rather than 1234 5678 9012 3456), etc. Alternatively, we can suppress the sensitive information by removing it totally.

Now there is a whole academic discipline surrounding data anonymity and how to achieve k-anonymity that I won't go into here. I'll just look at what the above means to data such as a normal company might want to use for business intelligence reasons, rather than surveys and data gathering purposes. In this sense, we are trying to protect the privacy of our customers, employees, etc., above all else, rather than have the minimum anonymity possible for the data set. I will use the following table to illustrate the anonymisation.

NameDoBPostcodeCC No.
Alice02/02/64KT1 1AB1234 5678 9012 3456
Bob16/02/64KT1 1BC1234 5678 9012 3467
Charlie08/04/64KT1 1CD1234 6778 9012 3478
David02/04/66KT1 1DE1234 6778 9012 3489
Edgar04/04/66KT1 2AB1234 6778 9012 3490

There are many schemes for anonymising this data, but I'm going to concentrate on Attribute Generalisation combined with Attribute Suppression. This basically means that we will generalise each value at the attribute level (i.e. the same level of generalisation will be applied to all values). Secondly, we will suppress any attribute that uniquely identifies someone. Using minimal generalisation we would get the following table. ('-' denotes suppressed data and '*' denotes generalised data)

NameDoBPostcodeCC No.
-**/02/64KT1 1**1234 5678 9012 34**
-**/02/64KT1 1**1234 5678 9012 34**
--KT1 1**1234 6778 9012 34**
-**/04/66KT1 1**1234 6778 9012 34**
-**/04/66-1234 6778 9012 34**

We have had to suppress Charlie's birthday, because she was the only one born in April 1964. Similarly, Edgar is the only one who lives in KT1 2**. However, we haven't achieved anonymisation here. If we know Charlie was born in April 1964, then this date doesn't appear in the table and only one date is suppressed, so we know her tuple in the table. Similarly, if we know Edgar lives at KT1 2AB, then we know that his is the last tuple. The credit card details should be generalised more than this as well, as others may store the last four digits of a credit card number, so it may be possible to cross reference. Also, why do we need their credit card number for business intelligence? Surely issuer is good enough? So, we can do the following.

NameDoBPostcodeCC No.
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****

This gives us a full count of customers, their geographic locations, age and credit card issuer. I suggest that this is enough information to cover most queries that you may wish to run for business intelligence purposes and, therefore, the maximum that should ever be stored on a mobile device or removable media. This data should also be encrypted.

Of course, this doesn't solve all problems. What if you know Edgar was born in 1966? You now know his credit card issuer, which enables you to launch a directed phishing attack on him. Data Anonymisation can fail in the face of attack, particularly when there is external knowledge, which you have no control over. The moral is, don't store sensitive data on mobile devices or removable media. If this really isn't possible to avoid, then you must anonymise it first and encrypt it.


Post a Comment

Welcome to the RLR UK Blog

This blog is about network and information security issues primarily, but it does stray into other IT related fields, such as web development and anything else that we find interesting.

Tag Cloud

Twitter Updates

    follow me on Twitter

    Purewire Trust