Many companies track customer data using some form of anonymisation. This means that the personal data is processed such that all directly identifiable details are removed. This is beneficial for both the user, who has to worry less about his privacy, and the company, who has to worry less about data protection legislation. However, most forms of anonymisation do not really exist from a theoretical perspective.

You probably have been asked by an installer or web application if you would mind the anonymous collecting of details that could aid in improving the application. Anonymous does mean that your private details are not transmitted, but that usage details do get sent to the vendor. For example, your DVD player could report that a certain DVD has been played by a male of 26 years living in Amsterdam, without revealing at any point details that would directly reveal it is you.

Entropy and Information Theory: The Mathematics Behind Anonymity
A well-known concept within information theory is Shannon’s entropy. This value expresses the (expected) amount of  information within a certain message, commonly expressed in bits. For example, the gender of a person has an entropy of one bit, as there are two possible values – please note that this example assumes 50% of the population to be male and 50% to be female and ignores transgenders for the sake of simplicity.

Now, using information theoretical concepts, one can deduce how unique certain bits of information are within a group. If we return to the example from the introduction, there probably are quite some 26 year old males living in Amsterdam. However, if Amsterdam were to be substituted with a small village, there would be much less people to choose from. In other words, the former contains much less entropy as the latter.

The Paradox of Anonymous Data Collection
When evaluating the privacy of a system, one wants to look at the unlinkability of items. When two items are unlinkable, someone cannot distinguish whether these two are related or not. When most items in a database are unlinkable, it becomes very hard to get back to personal details.

However, the ability to find these links is probably one of the reasons a company started collecting this data to begin with. In other words, full anonymity would prevent the party collecting the data from making the statistical analysis he wants to optimise his business, application or marketing strategy on. This creates a paradox, as wanting to respect the privacy tends to make data collection a lot less useful. Of course, statistical analysis on a more global level remains possible either way.

The Long Tail of Anonymous Databases
Another very interesting phenomenon is the so-called long tail effect. This refers to the fact that a lot of cases tend to fall in the narrow tail of a statistical distribution. For example, when you request a list of ten watched movies from each of your friends, there is a very high chance that every list contains a very obscure movie. This obscure entry makes it much easier to identify your friends on the basis of those fully anonymous lists of watched movies.

In practice, this effect is why linking two databases can result in a complete loss of anonymity. By combining, the obscure entries make it much easier to match the details of one persons, thereby revealing the combined details of this person. It is possible that this results in the identification of the person the data belongs to.

Anonymisation: It Does Not Really Exist
Sure, one can try to anonymise data. However, when one still wants to maintain a fine grained level of information for statistical analysis, the data is probably deanonimisable using the same statistical techniques. For this reason, the collection of “anonymised” data still calls for adequate protection. Additionally, it requires a decent amount of scepticism regarding the methods used to make data anonymous.

Tagged with:
 

Leave a Reply

Your email address will not be published. Required fields are marked *