Mathematical Musing – The Myth of Anonymity in America

I recently read an article discussing how just 10 digits would be enough to end privacy as we know it.  The article is a bit alarmist, but makes some interesting points that I’d like to discuss here.

Firstly, the article claims that a 10-digit code is sufficient to uniquely identify every person alive on earth.  Where do they get that figure?  It has to do with a tool in mathematics called a permutation, which is essentially an ordering of some sequence of numbers or objects.  Consider sports jerseys.

A sports jersey has room for two digits, both of which can be any number from 0 to 9.  There are, therefore, 10 options for both places, giving us a total of 100 possible jersey numbers – from 00 to 99.  What we have just used here is something called the Fundamental Counting Principle (also known as the rule of product).  Essentially, given a number of slots to fill and a number of choices for each slot, the total number of outcomes is equal to the products of all the numbers of choices for each slot.  Since a jersey has two slots with ten choices each, the total number of outcomes is 10*10 = 10^2 =100.

If we instead have a string of ten digits, each place having ten options, the total number of outcomes is 10*10*10*10*10*10*10*10*10*10 = 10^10 = 10,000,000,000, or ten billion.  Considering the world’s population is still less than 7 billion (though is predicted to reach that mark in 2011), a quantity of 10 billion identification numbers would be more than enough to assign one to every living human being. The idea of having your entire persona and identity reduced to a string of numbers is a frightening thing to many people.

Of course, we in the United States already have a system that pretty much does this.

Your social security number is made up of nine digits, meaning there are 10^9 = 1 billion different numbers that could be assigned to individuals and be uniquely identifiable with them.  Originally, people did not get a number until they turned 14 because that was the time they started working.  in 1986, the tax laws were altered so that children over the age of 5 years without one could not be claimed as dependants on income tax forms (and, according to the authors of Freakonomics, “seven million dependents suddenly vanished from the tax rolls … generat[ing] nearly $3 billion in a single year”).  Now, children are assigned such numbers at birth.

Your social security number isn’t quite that simple, however.  The number is divided into three sections.  The first three digits are the area number and are assigned by geographical reason.  According to the Social Security Administration, there are no social security numbers with an area number higher than 772, nor are there numbers with a 666 or 000.  The second two digits are the group code and are assigned in a particular order, with only 00 being invalid.  The last four digits are the serial number which also cannot be 0000.

Moreover, social security numbers from 987-65-4320 to 987-65-4329 are reserved for use in advertisements, and due to a rather amusing source of confusion, the number 078-05-1120 is also no longer in use.

Even with these rules in place, however, there are still roughly 760 million valid social security numbers that could be assigned (759899989 to be exact).  Given that the nation’s population only farily recently surpassed 300 million, this is more than enough.

What I personally find far more troublesome, however, is a different point the news article above makes in reference to an article by Latanya Sweeney of the School of Computer Science at Carnegie Mellon University published in the International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems.  (To read the full article, click here for a pdf – click here if you need a pdf reader).

Dr. Latanya Sweeney

Dr. Sweeney found by examining the 1990 US Census data, fully 87% of people can be uniquely identified by their gender, date of birth, and zip code alone. Think for a moment.  How many online forms have you filled out where you have recorded that exact information, often including even more details?

Dr. Sweeney goes on to discuss how databases can be cross-checked to reveal even more information about individuals.  She contacted the Massachusetts Group Insurance Commission, a group that is responsible for purchasing health insurance for state employees and obtained a list of the “anonymous” medical information of 135,000 such employees, lists that included ethnicities, diagnoses, procedures, and charges of visits, as well as the zip code, gender, and date of birth for each individual.  For $20, she also obtained a copy of the voter registration list for Cambridge, Massachusetts.  In addition to the zip code, gender, and date of birth, each record also included names, addresses, and political affiliations.

By linking the two databases via the three pieces of information, she was able to identify the health record information for then governor of Massachusetts, William Weld because,

According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.

Why does this happen?  Why can such a small amount of information so specifically identify people?  It goes back to the Fundamental Counting Principle.  As we introduce more slots to fill with information, or increase the number of things that can fill in each slot, the total number of combinations of each bit of data can increase dramatically.  With 365 possible days of birth during the year over we’ll say 80 years of age, two choices for gender, and 5 valid zip codes for the Cambridge, MA area, there are 365*80*2*5 = 292,000 different combinations (292,200 if we factor in leap years).  With a 1990 population of 95,802, it shouldn’t be surprising that there are combinations that apply to unique individuals, especially when we look at the older end of the spectrum.

So what can you do?  A follow-up article (pdf) written by Philippe Golle at the Palo Alto Research Center examined the same ideas as Dr. Sweeney’s using the 2000 census data.  For one, he discovered that the unique identifiability Dr. Sweeney discovered has decreased to 63%, though for just about everybody that information will narrow down the entire population to only around five people.  He also discovered that being in your late teens or early twenties will greatly increase your anonymity due to such individuals usually living in a college environment with thousands of other similarly aged people.  Because most students at a college will have been born in the same small range of years, there’s a much higher chance that you’ll share exact birth dates with a large number of other people, and since you all share the same zip code will make you indistinguishable from them using that information alone.

This fact, Golle concludes, is one that people can use to their advantage:

Finally, those willing to sacrifice truthfulness for optimal anonymity should claim, when asked for their age and ZIP code, to be a 21-year-old male from Camp Pendleton, California (ZIP code 92054); or, if female, to be a 19-year-old from College Station, Texas (ZIP code 77840). They will share these characteristics with respectively 4, 099 other males and 3, 744 other females.”

With the 2010 census right around the corner, it would be interesting to see how these figures have changed further.  We are becoming an increasingly connected society, and privacy on the Internet has become, for many, a major concern.  To bring the point back to the original article, we don’t need to be assigned ten digit codes in order to identify us.  Most of us voluntarily give up enough information on a daily basis to do it anyway.

Questions? Comments?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.