Big data collection makes it hard for you to remain anonymous

How anonymous is "anonymous" in today's digital world

Not the hacktivist collective -- this is about how anonymous average people are when the data they generate is vacuumed up by everybody from marketers to websites, law enforcement, researchers, government and more.

Is Big Data collection, even with personally identifiable information (PII) stripped out or encrypted, still vulnerable to "re-identification" techniques that pinpoint individuals to the point that intrusive surveillance is possible, or already going on

Or can "de-identification" leave individuals comfortably faceless in an ocean of data that is just being used to spot trends, track the spread of disease, specify high-crime areas or other things that will improve the economic well being or health of a population

Don't expect a unanimous answer from IT and privacy experts. The debate about it is ongoing.

Among those on one side are the authors of a June 2014 white paper sponsored by the Information and Privacy Commissioner (IPC) of Ontario, Canada and the Information Technology & Innovation Foundation (ITIF) titled "Big Data and Innovation, Setting the Record Straight: De-identification Does Work," who argue that privacy advocates and their media enablers should chill out.

Lead authors Daniel Castro and Ann Cavoukian decry what they call, "misleading headlines and pronouncements in the media," that they say suggest that those with even a moderate amount of expertise and the right technology tools can expose those whose data have been anonymized.

The fault for the spread of this "myth," they say, is not with findings presented by researchers in primary literature, but "a tendency on the part of commentators on that literature to overstate the findings."

They contend that de-identification, done correctly, is close to bulletproof, reducing the chance of a person being identified to less than 1% -- far less than the risk of simply taking out trash containing documents that might have PII in them.

They also argue that unwarranted fear of a loss of anonymity may undermine, "advancements in data analytics (that) are unlocking opportunities to use de-identified datasets in ways never before possible ... (to) create substantial social and economic benefits."

But they do acknowledge that, to be effective, "creating anonymized datasets requires statistical rigor, and should not be done in a perfunctory manner."

And that, according to Pam Dixon, executive director of the World Privacy Forum (WPF), is the problem. She and others contend that outside of the controlled environment of academic research, both anonymity and privacy are essentially dead.

Dixon doesn't quarrel with the white paper's contention that de-identification can be effective, but said that "in the wild," not all datasets are going to be rigorously anonymized.

"In the real world, people aren't going to do that all the time," she said. "To actually get true anonymity in big data, you have to go to an extraordinarily broad aggregate level.

"If you're talking just about data collected for statewide or citywide trends, then it can be de-identified because it's not talking about individuals. But if you're talking how many had the flu in Boston, and any kind of ZIP code data is available, that's different," she said.

Joseph Lorenzo Hall, chief technologist at the Center for Democracy & Technology, agrees that while rigorous de-identification is demonstrably effective, the world of data collection does not always meet the ideal. One reason for that, he said, is that truly impregnable de-identification makes data much less useful.

"The essential feature of these sets of data that make re-identification feasible is that records of behavior from the same individual are linked to one another," he said. "That's a big part of the benefit for keeping these records.

"The big problem is public release of data sets that have been poorly anonymized and sharing between private parties data sets that they consider to not contain personal information, when they definitely contain some sort of persistent identifier that could be trivially associated with an individual."

And while clearly some data collection is aimed at the economic well being or health of people, Hall notes that plenty more is not. "Many retail establishments use Wi-Fi tracking that uses your device's MAC address (a persistent network identifier) to track you through the store," he said.

"This is why Apple has begun randomizing these addresses as announced to the network."

Paul O'Neil, senior information security adviser at IDT911 Consulting, has much the same view. "If de-identification is done properly, then yes, it can work," he said. "But that is a much bigger 'if' than most people realize."

Raul Ortega, head of global presales at Protegrity, also notes how uneven the protection of data is. "Credit card protection is improving, while there is very little being done to de-identify the hordes of PII data that exist in every company," he said.

Part of the problem, say legal experts, may be one of semantics, which leads to public confusion. "We need to be clear what we mean when we call data anonymous," said Kelsey Finch, policy counsel at the Future of Privacy Forum (FPF).

She said only data that has both direct and indirect identifiers removed should be called "anonymous," while data that still has indirect identifiers should be termed, "pseudonymous."

"Very often, advertising companies that track and profile users' cookies or mobile device identifiers call that data anonymous," she said. "However, these same data are often considered personal by privacy advocates, because they can be linked over time to an individual."

Heidi Wachs, special counsel in the Privacy and Information Governance Practice of Jenner & Block, agreed. "I think the word 'anonymous' gets thrown around a lot without a true understanding of how information is collected and shared," she said. "So much of what we do every day online can be traced back to an IP address or a device ID. Even when our names aren't being collected in conjunction with online activity, there is often some form of identifier that uniquely identifies us."

[ The 5 worst Big Data privacy risks (and how to guard against them) ]

Indeed, data collectors don't need names to treat individuals differently. In 2012, the travel website Orbitz generated headlines about pitching higher-priced hotel rooms to users of Macintosh computers, since the company's data collection showed them to be wealthier or willing to pay a premium.

And plenty of data collection doesn't come with even an implied promise of anonymity. They include highway toll collection readers and ubiquitous security cameras. Ortega notes that there is, "face recognition, video that is taken without you knowing, and exercise tracker data about where you run or where you go to work out and when, etc. With the Internet of Things, there will be more data like this collected about us."

O'Neil notes that social media sites have "the most precious datasets" to marketers, and may not have rigorous protection. "Are those companies in the advertising initiatives following the same security best practices as others" he asked. "Meanwhile, your personal data is traded and moved back and forth like high-frequency stocks between dozens of data aggregators."

Another conundrum is that the more data is de-identified, the less useful it becomes, and there are some cases where people don't expect anonymity, but they do expect their information to be protected.

"There are times when the cost of making data anonymous may be outweighed by the benefits we can reap from higher quality data," Finch said. "We also have to consider situations where we don't want perfect anonymity -- if you're a patient in a clinical study, for example, and a researcher notices a potentially dangerous abnormality in your de-identified records, it would be important they have some way to re-identify you."

Ortega agrees. "You cannot protect data with a 100% chance of being completely secure unless you lock it up in a safe and throw away the key. That would not be good for analysis," he said.

If there is a consensus among experts, it is that most collectors can and should do a better job of protecting the privacy of individuals, either through rigorous anonymization or other privacy protections.

"Data minimization plays a part here," Wachs said. "In any given data set, were all of the data elements necessary to accomplish a specific goal Or was data just being collected because it could, or people were willing to offer it"

She said organizations should consider security and privacy risks before they even begin data collection. Among questions to ask are: "Who might want to steal this data What could they do with it if they were successful What is the minimum data set required to accomplish the goal How can this data set be most effectively secured"

Hall advocates more use of techniques like RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) techniques, which allows, "statistics to be collected on the population of client-side strings with strong privacy guarantees for each client, and without linkability of their reports," according to researchers at Cornell University.

That, he said, "can result in win-win in terms of collection of data and analysis with few implications for privacy."

But the bottom line is that there is not a way to guarantee anonymity. "Even if we applied today's cutting edge anonymization techniques across the board," Finch said, "five years from now new technologies and new data sets could potentially make that data re-identifiable."


Taylor Armerding

Zur Startseite