Cellphone Data Better than Fingerprints at Identifying Users
CREDIT: Hasloo Group Production Studio | Shutterstock
The routes people regularly travel may even be a better identifier than fingerprints for identifying them. In fact, because mobile location data is so unique, nearly every cellphone user can be recognized from a random sample of just four places they have been in over the past year, researchers say.
These findings, based on an investigation of cellphone data, reveal that defending privacy may be more difficult than previously thought because even mostly anonymous data could be used to identify individuals.
Modern electronics, such as mobile phones, raise privacy concerns because many of these devices now can record personal data, such as where a person has been. Such knowledge could be used to reconstruct an individuals’ movements and reveal information about their behavior that they might not want others to know. For instance, this information could yield details about where competitors are going for customers, which church a person attends, or whether they visited an abortion clinic.
However, despite the potentially delicate nature of this information, it could also offer many benefits to both the public and scientists. In order to address privacy concerns, the data are usually presented in data sets that provide locations over time but remove names, home addresses, phone numbers and other obvious identifiers.
"When you have personalized services that allow you to use your phone to search for pizza near you, this is what this data allows you to do," said Yves-Alexandre de Montjoye, a computational social scientist at MIT. “Knowing where traffic is right now is also based on mobile-phone data, on where users of mobile phones are concentrated.”
"From a scientific perspective, we've already made amazing discoveries from this data — for instance, we can use it to study the spread of malaria, since the mosquitoes that carry the disease mainly travel long distances via humans,” de Montjoye added. “We can use it to understand human behavior on large scales."
Still, it was uncertain precisely how anonymous these data sets really were. For instance, if the pattern of a person's behavior is unique enough, knowledge from these data sets could, with assistance from other publicly available data, help reveal his or her identity. In one study, a medical database was combined with a list of voters to successfully extract the health record of the governor of Massachusetts.
To learn more about the information hidden in cellphone data, scientists analyzed information from about 1.5 million cellphone users in an anonymous European country from April 2006 to June 2007. Each time a person initiated or received a call or text message, the location of the connecting antenna in the mobile-phone network was recorded.
Each cellphone was given a unique, randomly generated identification number so that researchers could track its user's movement over time. This data was otherwise anonymous — no other information was available to researchers connecting that number to the phone's owner.
On average, each user had 114 interactions per month with the network of nearly 6,000 antennae. In total, the scientists had about 170 million such interactions as reference points to investigate.
Analyzing these cellphone data sets revealed that people often moved in regular, unique patterns. The researchers were able to successfully identify 95 percent of cellphone users just by knowing four randomly chosen reference points from each user.
Why was the success rate so high? People usually spend most of their time at a few places, such as their home or office. Therefore, a random selection will likely pick at least one of these locations, which are usually successful in identifying the person.
"Four is a really low number —a surprisingly low number," de Montjoye told TechNewsDaily. "It just shows you how much there is to learn from cellphone data."
At most, 11 reference points were needed to identify people by their routes. In comparison, to identify someone by a fingerprint, at least 12 points are generally needed.
These findings held true even given how coarse the location and time data were — the records only placed cellphone users within a couple hundred yards of a cellphone transmitter, sometime over the course of an hour. Even when the locations were as imprecise as somewhere amid 15 adjacent cell towers — or times as imprecise as during a 15-hour time span — researchers could identify cellphone users by their routes about half of the time.
In principle, anonymized cellphone data could be used to confirm a person’s identity by cross-referencing the data with information such as a person's home or work address, or geo-localized tweets or pictures.
"It seems really hard to anonymize this kind of data," de Montjoye said.
Similar relationships might exist for other types of data. "I would not be surprised if a similar result — maybe requiring more points — would, for example, extend to Web browsing," said César Hidalgo, a researcher at MIT. "The probability that two people would have the same exact trajectory, whether it's walking or browsing, is almost nil."
Although these findings may seem to raise concerns about violations of privacy, the researchers hope their formula will instead provide a way for researchers and policy analysts to think more carefully about the privacy safeguards needed for this information. For example, instead of handing anonymized data sets containing location data out to anyone, caretakers of this information might instead ask companies to give them what calculations they want to run on the data and return the answers back to the companies for them to supply customers.
"Both César and I deeply believe that we all have a lot to gain from this data being used," de Montjoye said. "This formula is something that could be useful to help the debate and decide, 'OK, how do we balance things out, and how do we make it a fair deal for everyone to use this data?'"
The scientists detailed their findings online March 25 in the journal Scientific Reports.