A basic principle underlying the digital economy is that users trade their data for convenience. For instance, Google provides search, email, and about a zillion other things for users in return for their personal information: what they like, where they go, who they talk to. Facebook does the same thing only more so, and so do other social media sites and many big corporations.
We users still don’t know what they do with those vast hoards of information. We assume they sell it to advertisers, or like Google, subtly direct our attention to products that match our preferences. But we certainly can’t tell how they analyze it, and what can be found out from it. But new research just published (in PDF format) by England’s prestigious Cambridge University may give us a glimpse. And the results should give one pause, because a lot more can be found out than anyone suspected.
In cooperation with Microsoft, British researchers have analyzed public data from 58,000 Facebook users who actively participated in the research. They looked at a purely public open database, Facebook Likes, where users are invited to indicate preferences and affections for just about anything. They discovered that subtle “digital traces” left in this data could indicate individual personality attributes and traits with a surprisingly high degree of accuracy.
Here’s a few things they could deduce about users, and how often the information was accurate:
- Race – Caucasian v. African: 95%
- Gender – 93%
- Gay (male) – 88%
- Political – Democrat v. Republican: 85%
- Religion – Christianity v. Islam: 82%
- Gay (lesbian) – 75%
- Smokes cigarettes – 73%
- Drinks alcohol – 70%
- Single or in relationship – 67%
- Drug user – 65%
- Parents together at 21 – 60%
Note that some of this information is not that obvious, like whether the person’s parents were still together when he/she reached 21. As for gender preference, only 5% of gay users clicked on liking gay marriage. Yet the algorithms the researchers used were able to sniff these highly personal traits out from inferences provided by other data. In other words, the users might not always admit what they like or dislike, but their music and other enthusiasms spoke loudly for them.
They also tested for other psychological traits, including intelligence, emotional stability, openness to change, and extroversion. Surprisingly, the ability to detect such latent traits proved almost equal in accuracy to psychological tests. And they turned up a few puzzlers, too. It turns that people with high IQs tend to “like” curly fries, and non-smokers are drawn to “that spider is more afraid of you”. Alas, there doesn’t seem to be any hard data on what liking LOLcats means.
While 60% accuracy is not worth staking one’s life on, it’s quite enough to make it valuable for advertisers. The researchers also claim that their results could lead to better modes of psychological assessment as well on an unprecedented scale, as the algorithms could be applied to hundreds of millions without their knowing. And it’s important to remember that this is just public data, willingly surrendered by users. Web search and browsing histories might almost reveal as much. What might be discovered from the information Facebook keeps private?
It’s also important to note that all this information was extracted automatically by machines using mathematical models. Humans did not sit down and look at the data themselves. This, too, has important implications, for algorithms can be wrong. More and more automated systems use big databases and mathematical models to do everything from decide who can drive, receive Medicare benefits, be eligible to vote or get government contracts, even determine if the person is a deadbeat dad. And there have been cases where all these things have gone wrong, often with tragic real-world results.
One study by an African-American professor at Harvard proved that Google’s search results were “inadvertently racist“, sending black users more ads about arrest records than to whites. Sometimes it may be done deliberately, as when gay men acting on a Google recommendation to download social networking software for gays, also downloaded a sex-offender tracking app.
Clearly, there are some big issues involved here. And the situation is only likely to get worse as humongous amounts of data accumulates that can only be really examined and weighed by the machines. Human safeguards need to be in the loop, because it’s people that are responsible and affected by these decisions. But it may take awhile and a few disasters before it happens.