Duncan Anderson's Blog: Privacy and Big Data

This has been a hard blog post to write. I’ve drafted it several times, each time pausing before posting and then dismissing the draft a few days later in frustration.

It’s hard to tread the right line when discussing privacy. On the one hand its easy to come across as some form of rabid revolutionary. On the other, the significance and enormity of some of the implications are too easy to down-play. I hope I’ve got the tone right, it’s an important topic.

My interest in this topic started with a series of talks I made about Big Data and Cognitive Computing. For those not of a technical persuasion, these techno terms mean:

Big Data is the technology that allows massive volumes of data to be collected and processed economically.
Cognitive Computing is an emerging technology that aims to build computer systems that better emulate the human brain, in some ways appearing to ‘think’ like we do. By necessity, a computer that appears to think must know a lot, so Big Data is an underpinning technology in cognitive systems.

I had some great feedback on the talks, but time and again the discussion they provoked rapidly focused on the issue of privacy. Or rather, the risk these technologies have to rob us of our privacy. The reaction was consistent and strong; people are concerned about a loss of privacy. They view it as something of importance and do not think it’s OK to give away their data. This reaction got me thinking and researching. What I found perplexed me and led to this blog post.

I guess in the technical world we’ve become used to the fact that our data is being exploited by others. But the penny dropped for me only recently that most people do not realise how much data they are giving away or the way it’s being used. When I point this out to them, they are almost always shocked. It’s worrying that private data should be collected and used without the knowledge or permission of those whom the data is about.

So, what have I noticed that concerns me and my audiences?

Security Agencies

The revelations around national security agencies spying on their own public are well documented.

If we heard that governments were opening all of our snail-mail and reading it, there would be justifiable outrage. But from what we hear of their intercept of electronic mail, browsing history and messaging, there seems little real difference. In fact, I might consider electronic surveillance worse than the intercept of physical mail because of its insidious nature.

Technology has made this possible on a vast scale and with near invisibility. Because this is possible, it doesn’t automatically mean it is right to do so. Its always possible to argue that collecting more data means security agencies can better target their efforts; but does this mean there should be no limits on their data collection powers? I don’t think so.

Of course I expect security agencies to spy on those suspected of wrong-doing. But to collect data indiscriminately on all of us brings uncomfortable parallels with the East German Satsi. In 1978 a young Englishman moved to Berlin. Fifteen years later, after the fall of communism, Timothy Garton Ash returned to look at a file that had been compiled by the Stasi. It contained a meticulous record of his life in Berlin and he recounted this story in ‘The File’. Is this our future as well?

I dislike this idea, not because of some idealistic world-view, but because I think it has the potential to be a threat to even the most innocent.

There’s a psychological phenomena called “Confirmation Bias” that reveals that we all have a tendency to interpret information in a way that confirms our existing beliefs. We might think we’re objective, but our brains are hard-wired in a way that often makes us far from such. We have a tendency to ignore facts that refute our beliefs, focussing instead on those that appear to support them.

Confirmation Bias can be particularly dangerous when there is a lot of information at hand. This is because it becomes easier to assign guilt to innocent people by selectively choosing fragments of information that support a rogue theory. If you think this unlikely, its worth studying some real case studies of wrongful imprisonment of innocent individuals precisely because of confirmation bias.

Security agencies even appear to have “private” webcam images from millions of citizens. Exposing yourself to a webcam arguably might not be the best idea in the world, but its certainly not illegal.

If you capture enough private data then it becomes inevitable that you’re going to find something potentially incriminating on quite a large proportion of the population – whether that be pornographic webcam images, tittle-tattle about others, private admissions of minor motoring offences, the download of illegally ripped music. The low-level transgressions that many, otherwise innocent, citizens perform is probably quite extensive.

“We are all capable of believing things which we know to be untrue, and then, when we are finally proved wrong, impudently twisting the facts so as to show that we were right.” George Orwell

Even completely innocent activities can, when a culprit is being sought, take on new and sinister implications. A photograph of a smiling man with a bowl of mince and a meat cleaver might be an image of an amateur chef, or a sinister picture of a mass-murderer. Support of a fringe politician might be a sign of innocent eccentricity, or of a political extremist with violent tendencies. If you suspect guilt, then these items might be used to confirm it – but they might just as easily be of no consequence whatsoever.

Having too much information risks making potential criminals of us all. It is not safe to assume “I’ve got nothing to fear because I’ve done nothing wrong”.

“If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him”, Cardinal De Richelieu.

As I write, shocking revelations about the corrupt practices around the dreadful Stephen Lawrence case are emerging. It appears that something more than just individual rogue officers has been at fault over a period of decades. With such a backdrop, its hard not to be suspicious of assurances from those in positions of power.

As citizens, it is our duty to ask what data is being collected, under what circumstances it can be used, by whom and most importantly with what independent oversight. For without that independent oversight, how are we to believe anything we are told?

That is not to say that targeted data collection, with appropriate oversight and controls can’t be justified. But who controls this and decides what is appropriate and what is not? So far the revelations I’ve read do not appear to have been balanced by any form of democratic oversight that ensures what is being done is appropriate. The answers to questions about oversight appear confused and murky. In such circumstances we have to assume there is ineffective oversight, or rather no oversight that we have any visibility of or influence over. If such oversight isn’t visible and we can’t influence it, then its not effective.

Public Health records

There’s an initiative to pool annonymised public health records in the UK called care.data. The idea being that this data can then be analysed for the benefit of citizens. On the surface this seems a noble endeavour.

However, there are a number of issues with this initiative, namely:

Whilst the data is anonymised, its possible people might be re-identified by matching those anonymised records with other data.
Its proposed that data will be sold to certain qualifying third parties, for example drug companies. This raises the question of who will have access to our (anonymised) health records and what they might do with it (see previous point).
Although there was a half-hearted attempt to inform the public, this was so badly managed that it ended up being via a leaflet distributed with junk-mail. Most of us dumped it in the bin without noticing what it was.

However, the theoretical concerns around care.data ended up being overtaken by real-life events when it emerged the similar hospital data had been sold to a consulting company who had then uploaded it to Google’s cloud servers, based in the USA, for analysis.

Now I work for an IT company that strictly forbids the use of such third-party cloud services because they are deemed a privacy risk to its business. And given recent NSA revelations, we have to assume that this data, now that it exists in the US, is in that government’s hands. I have no idea what they might do with it, but its available.

There’ve been other troubling revelations about how our health data is being treated. For example, this same hospital data was sold to life insurance actuaries so that they could calculate the probability of death given a particular hospital procedure. This purpose, of course, was to “better” calculate health insurance premiums. The custodian of the data, the Health and Social Care Information Centre, has admitted that this particular revelation broke its rules and it is “investigating”.

“patients’ medical records contain secrets, and we owe them our highest protection. Where we use them – and we have used them, as researchers, for decades without a leak – this must be done safely, accountably, and transparently.” Ben Goldacre

As Ben Goldacre so ably argues, this saga of institutional incompetence (for thats what it surely is) is so troubling because the value of using analytics on our health data for public good is so great. To see this value so comprehensively undermined by a loss of public trust is indeed upsetting. Lets be honest here: if public trust is lost, such initiatives rapidly become politically impossible.

Social Media

My third area of concern over privacy is the way that various Social Networking and large Internet companies are grabbing our data and using it without our knowledge or permission.

Without knowledge or permission is the critical fact here. It’s OK for our data to be used if we give informed consent. But I would contend that very, very few people really understand what data they are giving away on the internet. Nobody is making the collection and use of data either explicit or obvious to users; presumably because they don’t want to scare people off.

Lets look at some real examples of what I’m talking about:

What proportion of Google Mail customers realise that Google is reading the content of all their emails? Its using that data to build a profile of you and decide what adverts to serve you. Why is this any different from the Post Office opening your letters and deciding what junk mail to include with your post?
Who realises that Google stores all of your web searches, and Facebook all of your Graph Searches, again for profiling purposes? All of those embarressing medical problems you googled are retained for posterity - did you know that?
Does anyone realise that if they leave their Facebook or Google+ account logged in then Facebook/Google track your movement across the web. That means Facebook/Google knows what websites you visit, if you leave your account logged in. It builds a profile of you based on those websites. This might explain some of the adverts you see appearing on your screens.
Are you conscious that if you take a photograph with your smartphone it probably includes embedded tags giving the latitude/longitude of the location it was taken at? Take a picture at home and send it to someone and you’ve just told them precisely where you live.
Who really understands that every computer in the world has an IP address that uniquely identifies it and when you visit any Internet location your IP is provided to the that site? Further, its trivial to map an IP address to an approximate location. But our web browsers also disclose other information about our computer - the operating system, the browser version, installed plugins, etc. There will only be a given number of people with an identical computer configuration in our exposed geographic location. An IP address along with a some intelligent browser fingerprinting can uniquely identify who we are before we’ve consciously divulged any information.

Who reads those impenetrable “terms of service” where the real privacy rules of those free services are described? Here’s an extract from Google.

“Your Content in our Services: When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide licence to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes that we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.”

Do you still want to store files in Google Drive knowing that you’re giving away such liberal rights? But how many people actually read and understand this stuff? I’d wager very, very few.

There’s nothing illegal about any of these things. If you agreed to T&C’s without reading them it’s your lookout. But are these big internet companies really doing the right thing by hiding what they do in long legalese they know that hardly anybody reads and even fewer understand?

Do these companies not have a duty of care to be more open and actively educate us? Their products are so polished, surely they could put a little effort into better communicating their privacy implications? Maybe they could even make it clear what their business model is - i.e. “We’re offering this service to you for free, but in return we will collect the information you provide to us and use it to target adverts.”

By being so obscure in their approach, are these companies doing what’s right or just what they think they can get away with? For I can only assume the reluctance to be open is because they are scared of how customers might react - there can be no other logical reason.

The threat of data centralisation

There’s also a good reason why we need to be aware and alert to how our personal data is collected and used. The more this data is collected and centralised, the higher the risk from the “bad guys”. Even if you’re OK with what Google or Facebook does, when you put lots of interesting data together you present an attractive target for hackers. Many high profile and respectable companies have been hacked and had data like this stolen. It’s never safe to assume that because the organisation hosting the data is “respectable” that nothing untoward will happen.

What’s legal or what’s right?

This is the crux of my concern across security and government agencies and commercial companies. They all appear to be doing what they think they can get away with. Little of what I’ve mentioned here seems to be “what’s right” by citizens. For if it were right, we’d be informed about it and we’d probably be asked permission. There would certainly be effective democratic institutions where we could influence what was happening.

Instead, the collection and analysis of this data is done surupticiously and in an underhand manner. If you’re a security agency then you need to be underhand to a certain degree - but do you really need to collect data on all of us? Do you really need to gather embarrassing personal factoids on people who will never present a threat? And if you’re not dealing with security data then you have no excuse at all. We might live in a democracy, but when all major political parties support the status-quo, how are we to influence this?

Just because something might be possible and legal, doesn’t mean it should be done.

As some of the protagonists of the banking crisis have found, society has a habit of finding ways to punish miscreants, even if they haven’t actually broken any laws. “Fred the Shred” (the CEO of one of the major UK banks impacted by the financial crisis) might not be in jail, but he was subjected to a sustained campaign of public humiliation by the UK press; it’s rumoured that he’s no longer invited to polite dinner parties, his pension is much reduced and he was humiliatingly stripped of his knighthood.

But does privacy matter anymore?

When even Mark Zuckerberg takes to phoning the President of the United States to “express frustration” with it’s snooping of private individuals, you know that someone is rattled.

The more revelations we see, the more people start to withdraw. They stop using Facebook, or they reduce the information they share. My daughter’s school places a great deal of effort in educating their students on the perils of “over sharing”. Despite Facebook’s success, I know more people who don’t use it (or who use it minimally) than those who are active users. I was told recently of a bank who studied the proportion of their customers with social media profiles; they estimated that 60% didn’t have one and amongst those who did, the vast majority were inactive. I suspect the silent majority are more concerned about privacy than those within the “tech bubble” realise.

People have a limit to what they will tolerate. If that limit is breached then consequences result. I’ve seen some argue that privacy no longer exists and we should give up trying to pretend it does. My personal view is that this is a broken argument; I’ve never met anyone who really believes this. I’ve met a few people who work in the technology industry who’ve espouse it, but is this really their opinion of are they acting as a corporate mouthpiece to further their career? In other words, if the commercial interests of their employer were deeply intertwined with the protection of privacy, would they argue differently (I think they might)?

Certainly the reaction to my Big Data and Cognitive Computing talks has been unanimous; nobody argued against the need for privacy in the subsequent discussions. Everyone expressed concern.

I am forced to conclude that privacy is still important. At least it should be our choice as citizens; nobody has the right to take it away without our agreement. Governments and some large internet companies seem to be trying to hide what they are doing and surreptitiously make the removal of privacy the norm.

However, most of us care about the risk of identity theft, we care that we might live in a surveillance society. Most of us are inherently private people and care that our medical records are protected with the upmost respect. Life is not just about the efficiencies that might be gained through relaxing privacy. We care about privacy, it’s what makes us individuals.

What should be done?

At this point you might be thinking we need to stop the bus. However, I’m personally of the view that its impossible to stop change. The emergence of Big Data and Cognitive Computing technologies is impossible to rewind. There are benefits to the technology for sure. But without a better balancing of the privacy risks, there’s a terrible risk of public dissent. People can stop using social media. People can protest on the streets and force a new political reality. The press can mount very effective campaigns that bring about change. Once a tipping point is reached, things start happening. Despite what some think, the little people run things.

The collection and processing of private data on a large scale is a relatively new practice, so its perhaps not surprising that we should first enter a “Wild West” phase. But if this is to be anything other than a flash in the pan, there needs to be much more user control and transparency around how our data is used.

Its not good enough for commercial companies and government agencies to continue the “trust us, we have your best interests at heart” line; we have good reason not to trust them from what I can see. There are enough examples of those in positions of power abusing the trust they’ve been given for us to be suspicious.

So what should these data-gathering organisations do?

They need to:

Build a culture that respects customers, that does what is right and not what is expedient. When “doing the right thing” might sometimes be at odds with perceived short-term revenue objectives, such a culture needs to be robust. Robust cultures can only be driven from the very top, so strong leadership that sets the right tone is paramount. A leadership that is dominated by immediate commercial pressures, or ones driven from lower down an organisation, are unlikely to be strong enough to counterbalance commercial pressures. And its about leading by doing, not through words. History is filled with examples of companies who’ve said one thing and done another. Employees pick up on unsaid queues and take the action they think is wanted. If employees don’t see the leadership bolstering a company’s policy with specific and continuous actions, they might start to think the policy is just lip-service and not important.
Consider setting up independent ethics committees to oversee the use of private data. I do mean independent. An ethics committee made up of those whose job is processing that data won’t have the distance to make the right judgement calls. We need ethics committees that are able to say “no” even when that might be against immediate commercial interests. Some level of independent scrutiny might help to counter-balance the strength of expediency that’s so prevalent in both commercial and governmental organisations.
Be open about what you are doing and, wherever possible, obtain informed consent. Those who process data that any reasonable person might consider to be ‘private’ need to make it very clear to the user what data is being collected and how it will be used. Very few internet companies are doing this today.
Give users clear options that give them control of their data and of how it will be used. Default settings should favour privacy over sharing and organisations should sell the benefit of sharing to persuade users to change those settings. Its not enough to provide hidden-away controls, or verbose privacy policies that no regular person will ever read. Policies and controls need to be accessible, clear, obvious and written in plain English.
Take security very, very seriously. Too many large organisations have had private data stolen by hackers because they didn’t put a high enough focus on protecting their customer’s data. Those who process private data have a moral and legal duty to keep that data secure. In today’s world, top class security skills are required to discharge that responsibility. That costs money, but not doing it costs more. Reputational damage alone for a security breach can be enormous, but the current bill for Target’s recent hack runs to $61m and is still rising.

The veteran UK politician Tony Benn was famous for his 5 questions about power and democracy. If knowledge (and by implication data) is power, asking these questions of those who collect and process our private data might be illuminating.

What power have you got?
Where did you get it from?
In whose interests do you exercise it?
To whom are you accountable?
How can we get rid of you?

I think its reasonable to think the those who own data should have influence over those who process it. That means I want it to be easier to understand what data Facebook or Google are collecting on me and how they use it. It means I want to know how government security agencies are democratically accountable and how I can influence those who oversee them. It means I want control over how my health data is processed. Resolution of these issues would be a sign that we’ve exited the “Wild West” phase of data processing, but I fear that in 2014 we are still firmly in the centre of the gold rush.

Duncan Anderson's Blog

Find articles from my Blog Archive:

Saturday, 5 April 2014

Privacy and Big Data