The Lybbaverse / john-wilbanks  

Big data democratization: a new open framework


Providing public easement—the new “open”

Most of the definitions in “open” systems have revolved around intellectual property, and of those, the vast majority of successful ones around copyright. The definitions are used to measure licenses, which pass or fail. But data doesn’t exist in a similar intellectual property regime. And data itself is hard to define—especially given that in a computational context, everything is eventually a one or a zero.

We recommend instead of thinking about an “open” system in the traditional sense, to think of it as providing easement or a public path to data. In this model, there are three layers of data:

Private: A trusted organization might be the steward of all information going into, residing in, and leaving a system. Some of that data is owned outright, as well as content, algorithms, technologies, etc., that it has produced on it own or according to partnership agreements.

Trusted: A subset of all data, content, etc., is either the non-exclusive or exclusive property of a partner, according to whatever agreements the lead organization and the partner arrive at. As steward of this property (and in some cases partial owner of that property) the lead organization is obliged to ensure integrity of and access to this property in the system. This property can be used for a range of purposes that may include biomedical research, creation of patents, etc.

Public: The public at large will have access (and in some cases a right) to select data and content in the knowledge base. Contributors to the knowledge base may choose to stipulate a “public good” clause into their agreement with the lead organization requiring the information be made accessible to certain kinds of research initiatives under particular circumstances. In addition, patients may want access to information tied to their personal health record—for instance, in cases when they (or someone on their behalf) input healthcare data into the system in order to obtain benefits from the system.

Data Democratization: Principles

We can develop rudimentary principles and heuristics for democratizing big data. These principles become important in such cases where personal healthcare information blends with information from the knowledge base. For instance, it is widely acknowledged that the accuracy and timeliness of a diagnosis increase dramatically according to the amount of information the physician has about the patient.

1.     The Access Principle: Data democratization is fundamentally about access to data, not solely control of data. Whether the data is held by a government, a corporation, or a nonprofit, or if the data relates to a specific individual and is stored at the individual level, they should have the right to access and download a copy of the data.

2.     The Annotation Principle: We live in a world in which individuals with access to their own data can make gifts of data, and likely over time even investments of data. So when data about specific individuals is provided to individuals, it should carry sufficient annotation so that it can be reused computationally by third parties.

3.     The Transaction Principle: If the company, government, or nonprofit is gathering data about specific individuals in order to subsidize goods or services, that transaction should be specified clearly to individuals at sign up. Users should consent to the service after being offered a clear, layperson-readable summary of the deal. Transactional data relationships are fine—hiding them is not.

4.     The Export Principle: Individuals should be able to export all data about themselves from systems in order to migrate to new systems. All walled gardens must have doorways.

These are not revolutionary principles on their own. Indeed, many of them are implemented in one way or another, at varying levels of completeness, by most major data aggregators. But most aggregators practice selective implementation. One can download Facebook data, but not email contacts, for example.

Annotation is noticeable primarily in its absence. And the terms of use for sites or apps that collect user-level data run to absurd lengths, require graduate-level education to parse, and intentionally hide the fundamental transactional nature of the services. The goal in a “big data democratization” definition is to knit these practices together to allow us to understand the policies of data collectors and providers.

Think of it as the Good Housekeeping seal of approval for big data. For example, foundations making investments in grants can use the principles to understand the outputs in which they invest, and individuals can view the democratization “score” of a service before they decide if it’s worth it. We have long leveraged the social power of quality trademarks, and we can bring the same power of normative scoring into the data world.

Democratization will not likely emerge from the dominant players in data collection and aggregation. It will emerge first from those who already agree with the ideas and principles, many of whom lack access to expensive attorneys, designers, programmers, ideation experts, and more. It will also emerge from those in the funding community who buy into the ideas but lack the experience to distinguish something that is truly democratic from something that simply bears the trimmings of democracy.

Big Data Democratization: Tools

It is vital to focus on tools that embed the previously described principles and can be distributed as a package just as software is distributed. We might even provide a data democratization kit that contains the following, to be adopted by and adapted to multiple ecosystems over time:

Open Designs

• Interaction designs for data sharing

• Operational processes and practices for democratized systems

Open Code

• Open source software for data return to individuals

• Cloud services for applications-based data return to individuals

• Data safe deposit boxes for democratized data

• Cryptographically sound techniques for creating user IDs about individuals

Open Law

• Trademarks that can be community-applied to good, bad, and indifferent data providers

• Terms of use for data collection

• Privacy policy for data distribution

• Consent practices for data collection

• Informed consent with IRB needed in health data collection

• Informed consent with IRB probably needed to make a data gift

• Basic checkbox consent with layperson text probably sufficient elsewhere for data collection

• Selected existing open copyright licenses

• Patent policies for group projects

Unreasonable people unite: Lybba Fellow John Wilbanks at TEDGlobal 2012


The following was reblogged from TED Blog

John Wilbanks arrives on stage with some bad news, some good news and a task. But first, he says, let’s be honest. We all get sick. We don’t always die, but quite reasonably we do try to find out what’s going on.

In the late 1800s, Dr. Carlos Finlay had a hypothesis. He thought yellow fever was not transmitted by chance or dirty clothes but by mosquitoes. Laughed at as the Mosquito Man, his theory was vindicated some 20 years later. How? By volunteers who moved to Cuba, lived in tents and agreed to be voluntarily infected with the disease. Knowing full well that they might die as a result of their action, the volunteers knew what they signed up for thanks to a document known as informed consent. “We should be very proud of this as a society,” Wilbanks says. Informing participants “makes us different from the Nazis.”

But times have changed, and informed consent has become a millstone around the neck of medical advancement. “What we think of as health are now interactions of choices and environment, and clinical methods are not good at studying that,” he says. “Those are based on person to person interaction.” And now we live in a networked world.

As such, the data collected on diseases such as prostate cancer or Alzheimer’s descends into a silo from which it is impossible to extract. “It cannot be networked, it cannot be integrated, it cannot be used by people who aren’t credentialed,” says Wilbank. That means a physicist couldn’t use it to try out a good idea. A computer scientist would have to get credentials to use the data to test a hypothesis. “Computer scientists aren’t patient,” he says. “They don’t file paperwork.” The inference: what lateral thinking are we missing through this well-intended departmentalization? ”The tool to protect us from harm is protecting us from innovation.”


It’s a life-threatening problem. 45% of men in the United States develop cancer; 38% of women. 1 in 4 men die; 1 in 5 women. 1,500 people a day die from cancer while the U.S. spends $226 billion on the disease each year. Wilbanks has first-hand knowledge: his sister is a cancer survivor, as is his mother-in-law. “Cancer sucks,” he says bluntly. And let’s be honest, privacy leaves the room when cancer enters. So when Wilbanks shares with survivors that the tool designed to protect them is in fact preventing their data from being put toward the development of a potential cure, “the reaction is not ‘thank you, God, for protecting my privacy,’” he says. “It’s outrage that we have this information and we can’t use it.”

But there’s some good news. Our world includes “digital exhaust,” a phenomenon he thinks of as the dust trail kicked up by his son running in the woods. That means we can track our selves on our own. He shows an iPhone app, Eatery, through which we can monitor and share details of the food we eat.

Nowadays we can get our genes read; before too long that’ll be our whole genome. Wilbanks shares details of his own scan. He carries a 32% risk of prostate cancer and 14% risk of Alzheimer’s. When he got his report, doctors advised him not to tell anyone. “Will that help anyone cure me?” he asked. “No one could tell me yes. I live in a web world where when you share things beautiful things happen.”

So he didn’t stop there. He got his bloodwork back and started to share it. “I have bad cholesterol,” he says, and bad liver results (the result, he claims, of a good, wine-filled dinner party the night before the test). But, he adds, “look at how non-computable the information is!” Indeed, the printout he throws up onscreem looks like a throwback to the days of the first dot-matrix printers.

Wilbanks’ proposal is a medical commons, a way for people to gather this medical data and share it freely. People are neurotic about privacy and keeping control of their data. “Some of us like to share as control.” And, he believes we live in an age where people agree with him. He mentions a study run at Vanderbilt University in Tennessee.”It’s not the most science positive state in America,” he say. “Only 5% wanted out. People like to share if given the opportunity and choice.” And not using this data to understand health issues through mathematic analysis “is like having a giant set of power tools but leaving them not plugged in while using hand saws.”

“This is the world’s first fully digital, self-contributed, unlimited in scope, global in participation, ethically approved clinical study,” he says. It’s a way to reach behind and grab that dust, to extract medical records and donate them, to have them syndicated to mathematicians who’ll do big data research.” To sign up, all you have to be is over 14 years old, “willing to sign a contract to say you’re not going to be jerk. Oh, you have to solve a Captcha too.” That’s it. “If you don’t like those terms, don’t come in.”

Here’s the other thing about systems: it doesn’t take that many people to make big advances. “It didn’t take that many to make or keep up Wikipedia,” he says. “We need a small number of unreasonable people working together.” And let’s move on from using the word “patient.” ”I don’t like being patient when systems are broken, and healthcare is broken. I’m not talking about politics but the way scientists approach healthcare. I don’t want to be patient.”

Now for his task. “Try when you get home to get your data,” he challenged. “You’ll be shocked, offended, outraged at how hard to get it it is.” So try, and share it if you feel like it. He’s looking forward to seeing how many unreasonable people will join him. “After all,” he concludes: “It doesn’t take all of us. It takes all of some of us.”

Photos: James Duncan Davidson