Big data democratization: a new open framework


Providing public easement—the new “open”

Most of the definitions in “open” systems have revolved around intellectual property, and of those, the vast majority of successful ones around copyright. The definitions are used to measure licenses, which pass or fail. But data doesn’t exist in a similar intellectual property regime. And data itself is hard to define—especially given that in a computational context, everything is eventually a one or a zero.

We recommend instead of thinking about an “open” system in the traditional sense, to think of it as providing easement or a public path to data. In this model, there are three layers of data:

Private: A trusted organization might be the steward of all information going into, residing in, and leaving a system. Some of that data is owned outright, as well as content, algorithms, technologies, etc., that it has produced on it own or according to partnership agreements.

Trusted: A subset of all data, content, etc., is either the non-exclusive or exclusive property of a partner, according to whatever agreements the lead organization and the partner arrive at. As steward of this property (and in some cases partial owner of that property) the lead organization is obliged to ensure integrity of and access to this property in the system. This property can be used for a range of purposes that may include biomedical research, creation of patents, etc.

Public: The public at large will have access (and in some cases a right) to select data and content in the knowledge base. Contributors to the knowledge base may choose to stipulate a “public good” clause into their agreement with the lead organization requiring the information be made accessible to certain kinds of research initiatives under particular circumstances. In addition, patients may want access to information tied to their personal health record—for instance, in cases when they (or someone on their behalf) input healthcare data into the system in order to obtain benefits from the system.

Data Democratization: Principles

We can develop rudimentary principles and heuristics for democratizing big data. These principles become important in such cases where personal healthcare information blends with information from the knowledge base. For instance, it is widely acknowledged that the accuracy and timeliness of a diagnosis increase dramatically according to the amount of information the physician has about the patient.

1.     The Access Principle: Data democratization is fundamentally about access to data, not solely control of data. Whether the data is held by a government, a corporation, or a nonprofit, or if the data relates to a specific individual and is stored at the individual level, they should have the right to access and download a copy of the data.

2.     The Annotation Principle: We live in a world in which individuals with access to their own data can make gifts of data, and likely over time even investments of data. So when data about specific individuals is provided to individuals, it should carry sufficient annotation so that it can be reused computationally by third parties.

3.     The Transaction Principle: If the company, government, or nonprofit is gathering data about specific individuals in order to subsidize goods or services, that transaction should be specified clearly to individuals at sign up. Users should consent to the service after being offered a clear, layperson-readable summary of the deal. Transactional data relationships are fine—hiding them is not.

4.     The Export Principle: Individuals should be able to export all data about themselves from systems in order to migrate to new systems. All walled gardens must have doorways.

These are not revolutionary principles on their own. Indeed, many of them are implemented in one way or another, at varying levels of completeness, by most major data aggregators. But most aggregators practice selective implementation. One can download Facebook data, but not email contacts, for example.

Annotation is noticeable primarily in its absence. And the terms of use for sites or apps that collect user-level data run to absurd lengths, require graduate-level education to parse, and intentionally hide the fundamental transactional nature of the services. The goal in a “big data democratization” definition is to knit these practices together to allow us to understand the policies of data collectors and providers.

Think of it as the Good Housekeeping seal of approval for big data. For example, foundations making investments in grants can use the principles to understand the outputs in which they invest, and individuals can view the democratization “score” of a service before they decide if it’s worth it. We have long leveraged the social power of quality trademarks, and we can bring the same power of normative scoring into the data world.

Democratization will not likely emerge from the dominant players in data collection and aggregation. It will emerge first from those who already agree with the ideas and principles, many of whom lack access to expensive attorneys, designers, programmers, ideation experts, and more. It will also emerge from those in the funding community who buy into the ideas but lack the experience to distinguish something that is truly democratic from something that simply bears the trimmings of democracy.

Big Data Democratization: Tools

It is vital to focus on tools that embed the previously described principles and can be distributed as a package just as software is distributed. We might even provide a data democratization kit that contains the following, to be adopted by and adapted to multiple ecosystems over time:

Open Designs

• Interaction designs for data sharing

• Operational processes and practices for democratized systems

Open Code

• Open source software for data return to individuals

• Cloud services for applications-based data return to individuals

• Data safe deposit boxes for democratized data

• Cryptographically sound techniques for creating user IDs about individuals

Open Law

• Trademarks that can be community-applied to good, bad, and indifferent data providers

• Terms of use for data collection

• Privacy policy for data distribution

• Consent practices for data collection

• Informed consent with IRB needed in health data collection

• Informed consent with IRB probably needed to make a data gift

• Basic checkbox consent with layperson text probably sufficient elsewhere for data collection

• Selected existing open copyright licenses

• Patent policies for group projects

  1. Eden

    This is a great article and a great idea!

  2. I think this is very interesting and fresh! :-) As I understand, you have visualized a shared knowledge dimension (or dimensions). As far as there is personal health information concerned; have you also visualized a “patient centered starting point”, starting in personal control that should always be there, or do you see it more as a network of data sources that do not necessarily go via the “patient”? I mean this as a principle, not as a rule of course.