Data overload, the semantic web, and future data-sharing incentives
Michael Kinsley’s “Cut This Story!” is a must-read for anyone who writes (or reads) in 2010. The article, in this month’s Atlantic, offers the latest diagnosis on the decline of print media: the excessive wordiness of traditional newspaper articles. Long, comma-laden sentences, once useful for providing context in the slow, print-only world, now only distract from the the modern, networked reader’s main goal: extracting useful bits of newsworthy information.
Kinsley writes of the print-journalism trade, but in his description of the modern information-seeker, he identifies a general problem yet unsolved by internet technology. People and software become better and faster at searching through data–articles, databases, images–but accessing that data is still like using a glorified card catalog: entering in a search term, finding a set of web pages that might be relevant, and then wading through content to find an answer. The goal of computer scientists is to turn the internet into an intelligent research librarian, able itself to sift through information to answer questions directly. The so-called “semantic web” promises to transform the way we interact with information, especially in fields like the life sciences, where the research community has been side-swiped by a flood of interdisciplinary data coming from new technologies. However, for the semantic web to become a reality is going to require a revolution in the way scientists and others share data that is currently proprietary, and, even among government-funded academics, the incentives for opening up such data are still missing.
Many talk about the “disruptions” to media, commerce and communication caused by the online, computerized world. Compared to the advanced information network represented by the semantic web ideal, however, our current internet seems fairly tame. Today, new knowledge is still mainly transmitted in media forms that would have been largely familiar in 1990, except in quantity and access speed. In science, freely available, web-only journals are a recent innovation, but the fundamental medium of the journal article has evolved little since the middle of the 20th century, if not before.
Simply put, the semantic web is (or will be) “semantic” because information content is tagged in such a way as to organize and connect it with other related information. The intent is to embed these content descriptions in a standardized way that is both intelligible to humans and also efficiently interpreted by computers. It is meant to be a broad movement and guiding philosophy for anyone who wants to put information online and make it useful for future “semantic” applications (such as intelligent, interpretive search engines of the future).
For science, the semantic web promises to solve the barriers in communicating the overwhelming abundance of data that is produced from modern technologies, such as ultra-high-throughput DNA sequencing. Ideally, it will provide a way for scientists to knit together large amounts of data from different projects in a way that other researchers will be able to query through it, asking questions and receiving intelligent answers. And, because well designed semantic tags should slice through jargon, the semantic web will be ideal for nonspecialists–for instance, non-scientists who wish to investigate pharmaceutical interactions or climate research.
Tim Berners-Lee, the internet pioneer and semantic web visionary, wrote in 2001 of the enormous change in the scientific publishing model represented by the semantic web. An important consideration is that semantically-tagged content is only as useful as the careful knowledge curation that goes into building it:
The concept of machine-understandable documents does not imply some magical artificial intelligence allowing machines to comprehend human mumblings. It relies solely on the machine’s ability to solve well-defined problems by performing well-defined operations on well-defined data.
Thus, building this new information network requires experts to input their data into pre-established semantic frameworks, where new information about molecular interactions or genetic traits can be read by database software itself. At the least, this process would add a new computer-readable component to traditional journal articles. More likely, it would provide a way to share information that would eventually side-step journal publication completely. Berners-Lee predicted that the semantic web could based, at first, on traditional articles but would evolve to a model that is decentralized and side-steps the current process of journal publications:
Papers that include this new mark-up language will be found by new and better search engines, so users will be able to issue significantly more precise queries. More importantly, experimental results can themselves be published on the web, outside the context of a research paper. So a scientist can design and run an experiment, and create an emerging web page containing the information that he or she wants to share with trusted colleagues.
As traditional publication loses importance, new questions about data integrity arise: how do you peer-review data incorporated into the semantic web, or otherwise ensure trust? Just as significantly, it adds layers of complexity on assigning attribution to research data. Academic scientists judge each other on the quality and quantity of the journal articles they produce. In the scientific world, journal publication is the standard for scientific success. If science is to benefit from the futuristic information environments promised by the semantic web, which relies on the prompt and full disclosure of data analysis, it is going to require sufficient incentives for individual investigators to contribute.
New incentives or mandates will be necessary to overcome the increasingly burdensome task of preparing highly complex research findings for machine-readable databases. Formatting findings to be machine-readable might be feasible in the case of a small-scale study of the interactions or regulation of a single molecule or complex. But what about experiments using new technologies that is based on tens of thousands (if not millions) of new pieces of information, such as the analysis of a whole transcriptome? Who is going to curate this information for the semantic web?
There is an even more fundamental issue regarding the full release of raw data, and it must be resolved before data-intensive findings may be incorporated into the semantic web. Currently, biomedical researchers often amass data from a number of sources–both small-scale and large-scale–before summarizing, analyzing and communicating it in the form of the journal article. Currently, in many fields, it is common for a research paper might be based on gigabytes (if not terabytes) of data, but in a the journal article this data, after extensive computer processing, is condensed into several graphical figures and text. Since they don’t fit into the traditional format of a journal article, the original raw data become de facto proprietary–in some cases merely because there is no established clearinghouse for it. Such a situation is becoming increasingly common–just last week, I was attempting to compare new deep-sequencing data to similar, large datasets summarized in recent publication. The recent (government-funded) publication, however, had no link to their original data. Sure, I could email the lead author to get access to the data, but then they would know what I’m working on before I get a chance to publish.
Not providing a way to access the primary data behind a publication runs counter to the policies of most funding agencies (e.g. the UK’s Wellcome Trust), but in the situations I’ve encountered, it was not likely that the authors or journal were acting particularly malevolently. Even in 2010, it is logistically difficult to post many gigabytes of data on public servers. Moreover, most scientists are accustomed to submitting journal articles with 3-5 figures containing gels and microscopic images (and perhaps video) as the data supporting their claims, relying on their peers to trust their analysis. Submitting raw digital data (whether from deep sequencing or high-resolution microscopy) is a new paradigm for accountability that diminishes, if only superficially, the author’s authority.
One way to ensure that new forms of data are available (and eventually incorporated into semantic-based information structures) may be with a centralized, government mandate. The past few years have seen new rules requiring government-funded research to be freely available to the community within a certain time-frame after publication. The Obama administration is currently drafting new requirements for the public release of government-funded research data, as part of its general initiative to open access to government data. However, the current federal effort is still focused on ensuring public access to traditional publications, an effort that began a few years ago at the NIH’s PubMed Central.
As the U.S. Office of Science and Technology Policy puts it in a recent invitation to comment on publication access policies: “Access demands not only availability, but also meaningful usability.” Organization of new research knowledge into semantic-web repositories represents the future of “meaningful usability”. Will government be able to mandate that scientists contribute data to the semantic web, when that effort simultaneously disrupts the traditional publication process on which reputations are based? To get the research community to take full advantage of the semantic web and related knowledge-organizing technologies will require a total rethinking of the incentives for knowledge sharing–one that replaces the rewards currently conferred when a peer-reviewed article is accepted for publication.
