<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>yalepatents.org &#187; bioinfomatics</title>
	<atom:link href="http://yalepatents.org/tag/bioinfomatics/feed/" rel="self" type="application/rss+xml" />
	<link>http://yalepatents.org</link>
	<description>Discussing Yale, intellectual property reform and biotech industry in New Haven and Connecticut.</description>
	<lastBuildDate>Mon, 28 Jun 2010 13:15:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Data overload, the semantic web, and future data-sharing incentives</title>
		<link>http://yalepatents.org/2010/01/12/data-overload-the-semantic-web-and-future-data-sharing-incentives/</link>
		<comments>http://yalepatents.org/2010/01/12/data-overload-the-semantic-web-and-future-data-sharing-incentives/#comments</comments>
		<pubDate>Wed, 13 Jan 2010 02:45:42 +0000</pubDate>
		<dc:creator>Joseph B. Franklin</dc:creator>
				<category><![CDATA[News & Commentary]]></category>
		<category><![CDATA[bioinfomatics]]></category>
		<category><![CDATA[open-access]]></category>
		<category><![CDATA[open-source]]></category>
		<category><![CDATA[semantic web]]></category>

		<guid isPermaLink="false">http://yalepatents.org/?p=710</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Data+overload%2C+the+semantic+web%2C+and+future+data-sharing+incentives&amp;rft.aulast=Franklin&amp;rft.aufirst=Joseph&amp;rft.subject=News+%26amp%3B+Commentary&amp;rft.source=yalepatents.org&amp;rft.date=2010-01-12&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://yalepatents.org/2010/01/12/data-overload-the-semantic-web-and-future-data-sharing-incentives/&amp;rft.language=English"></span>
Michael Kinsley&#8217;s &#8220;Cut This Story!&#8221; is a must-read for anyone who writes (or reads) in 2010.  The article, in this month&#8217;s Atlantic, offers the latest diagnosis on the decline of print media: the excessive wordiness of traditional newspaper articles.  Long, comma-laden sentences, once useful for providing context in the slow, print-only world, now only distract [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Data+overload%2C+the+semantic+web%2C+and+future+data-sharing+incentives&amp;rft.aulast=Franklin&amp;rft.aufirst=Joseph&amp;rft.subject=News+%26amp%3B+Commentary&amp;rft.source=yalepatents.org&amp;rft.date=2010-01-12&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://yalepatents.org/2010/01/12/data-overload-the-semantic-web-and-future-data-sharing-incentives/&amp;rft.language=English"></span>
<p>Michael Kinsley&#8217;s <a href="http://www.theatlantic.com/doc/201001/short-writing" target="_blank">&#8220;Cut This Story!&#8221;</a> is a must-read for anyone who writes (or reads) in 2010.  The article, in this month&#8217;s <em>Atlantic</em>, offers the latest diagnosis on the decline of print media: the excessive wordiness of traditional newspaper articles.  Long, comma-laden sentences, once useful for providing context in the slow, print-only world, now only distract from the the modern, networked reader&#8217;s main goal: extracting useful bits of newsworthy information.</p>
<p>Kinsley writes of the print-journalism trade, but in his description of the modern information-seeker, he identifies a general problem yet unsolved by internet technology.  People and software become better and faster at searching through data&#8211;articles, databases, images&#8211;but accessing that data is still like using a glorified card catalog: entering in a search term, finding a set of web pages that might be relevant, and then wading through content to find an answer.  The goal of computer scientists is to turn the internet into an intelligent research librarian, able itself to sift through information to answer questions directly.  The so-called &#8220;<a href="http://www.w3.org/2001/sw/" target="_blank">semantic web</a>&#8221; promises to transform the way we interact with information, especially in fields like the life sciences, where the research community has been side-swiped by a flood of interdisciplinary data coming from new technologies.  However, for the semantic web to become a reality is going to require a revolution in the way scientists and others share data that is currently proprietary, and, even among government-funded academics, the incentives for opening up such data are still missing.<span id="more-710"></span></p>
<p>Many talk about the &#8220;disruptions&#8221; to media, commerce and communication caused by the online, computerized world.  Compared to the advanced information network represented by the semantic web ideal, however, our current internet seems fairly tame.  Today, new knowledge is still mainly transmitted in media forms that would have been largely familiar in 1990, except in quantity and access speed.  In science, freely available, web-only journals are a recent innovation, but the fundamental medium of the journal article has evolved little since the <a href="http://jcb.rupress.org/cgi/content/abstract/2/4/417?maxtoshow=&amp;HITS=10&amp;hits=10&amp;RESULTFORMAT=1&amp;author1=palade%2C+g&amp;andorexacttitle=and&amp;andorexacttitleabs=and&amp;andorexactfulltext=and&amp;searchid=1&amp;FIRSTINDEX=0&amp;sortspec=date&amp;tdate=1/31/1960&amp;resourcetype=HWCIT" target="_blank">middle of the 20th century</a>, if not before.</p>
<p>Simply put, the <a href="http://semanticweb.org/wiki/Main_Page" target="_blank">semantic</a> web is (or will be) &#8220;semantic&#8221; because information content is tagged in such a way as to organize and connect it with other related information.  The intent is to embed these content descriptions in a standardized way that is both intelligible to humans and also efficiently interpreted by computers.  It is meant to be a broad movement and guiding philosophy for anyone who wants to put information online and make it useful for future &#8220;semantic&#8221; applications (such as intelligent, interpretive search engines of the future).</p>
<p>For science, the semantic web promises to solve the barriers in communicating the overwhelming abundance of data that is produced from modern technologies, such as ultra-high-throughput DNA sequencing.  Ideally, it will provide a way for scientists to knit together large amounts of data from different projects in a way that other researchers will be able to query through it, asking questions and receiving intelligent answers.  And, because well designed semantic tags should slice through jargon, the semantic web will be ideal for nonspecialists&#8211;for instance, non-scientists who wish to investigate pharmaceutical interactions or climate research.</p>
<p>Tim Berners-Lee, the internet pioneer and semantic web visionary, <a href="http://www.nature.com/nature/journal/v410/n6832/full/4101023b0.html" target="_blank">wrote</a> in 2001 of the enormous change in the scientific publishing model represented by the semantic web.  An important consideration is that semantically-tagged content is only as useful as the careful knowledge curation that goes into building it:</p>
<blockquote><p>The concept of machine-understandable documents does not imply some magical artificial intelligence allowing machines to comprehend human mumblings. It relies solely on the machine&#8217;s ability to solve well-defined problems by performing well-defined operations on well-defined data.</p></blockquote>
<p>Thus, building this new information network requires experts to input their data into pre-established semantic frameworks, where new information about molecular interactions or genetic traits can be read by database software itself.  At the least, this process would add a new computer-readable component to traditional journal articles.  More likely, it would provide a way to share information that would eventually side-step journal publication completely.  Berners-Lee predicted that the semantic web could based, at first, on traditional articles but would evolve to a model that is decentralized and side-steps the current process of journal publications:</p>
<blockquote><p>Papers that include this new mark-up language will be found by new and better search engines, so users will be able to issue significantly more precise queries. More importantly, experimental results can themselves be published on the web, outside the context of a research paper. So a scientist can design and run an experiment, and create an emerging web page containing the information that he or she wants to share with trusted colleagues.</p></blockquote>
<p>As traditional publication loses importance, new questions about data integrity arise: how do you peer-review data incorporated into the semantic web, or otherwise ensure trust?  Just as significantly, it adds layers of complexity on assigning attribution to research data.  Academic scientists judge each other on the quality and quantity of the journal articles they produce.  In the scientific world, journal publication is <em>the</em> standard for scientific success.  If science is to benefit from the futuristic information environments promised by the semantic web, which relies on the prompt and full disclosure of data analysis, it is going to require sufficient incentives for individual investigators to contribute.</p>
<p>New incentives or mandates will be necessary to overcome the increasingly burdensome task of preparing highly complex research findings for machine-readable databases.  Formatting findings to be machine-readable might be feasible in the case of a small-scale study of the interactions or regulation of a single molecule or complex.  But what about experiments using new technologies that is based on tens of thousands (if not millions) of new pieces of information, such as the analysis of a whole transcriptome?  Who is going to curate this information for the semantic web?</p>
<p>There is an even more fundamental issue regarding the full release of raw data, and it must be resolved before data-intensive findings may be incorporated into the semantic web.   Currently, biomedical researchers often amass data from a number of sources&#8211;both small-scale and large-scale&#8211;before summarizing, analyzing and communicating it in the form of the journal article.  Currently, in many fields, it is common for a research paper might be based on gigabytes (if not terabytes) of data, but in a the journal article this data, after extensive computer processing, is condensed into several graphical figures and text.  Since they don&#8217;t fit into the traditional format of a journal article, the original raw data become <em>de facto </em>proprietary&#8211;in some cases merely because there is no established clearinghouse for it.  Such a situation is becoming increasingly common&#8211;just last week, I was attempting to compare new deep-sequencing data to similar, large datasets summarized in recent publication.  The recent (government-funded) publication, however, had no link to their original data.  Sure, I could email the lead author to get access to the data, but then they would know what I&#8217;m working on before I get a chance to publish.</p>
<p>Not providing a way to access the primary data behind a publication runs counter to the policies of most funding agencies (e.g. the UK&#8217;s <a href="http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm" target="_blank">Wellcome Trust</a>), but in the situations I&#8217;ve encountered, it was not likely that the authors or journal were acting particularly malevolently.  Even in 2010, it is logistically difficult to post many gigabytes of data on public servers.  Moreover, most scientists are accustomed to submitting journal articles with 3-5 figures containing gels and microscopic images (and perhaps video) as the data supporting their claims, relying on their peers to trust their analysis.  Submitting raw digital data (whether from deep sequencing or high-resolution microscopy) is a new paradigm for accountability that diminishes, if only superficially, the author&#8217;s authority.</p>
<p>One way to ensure that new forms of data are available (and eventually incorporated into semantic-based information structures) may be with a centralized, government mandate.  The past few years have seen new rules requiring government-funded research to be freely available to the community within a certain time-frame after publication.  The Obama administration is currently <a href="http://edocket.access.gpo.gov/2009/E9-29322.htm" target="_blank">drafting new requirements</a> for the public release of government-funded research data, as part of its general initiative to open access to government data.  However, the current federal effort is still focused on ensuring public access to traditional publications, an effort that began a few years ago at the NIH&#8217;s <a href="http:// www.pubmedcentral.nih.gov/about/ faq.html" target="_blank">PubMed Central</a>.</p>
<p>As the U.S. Office of Science and Technology Policy puts it in a <a href="http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf">recent invitation to comment</a> on publication access policies: &#8220;Access demands not only availability, but also meaningful usability.&#8221;  Organization of new research knowledge into semantic-web repositories represents the future of &#8220;meaningful usability&#8221;.  Will government be able to mandate that scientists contribute data to the semantic web, when that effort simultaneously disrupts the traditional publication process on which reputations are based?  To get the research community to take full advantage of the semantic web and related knowledge-organizing technologies will require a total rethinking of the incentives for knowledge sharing&#8211;one that replaces the rewards currently conferred when a peer-reviewed article is accepted for publication.</p>
]]></content:encoded>
			<wfw:commentRss>http://yalepatents.org/2010/01/12/data-overload-the-semantic-web-and-future-data-sharing-incentives/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Next-generation bioinformatics: an open-source education for biologists</title>
		<link>http://yalepatents.org/2009/12/23/next-generation-bioinformatics-an-open-source-education-for-biologists/</link>
		<comments>http://yalepatents.org/2009/12/23/next-generation-bioinformatics-an-open-source-education-for-biologists/#comments</comments>
		<pubDate>Wed, 23 Dec 2009 14:22:09 +0000</pubDate>
		<dc:creator>Joseph B. Franklin</dc:creator>
				<category><![CDATA[News & Commentary]]></category>
		<category><![CDATA[bioinfomatics]]></category>
		<category><![CDATA[foss]]></category>
		<category><![CDATA[open-source]]></category>

		<guid isPermaLink="false">http://yalepatents.org/?p=583</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Next-generation+bioinformatics%3A+an+open-source+education+for+biologists&amp;rft.aulast=Franklin&amp;rft.aufirst=Joseph&amp;rft.subject=News+%26amp%3B+Commentary&amp;rft.source=yalepatents.org&amp;rft.date=2009-12-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://yalepatents.org/2009/12/23/next-generation-bioinformatics-an-open-source-education-for-biologists/&amp;rft.language=English"></span>
I entered my PhD program in cell biology with some knowledge about molecular biology, a dose of scientific ambition and a shiny new Powerbook G4.  For the first half of my PhD research, the laptop mostly idled on my desk, waiting patiently for me to take breaks from my bench-work to check email, search for [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Next-generation+bioinformatics%3A+an+open-source+education+for+biologists&amp;rft.aulast=Franklin&amp;rft.aufirst=Joseph&amp;rft.subject=News+%26amp%3B+Commentary&amp;rft.source=yalepatents.org&amp;rft.date=2009-12-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://yalepatents.org/2009/12/23/next-generation-bioinformatics-an-open-source-education-for-biologists/&amp;rft.language=English"></span>
<p>I entered my PhD program in cell biology with some knowledge about molecular biology, a dose of scientific ambition and a shiny new Powerbook G4.  For the first half of my PhD research, the laptop mostly idled on my desk, waiting patiently for me to take breaks from my bench-work to check email, search for technical literature or assemble figures for a presentation.</p>
<p>It was after a year or two of such &#8220;wet-lab&#8221; work&#8211;biochemistry and microscopy&#8211;that my colleagues and I encountered pressing biological questions that, we hoped, might be answered using a new technology: &#8220;deep&#8221; DNA sequencing.  This process, much-anticipated in the past few years, allows for the simultaneous, parallel sequencing of millions of short lengths of DNA.  A single run produces gigabytes of pure data; analyzing and displaying that data in useful ways required me to revisit my relationship with my computer.  Beyond being my robotic secretary, research assistant and conduit to YouTube-style procrastination, my computer suddenly became the fulcrum of my research.<span id="more-583"></span></p>
<p>In short, I was introduced to the world of bioinformatics&#8211;and by that I don&#8217;t simply mean the protocols and theory of digital data <a href="http://yalepatents.org/wp-content/uploads/2009/12/photo.jpg"><img class="alignright size-medium wp-image-596" style="margin-top: 25px; margin-bottom: 25px;" title="laptop" src="http://yalepatents.org/wp-content/uploads/2009/12/photo-225x300.jpg" alt="A well-worn notebook computer." width="144" height="192" /></a>analysis applied to biology.  Computational biology is, significantly, a field composed of real computer geeks more aligned, culturally, with the modern silicon-based tech sector than with the traditions of biological science.  As a young biologist, communicating with bioinformaticians finally opened my eyes to technological and economic trends in software and networking.</p>
<p>Leaders in the field have long been <a href="http://www.oreillynet.com/lpt/a/1511" target="_blank">dedicated devotees to the open-source model of software distribution</a>, and have developed a software ecosystem based on collaborative communities of developers.  Accessing the collective knowledge of these communities is as easy as joining an email list or online forum.  For the purposes of even advanced data analysis, basic Perl scripting, command-line Linux administration and the willingness to find information online are the only costs of entry.  Some of the most useful and active open-source bioinformatics environments include <a href="http://www.bioperl.org/wiki/Main_Page" target="_blank">BioPerl</a> and <a href="http://bioconductor.org/" target="_blank">Bioconductor</a> (for the <a href="http://www.r-project.org/" target="_blank">R statistical software</a>).   There are many smaller, single-purpose tools (such as the fast alignment programs that have grown out of the sequencing revolution) that function on the same premise: an online community of users contributing advice and improvements based on access to source code.</p>
<p>Academic biomedical scientists, which I use as a catch-all term to encompass researchers in fields like molecular and cell biology, immunology and infectious disease research, do not usually consider themselves technologically conservative.  However, the reality is that many experiments rely on decades-old technology&#8211;gel electrophoresis, microscopy, automated cell-sorting.  After all, time-tested techniques have been carefully honed through generations of scientists, and their results are easily judged during peer-review.</p>
<p>My point is that inherent conservatism of scientists (no different from their peers elsewhere in academia) predisposes them to an ignorance toward the newer, technology-enabled forms of collaborative research represented by bioinformatics&#8211;and advanced information technology in general.  Obviously, developing code to analyze data is a much different process than experimenting on living systems in the lab&#8211;not to mention trying to convert biological knowledge to improvements in drugs and diagnostics.  However, as a younger generation of biologists <a href="http://www.nytimes.com/2009/10/12/technology/12data.html" target="_blank">responds to the computational demands of processing enormous piles of data</a>, it will be interesting to see whether the <a href="http://en.wikipedia.org/wiki/Free_and_open_source_software" target="_blank">FOSS</a> ethos of academic bioinformatics pollinates the more traditional world of the wet-lab.</p>
]]></content:encoded>
			<wfw:commentRss>http://yalepatents.org/2009/12/23/next-generation-bioinformatics-an-open-source-education-for-biologists/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
