w<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	
	xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Fuzzier Logic &#187; statistics</title>
	<atom:link href="http://blog.fuzzierlogic.com/archives/tag/statistics/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.fuzzierlogic.com</link>
	<description>Logic. Just a bit woolier.</description>
	<lastBuildDate>Tue, 22 Nov 2011 09:21:38 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Graphing protein databases</title>
		<link>http://blog.fuzzierlogic.com/archives/425</link>
		<comments>http://blog.fuzzierlogic.com/archives/425#comments</comments>
		<pubDate>Thu, 11 Nov 2010 15:24:57 +0000</pubDate>
		<dc:creator>Simon</dc:creator>
				<category><![CDATA[science]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[Visualisation]]></category>
		<category><![CDATA[graphs]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://blog.fuzzierlogic.com/?p=425</guid>
		<description><![CDATA[<p>I&#8217;m giving a lecture next week to the Bioinformatics Masters students here about protein structure prediction. As part of the introduction to this topic, I have a traditional &#8216;data explosion&#8217; slide, to illustrate the gap between the quantity of protein sequence data available versus the number of solved protein structures in the PDB (hence the [...]]]></description>
			<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="425">
<p>I&#8217;m giving a lecture next week to the Bioinformatics Masters students here about protein structure prediction. As part of the introduction to this topic, I have a traditional &#8216;data explosion&#8217; slide, to illustrate the gap between the quantity of protein sequence data available versus the number of solved protein structures in the PDB (hence the need for bioinformatics to help fill the gap, by good prediction algorithms). When I last gave this talk (scarily, 4 years ago), this slide was just text, a description of the present size of UniProt &amp; the PDB.</p>
<p>Since 2006 my lecturing style has progressed somewhat, I don&#8217;t like to have slides with just words on anymore, so I wanted to replace this slide, rather than just updating the numbers. Graphs of the growing sizes of the databases are easy to find online, but to my mind the real story here is of the gap in the sizes of the 2 databases (UniProt &amp; PDB), and whether it is growing (or are protein structural determination methods catching up). This graph doesn&#8217;t (to my knowledge) exist, so, inspired by <a title="BioStar" href="http://biostar.stackexchange.com/questions/3029/locations-of-plots-of-quantities-of-publicly-available-biological-data" target="_blank">this question on BioStar</a> I set out to draw them.</p>
<p>The first task is to retrieve numbers from each of the databases of their size at particular dates. For the PDB this is simple, because they distribute a CSV file of this information. You can get it too, it&#8217;s <a title="PDB Stats" href="http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total" target="_blank">linked to here</a>. For UniProt, it was non-obvious where to find this information. Every time there&#8217;s a new release, the webpage documenting that release gives the size of UniProt at the point of release (and it&#8217;s components, SwissProt and TrEMBL), but it is hard to find these pages for any release that is not current. So my approach was to download the history of UniProt from their FTP server, and use BioPython to calculate the size of each release:</p>
<pre class="brush: python; title: ; notranslate">
import os
import sys
from Bio import SwissProt

def main():
    dirs = os.listdir(&quot;data&quot;)
    results = map(numbers, dirs)

def numbers(dir):
    directory = &quot;data/&quot;+dir
    h = open(directory+&quot;/reldate.txt&quot;)
    lines = h.readlines()
    h.close()
    date = lines[1].rstrip() #more processing required to return just date
    sh = open(directory+&quot;/uniprot_sprot.dat&quot;)
    descriptions = [record.accessions for record in SwissProt.parse(sh)]
    sprot_size = len(descriptions)
    sh.close()
    th = open(directory+&quot;/uniprot_trembl.dat&quot;) #and the same for trembl
    descriptions = [record.accessions for record in SwissProt.parse(th)]
    trembl_size = len(descriptions)
    th.close()
    return (date,sprot_size,trembl_size)
</pre>
<p>It was only once I was coming to the end of this process (slow, because we&#8217;re dealing with 16 releases of UniProt: 150GB of data) that I found <a href="http://www.expasy.org/sprot/relnotes/" target="_blank">this page</a>, which was fairly hidden away, but gives me the sizes of SwissProt from the last 25 years. Curses! So much effort seemingly gone to waste. However, there doesn&#8217;t appear to be a corresponding page for TrEMBL, which is much larger (being a conceptual translation of EMBL), and I wanted these numbers too, to illustrate the full scope of the problem. So my effort was not in vein.</p>
<p>Now that we have all the numbers in an appropriate format (DATE,DATABASE,SIZE), we can draw some graphs. For this I use the ggplot2 library and R, which seems to be de rigueur for pretty visualisations these days. Here&#8217;s some code:</p>
<pre class="brush: r; title: ; notranslate">
library(ggplot2)
pdb &lt;- read.table(&quot;/path/to/data/pdb.txt&quot;, sep=&quot;,&quot;)
colnames(pdb) = c(&quot;Year&quot;, &quot;Database&quot;, &quot;value&quot;)
pdb$Year &lt;- as.Date(pdb$Year)
png(&quot;/path/to/graphs/uniprot_graphs/pdb.png&quot;, bg=&quot;transparent&quot;, width=800, height=600)
qplot(Year, value, data=pdb, geom=&quot;line&quot;, color=I(&quot;red&quot;)) + scale_x_date(format=&quot;%Y&quot;) + scale_y_continuous(&quot;Entries&quot;, formatter=&quot;comma&quot;)
dev.off()

spdb &lt;- read.table(&quot;/path/to/data/sp_pdb.txt&quot;, sep=&quot;,&quot;)
colnames(spdb) = c(&quot;Year&quot;, &quot;Database&quot;, &quot;value&quot;)
spdb$Year &lt;- as.Date(spdb$Year)
png(&quot;/path/to/graphs/sp_pdb.png&quot;, bg=&quot;transparent&quot;, width=800, height=600)
qplot(Year, value, data=spdb, geom=&quot;line&quot;, group=Database, color=Database) + scale_x_date(format=&quot;%Y&quot;) + scale_y_continuous(&quot;Entries&quot;, formatter=&quot;comma&quot;)
dev.off()

all &lt;- read.table(&quot;/path/to/data/all.txt&quot;, sep=&quot;,&quot;)
colnames(all) = c(&quot;Year&quot;, &quot;Database&quot;, &quot;value&quot;)
all$Year &lt;- as.Date(all$Year)
png(&quot;/path/to/graphs/all.png&quot;, bg=&quot;transparent&quot;, width=800, height=600)
qplot(Year, value, data=all, geom=&quot;line&quot;, group=Database, color=Database) + scale_x_date(format=&quot;%Y&quot;) + scale_y_log10(&quot;Entries&quot;, breaks=c(10^4,10^5,10^6,10^7))
dev.off()
</pre>
<p>This very simple R produces 3 plots, all of which are informative in different ways.</p>
<p style="text-align: center;"><a href="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/pdb1.png"><img class="aligncenter size-full wp-image-439" title="PDB" src="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/pdb1.png" alt="PDB" width="560" height="420" /></a></p>
<p style="text-align: center;">
<p>Plot 1 is a simple restatment of the <a href="http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total" target="_blank">PDB graph</a>, which I produced just so all my graphs would look the same, it&#8217;s a pretty standard exponential curve (though admittedly the numbers are slightly smaller than the numbers you may be used to seeing on such plots).</p>
<p style="text-align: center;"><a href="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/sp_pdb1.png"><img class="aligncenter size-full wp-image-440" title="SwissProt vs PDB" src="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/sp_pdb1.png" alt="SwissProt vs PDB" width="560" height="420" /></a></p>
<p>Plot 2 compares the size of SwissProt with the size of the PDB. I&#8217;m extremely happy with this one, as it shows precisely what I wanted it to, SwissProt being much larger than the PDB, and marching away at an increasing rate. For the record, the most recent size of the PDB and SwissProt in the graph are 68,998 and 522,019 respectively (compared with when I last gave the protein structure lecture: 40,132 &amp; 241,365).</p>
<p style="text-align: center;"><a href="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/all1.png"><img class="aligncenter size-full wp-image-441" title="TrEMBL vs SwissProt vs PDB" src="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/all1.png" alt="TrEMBL vs SwissProt vs PDB" width="560" height="420" /></a></p>
<p>The final plot is just to scare people. It includes TrEMBL, and had to be plotted on a log10 scale, because TrEMBL is another order of magnitude larger than SwissProt (12,347,303 sequences).</p>
<p><strong>Addendum</strong> &#8211; further to all this, the problem of the gap between sequence and structure is actually more stark than presented here. Although the PDB today (11/11/10) contains 69,162 structures, they are highly redundant, and there are only 39,724 unique sequences of known structure.</p>
<!-- kcite active, but no citations found -->
</div> <!-- kcite-section 425 -->]]></content:encoded>
			<wfw:commentRss>http://blog.fuzzierlogic.com/archives/425/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/pdb1-150x150.png" />
		<media:content url="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/pdb1.png" medium="image">
			<media:title type="html">PDB</media:title>
			<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/pdb1-150x150.png" />
		</media:content>
		<media:content url="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/sp_pdb1.png" medium="image">
			<media:title type="html">SwissProt vs PDB</media:title>
			<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/sp_pdb1-150x150.png" />
		</media:content>
		<media:content url="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/all1.png" medium="image">
			<media:title type="html">TrEMBL vs SwissProt vs PDB</media:title>
			<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2010/11/all1-150x150.png" />
		</media:content>
	</item>
		<item>
		<title>&#8220;Peer review does not guarantee quality&#8221;</title>
		<link>http://blog.fuzzierlogic.com/archives/278</link>
		<comments>http://blog.fuzzierlogic.com/archives/278#comments</comments>
		<pubDate>Fri, 11 Sep 2009 11:14:13 +0000</pubDate>
		<dc:creator>Simon</dc:creator>
				<category><![CDATA[Radio]]></category>
		<category><![CDATA[podcast]]></category>
		<category><![CDATA[publishing]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://blog.fuzzierlogic.com/?p=278</guid>
		<description><![CDATA[<p>I am still catching up on my podcast backlog after my 2 week holiday in August. The excellent <a title="BBC Radio 4 - 'More or Less'" href="http://news.bbc.co.uk/1/hi/programmes/more_or_less/default.stm" target="_blank">&#8216;More or Less&#8217;</a> provided the gem of a quote in the title during a discussion about meta-analyses.</p> <p><a title="Prof Senn Homepage" href="http://www.senns.demon.co.uk/home.html" target="_blank">Professor Stephen Senn</a> was explaining why [...]]]></description>
			<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="278">
<p>I am still catching up on my podcast backlog after my 2 week holiday in August. The excellent <a title="BBC Radio 4 - 'More or Less'" href="http://news.bbc.co.uk/1/hi/programmes/more_or_less/default.stm" target="_blank">&#8216;More or Less&#8217;</a> provided the gem of a quote in the title during a discussion about meta-analyses.</p>
<p><a title="Prof Senn Homepage" href="http://www.senns.demon.co.uk/home.html" target="_blank">Professor Stephen Senn</a> was explaining why careless mathematics can distort the results of a meta-analysis (things like including a prior meta-analysis amongst your data sets can lead to double-counting &#8211; see <a title="Overstating the evidence – double counting in meta-analysis and related problems" href="http://www.biomedcentral.com/1471-2288/9/10" target="_blank">this paper</a>). The presenter, <a title="Tim Harford" href="http://www.bbc.co.uk/radio4/people/presenters/tim-harford/" target="_blank">Tim Harford</a>, suggested that surely this is a problem easily fixed. A reader spots an error in a published meta-analysis, contacts the journal and a correction ensues. A suggestion that was quickly knocked back by Prof Senn. The problem, as he sees it, is that we have no culture of correction; that peer reviewed results are considered irreproachable.</p>
<p>Doesn&#8217;t peer review offer some guarantee of quality?, suggests Harford. &#8220;Peer review is of minimal value&#8221; is the response to this, &#8220;&#8230;checkability is what really guarantees quality&#8221;. Senn goes on to suggest that scientists sign an undertaking to provide raw original data to anyone who requests it.</p>
<p>This was the clearest argument I&#8217;ve heard, not against peer review, but for the availability of raw data, and for post-publication quality control on a grand scale.</p>
<p>This multi-eyes approach to quality checking, post-publication, is <a title="PLoS One - About" href="http://www.plosone.org/static/information.action" target="_blank">familiar from somewhere</a>&#8230;</p>
<div id="attachment_279" class="wp-caption alignright" style="width: 310px"><a href="http://blog.fuzzierlogic.com/wp-content/uploads/2009/09/Minard.png"><img class="size-medium wp-image-279" title="Napoleon's March" src="http://blog.fuzzierlogic.com/wp-content/uploads/2009/09/Minard-300x143.png" alt="Charles Minard's 1869 chart showing the losses in men, their movements, and the temperature of Napoleon's 1812 Russian campaign." width="300" height="143" /></a><p class="wp-caption-text">Charles Minard&#39;s 1869 chart showing the losses in men, their movements, and the temperature of Napoleon&#39;s 1812 Russian campaign.</p></div>
<p>The same edition of the show had a section on data visualisation, and bought the &#8216;Napoleon&#8217;s March&#8217; graphic to my attention. I had not previously been aware of this &#8216;infographic&#8217;, produced in the mid-19th century.</p>
<!-- kcite active, but no citations found -->
</div> <!-- kcite-section 278 -->]]></content:encoded>
			<wfw:commentRss>http://blog.fuzzierlogic.com/archives/278/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2009/09/Minard-150x150.png" />
		<media:content url="http://blog.fuzzierlogic.com/wp-content/uploads/2009/09/Minard.png" medium="image">
			<media:title type="html">Napoleon&#8217;s March</media:title>
			<media:description type="html">Charles Minard's 1869 chart showing the losses in men, their movements, and the temperature of Napoleon's 1812 Russian campaign.</media:description>
			<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2009/09/Minard-150x150.png" />
		</media:content>
	</item>
		<item>
		<title>Randomness, statistics and understanding</title>
		<link>http://blog.fuzzierlogic.com/archives/254</link>
		<comments>http://blog.fuzzierlogic.com/archives/254#comments</comments>
		<pubDate>Wed, 24 Jun 2009 14:25:35 +0000</pubDate>
		<dc:creator>Simon</dc:creator>
				<category><![CDATA[Review]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[book]]></category>
		<category><![CDATA[randomness]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://blog.fuzzierlogic.com/?p=254</guid>
		<description><![CDATA[<p>So here I am, sitting in a statistics workshop, having finished all the exercises ahead of time, musing on how much easier all this stuff is once you understand where it all comes from. This made me think that I have found this workshop more understandable and simpler to tackle because I have pretty much [...]]]></description>
			<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="254">
<p>So here I am, sitting in a statistics workshop, having finished all the exercises ahead of time, musing on how much easier all this stuff is once you understand where it all comes from. This made me think that I have found this workshop more understandable and simpler to tackle because I have pretty much finished reading a marvellous little book called <a title="Amazon.co.uk - The Drunkard's Walk" href="http://www.amazon.co.uk/Drunkards-Walk-Randomness-Rules-Lives/dp/0141026472/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1245852119&amp;sr=8-1" target="_blank">&#8216;The Drunkard&#8217;s Walk&#8217;</a> by Leonard Mlodinow.</p>
<div id="attachment_256" class="wp-caption alignleft" style="width: 310px"><a href="http://www.flickr.com/photos/sixmilliondollardan/3193613357/"><img class="size-medium wp-image-256" title="drunk_walk" src="http://blog.fuzzierlogic.com/wp-content/uploads/2009/06/drunk_walk-300x199.jpg" alt="Photo from http://www.flickr.com/photos/sixmilliondollardan/3193613357/" width="300" height="199" /></a><p class="wp-caption-text">Photo from http://www.flickr.com/photos/sixmilliondollardan/3193613357/</p></div>
<p>Mlodinow aims to educate the reader about randomness and statistics, by way of history and illustrative example, and he succeeds admirably. The book is a walk through mathematics from the Greeks and Romans, by way of the renaissance, to Einstein and the modern day. Each important advance toward the modern day study of statistics is illustrated with excellent examples and anecdotes, many of them personal to the author. The <a title="Wikipedia - The Monty Hall Problem" href="http://en.wikipedia.org/wiki/Monty_Hall_problem" target="_blank">Monty Hall problem</a>, the anomoly of <a title="Wikipedia - Jeanne Calment" href="http://en.wikipedia.org/wiki/Jeanne_Calment">Jeanne Calment</a>, who reverse-mortgaged her apartment to a 47 year old lawyer when she was 90, only to outlive him (and he died aged 77), even the author&#8217;s own (false) positive AIDS test makes for an intriguing case study, and illustrates the importance of understanding prior probabilities when reporting the results of a test.</p>
<p>The setting of all this stuff in context has really helped my brain with the basic concepts, and even without this current course, I feel like I&#8217;ve got a much better grip on statistics in general. A remarkable claim for a popular science book. I look forward to the remaining 30 or so pages.</p>
<!-- kcite active, but no citations found -->
</div> <!-- kcite-section 254 -->]]></content:encoded>
			<wfw:commentRss>http://blog.fuzzierlogic.com/archives/254/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2009/06/drunk_walk-150x150.jpg" />
		<media:content url="http://blog.fuzzierlogic.com/wp-content/uploads/2009/06/drunk_walk.jpg" medium="image">
			<media:title type="html">drunk_walk</media:title>
			<media:description type="html">Photo from http://www.flickr.com/photos/sixmilliondollardan/3193613357/</media:description>
			<media:thumbnail url="http://blog.fuzzierlogic.com/wp-content/uploads/2009/06/drunk_walk-150x150.jpg" />
		</media:content>
	</item>
	</channel>
</rss>

