Quantifying the margin of error in high-throughput data interpretation
In the interpretation of high-throughout data, we make statements about the number of biological entities. However, the number of genes in a higher eukaryote, of protein domains or folds, or protein complexes is the product of many parameter choices. Because these entities have soft boundaries, we have to make assumptions about their nature that can not be quantified by a confidence interval or other established quantifiers. It would be very useful to label the results with a statement about their validity, expressing the confidence of the researchers, similar to the procedures for annotations in Gene Ontology.
A statement in a publication could read:
The number of protein coding genes in Alosa fallax is 37.387[SEN]. We predict that these proteins form 235.345 splice forms[I50K], arranging themselves into 920 protein complexes[IA<]. 4 protein are involved in fin formation[I5] and 3.454 in cell cycle regulation[I5].
Here are the abbreviations
[I5] Inferred from Perl script. 5 lines of Perl can't be wrong.
[I50K] Inferred from major calculation. 50.000 lines of C++ can't be wrong.
[SEN] Contribution of the senior author.
[REV] Stinking reviewers didn't like our numbers. Have to put them into the acknowledgments.
[IA>] As high as we could get it to meet your expectations
[IA<] As low as we could get it to meet your expectations
[2*] Could be twice as much but who am I
[DUD] Spot on, dude. Seriously!
A statement in a publication could read:
The number of protein coding genes in Alosa fallax is 37.387[SEN]. We predict that these proteins form 235.345 splice forms[I50K], arranging themselves into 920 protein complexes[IA<]. 4 protein are involved in fin formation[I5] and 3.454 in cell cycle regulation[I5].
Here are the abbreviations
[I5] Inferred from Perl script. 5 lines of Perl can't be wrong.
[I50K] Inferred from major calculation. 50.000 lines of C++ can't be wrong.
[SEN] Contribution of the senior author.
[REV] Stinking reviewers didn't like our numbers. Have to put them into the acknowledgments.
[IA>] As high as we could get it to meet your expectations
[IA<] As low as we could get it to meet your expectations
[2*] Could be twice as much but who am I
[DUD] Spot on, dude. Seriously!
spitshine - 2005-08-15 08:09
Precisely the point I've been trying to make.
Automated data extraction is great, but it's only as good as source material, and there's a lot of crap out there. Many statements exist in the text of a peer-reviewed journal article that are only weakly suggested by the data shown, and without some way of separating the authoritative statements from the more speculative, automated data extraction will be no better than manual slogging through the literature, as measured by the amount of highly-likely relationships you've learned at the end of the day.