Recent Updates

Last post
Notes from the biomass will continue at My...
spitshine - 2006-07-16 13:11
OK, you got me. While technically not blogging at the...
spitshine - 2006-07-07 10:55
Greetings from another...
Greetings from another HBS-founder (
freshjive - 2006-06-15 20:06
HBS manifesto will be...
Hi there! I am one of the hard blogging scientsts. We...
020200 - 2006-06-15 18:13
Latter posts - comment...
Things to do when you're not blogging: Taking care...
spitshine - 2006-04-29 18:46

About this blog

About content and author

A few posts of interest

The internet is changing... Powerpoint Karaoke
Quantifying the error...

Link target abbreviations

[de] - Target page is in German
[p] - Paywall - content might not be freely available
[s] - Subscription required
[w] - Wikipedia link




December 2018


vi knallgrau GmbH

powered by Antville powered by Helma

Creative Commons License

xml version of this page
xml version of this page (summary) AGB

Identifying names of proteins in the biomedical literature

Text mining the biomedical literature has matured from a peculiar idea towards a respectable niche in bioinformatics. In a letter to the editor of Bioinformatics, Blaschke, Valencia and others, themselves pioneers in the field summarize recent advances. They note that the crucial task of relating gene and protein names to normalized identifiers is now producing reliable results, even if they still somewhat lacking in precision compared to other fields of computer linguistics. I am positively surprised about the high numbers. The tasks of identifying the correct sequence of a gene from a paper is sometimes still complicated even if you do it "manually" despite working with fully sequenced bacterial genomes and extensive set of synonyms. Hopefully, we such parses will make it to the desktop soon.

If you think about this from a high-level perspective, the way we work with biomedical information seems absurd. It would be simpler and useful if the names of proteins and genes would be identified as such in the publication using some simple markup like for references to other publications; one could extend reference managers like Endnote or Bibtex. Clearly, we need mnemonic names for spots on a gel, phenotypes or ORFs in sequenced genome for readability but why don't we disambiguate them in a consistent manner now that the majority of important genome sequences are at hand? Other "stable" constituents of a paper -strains, protocolls, chemicals, antibodies- could also be referenced in such manner, directly linking to resources such as the Registry of Standard Biological Parts at the MIT or catalogs (if stable identifiers will be maintained). After all, the method sections often name the supplier of agents - would it not be a logical extension to submit all information?

However, if the publishers support these advancements, text mining methods will be required to identify such information in the tome of fundamental biological information that we have compiled in the last 100 years now that it is slowly becoming digitalized. This will stay an interesting field for years too come.

W3C launches a Semantic Web Health Care and Life Sciences Interest Group

The W3C announced a launch of a Semantic Web Health Care and Life Sciences Interest Group today.

The Semantic Web Health Care and Life Sciences Interest Group is designed to improve collaboration, research and development, and innovation adoption in the health care and life science industries. Aiding decision-making in clinical research, Semantic Web technologies will bridge many forms of biological and medical information across institutions.

What looks like straight from a buzzword generator might substantially enhance the current (babylonic) state of biological databases. Greg over at Nodalpoint recently summarized how RDF, SPARQL et al could help for the integration of bioinformatics resources, including practical problems and support through larger communities (or lack thereof).

There is a substantial increase in such techniques recently and I would not be surprised if they will finally deliver what bioinformaticians have been waiting for (and trying to achieve with other approaches) - integrating the vast amounts of biological data in a flexible way.

One of the challenges to such a system that the W3C cannot address is the availability of the information in a stable and usable form - the EBI can allocate the resources to offer Uniprot in XML but a small lab performing a high throughput screen usually does not have the necessary skills in the lab and will not put their data online in a highly abstracted way unless it hinders publication. On the other hand, I don't expect that journals will raise their standard for publications soon - after all, the format of the information is a lesser part of value of a research publication and the abundance of data is only comparable to those of the possible standards.

Funding for the BIND data base running out

The home page of the BIND database now carries a grave statement from its PI, Chris Hogue, which explains that their last dollar was spent on November 16th and that the last thing the team will do is to maintain the status quo of the web servers. BIND is a database collecting protein-protein interactions, hosted by the Mount Sinai Hospital, Toronto, Canada.
Many smaller databases face the same problem - the Postdoc or grad student on the project leaves and funding is hardly ever available for the development of a database, let alone for maintenance. Competition from larger institutions such as NCBI or the EBI often spoil the efforts of many years of development.

However, BIND is not a small databasea and lists more than 100 programmers and curators and received CDN 29$ in public funding in 2003. I have to admit that I never quite understood why one would need such a major effort for a database of protein-protein interactions and can imagine why a project of this size is challenged by other scientists.

BIND is not the first data major database to run out of funding. The situation was similar for the GDB which lost funding in 1998. It was transferred from Johns Hopkins University, Baltimore to the The Hospital For Sick Children, Toronto and subsequently to RTI. The most important database running out of funding was probably Swissprot in 1999(?), which both managed to commercialize its data (thanks to the New Economy biotechs which floated at that time) and attracting support from the EBI.
Chris Hogue selected a few editorials papers covering the situation in the media and promises to continue to report on the situation in his blog.

[Via public rambling]

Science 2.0

Most blogs of the tiny bioinformatics blogosphere featured a link to the Science 2.0/Brainstorming page of the OpenWetWare wiki recently, and I mentioned OWW earlier myself. My excuse for the redundancy: The site is a great collection of ideas and frontiers that we face in publishing outside the classical peer-reviewed scheme and one could imagine that the participating labs realize some of them.

Google and the genes

The Washington Post from Monday presents a chapter entitled "Googling Your Gene" of a forthcoming book called "The Google story" by David Vise, and Mark Malseed [Amazon].
It is about the most brainless, badly researched fad that I have read recently, featuring Craig Venter (who is presented as the guy who "sequenced the human genome" and "made it available to the public"), researchers performing simulations in cyberspace and scholars solely relying on Google as if specialized search engines would not exist.
A company like Google could probably contribute massively to biological research and the compute time in Google Compute is going into protein folding. There is a great potential for Google in field of genetic information but the issue is a sensitive one, particular for personalized medicine. If I would be working in Googles PR department, I would make sure the book does not appear in print if the other chapters are of the same quality.

Modesty in science

The observation that the selector sequences are complementary to the docking site occurred through a combination of staring at the sequences for months and sheer luck.

Nice final words for the methods section of a pure bioinformatics, single author Cell paper on DSCAM, the enigmatic gene that codes for 38,016 potential proteins via alternative splicing.
[Pubmed, Cell]

Scientific publishing as seen from the editors chair

To many younger scientists the review process appears as an heavily biased, unfair process played by a set of clandestine rules (and established scientists feel the same way but rather play Dr. Seen-it-all). In the current issue of EMBO Journal, Pernille Rørth, its current executive editor presents a detailed, personal view of his her work and challenges to it such as full time editors vs part time editors. Recommended, even if EMBO Journal is not your target journal for submissions.

A need for non-peer reviewed scientific communication (interrupted)

Sigh. I was drafting a post, enthusiastically sketching the need to non-peer reviewed scientific communication using web technologies, when a friend of mine notified me that work that was carried out in our department was featured on Slashdot. I was not involved in the project now published in PLoS Pathogens other than donating blood once (my poor neutrophils ...) but got curious and paused. So, I went to find out how a technologically and scientifically open and informed group of people would take on the publication. As Bacillus anthracis was studied I was already expecting little interesting considerations of the actual work. I guess, most readers missed the point that we study host-pathogen interaction and that the discoveries are made more on the host side than the bug side. I still was very disappointed with the responses and scrapped my sketch.
I don't want to start bashing Slashdot - it's an interesting place and it's a fairly informed audience. However, I wonder how a "serious" open scientific discussion would work and I am less surprised about the strict rules that for instance PLoS imposes on scientific comments.
What do we need to do to ensure high quality discussion? - moderation only won't work in the very diverse field of the modern sciences as you would have to have editors that are very much skilled in their field and one would practically introduce peer review.
Also, the short-lived comment wave is not stimulating a real discussion. Science blogs and related community pages, in particular those running longer discussions must develop longer threads and thinking to deliver a value that is worth reading and live with small numbers of informed readers. I wonder, whether the reason why there are so few life science blogs about science (not life in the lab or creationists) is caused in part by the need for slower but more thorough formats that are not served by blogs and news groups currently.



Online for 5011 days
Last update: 2006-07-16 13:11

The young PI
Useful tools