Identifying names of proteins in the biomedical literature
Text mining the biomedical literature has matured from a peculiar idea towards a respectable niche in bioinformatics. In a letter to the editor of Bioinformatics, Blaschke, Valencia and others, themselves pioneers in the field summarize recent advances. They note that the crucial task of relating gene and protein names to normalized identifiers is now producing reliable results, even if they still somewhat lacking in precision compared to other fields of computer linguistics. I am positively surprised about the high numbers. The tasks of identifying the correct sequence of a gene from a paper is sometimes still complicated even if you do it "manually" despite working with fully sequenced bacterial genomes and extensive set of synonyms. Hopefully, we such parses will make it to the desktop soon.
If you think about this from a high-level perspective, the way we work with biomedical information seems absurd. It would be simpler and useful if the names of proteins and genes would be identified as such in the publication using some simple markup like for references to other publications; one could extend reference managers like Endnote or Bibtex. Clearly, we need mnemonic names for spots on a gel, phenotypes or ORFs in sequenced genome for readability but why don't we disambiguate them in a consistent manner now that the majority of important genome sequences are at hand? Other "stable" constituents of a paper -strains, protocolls, chemicals, antibodies- could also be referenced in such manner, directly linking to resources such as the Registry of Standard Biological Parts at the MIT or catalogs (if stable identifiers will be maintained). After all, the method sections often name the supplier of agents - would it not be a logical extension to submit all information?
However, if the publishers support these advancements, text mining methods will be required to identify such information in the tome of fundamental biological information that we have compiled in the last 100 years now that it is slowly becoming digitalized. This will stay an interesting field for years too come.
If you think about this from a high-level perspective, the way we work with biomedical information seems absurd. It would be simpler and useful if the names of proteins and genes would be identified as such in the publication using some simple markup like for references to other publications; one could extend reference managers like Endnote or Bibtex. Clearly, we need mnemonic names for spots on a gel, phenotypes or ORFs in sequenced genome for readability but why don't we disambiguate them in a consistent manner now that the majority of important genome sequences are at hand? Other "stable" constituents of a paper -strains, protocolls, chemicals, antibodies- could also be referenced in such manner, directly linking to resources such as the Registry of Standard Biological Parts at the MIT or catalogs (if stable identifiers will be maintained). After all, the method sections often name the supplier of agents - would it not be a logical extension to submit all information?
However, if the publishers support these advancements, text mining methods will be required to identify such information in the tome of fundamental biological information that we have compiled in the last 100 years now that it is slowly becoming digitalized. This will stay an interesting field for years too come.
spitshine - 2005-11-24 17:27
Trackback URL:
https://binf.twoday.net/stories/1188496/modTrackback