Abstract
To access and utilize the rich information contained in biomedical literature, the ability to recognize and normalize gene mentions referenced in the literature is crucial. In this paper, we focus on improvements to the accuracy of gene normalization in cases where species information is not provided. Gene names are often ambiguous, in that they can refer to the genes of many species. Therefore, gene normalization is a difficult challenge.
We define "gene normalization" as a series of tasks involving several issues, including gene name recognition, species assignation and species-specific gene normalization. We propose an integrated method consisting of three modules to handle the issues of this task. Every issue can affect overall performance, though the most important is species assignation. Clearly, correct identification of the species can decrease the ambiguity of orthologous genes. We propose an integrative method to handle the three issues of the GN task. Our approach uses three modules, the gene name recognition (GNR) module, the species assignation (SA) module and the species-specific gene normalization (SGN) module.
Figure1. Architecture of the gene normalization method.
We participated in the GN task of the BioCreaTive III (BC3) competition by adopting an integrated method based on our previous work to handle intra-species gene ambiguity. Results demonstrated that our method worked well, ranking at the top level of performance among all teams. Our proposed method makes sufficient use of gene/species information in context and of a thesaurus of gene/species.
In experiments, the proposed model attained the top threshold average precision (TAP-k) scores of 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20) when tested against 50 articles that had been selected for their difficulty and the most divergent results from pooled team submissions. The second highest scores were obtained with the full test set of articles (TAP-k score of 0.4591 for k=5, 10)
Table1. Performance statistic by BC3 test and training data
| Corpus |
Data set |
TAP-5 |
TAP-10 |
TAP-20 |
Precision |
Recall |
F-measure |
test data (1st run) |
50 (gold standard) |
0.3254 |
0.3538 |
0.3535 |
53.85% |
39.44% |
45.53% |
test data (2nd run) |
50 (gold standard) |
0.3216 |
0.3435 |
0.3435 |
55.54% |
39.07% |
45.87% |
test data (3rd run) |
50 (gold standard) |
0.3297 |
0.3514 |
0.3514 |
56.23% |
39.72% |
46.56% |
test data (1st run) |
50 (silver standard) |
0.3567 |
0.3600 |
0.3600 |
58.94% |
38.95% |
46.90% |
test data (2nd run) |
50 (silver standard) |
0.3291 |
0.3291 |
0.3291 |
58.60% |
37.64% |
45.84% |
test data (3rd run) |
50 (silver standard) |
0.3382 |
0.3382 |
0.3382 |
59.46% |
38.35% |
46.62% |
test data (1st run) |
507(silver standard) |
0.4591 |
0.4591 |
0.4591 |
71.79% |
44.69% |
55.09% |
test data (2nd run) |
507 (silver standard) |
0.4323 |
0.4323 |
0.4323 |
72.08% |
42.70% |
53.64% |
test data (3rd run) |
507 (silver standard) |
0.4327 |
0.4327 |
0.4327 |
72.41% |
42.82% |
53.82% |
Training data |
32 (gold standard) |
0.4703 |
0.4969 |
0.4969 |
63.82% |
67.71% |
65.70% |
Table2. Performance on the gene normalization task by the top 4 performing teams for this task in the BioCreaTive III competition
| Teams |
Gold standard
(50 selected articles) |
Silver standard
(All 507 articles) |
Silver standard
(50 selected articles) |
| TAP-5 |
TAP-10 |
TAP-20 |
TAP-5 |
TAP-10 |
TAP-20 |
TAP-5 |
TAP-10 |
TAP-20 |
Kuo et al.
(Team 74) |
1st run |
0.2137 |
0.2509 |
0.2509 |
0.3820 |
0.3820 |
0.3820 |
0.4873 |
0.4873 |
0.4873 |
2nd run |
0.2083 |
0.2480 |
0.2480 |
0.3855 |
0.3855 |
0.3855 |
0.4871 |
0.4871 |
0.4871 |
3rd run |
0.2099 |
0.2495 |
0.2495 |
0.3890 |
0.3890 |
0.3890 |
0.4916 |
0.4916 |
0.4916 |
Our method
(Team 83) |
1st run |
0.3254 |
0.3538 |
0.3535 |
0.3567 |
0.3600 |
0.3600 |
0.4591 |
0.4591 |
0.4591 |
2nd run |
0.3216 |
0.3435 |
0.3435 |
0.3291 |
0.3291 |
0.3291 |
0.4323 |
0.4323 |
0.4323 |
3rd run |
0.3297 |
0.3514 |
0.3514 |
0.3382 |
0.3382 |
0.3382 |
0.4327 |
0.4327 |
0.4327 |
Liu et al.
(Team 98) |
1st run |
0.2835 |
0.3012 |
0.3103 |
0.3343 |
0.3535 |
0.3629 |
0.3818 |
0.3899 |
0.3875 |
2nd run |
0.2909 |
0.3079 |
0.3087 |
0.3354 |
0.3543 |
0.3634 |
0.3790 |
0.3878 |
0.3868 |
3rd run |
0.3013 |
0.3183 |
0.3303 |
0.3710 |
0.4116 |
0.4672 |
0.4086 |
0.4511 |
0.4648 |
Lai et al.
(Team 101) |
1st run |
0.1896 |
0.2288 |
0.2385 |
0.3590 |
0.3859 |
0.3859 |
0.4289 |
0.4289 |
0.4289 |
2nd run |
0.1672 |
0.2150 |
0.2418 |
0.3239 |
0.3945 |
0.4132 |
0.4294 |
0.4408 |
0.4408 |
3rd run |
0.1812 |
0.2141 |
0.2425 |
0.3258 |
0.4109 |
0.4109 |
0.4536 |
0.4536 |
0.4536 |
Web Service
Weo provided a web-services version of Cross-species gene normalization for user to detect the EntrezGene Indentifiers in Bubmed Central Articles. It's a XML-RPC service for Perl program particularly.
Pre-Processing
Firstly, user needs to install Perl with RPC::XML and Data::Dumper modules by PPM.
Secondly, the input data can be a xml format article or an article ID. User needs to collect a XML format article or PMID/PMCID from NCBI PubMed Central search engine or PubMed search engine.
Thirdly, user need to change the file format to utf8.
Input
1.PMID
2.PMCID
3.PMID article (XML format)
4.PMCID article (XML format)
5.Just a paragraph (Free format)
Example: [Example1] [Example2] [Example3] [Example4] [Example5]
Input example
-------------------------------------------------------------------------------------------------------------
#!perl
use Data::Dumper;
use RPC::XML;
use RPC::XML::Client;
my $conn = RPC::XML::Client->new( 'http://ikmbio.csie.ncku.edu.tw:8080');
my $xml_format="";
open read_file,"<Input_filename.nxml"; # The region by red color needs be changed to user input file name.
while(<read_file>)
{
$temp=$_;
$xml_format=$xml_format.$temp;
}
close read_file;
$xml_format =~ s/\n//g;
$xml_format =~ s/ //g;
$conn->useragent->timeout(60000);
my $req = RPC::XML::request->new('Fulltext.getAnnotation',$xml_format);
my $resp = $conn->simple_request($req);
print Data::Dumper::Dumper($resp);
-------------------------------------------------------------------------------------------------------------
Output
1.EntrezID
2.confidence of EntrezID
3.Rank of EntrezID
Output example
Example:
-------------------------------------------------------------------------------------------------------------
$VAR1 = {
'normalizations' => [
{
'Entity inference' => 'esr 1|estrogen receptor',
'BOW inference' => 'estrogen',
'Confidence' => '1',
'rank' => '1',
'Entrez ID' => '2099'
},
{
'Entity inference' => 'erbb 2|h er 2|neu',
'BOW inference' => '',
'Confidence' => '0.8',
'rank' => '2',
'Entrez ID' => '2064'
},
]
};
-------------------------------------------------------------------------------------------------------------
Other resource
Species name regular expression. (link)
Reference
-
Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan Wei, Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun-Nan Hsu,, Richard Tzong-Han Tsai, Hong-Jie Dai, Naoaki Okazaki, Han-Cheol Cho, Martin Gerner, Illes Solt, Shashank Agarwal, Feifan Liu, Dina Vishnyakova, Patrick Ruch, Martin Romacker, Fabio Rinaldi, Sanmitra Bhattacharya, Padmini Srinivasan, Hongfang Liu, Manabu Torii, Sergio Matos, David Campos, Karin Verspoor, Kevin M. Livingston, and W. John Wilbur. (2010) "The Gene Normalization Task in BioCreative III. BMC Bioinformatics", special issue on BioCreative III. (accepted)
-
Chih-Hsuan Wei, Hung-Yu Kao (2010) "Cross-species gene normalization by species inference". BMC Bioinformatics, special issue on BioCreative III. (accepted)
-
Chih-Hsuan Wei, Hung-Yu Kao (2010) "Represented indicator measurement and corpus distillation on focus species detection". in IEEE International Conference on Bioinformatics & Biomedicine, pp. 657-662.
-
Chih-Hsuan Wei, I-Chin Huang, Yi-Yu Hsu, Hung-Yu Kao (2009) "Normalizing Biomedical Name Entities by Similarity-Based Inference Network and De-ambiguity Mining," in Ninth IEEE International Conference on Bioinformatics and Bioengineering Workshop: Semantic Biomedical Computing, Taichung, Taiwan, pp. 461-466.
|