News
[2013 Jan] Allow user to access GenNorm results of whole PubMed abstracts via PubTator.
[2012 Aug] GenNorm offline (version 1.2)
[2010 Nov] GenNorm online (version 1.0)
Offline processing (verion 1.2)
Major modifications:
1. GenNorm ver1.2 equips SR4GN for species recognition and species assignement. (for gene mention).
2. This version can map the normalization results to gene mentions. (From document level to mention level)
3. A regular update of algorithm and gene,sepcies dictionary is finished. (2012 Aug)
Download: [GenNorm v1.2] [Dictionary]
Please cite this resource as shown below:
-
We C-H, Kao H-Y ,Lu Z (2012) SR4GN: a species recognition software tool for gene normalization. PLoS ONE, 7(6):e38460
-
We C-H, Kao H-Y (2011) Cross-species gene normalization by species inference. BMC Bioinformatics, special issue on BioCreative III, 12(Suppl 8):S5
Online processing (verion 1.0)
An online version of GenNorm is provided. It's a XML-RPC service, and work on Perl environment particularly.
Setup and intstall
Firstly, user needs to install Perl with RPC::XML and Data::Dumper modules by PPM.
Secondly, the input data can be a xml format article or an article ID. User needs to collect a XML format article or PMID/PMCID from NCBI PubMed Central search engine or PubMed search engine.
Thirdly, user need to change the file format to utf8.
Input
1.PMID
2.PMCID
3.PMID article (XML format)
4.PMCID article (XML format)
5.Just a paragraph (Free format)
Example: [Example1] [Example2] [Example3] [Example4] [Example5]
Input example
-------------------------------------------------------------------------------------------------------------
#!perl
use Data::Dumper;
use RPC::XML;
use RPC::XML::Client;
my $conn = RPC::XML::Client->new( 'http://ikmbio.csie.ncku.edu.tw:8080');
my $xml_format="";
open read_file,"<Input_filename.nxml"; # The region by red color needs be changed to user input file name.
while(<read_file>)
{
$temp=$_;
$xml_format=$xml_format.$temp;
}
close read_file;
$xml_format =~ s/\n//g;
$xml_format =~ s/ //g;
$conn->useragent->timeout(60000);
my $req = RPC::XML::request->new('Fulltext.getAnnotation',$xml_format);
my $resp = $conn->simple_request($req);
print Data::Dumper::Dumper($resp);
-------------------------------------------------------------------------------------------------------------
Output
1.EntrezID
2.confidence of EntrezID
3.Rank of EntrezID
Output example
Example:
-------------------------------------------------------------------------------------------------------------
$VAR1 = {
'normalizations' => [
{
'Entity inference' => 'esr 1|estrogen receptor',
'BOW inference' => 'estrogen',
'Confidence' => '1',
'rank' => '1',
'Entrez ID' => '2099'
},
{
'Entity inference' => 'erbb 2|h er 2|neu',
'BOW inference' => '',
'Confidence' => '0.8',
'rank' => '2',
'Entrez ID' => '2064'
},
]
};
------------------------------------------------------------------------------------------------------------
Introduction
To access and utilize the rich information contained in biomedical literature, the ability to recognize and normalize gene mentions referenced in the literature is crucial. In this paper, we focus on improvements to the accuracy of gene normalization in cases where species information is not provided. Gene names are often ambiguous, in that they can refer to the genes of many species. Therefore, gene normalization is a difficult challenge.
We define "gene normalization" as a series of tasks involving several issues, including gene name recognition, species assignation and species-specific gene normalization. We propose an integrated method consisting of three modules to handle the issues of this task. Every issue can affect overall performance, though the most important is species assignation. Clearly, correct identification of the species can decrease the ambiguity of orthologous genes. We propose an integrative method to handle the three issues of the GN task. Our approach uses three modules, the gene name recognition (GNR) module, the species assignation (SA) module and the species-specific gene normalization (SGN) module.
Figure1. Architecture of the gene normalization method.
We participated in the GN task of the BioCreaTive III (BC3) competition by adopting an integrated method based on our previous work to handle intra-species gene ambiguity. Results demonstrated that our method worked well, ranking at the top level of performance among all teams. Our proposed method makes sufficient use of gene/species information in context and of a thesaurus of gene/species.
In experiments, the proposed model attained the top threshold average precision (TAP-k) scores of 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20) when tested against 50 articles that had been selected for their difficulty and the most divergent results from pooled team submissions. The second highest scores were obtained with the full test set of articles (TAP-k score of 0.4591 for k=5, 10)
Table1. Performance statistic by BC3 test and training data
| Corpus |
Data set |
TAP-5 |
TAP-10 |
TAP-20 |
Precision |
Recall |
F-measure |
test data (1st run) |
50 (gold standard) |
0.3254 |
0.3538 |
0.3535 |
53.85% |
39.44% |
45.53% |
test data (2nd run) |
50 (gold standard) |
0.3216 |
0.3435 |
0.3435 |
55.54% |
39.07% |
45.87% |
test data (3rd run) |
50 (gold standard) |
0.3297 |
0.3514 |
0.3514 |
56.23% |
39.72% |
46.56% |
test data (1st run) |
50 (silver standard) |
0.3567 |
0.3600 |
0.3600 |
58.94% |
38.95% |
46.90% |
test data (2nd run) |
50 (silver standard) |
0.3291 |
0.3291 |
0.3291 |
58.60% |
37.64% |
45.84% |
test data (3rd run) |
50 (silver standard) |
0.3382 |
0.3382 |
0.3382 |
59.46% |
38.35% |
46.62% |
test data (1st run) |
507(silver standard) |
0.4591 |
0.4591 |
0.4591 |
71.79% |
44.69% |
55.09% |
test data (2nd run) |
507 (silver standard) |
0.4323 |
0.4323 |
0.4323 |
72.08% |
42.70% |
53.64% |
test data (3rd run) |
507 (silver standard) |
0.4327 |
0.4327 |
0.4327 |
72.41% |
42.82% |
53.82% |
Training data |
32 (gold standard) |
0.4703 |
0.4969 |
0.4969 |
63.82% |
67.71% |
65.70% |
Table2. Performance on the gene normalization task by the top 4 performing teams for this task in the BioCreaTive III competition
| Teams |
Gold standard
(50 selected articles) |
Silver standard
(All 507 articles) |
Silver standard
(50 selected articles) |
| TAP-5 |
TAP-10 |
TAP-20 |
TAP-5 |
TAP-10 |
TAP-20 |
TAP-5 |
TAP-10 |
TAP-20 |
Kuo et al.
(Team 74) |
1st run |
0.2137 |
0.2509 |
0.2509 |
0.3820 |
0.3820 |
0.3820 |
0.4873 |
0.4873 |
0.4873 |
2nd run |
0.2083 |
0.2480 |
0.2480 |
0.3855 |
0.3855 |
0.3855 |
0.4871 |
0.4871 |
0.4871 |
3rd run |
0.2099 |
0.2495 |
0.2495 |
0.3890 |
0.3890 |
0.3890 |
0.4916 |
0.4916 |
0.4916 |
Our method
(Team 83) |
1st run |
0.3254 |
0.3538 |
0.3535 |
0.3567 |
0.3600 |
0.3600 |
0.4591 |
0.4591 |
0.4591 |
2nd run |
0.3216 |
0.3435 |
0.3435 |
0.3291 |
0.3291 |
0.3291 |
0.4323 |
0.4323 |
0.4323 |
3rd run |
0.3297 |
0.3514 |
0.3514 |
0.3382 |
0.3382 |
0.3382 |
0.4327 |
0.4327 |
0.4327 |
Liu et al.
(Team 98) |
1st run |
0.2835 |
0.3012 |
0.3103 |
0.3343 |
0.3535 |
0.3629 |
0.3818 |
0.3899 |
0.3875 |
2nd run |
0.2909 |
0.3079 |
0.3087 |
0.3354 |
0.3543 |
0.3634 |
0.3790 |
0.3878 |
0.3868 |
3rd run |
0.3013 |
0.3183 |
0.3303 |
0.3710 |
0.4116 |
0.4672 |
0.4086 |
0.4511 |
0.4648 |
Lai et al.
(Team 101) |
1st run |
0.1896 |
0.2288 |
0.2385 |
0.3590 |
0.3859 |
0.3859 |
0.4289 |
0.4289 |
0.4289 |
2nd run |
0.1672 |
0.2150 |
0.2418 |
0.3239 |
0.3945 |
0.4132 |
0.4294 |
0.4408 |
0.4408 |
3rd run |
0.1812 |
0.2141 |
0.2425 |
0.3258 |
0.4109 |
0.4109 |
0.4536 |
0.4536 |
0.4536 |
References
-
Sofie Van Landeghem, Jari Bjorne, Chih-Hsuan Wei, Kai Hakala, Sampo Pyysalo, Sophia Ananiadou, Hung-Yu Kao, Zhiyong Lu, Tapio Salakoski, Yves Van de Peer, Filip Ginter (2013) EVEX 1.0 : A bibliome-wide resource of biomolecular event extraction and taxonomic classification, PLoS ONE 8(4): e55814. doi:10.1371/journal.pone.0055814
-
Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu (2012) "Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts.", Database (Oxford); ,bas041
-
Chih-Hsuan Wei, Hung-Yu Kao, Zhiyong Lu "PubTator: A PubMed-like interactive curation system for document triage and literature curation", procceding of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012.
-
Chih-Hsuan Wei, Hung-Yu Kao ,Zhiyong Lu (2012) "SR4GN: a species recognition software tool for gene normalization", PLoS ONE, 7(6):e38460
-
Chih-Hsuan Wei, Hung-Yu Kao (2011) "Cross-species gene normalization by species inference". BMC Bioinformatics, special issue on BioCreative III, 12(Suppl 8):S5
-
Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan Wei, Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun-Nan Hsu,, Richard Tzong-Han Tsai, Hong-Jie Dai, Naoaki Okazaki, Han-Cheol Cho, Martin Gerner, Illes Solt, Shashank Agarwal, Feifan Liu, Dina Vishnyakova, Patrick Ruch, Martin Romacker, Fabio Rinaldi, Sanmitra Bhattacharya, Padmini Srinivasan, Hongfang Liu, Manabu Torii, Sergio Matos, David Campos, Karin Verspoor, Kevin M. Livingston, and W. John Wilbur. (2011) "The Gene Normalization Task in BioCreative III. BMC Bioinformatics", special issue on BioCreative III, 12(Suppl 8):S2
-
Chih-Hsuan Wei, Hung-Yu Kao (2010) "Represented indicator measurement and corpus distillation on focus species detection". in IEEE International Conference on Bioinformatics & Biomedicine, pp. 657-662.
-
Chih-Hsuan Wei, I-Chin Huang, Yi-Yu Hsu, Hung-Yu Kao (2009) "Normalizing Biomedical Name Entities by Similarity-Based Inference Network and De-ambiguity Mining," in Ninth IEEE International Conference on Bioinformatics and Bioengineering Workshop: Semantic Biomedical Computing, Taichung, Taiwan, pp. 461-466.
|