IKM lab | NCKU CSIE | NCKU
ikmlogo GenNorm (Cross-species gene normalization by species inference)
Offline version dot Introduction dot Reference

Our current reseach:

tmVar: A mutation recognition software tool

SR4GN: A species recognition software tool for gene normalization.

PubTator: A PubMed-like interactive curation system for document triage and literature curation.

AutoPat: A gene regulation information extraction system.

News

[2013 Sep] GenNorm offline (version 1.3)
[2013 Jan] Allow user to access GenNorm results of whole PubMed abstracts via PubTator & gene2pubtator
[2012 Aug] GenNorm offline (version 1.2)
[2010 Nov] GenNorm online (version 1.0)


Offline processing (verion 1.3)

Major modifications:

1. Modified the input and output format compatible for BioC.

Download: [GenNorm v1.3] [Dictionary-updated by July,2014]
* Put downloaded Dictionary in GenNorm dictionary folder.
* AIIA-GMT is no longer available. Please use BANNER instead, and contact us for detail.

Please cite this resource as shown below:

  • We C-H, Kao H-Y ,Lu Z (2012) SR4GN: a species recognition software tool for gene normalization. PLoS ONE, 7(6):e38460
  • We C-H, Kao H-Y (2011) Cross-species gene normalization by species inference. BMC Bioinformatics, special issue on BioCreative III, 12(Suppl 8):S5
Contact: Chih-Hsuan Wei @ NCBI

 


Offline processing (verion 1.2)

Major modifications:

1. GenNorm v1.2 equips SR4GN for species recognition and species assignement. (for gene mention).
2. This version can map the normalization results to gene mentions. (From document level to mention level)
3. A regular update of algorithm and gene,sepcies dictionary is finished. (2012 Aug)


Online processing (verion 1.0) no longer available.

An online version of GenNorm is provided. It's a XML-RPC service, and work on Perl environment particularly.

Setup and intstall

Firstly, user needs to install Perl with RPC::XML and Data::Dumper modules by PPM.
Secondly, the input data can be a xml format article or an article ID. User needs to collect a XML format article or PMID/PMCID from NCBI PubMed Central search engine or PubMed search engine.
Thirdly, user need to change the file format to utf8.

Input

1.PMID
2.PMCID
3.PMID article (XML format)
4.PMCID article (XML format)
5.Just a paragraph (Free format)

Example: [Example1] [Example2] [Example3] [Example4] [Example5]

Input example

-------------------------------------------------------------------------------------------------------------
#!perl

use Data::Dumper;
use RPC::XML;
use RPC::XML::Client;

my $conn = RPC::XML::Client->new( 'http://ikmbio.csie.ncku.edu.tw:8080');
my $xml_format="";

open read_file,"<Input_filename.nxml"; # The region by red color needs be changed to user input file name.
while(<read_file>)
{
$temp=$_;
$xml_format=$xml_format.$temp;
}
close read_file;
$xml_format =~ s/\n//g;
$xml_format =~ s/ //g;

$conn->useragent->timeout(60000);
my $req = RPC::XML::request->new('Fulltext.getAnnotation',$xml_format);
my $resp = $conn->simple_request($req);

print Data::Dumper::Dumper($resp);
-------------------------------------------------------------------------------------------------------------

Output

1.EntrezID
2.confidence of EntrezID
3.Rank of EntrezID

Output example

Example:
-------------------------------------------------------------------------------------------------------------
$VAR1 = {
   'normalizations' => [
      {
            'Entity inference' => 'esr 1|estrogen receptor',
            'BOW inference' => 'estrogen',
            'Confidence' => '1',
            'rank' => '1',
            'Entrez ID' => '2099'
      },  
      {
            'Entity inference' => 'erbb 2|h er 2|neu',
            'BOW inference' => '',
            'Confidence' => '0.8',
            'rank' => '2',
            'Entrez ID' => '2064'
      }, 
   ]
};
------------------------------------------------------------------------------------------------------------


Introduction

To access and utilize the rich information contained in biomedical literature, the ability to recognize and normalize gene mentions referenced in the literature is crucial. In this paper, we focus on improvements to the accuracy of gene normalization in cases where species information is not provided. Gene names are often ambiguous, in that they can refer to the genes of many species. Therefore, gene normalization is a difficult challenge.

We define "gene normalization" as a series of tasks involving several issues, including gene name recognition, species assignation and species-specific gene normalization. We propose an integrated method consisting of three modules to handle the issues of this task. Every issue can affect overall performance, though the most important is species assignation. Clearly, correct identification of the species can decrease the ambiguity of orthologous genes. We propose an integrative method to handle the three issues of the GN task. Our approach uses three modules, the gene name recognition (GNR) module, the species assignation (SA) module and the species-specific gene normalization (SGN) module.

flowchart

Figure1. Architecture of the gene normalization method.

We participated in the GN task of the BioCreaTive III (BC3) competition by adopting an integrated method based on our previous work to handle intra-species gene ambiguity. Results demonstrated that our method worked well, ranking at the top level of performance among all teams. Our proposed method makes sufficient use of gene/species information in context and of a thesaurus of gene/species. In experiments, the proposed model attained the top threshold average precision (TAP-k) scores of 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20) when tested against 50 articles that had been selected for their difficulty and the most divergent results from pooled team submissions. The second highest scores were obtained with the full test set of articles (TAP-k score of 0.4591 for k=5, 10)

Table1. Performance statistic by BC3 test and training data

Corpus Data set TAP-5 TAP-10 TAP-20 Precision Recall F-measure
test data (1st run)
50 (gold standard)
0.3254
0.3538
0.3535
53.85%
39.44%
45.53%
test data (2nd run)
50 (gold standard)
0.3216
0.3435
0.3435
55.54%
39.07%
45.87%
test data (3rd run)
50 (gold standard)
0.3297
0.3514
0.3514
56.23%
39.72%
46.56%
test data (1st run)
50 (silver standard)
0.3567
0.3600
0.3600
58.94%
38.95%
46.90%
test data (2nd run)
50 (silver standard)
0.3291
0.3291
0.3291
58.60%
37.64%
45.84%
test data (3rd run)
50 (silver standard)
0.3382
0.3382
0.3382
59.46%
38.35%
46.62%
test data (1st run)
507(silver standard)
0.4591
0.4591
0.4591
71.79%
44.69%
55.09%
test data (2nd run)
507 (silver standard)
0.4323
0.4323
0.4323
72.08%
42.70%
53.64%
test data (3rd run)
507 (silver standard)
0.4327
0.4327
0.4327
72.41%
42.82%
53.82%
Training data
32 (gold standard)
0.4703
0.4969
0.4969
63.82%
67.71%
65.70%

Table2. Performance on the gene normalization task by the top 4 performing teams for this task in the BioCreaTive III competition

Teams Gold standard
(50 selected articles)
Silver standard
(All 507 articles)
Silver standard
(50 selected articles)
TAP-5 TAP-10 TAP-20 TAP-5 TAP-10 TAP-20 TAP-5 TAP-10 TAP-20
Kuo et al.
(Team 74)
1st run
0.2137
0.2509
0.2509
0.3820
0.3820
0.3820
0.4873
0.4873
0.4873
2nd run
0.2083
0.2480
0.2480
0.3855
0.3855
0.3855
0.4871
0.4871
0.4871
3rd run
0.2099
0.2495
0.2495
0.3890
0.3890
0.3890
0.4916
0.4916
0.4916
Our method
(Team 83)
1st run
0.3254
0.3538
0.3535
0.3567
0.3600
0.3600
0.4591
0.4591
0.4591
2nd run
0.3216
0.3435
0.3435
0.3291
0.3291
0.3291
0.4323
0.4323
0.4323
3rd run
0.3297
0.3514
0.3514
0.3382
0.3382
0.3382
0.4327
0.4327
0.4327
Liu et al.
(Team 98)
1st run
0.2835
0.3012
0.3103
0.3343
0.3535
0.3629
0.3818
0.3899
0.3875
2nd run
0.2909
0.3079
0.3087
0.3354
0.3543
0.3634
0.3790
0.3878
0.3868
3rd run
0.3013
0.3183
0.3303
0.3710
0.4116
0.4672
0.4086
0.4511
0.4648
Lai et al.
(Team 101)
1st run
0.1896
0.2288
0.2385
0.3590
0.3859
0.3859
0.4289
0.4289
0.4289
2nd run
0.1672
0.2150
0.2418
0.3239
0.3945
0.4132
0.4294
0.4408
0.4408
3rd run
0.1812
0.2141
0.2425
0.3258
0.4109
0.4109
0.4536
0.4536
0.4536

References

  • Jui-Chen Hsiao, Chih-Hsuan Wei, Hung-Yu Kao (2014) ¡§Gene name Disambiguation using Multi-scope Species Detection¡¨, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11(1):55 - 62, doi: 10.1109/TCBB.2013.139
  • Sofie Van Landeghem, Jari Bjorne, Chih-Hsuan Wei, Kai Hakala, Sampo Pyysalo, Sophia Ananiadou, Hung-Yu Kao, Zhiyong Lu, Tapio Salakoski, Yves Van de Peer, Filip Ginter (2013) Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization, PLoS ONE 8(4): e55814. doi:10.1371/journal.pone.0055814
  • Chih-Hsuan Wei, Hung-Yu Kao ,Zhiyong Lu (2012) SR4GN: a species recognition software tool for gene normalization, PLoS ONE, 7(6):e38460
  • Chih-Hsuan Wei, Hung-Yu Kao (2011) Cross-species gene normalization by species inference. BMC Bioinformatics, special issue on BioCreative III, 12(Suppl 8):S5
  • Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan Wei, Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun-Nan Hsu,, Richard Tzong-Han Tsai, Hong-Jie Dai, Naoaki Okazaki, Han-Cheol Cho, Martin Gerner, Illes Solt, Shashank Agarwal, Feifan Liu, Dina Vishnyakova, Patrick Ruch, Martin Romacker, Fabio Rinaldi, Sanmitra Bhattacharya, Padmini Srinivasan, Hongfang Liu, Manabu Torii, Sergio Matos, David Campos, Karin Verspoor, Kevin M. Livingston, and W. John Wilbur. (2011) The Gene Normalization Task in BioCreative III. BMC Bioinformatics, special issue on BioCreative III, 12(Suppl 8):S2
  • Chih-Hsuan Wei, Hung-Yu Kao (2010) Represented indicator measurement and corpus distillation on focus species detection. in IEEE International Conference on Bioinformatics & Biomedicine, pp. 657-662.
  • Chih-Hsuan Wei, I-Chin Huang, Yi-Yu Hsu, Hung-Yu Kao (2009) Normalizing Biomedical Name Entities by Similarity-Based Inference Network and De-ambiguity Mining, in Ninth IEEE International Conference on Bioinformatics and Bioengineering Workshop: Semantic Biomedical Computing, Taichung, Taiwan, pp. 461-466.
Contact Us | Chih-Hsuan Wei Designed. (2010) Locations of visitors to this page