Back to CASCI Research

Subnetwork of word co-occurrence proximity (with 34 words) for a specific document from the first BioCreative competition. The red nodes denote the words retrieved from a s specific GO annotation (0007266: Rho, protein, signal, transduce). The blue nodes denote the words that co-occur very frequently with at least one of the red nodes: the co-occurrence neighborhood of the GO words. The green nodes denote the additional words discovered by our network algorithm as described in (Verspoor et al,2005).
Much of the research presently conducted in the biomedical domain relies on the induction of correlations and interactions from data. Because we ultimately want to increase our knowledge of the biochemical and functional roles of genes and proteins in organisms, there is a clear need to integrate the associations and interactions among biological entities that have been reported and accumulate in the literature and databases. Biomedical literature mining is an important informatics methodology for large scale information extraction from repositories of textual documents, as well as for integrating information available in various domain-specific databases and ontologies, ultimately leading to knowledge discovery. It helps us tap into the biomedical collective knowledge, and uncover relationships and interactions buried in the literature and databases, and even those inferred from global information but unreported in individual experiments. Our approach to literature mining is based on bottom-up, data-driven or bio-inspired methods, which we have applied to automatic discovery, classification and annotation of protein-protein and drug-drug interactions, pharmacokinetic data, protein sequence family and structure prediction, functional annotation of transcription data, enzyme annotation publications, and so on. Examples of these are shown below, together with links to additional resources and publications.

PPI task- Decision structure on the protein-protein interaction article test data of Biocreative II, as produced by our Variable Trigonometric Threshold model.Abi Haidar, A et al. (2008)
Protein-Protein Interaction Discovery (PPI): Until now, literature mining has been applied essentially to help annotate and characterize molecular entities such as genes and proteins. In the next few years the field is expected to move to aid the discovery and automatic annotation of relationships among such entities, e.g. protein-protein and gene-disease interactions. Indeed, the Biocreative challenges II, II.5, and III, which we participated in [Abi-Haidar et al,2008], [Kolchinsky et al, 2010], [Lourenco et al, 2011]), includes a series of tasks on extraction of protein-protein interaction information from the literature. As the field moves to uncovering relations rather than entities, our complex network approach to biomedical literature mining [Verspoor et al,2005], which we tried on the first BioCreative competition, makes all the more sense. Additionally, since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. We were among most competitive teams in the PPI tasks of BioCreative II, II.5 and III. See our PIARE (Protein Interaction Abstract Relevance Evaluator) web tool for classification of documents relevant for protein-protein interaction, as well as supplementary materials for publications.

Estimated PK clearance parameter data from literature.Wang, Z., et al (2009)
Estimation of pharmacokinetics numerical data from literature and Drug-Drug interaction extraction. Our objective is to mine drug-specific (e.g. Midazolam (MDZ)) pharmokinetic (PK) clearance data (systemic and oral) from the literature. We obtained 88% precision rate and 92% recall rate are achieved, with an F-score = 90%. Out-performs support vector machine (F-score of 68.1%). Further investigation on 7 other drugs showed comparable performance [Wang et al, 2009]. This is a collaboration with Indiana University’s Medical School and the group of Dr, Lang Li. Recently, we received funding for a project on “Drug-Drug Interaction Prediction from Large-scale Mining of Literature and Patient Records” by Indiana University Collaborative Research Grants 2011.

proteins voting in proportion to their cosine similarity to the target protein. Maguitman, A. et al (2006)
Protein Family Prediction (PFP):Since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. We have been working in the large-scale validation of bibliome algorithms , and proposed a method that predict a protein’s Pfam family correctly 76% of the time and 89% of the time issue a prediction that will be among top 5 families [Maguitman et al,2006].

PSP task- Our combined method performs significantly better than either the original structure predictionor keyword based prediction methods alone. Rechtsteiner, A., et al (2006)
Protein Structure Prediction (PSP): Literature-mining prediction comparable to best ab-initio methods in lack of sequence homology. Combining text-mining with ab-initio method leads to 35% improvement over ab-initio method alone. See [Rechtsteiner et al, 2006]
![Picture 13 Rechtsteiner, A. [2005]. PhD Dissertation.](http://cnets.indiana.edu/wp-content/uploads/Picture-131-300x271.png)
Rechtsteiner, A. (2005). PhD Dissertation.
characterizing gene regulation: SVD (“eigen-clustering”) of microarray data produces sets of co-expressed genes, which were then characterized with annotations automatically extracted from literature [Rechtesteiner, 2005].
Project Members

Luis M. Rocha, PI

Jon Duke

Lang Li

Predrag Radivojac

Hagit Shatkay

Analia Lourenco

Ana Maguitman

Al Abi-Haidar

Michael Conover

Mohsen JafariAsbagh

Jasleen Kaur

Artemy Kolchinsky

Azadeh Nematzadeh

Andreas Rechtsteiner

Tiago Simas

Zhiping (Paul) Wang
Funding
Project partially funded by
- Indiana University Collaborative Research Grants 2011. Project title: “Drug-Drug Interaction Prediction from Large-scale Mining of Literature and Patient Records”.
- Fundação Luso-Americana para o Desenvolvimento (Portugal) and National Science Foundation (USA), 2012-2014. Project title: “Network Mining For Gene Regulation And Biochemical Signaling.” (171/11)
Selected Project Publications
- Wu, Hengyi, S. Karnik, A. Subhadarshini, Z. Wang, S. Philips, X. Han, C. Chiang, L. Liu, M. Boustani, L.M. Rocha, S.K. Quinney, D.A. Flockhart and L. Li [2013]. “An Integrated Pharmacokinetics Ontology and Corpus for Text Mining”. BMC Bioinformatics. BMC Bioinformatics. 14:35. DOI:10.1186/1471-2105-14-35. (Highly Accessed)
- A. Kolchinsky, A. Lourenço, L. Li, L.M. Rocha [2013]. “Evaluation of linear classifiers on articles containing pharmacokinetic evidence of drug-drug interactions“. Pacific Symposium on Biocomputing, 2013. 18:409-420.
- Zhiping Wang [2012]. “Biomedical Literature Mining for Pharmacokinetics Numerical Parameter Collection.” Ph.D. Dissertation (Computer Science Program), Indiana University.
- A. Lourenço, M. Conover, A. Wong, A. Nematzadeh, F. Pan, H. Shatkay, and L.M. Rocha [2012].“Correction: A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature”. BMC Bioinformatics, 13: 180.
- S. Karnik, A. Subhadarshini, Z. Wang, L.M. Rocha, and L. Li [2011].”Extraction of drug-drug interactions using all paths graph kernel.” (pdf). In: Proceedings of the 1st Challenge task on Drug-Drug Interaction Extraction. I. Segura-Bedmar, P. Martínez, D. Sánchez. September, 7th, 2011, Huelva, Spain, pp. 83-88.
- A. Lourenço, M. Conover, A. Wong, A. Nematzadeh, F. Pan, H. Shatkay, and L.M. Rocha [2011].”A Linear Classifier Based on Entity Recognition Tools and a Statistical Approach to Method Extraction in the Protein-Protein Interaction Literature“. BMC Bioinformatics. 12(Suppl 8):S12
- M. Krallinger, et al [2011].”The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text“. BMC Bioinformatics. 12(Suppl 8):S3.
- A. Lourenço, M. Conover, A. Wong, F. Pan, Alaa Abi-Haidar, A. Nematzadeh, H. Shatkay, and L.M. Rocha [2010].”Testing Extensive Use of NER tools in Article Classification and a Statistical Approach for Method Interaction Extraction in the Protein-Protein Interaction Literature” (pdf). Proceedings of the BioCreative III Workshop 2010, Bethesda, Maryland, September 13-15, 2010.
- A. Kolchinsky, A. Abi-Haidar, J. Kaur, A.A. Hamed and L.M. Rocha [2010]. “Classification of protein-protein interaction full-text documents using text and citation network features.” IEEE/ACM Transactions On Computational Biology And Bioinformatics, 7(3):400-411. DOI: doi.ieeecomputersociety.org/10.1109/TCBB.2010.55
- Z. Wang, S. Kim, S.K. Quinney, Y. Guo, S.D. Hall, L.M. Rocha, and L. Li [2009]. “Literature mining on pharmacokinetics numerical data: A feasibility study“. Journal of Biomedical Informatics. 42 (4): 726-735.
- A. Lourenço; R.C. Carreira; D. Glez-Peña; J.R. Méndez; S.A. Carneiro; L.M. Rocha; F. Díaz; E.C. Ferreira; I.P. Rocha; F. Fdez-Riverola; M. Rocha [2009]. “BioDR: Semantic Indexing Networks for Biomedical Document Retrieval.” Expert Systems with Applications, 37, 3444–3453.
- A. Abi-Haidar, J. Kaur, A. Maguitman, P. Radivojac, A. Retchsteiner, K. Verspoor, Z. Wang, and L.M. Rocha [2008]. Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks“. Genome Biology. 9(Suppl 2):S11
- Maguitman, A. G., Rechtsteiner, A., Verspoor, K., Strauss, C.E., Rocha, L.M. [2006]. “Large-Scale Testing Of Bibliome Informatics Using Pfam Protein Families“. In: Pacific Symposium on Biocomputing11:76-87.
- Rechtsteiner, A., Luinstra, J., Rocha, L.M., Strauss, C.E., [2006]. “Use of Text Mining for Protein Structure Prediction and Functional Annotation in Lack of Sequence Homology“. In: Joint BioLINK and Bio-Ontologies Meeting 2006 (ISMB Special Interest Group).
- Verspoor, K., J. Cohn, C. Joslyn, S. Mniszewski, A. Rechtsteiner, L.M. Rocha, T. Simas [2005]. “Protein Annotation as Term Categorization in the Gene Ontology using Word Proximity Networks“. BMC Bioinformatics, 6(Suppl 1):S20. doi:10.1186/1471-2105-6-S1-S20
- Andreas Rechtesteiner [2005]. Multivariate Analysis of Gene Expression Data and Functional Information: Automated Methods for Functional Genomics . PhD Dissertation, Systems Science Program, Portland State University.
- Wall, Michael E., Andreas Rechtesteiner, and Luis M. Rocha [2003]. “Singular Value Decomposition and Principal Component Analysis “. In: A Practical Approach to Microarray Data Analysis. D. P. Berrar, W. Dubitzky, and M. Granzow (Eds.). Kluwer Academic Publishers, pp. 91-109.