Back to CASCI Research

Subnetwork of word co-occurrence proximity (with 34 words) for a specific document from the first BioCreative competition. The red nodes denote the words retrieved from a s specific GO annotation (0007266: Rho, protein, signal, transduce). The blue nodes denote the words that co-occur very frequently with at least one of the red nodes: the co-occurrence neighborhood of the GO words. The green nodes denote the additional words discovered by our network algorithm as described in (Verspoor et al,2005).

Much of the research presently conducted in the biomedical domain relies on the induction of correlations and interactions from data. Because we ultimately want to increase our knowledge of the biochemical and functional roles of genes and proteins in organisms, there is a clear need to integrate the associations and interactions among biological entities that have been reported and accumulate in the literature and databases. Biomedical literature mining is an important informatics methodology for large scale information extraction from repositories of textual documents, as well as for integrating information available in various domain-specific databases and ontologies, ultimately leading to knowledge discovery. It helps us tap into the biomedical collective knowledge, and uncover relationships and interactions buried in the literature and databases, and even those inferred from global information but unreported in individual experiments. Our approach to literature mining is based on bottom-up, data-driven or bio-inspired methods, which we have applied to automatic discovery, classification and annotation of protein-protein and drug-drug interactions, pharmacokinetic data, protein sequence family and structure prediction, functional annotation of transcription data, enzyme annotation publications, and so on. Examples of these are shown below, together with links to additional resources and publications.

Decision structure on the protein-protein interaction article test data of Biocreative II, as produced by our Variable Trigonometric Threshold model.Abi Haidar, A et al. (2008)

PPI task- Decision structure on the protein-protein interaction article test data of Biocreative II, as produced by our Variable Trigonometric Threshold model.Abi Haidar, A et al. (2008)

Protein-Protein Interaction Discovery (PPI): Until now, literature mining has been applied essentially to help annotate and characterize molecular entities such as genes and proteins. In the next few years the field is expected to move to aid the discovery and automatic annotation of relationships among such entities, e.g. protein-protein and gene-disease interactions. Indeed, the Biocreative challenges II, II.5, and III, which we participated in [Abi-Haidar et al,2008], [Kolchinsky et al, 2010], [Lourenco et al, 2011]), includes a series of tasks on extraction of protein-protein interaction information from the literature. As the field moves to uncovering relations rather than entities, our complex network approach to biomedical literature mining [Verspoor et al,2005], which we tried on the first BioCreative competition, makes all the more sense. Additionally, since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. We were among most competitive teams in the PPI tasks of BioCreative II, II.5 and III. See our PIARE (Protein Interaction Abstract Relevance Evaluator) web tool for classification of documents relevant for protein-protein interaction, as well as supplementary materials for publications.

Estimated PK clearance parameter data from literature.Wang, Z., et al (2009)

Estimation of pharmacokinetics numerical data from literature and Drug-Drug interaction extraction. Our objective is to mine drug-specific (e.g. Midazolam (MDZ)) pharmokinetic (PK) clearance data (systemic and oral) from the literature. We obtained 88% precision rate and 92% recall rate are achieved, with an F-score = 90%. Out-performs support vector machine (F-score of 68.1%). Further investigation on 7 other drugs showed comparable performance [Wang et al, 2009]. This is a collaboration with Indiana University’s Medical School and the group of Dr, Lang Li. Recently, we received funding for a project on “Drug-Drug Interaction Prediction from Large-scale Mining of Literature and Patient Records” by Indiana University Collaborative Research Grants 2011.


proteins voting in proportion to their cosine similarity to the target protein. Maguitman, A. et al (2006)

proteins voting in proportion to their cosine similarity to the target protein. Maguitman, A. et al (2006)

Protein Family Prediction (PFP):Since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. We have been working in the large-scale validation of bibliome algorithms , and proposed a method that predict a protein’s Pfam family correctly 76% of the time and 89% of the time issue a prediction that will be among top 5 families [Maguitman et al,2006].


Our novel combined method performs significantly better than either the  original structure predictionor keyword based prediction methods alone. The keyword method performs  well even though the literature comes from sequences with little (BLAST) detectable sequence homology.

PSP task- Our combined method performs significantly better than either the original structure predictionor keyword based prediction methods alone. Rechtsteiner, A., et al (2006)

Protein Structure Prediction (PSP): Literature-mining prediction comparable to best ab-initio methods in lack of sequence homology. Combining text-mining with ab-initio method leads to 35% improvement over ab-initio method alone. See [Rechtsteiner et al, 2006]

Rechtsteiner, A. [2005]. PhD Dissertation.

Rechtsteiner, A. (2005). PhD Dissertation.

characterizing gene regulation : SVD (“eigen-clustering”) of microarray data produces sets of co-expressed genes, which were then characterized with annotations automatically extracted from literature [Rechtesteiner, 2005].


Project Members

Luis Rocha

Luis M. Rocha, PI

Jon Duke

Jon Duke

Lang Li

Lang Li

Predrag Radivojac

Predrag Radivojac

Hagit Shatkay

Hagit Shatkay

Analia Lourenco

Analia Lourenco

Ana Maguitman

Ana Maguitman

Al Abi-Haidar

Al Abi-Haidar

Michael Conover

Michael Conover

Mohsen JafariAsbagh

Mohsen JafariAsbagh

Jasleen Kaur

Artemy Kolchinsky

Artemy Kolchinsky

Azadeh Nematzadeh

Azadeh Nematzadeh

Andreas Rechtsteiner

Andreas Rechtsteiner

Tiago Simas

Tiago Simas

Zhiping (Paul) Wang

Zhiping (Paul) Wang


Funding

Project partially funded by

  • Indiana University Collaborative Research Grants 2011. Project title: “Drug-Drug Interaction Prediction from Large-scale Mining of Literature and Patient Records”.
  • Fundação Luso-Americana para o Desenvolvimento (Portugal) and National Science Foundation (USA), 2012-2014. Project title: “Network Mining For Gene Regulation And Biochemical Signaling.” (171/11)


Selected Project Publications