Department of Clinical and Biological Sciences, Turin University. Department of Genetics, General and Molecular Biology, Naples. Department of Mathematics and Information Science, Italy. CINECA, Italy. R.A. Calogero, G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turr
29
Embed
Department of Clinical and Biological Sciences, Turin University.
Mining literature to improve biological knowledge extraction by microarray transcriptional profiling. R.A. Calogero, G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turra. Department of Clinical and Biological Sciences, Turin University. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Department of Clinical and Biological Sciences, Turin University.Department of Genetics, General and Molecular Biology, Naples.Department of Mathematics and Information Science, Italy.CINECA, Italy.
R.A. Calogero, G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turra
Data Mining applications in biological fields
•On Sequence database / Molecular structure
Protein structure predictions, homology search, genomic sequence analysis, identification and gene mapping , gene expression microarrays, …
• On Biomedical literature databases
Identification and classification of biological terms, identification of keywords and concepts, clustering , supervised classification, …
Biomedical literature analysis State of art
Two different approach:
• Information Extraction
Application of Natural Language Processing techniques that produces structured representations (templates). Entities and relations must be defined before extraction from texts.Syntactic and semantic analysis lead the extraction.
• Text Mining
Identification of word patterns inside the document corpus. No prior entities, allow to identify new concepts and new relations.No semantics.
Text Mining - The KDD process
Databases, Web sites, …Databases, Web sites, …
Target Target documentsdocuments
Documentsselection
Transformed Transformed documentsdocuments
Grammatical analysisand lemmatization
KeywordsKeywords
meta-informationextraction PatternsPatterns
Text Mining
KnowledgeKnowledge
Interpretation and validationof results
>>> 35:TOYOTA: Avalon Receives Top Score in Frontal Offset Crash Tests
Toyota Motor Corp.'s Avalon received the top score -- a "good" rating earning a "best pick" -- in the 40 mile per hour frontal offset crash tests on new or updated vehicles. The tests were conducted by the Insurance Institute for Highway Safety, a nonprofit group funded by automobile insurers. Nissan Motor Co. Ltd.'s Maxima midsize sedan and Infiniti I30 luxury sedan, the Nissan Sentra small car and Mazda Motor Corp.'s Mazda MPV minivan all scored "average" marks. Isuzu Motors Ltd..'s Rodeo sport utility, also sold by Honda Motor Co. Ltd. as the Honda Passport, earned a "poor" rating due to high crash forces recorded on the crash dummy's head, indicating an increased likelihood of injury. In the crash tests, the vehicles were driven into a deformable barrier at 40 mph, with the driver's side of the vehicle taking the impact. The tests measured the potential for injury to the head, neck, chest and foot areas, and the risk of intrusion into the passenger compartment.
SUBJECTS: Japan; Safety; Passenger Vehicles;SOURCE: Reuters, June 21, 2000;Japan;English
tn.5.26.35 SOURCE Reuterstn.5.26.35 DATE 6/21/2000tn.5.26.35 MONTHYEAR 2000_06tn.5.26.35 SUBJECTS Japantn.5.26.35 SUBJECTS Passenger_Vehiclestn.5.26.35 SUBJECTS Safetytn.5.26.35 STATE Japantn.5.26.35 LANGUAGE English tn.5.26.35 ORG2 TOYOTAtn.5.26.35 NN areatn.5.26.35 NN automobiletn.5.26.35 NN averagetn.5.26.35 NN barriertn.5.26.35 NN cartn.5.26.35 NN chesttn.5.26.35 NN compartmenttn.5.26.35 NN crashtn.5.26.35 NN drivertn.5.26.35 NN dummytn.5.26.35 NN foottn.5.26.35 NN forcetn.5.26.35 NN grouptn.5.26.35 NN headtn.5.26.35 NN hourtn.5.26.35 NN impacttn.5.26.35 NN injurytn.5.26.35 NN insurertn.5.26.35 NN intrusiontn.5.26.35 NN likelihoodtn.5.26.35 NN luxurytn.5.26.35 NN marktn.5.26.35 NN miletn.5.26.35 NN necktn.5.26.35 NN offsettn.5.26.35 NN passengertn.5.26.35 NN potentialtn.5.26.35 NN ratingtn.5.26.35 NN risktn.5.26.35 NN safetytn.5.26.35 NN scoretn.5.26.35 NN sedantn.5.26.35 NN sidetn.5.26.35 NN sporttn.5.26.35 NN testtn.5.26.35 NN utilitytn.5.26.35 NN vehicle
Tagging
Documents selection
Gene or protein Gene or protein
1400000 abstract
The process1) Identification of different parts of a documents
20000219 NN astrocyte20000219 NN brain20000219 NN case20000219 NN cell20000219 NN control20000219 NN disease20000219 NN distribution20000219 NN expression20000219 NN frequency20000219 NN glioma20000219 NN grade20000219 NN index20000219 NN lesion20000219 NN pattern20000219 NN process20000219 NN proliferation20000219 NN rat20000219 NN specimen20000219 NN staining20000219 NN subset20000219 NN tumor
20000219 AD Department of Neurosurgery, Shiga University of Medical Science, Ohtsu,Japan
20000219 NN astrocyte20000219 NN brain20000219 NN case20000219 NN cell20000219 NN control20000219 NN disease20000219 NN distribution20000219 NN expression20000219 NN frequency20000219 NN glioma20000219 NN grade20000219 NN index20000219 NN lesion20000219 NN pattern20000219 NN process20000219 NN proliferation20000219 NN rat20000219 NN specimen20000219 NN staining20000219 NN subset20000219 NN tumor
20000219 AD Department of Neurosurgery, Shiga University of Medical Science, Ohtsu,Japan
ABNORMAL GAS MIN SEXACT GEL MINOR SIDEACTS GREAT NET SKINAIR HAND NON SMOOTHALPHA HER NONE SPINARM HIS OLD STEPBEST HOMOLOG OUT SUBBETA HOW PAST TERMBIS III POINT TRANSCRIPTIONALCONTACT KEY POLE TYPE1DELTA KILLER POLY TYPE-IDYE KIT PRE UPSTREAMEARLY LACK PRO WHITEEND LARGE PROTEINKINASE WHOFACT LIGHT RAYFAR MAP REDFAT MEN RINGFISH MET SALTGAMMA MICE SEAGAP MID SET
ABNORMAL GAS MIN SEXACT GEL MINOR SIDEACTS GREAT NET SKINAIR HAND NON SMOOTHALPHA HER NONE SPINARM HIS OLD STEPBEST HOMOLOG OUT SUBBETA HOW PAST TERMBIS III POINT TRANSCRIPTIONALCONTACT KEY POLE TYPE1DELTA KILLER POLY TYPE-IDYE KIT PRE UPSTREAMEARLY LACK PRO WHITEEND LARGE PROTEINKINASE WHOFACT LIGHT RAYFAR MAP REDFAT MEN RINGFISH MET SALTGAMMA MICE SEAGAP MID SET
For these aliases we made a constrained research or no research at all
KIT <near/6> (protein <or> gene <or> product)
Terms recognition - open problems
• Non standardised terminology (different conventions)
• Open vocabulary (added new terms)
• Abbreviations usage, upper case/lower case,names that describe the function, …
• Synonyms
• Term Class cross-over (es: proteins called on the basis of related DNA)
• Prepositions e conjunctions (ambiguity in the interpretation of dependence)
References1. Hamphrays, K., et al. (2000): Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and
Protein Structures, in Proceedings of Pacific Symposium on Biocomputing, pp 72-80, World Scientific Press 2. Milward, T., et al. (2000): Automatic Extraction of Protein Interactions from Scientific Abstracts, in Proceedings of Pacific Symposium on
Biocomputing, pp538-549, World Scientific Press.3. Rindflesch, T. C. et al. (2000), “EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature”, PSB'20004. Iliopoulos, et al., « TEXTQUEST : Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology»5. Stapley, B.J. et al., « Biobibliometrics : Information Retrieval and Visualization form Co-occurrences of Gene Names in Medline
Abstracts»6. Jeffrey T. Chang et al., « Including Biological Literature Improves Homology Search »7. Leung, S. et al., « Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia Coli promoter DNA sequences » 8. Andrade, M. A. Et at., « Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families »9. Marcotte, E. M. et al., « Mining literature for protein-protein interactions »10. Masys, D. R. et al., « Use of keyword hierarchies to interpret gene expression patterns »11. Eckman, B. A. et al., « The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and
EST data mining »12. Fukuda, et al., (1999): “Toward Information extraction: Identifying protein names from biological papers”, PSB 9813. Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of Genes and Gene Products with a Hidden Markov Model”,
COLING-200014. Nobata, C., et al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in Proceeding. of 5th Natural Language
Processing Pacific Rim Symposium15. Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of
the Sixth Workshop on Very Large Corpora, pp 152-160.16. Hatzivassiloglou, V. et al., « Disambiguating Proteins, Genes, and RNA in Text : A Machine Learning Approach»17. Mikheev, A. Et al., « Description of the LTG System used for MUC-7 »18. Andrade, M. A. Et at., « Automatic Annotation for Biological Sequences by Extraction of Keywords from Medline Abstracts. Development