Erik Fäßler Technical Introduction to Semedico 1 Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena, Jena, Germany http://www.julielab.de A Technical Introduction to the Semantic Search Engine SeMedico Erik Fäßler Talk in the Semesterprojekt Entwicklung einer Suchmaschine für Alternativmethoden zu Tierversuchen January 12, 2018 Humboldt-Universität zu Berlin
31
Embed
A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Erik Fäßler TechnicalIntroductiontoSemedico 1
Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,
Jena, Germany
http://www.julielab.de
A Technical Introduction to the Semantic Search Engine SeMedico
• Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming – ElasticSearch does the same on sent document text
• How to integrate UIMA?
• First idea: Create a Lucene UIMA analyzer, but – Moves (a lot!) processing requirements into the ElasticSearch
cluster – Requires to load dictionaries, machine learning models – Memory that is lost to Lucene and ElasticSearch – Overall: Diminishes search performance
?
Erik Fäßler TechnicalIntroductiontoSemedico 24
ElasticSearch III
• JULIE Lab ElasticSearch plugin to exactly specify index terms without ES-internal analysis – https://github.com/JULIELab/elasticsearch-mapper-preanalyzed
• Employs the JSON format created for the Solr JsonPreAnalyzedParser – https://lucene.apache.org/solr/guide/6_6/working-with-external-
• Created by JULIE Lab internal (currently) CAS consumer
Erik Fäßler TechnicalIntroductiontoSemedico 25
ElasticSearch IV Preanalyzed Format {"v":"1",
"str":"Immunohistochemistry performed to evaluate the expression of phosphorylated mTOR (p-mTOR), phosphorylated p70S6K (p-p70S6K), phosphorylated 4E-binding protein 1 (p-4E-BP1), and Ki-67 using 105 surgically resected ESCC correlated with treatment outcome.",
– Faessler, Erik, & Hahn, Udo (2017). SEMEDICO: A comprehensive semantic search engine for the life sciences. in: ACL 2017 – Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Vancouver, British Columbia, Canada, August 1, 2017, 91–96.
normalization with GeNo. in: Bioinformatics, 25, 815-821.
• BioSem – Bui, Q., Mulligen, E. van, Campos, D., & Kors, J. (2013). A Fast Rule-based Approach for
Biomedical Event Extraction. In Proceedings of the BioNLP 2013 Shared Task Workshop (pp. 104–108). Sofia, Bulgaria: Association for Computational Linguistics.
• Certainty Assessment – Engelmann, Christine, & Hahn, Udo (2014). An empirically grounded approach to extend the
linguistic coverage and lexical diversity of verbal probabilities. in: CogSci 2014 - Proceedings of the 36th Annual Cognitive Science Conference. Cognitive Science Meets Artificial Intelligence: Human and Artificial Agents in Interactive Contexts. Québec City, Québec, Canada, July 23-26, 2014., 451-456.
JCoRe 2.0 goes GitHub and Maven Central: State-of-the-art software resource engineering and distribution of NLP pipelines. in: LREC 2016 – Proceedings of the 10th International Conference on Language Resources and Evaluation. Portorož, Slovenia, 23-28 May 2016, 2502-2509.
Erik Fäßler TechnicalIntroductiontoSemedico 30
Conclusion
DocDoc
DocMEDLINE
JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
http://www.semedico.org/
Erik Fäßler TechnicalIntroductiontoSemedico 31
Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,
Jena, Germany
http://www.julielab.de
A Technical Introduction to the Semantic Search Engine SeMedico