TextMining% - University of Manchesterstudentnet.cs.manchester.ac.uk/pgr/2012/CDTSeminar/... · 2012-10-22 · NaCTeM- ! The 1st publicly funded national text mining centre in the
Post on 10-Jul-2020
1 Views
Preview:
Transcript
Text Mining
Sophia Ananiadou Sophia.Ananiadou@manchester.ac.uk
Na:onal Centre for Text Mining www.nactem.ac.uk
NaCTeM- www.nactem.ac.uk q The 1st publicly funded national
text mining centre in the world q Location: Manchester
Interdisciplinary Biocentre q Phase I - Biology (2005-2008) q Phase II - Biology, Medicine,
Social Sciences (2008-2011) q Phase III- Medicine, Biology
(2012-2016)
Sophia Ananiadou John McNaught
Text Mining Research Group
• Sophia Ananiadou sophia.ananiadou@manchester.ac.uk (MIB)
• John McNaught John.McNaught@manchester.ac.uk (MIB) • Goran Nenadic GN@cs.man.ac.uk IT building, IT308
The problem with information overload and knowledge discovery
• Humans cannot easily: – Keep up-to-date with all relevant literature – Find relevant and precise information – Synthesize information from many diverse
sources – Exploit the mass of information to generate
hypotheses – Discover new knowledge
S.Ananiadou
What is text mining?
• Extracts and discovers knowledge hidden in text
• Informa:on access • Knowledge discovery • Seman:c search, seman:c metadata
– iden:fying concepts – extrac:ng facts/rela:ons – discovering implicit links
S.Ananiadou
S.Ananiadou 6
The Need for Text Mining
§ Full Papers
§ Abstracts
§ Clinical trials
§ Reports, discharge summaries
§ EHR
§ Textbooks, monographs
§ Grey content, online discussion forums
MEDLINE • 2005: ~14M • 2009: ~18M • 2011: 21.2M (1/10/11)
Overwhelming information in textual, unstructured format
S.Ananiadou 7
A new paradigm of sharing informa:on and knowledge
Informa:on Retrieval Databases
Seman:c Web
Text Mining, NLP
Disciplines Merging Knowledge sharing
From Text to Knowledge: tackling the data deluge through text mining
Unstructured Text (implicit knowledge)
Structured content (explicit knowledge)
Information extraction
Semantic metadata
Knowledge Discovery
Information Retrieval
S.Ananiadou
Text mining steps
• Informa:on Retrieval yields all relevant texts – Gathers, selects, filters documents that may prove useful – Finds what is known
• Informa:on Extrac:on extracts facts & events of interest to user – Finds relevant concepts, facts about concepts
– Finds only what we are looking for
• Data Mining discovers unsuspected associa:ons – Combines & links facts and events – Discovers new knowledge, finds new associa:ons
S.Ananiadou
• Extrac:on of terms and named en::es (names of people, organisa:ons, diseases, genes, etc)
• Discovery of concepts allows seman:c annota:on and enrichment of documents
• Going a step further: extrac:ng facts, events from text
• And even further… opinions, a]tudes, certainty, contradic:ons…
. meta-‐knowledge
S.Ananiadou 10
Impact of NLP-‐based text mining
• Improves clustering, classifica:on of documents
• Improves informa:on access by going beyond index terms, enabling seman:c querying
• Enables even more advanced text mining applica:ons
• Linking text with pathways
Structured Knowledge
From Text to Knowledge: NLP and Knowledge Extrac:on
Lexicons and ontologies
Knowledge Extraction
Tools
Text Annotation Tools
S.Ananiadou
Who needs this stuff?
• Seman:c Web community: we provide the seman:cs
• Computa:onal Biology: we link text with networks/pathways
• Ontology: we populate ontologies from text, linking with Protégé
• Database curators: automa:c update using evidence from text….
S.Ananiadou
• Semantic search from full papers, abstracts • Hypothesis generator: mining direct and indirect
associations • Supporting systematic reviews • Developing clinical trial recommender systems • Extracting bioprocesses for cancer research • Enriching, curating pathways with literature evidence • Annotation environment for curators….
S.Ananiadou 13
How TM is embedded in applications
Which User Communi:es? • Pharma • Health/Medicine • Finance • Social sciences • Digital Economy, Digital Libraries • Google, IBM, Microsof: all inves:ng in text mining
• Everyone needs text mining to solve their knowledge management problems!
S.Ananiadou
Text Mining: Layers upon layers
Interactions
Facts
Terms
Entities
POS
WordsLayers of SophisticationSimple keyword search ala GoogleTM
Term identification
Information Extraction
GeneralSolution
HighlyCustomised
Solution
ImprovedAccuracy
Informative Summarisation
Q&A Services
Named Entity Recognition
Metadata Extraction
Database Curation
Indicative Summarisation
Semantically Annotate
Names, Addresses, Organisations or
Proteins
Who, What, When and Where?
Enhance searching by
looking for related keywords and
phrasesChoose between
different meanings - ‘a dog lead’ or ‘a lead balloon’?
What doesthis do?
Generatehypotheses
S.Ananiadou
Retrieving related concepts
MEDLINE (21 million abstracts) FACTA+
diabetes diabetes
216,000 documents relevant to diabetes
Insulin, albumin, …
Diabetes is …
… when insulin is …
… lower albumin level
http://refine1-nactem.mc.man.ac.uk/facta/ Tsuruoka, Y. et al (2008) Bioinforma:cs 24(21)
S.Ananiadou
Click!
S.Ananiadou
… However, further decreases in branched-‐chain amino acid levels indicate that caffeine might promote deeper fa@gue than placebo
Extracting snippets of information
S.Ananiadou
Extracting indirect associations
19
E-cadherin is associated with Parkinson’s disease via CASS4,
SNAIL3, transcription factor EB, etc.
S.Ananiadou
Directly associated concepts
20
Query: E-‐cadherin and GENIA:Nega:ve_regula:on
E-cadherin often appears with cancers S.Ananiadou
Indirectly associated concepts
21
Query: E-‐cadherin and GENIA:Nega:ve_regula:on
E-cadherin is indirectly associated with nervous system disorders (e.g., Alzheimer’s disease, Parkinson’s disease, epilepsy)
S.Ananiadou
Project : TM for cancer genomics
• Enhancing FACTA+ to deal with cancer genomics
• Muta:ons oncogenes • Rela:ons between treatments, genes, drugs • Research into Informa:on Extrac:on (Named en:ty, rela:on, event mining)
• Collabora:on with Medical School
S.Ananiadou
Information extraction with Ø Typed associations of arbitrary numbers of participants (n-ary) Ø Events (processes / reactions) can participate in other events (recursive) Ø Explicit identification of roles that participants play (Theme, Cause, ...) Many resources, methods and applications introduced since 2009
Event extraction (EE)
S.Ananiadou
Project: extrac:ng inten:ons
• Extrac:ng informa:on from full papers • Classify facts according to the authors’ inten:ons
• hnp://www.nactem.ac.uk/meta-‐knowledge/ • Nega:on, specula:on, contradic:on
S.Ananiadou
Nuances of language • Argumenta:on, rhetorical intent, meta-‐knowledge • Specula:on
– Probable, possible,… – Suggest, indicate, … – May, might, would, …
• Manner: slightly, rapidly, greatly, … • Polarity (nega:ve, posi:ve): no, never, … • Such knowledge required for: discourse analysis, opinion
mining, … • If not taken into account, then results can be invalid and
misleading • Collabora:ve project with publishing company.
S.Ananiadou
Meta-‐knowledge annota:on
Certainty level
Polarity
Analysis
Manner
Source
S.Ananiadou
Public Health reviews
S.Ananiadou
Unsupervised methods for Public Health Search
• Building on the clinical trials project • Extrac:ng informa:on from literature • Unsupervised methods + machine learning • Summarisa:on • Coopera:on with Public Health (NICE: na:onal Ins:tute for Health and Clinical Excellence)
S.Ananiadou
Finding evidence from full text
• In context of UKPMC • Beyond full text search and panern matching • Deeply analyse documents off-‐line • Index rela:onships • Key off search term to dynamically generate from indexed rela:onships ques;ons that have known answers – Not auto-‐comple:on…
S.Ananiadou
http://labs.ukpmc.ac.uk/evf S.Ananiadou
Fewer hits, now we click on a ques:on
S.Ananiadou
Known answers to “what is produced by GO”
We can find out more facts by investigating a document S.Ananiadou
Extracted subject-‐verb-‐object triples
Verbs are “domain verbs of interest” Deep analysis reveals “hidden” subjects (passives undone) S.Ananiadou
Biomedical causality recogni:on • Discovering new facts and connec:ons • Enriching exis:ng pathways • Crea:ng new pathways
CAUSES
Named en::es Events Causality Pathways Raw
text
S.Ananiadou
TwiHer analysis using text mining tools
• Twiner is: – one of the most popular social media – A new means of mass communica:on – accessible to all
• The load of informa:on is immense, thus automa@c analysis is essen:al. • In this project, the student will:
• use the text mining tools of NaCTeM (e.g. topic extrac:on, summarisa:on) • exploit panerns and trends in twiner feeds concerning specific topics or events • Sta:s:cal analysis based on text mining analy:cs, noisy data • Anempt to answer ques:ons about the nature of tweeter, for example:
– the way tweeter influences human behaviour – whether tweeter strengthens posi:ve or nega:ve emo:ons about an event – whether it can mo:vate people to par:cipate in a public protest – whether it can agitate or allay panic during extreme natural phenomena
such as floods, earthquakes sequences and typhoons, etc.
S.Ananiadou
Opinion and trend analysis using text mining
• Synthesis of mul@ple views about a topic, issue or product. • Sources: reviews, newswire ar:cles, blogs, and social media, such
as facebook, tweeter, google+ and myspace • These sources are are opinion repositories and logs of trends and
lifestyle • Opinion and trend analysis cuts across:
– informa:on retrieval – text mining – automa:c summarisa:on – sen:ment analysis.
• Research in this area includes: – learning the seman:c orienta:on and emo:onal stress of words – scoring the sen:ment of documents – analysing opinions and a]tudes etc.
S.Ananiadou
John McNaught
Text Mining Research Group and
NaCTeM (Deputy Director) John.McNaught@manchester.ac.uk
It’s your PhD, not mine
• If you want me to supervise you in an area of interest to me, then I expect you to come up with at least a rough idea for a research proposal – You’ll be more interested in working on something you “own”
– Whom would a top restaurant be more interested in employing?
• A cook who could show he was good at buying ready-‐made meals?
• Or a chef who could show he was capable of inven:ng a novel dish?
Proposals welcome in areas such as:
• Text mining – Informa:on extrac:on
• named en:ty recogni:on, rela:on extrac:on, fact or event extrac:on
– Opinion mining (sen:ment mining) – Presenta:on of complex text mining results to users, interac:on aspects, search aspects
• Issues in resource building for NLP/TM – Lexicons, terminologies, annotated corpora
Proposals welcome… • Mapping between the language of experts and the language of non-‐experts – Many non-‐experts anempt to use/understand specialised sources (health problems, …)
• Wri:ng aids – TM is applied post-‐crea:on of document, no author present
• Ambiguity greatest problem – Why not create seman:c metadata as author constructs document, resolve ambigui:es, propose extracted events, link document to knowledge sphere?
Proposals welcome… • If you have domain or language exper:se
– Proposals can be oriented towards that domain or language
– Although finding appropriate resources (lexica, corpora, language processing tools) may be a severe issue where NLP/TM is underdeveloped or nascent for some language
• (so that might give further ideas) • TM is of interest also to those in humani:es, social sciences, law, ..., so plenty of scope for topics in such domains (e.g. linking historical personages and historical events)
Some projects PhD students of mine have/are worked/working on
• Arabic named en:ty recogni:on – Hard because no capitalisa:on, lack of diacri:cs in MSA, ambiguity of names with common nouns
• Opinion mining for Arabic • Machine learning of template extrac:on rules
– To help grammar rule writers • Automa:c genera:on of seman:c clusters from defini:ons
– To help with “:p of tongue” phenomenon and with communica:on among experts from different domains
• Lexical simplifica:on for accessibility and low-‐literacy support
Information on NaCTeM
• All our services are here: http://www.nactem.ac.uk/services.php • Our tools are here: http://www.nactem.ac.uk/software.php • Our publications http://www.nactem.ac.uk/aigaion2/index.php?/publications
Possible projects
Identification of conflicting information in biological literature
• Aim: finding statements that express some degree of difference/conflict, e.g.
Protein A is highly expressed in T-cells T-cells show reduced expression of Protein A
• Build on previous work (completed PhD)
Possible projects Support for logical modelling in
systems biomedicine • Aim: extract information to construct quantitative
computational models of metabolic functions or diseases – involves literature mining and data integration,
but also some mathematical skills (e.g. logical models and simulations)
– one modelling project already running in a similar area
• Multi-disciplinary supervisory team (from Life Sciences)
Possible projects Clinical and health-care text mining
– Aim: support clinical decision support by extracting and aggregating textual health data
– Extraction and structuring of patient-specific information from health-care records, literature and patient generated sources
• combining text mining, ontologies and data analytics
– Multi-disciplinary supervisory teams (local hospitals: Christie, Children hospital, Hope)
Possible projects
Integrated data and text mining • Aim: combine data that comes from multi-modal
sources, e.g. structured and unstructured – e.g. integration of clinical/experimental data
• Many challenging questions to be asked: – how to combine different types of data, weights etc – defining kernel-based similarity methods to be used in
machine learning • Requires good maths and computing skills
Contact
• Goran Nenadic email: GN@cs.man.ac.uk IT building, IT308
http://gnode1.mib.man.ac.uk • Small scale pilot projects around these topics will
be available
top related