TextMining% - University of Manchesterstudentnet.cs.manchester.ac.uk/pgr/2012/CDTSeminar/... · 2012-10-22 · NaCTeM- ! The 1st publicly funded national text mining centre in the

Text Mining

Sophia Ananiadou Sophia.Ananiadou@manchester.ac.uk

Na:onal Centre for Text Mining www.nactem.ac.uk

NaCTeM- www.nactem.ac.uk q  The 1st publicly funded national

text mining centre in the world q  Location: Manchester

Interdisciplinary Biocentre q  Phase I - Biology (2005-2008) q  Phase II - Biology, Medicine,

Social Sciences (2008-2011) q  Phase III- Medicine, Biology

(2012-2016)

Sophia Ananiadou John McNaught

Text Mining Research Group

•  Sophia Ananiadou sophia.ananiadou@manchester.ac.uk (MIB)

•  John McNaught John.McNaught@manchester.ac.uk (MIB) •  Goran Nenadic GN@cs.man.ac.uk IT building, IT308

The problem with information overload and knowledge discovery

•  Humans cannot easily: – Keep up-to-date with all relevant literature – Find relevant and precise information – Synthesize information from many diverse

sources – Exploit the mass of information to generate

hypotheses – Discover new knowledge

S.Ananiadou

What is text mining?

•  Extracts and discovers knowledge hidden in text

•  Informa:on access •  Knowledge discovery •  Seman:c search, seman:c metadata

–  iden:fying concepts – extrac:ng facts/rela:ons – discovering implicit links

S.Ananiadou

S.Ananiadou 6

The Need for Text Mining

§  Full Papers

§  Abstracts

§  Clinical trials

§  Reports, discharge summaries

§  EHR

§  Textbooks, monographs

§  Grey content, online discussion forums

MEDLINE •  2005: ~14M •  2009: ~18M •  2011: 21.2M (1/10/11)

Overwhelming information in textual, unstructured format

S.Ananiadou 7

A new paradigm of sharing informa:on and knowledge

Informa:on Retrieval Databases

Seman:c Web

Text Mining, NLP

Disciplines Merging Knowledge sharing

From Text to Knowledge: tackling the data deluge through text mining

Unstructured Text (implicit knowledge)

Structured content (explicit knowledge)

Information extraction

Semantic metadata

Knowledge Discovery

Information Retrieval

S.Ananiadou

Text mining steps

•  Informa:on Retrieval yields all relevant texts –  Gathers, selects, filters documents that may prove useful –  Finds what is known

•  Informa:on Extrac:on extracts facts & events of interest to user –  Finds relevant concepts, facts about concepts

–  Finds only what we are looking for

•  Data Mining discovers unsuspected associa:ons –  Combines & links facts and events –  Discovers new knowledge, finds new associa:ons

S.Ananiadou

•  Extrac:on of terms and named en::es (names of people, organisa:ons, diseases, genes, etc)

•  Discovery of concepts allows seman:c annota:on and enrichment of documents

•  Going a step further: extrac:ng facts, events from text

•  And even further… opinions, a]tudes, certainty, contradic:ons…

. meta-‐knowledge

S.Ananiadou 10

Impact of NLP-‐based text mining

•  Improves clustering, classifica:on of documents

•  Improves informa:on access by going beyond index terms, enabling seman:c querying

•  Enables even more advanced text mining applica:ons

•  Linking text with pathways

Structured Knowledge

From Text to Knowledge: NLP and Knowledge Extrac:on

Lexicons and ontologies

Knowledge Extraction

Text Annotation Tools

S.Ananiadou

Who needs this stuff?

•  Seman:c Web community: we provide the seman:cs

•  Computa:onal Biology: we link text with networks/pathways

•  Ontology: we populate ontologies from text, linking with Protégé

•  Database curators: automa:c update using evidence from text….

S.Ananiadou

•  Semantic search from full papers, abstracts •  Hypothesis generator: mining direct and indirect

associations •  Supporting systematic reviews •  Developing clinical trial recommender systems •  Extracting bioprocesses for cancer research •  Enriching, curating pathways with literature evidence •  Annotation environment for curators….

S.Ananiadou 13

How TM is embedded in applications

Which User Communi:es? •  Pharma •  Health/Medicine •  Finance •  Social sciences •  Digital Economy, Digital Libraries •  Google, IBM, Microsof: all inves:ng in text mining

•  Everyone needs text mining to solve their knowledge management problems!

S.Ananiadou

Text Mining: Layers upon layers

Interactions

Entities

WordsLayers of SophisticationSimple keyword search ala GoogleTM

Term identification

Information Extraction

GeneralSolution

HighlyCustomised

Solution

ImprovedAccuracy

Informative Summarisation

Q&A Services

Named Entity Recognition

Metadata Extraction

Database Curation

Indicative Summarisation

Semantically Annotate

Names, Addresses, Organisations or

Proteins

Who, What, When and Where?

Enhance searching by

looking for related keywords and

phrasesChoose between

different meanings - ‘a dog lead’ or ‘a lead balloon’?

What doesthis do?

Generatehypotheses

S.Ananiadou

Retrieving related concepts

MEDLINE (21 million abstracts) FACTA+

diabetes diabetes

216,000 documents relevant to diabetes

Insulin, albumin, …

Diabetes is …

… when insulin is …

… lower albumin level

http://refine1-nactem.mc.man.ac.uk/facta/ Tsuruoka, Y. et al (2008) Bioinforma:cs 24(21)

S.Ananiadou

Click!

S.Ananiadou

… However, further decreases in branched-‐chain amino acid levels indicate that caffeine might promote deeper fa@gue than placebo

Extracting snippets of information

S.Ananiadou

Extracting indirect associations

E-cadherin is associated with Parkinson’s disease via CASS4,

SNAIL3, transcription factor EB, etc.

S.Ananiadou

Directly associated concepts

Query: E-‐cadherin and GENIA:Nega:ve_regula:on

E-cadherin often appears with cancers S.Ananiadou

Indirectly associated concepts

Query: E-‐cadherin and GENIA:Nega:ve_regula:on

E-cadherin is indirectly associated with nervous system disorders (e.g., Alzheimer’s disease, Parkinson’s disease, epilepsy)

S.Ananiadou

Project : TM for cancer genomics

•  Enhancing FACTA+ to deal with cancer genomics

•  Muta:ons oncogenes •  Rela:ons between treatments, genes, drugs •  Research into Informa:on Extrac:on (Named en:ty, rela:on, event mining)

•  Collabora:on with Medical School

S.Ananiadou

Information extraction with Ø Typed associations of arbitrary numbers of participants (n-ary)‏ Ø Events (processes / reactions) can participate in other events (recursive)‏ Ø Explicit identification of roles that participants play (Theme, Cause, ...)‏ Many resources, methods and applications introduced since 2009

Event extraction (EE)‏

S.Ananiadou

Project: extrac:ng inten:ons

•  Extrac:ng informa:on from full papers •  Classify facts according to the authors’ inten:ons

•  hnp://www.nactem.ac.uk/meta-‐knowledge/ •  Nega:on, specula:on, contradic:on

S.Ananiadou

Nuances of language •  Argumenta:on, rhetorical intent, meta-‐knowledge •  Specula:on

–  Probable, possible,… –  Suggest, indicate, … –  May, might, would, …

•  Manner: slightly, rapidly, greatly, … •  Polarity (nega:ve, posi:ve): no, never, … •  Such knowledge required for: discourse analysis, opinion

mining, … •  If not taken into account, then results can be invalid and

misleading •  Collabora:ve project with publishing company.

S.Ananiadou

Meta-‐knowledge annota:on

Certainty level

Polarity

Analysis

Manner

Source

S.Ananiadou

Public Health reviews

S.Ananiadou

Unsupervised methods for Public Health Search

•  Building on the clinical trials project •  Extrac:ng informa:on from literature •  Unsupervised methods + machine learning •  Summarisa:on •  Coopera:on with Public Health (NICE: na:onal Ins:tute for Health and Clinical Excellence)

S.Ananiadou

Finding evidence from full text

•  In context of UKPMC •  Beyond full text search and panern matching •  Deeply analyse documents off-‐line •  Index rela:onships •  Key off search term to dynamically generate from indexed rela:onships ques;ons that have known answers – Not auto-‐comple:on…

S.Ananiadou

http://labs.ukpmc.ac.uk/evf S.Ananiadou

Fewer hits, now we click on a ques:on

S.Ananiadou

Known answers to “what is produced by GO”

We can find out more facts by investigating a document S.Ananiadou

Extracted subject-‐verb-‐object triples

Verbs are “domain verbs of interest” Deep analysis reveals “hidden” subjects (passives undone) S.Ananiadou

Biomedical causality recogni:on •  Discovering new facts and connec:ons •  Enriching exis:ng pathways •  Crea:ng new pathways

CAUSES

Named en::es Events Causality Pathways Raw

S.Ananiadou

TwiHer analysis using text mining tools

•  Twiner is: –  one of the most popular social media –  A new means of mass communica:on –  accessible to all

•  The load of informa:on is immense, thus automa@c analysis is essen:al. •  In this project, the student will:

•  use the text mining tools of NaCTeM (e.g. topic extrac:on, summarisa:on) •  exploit panerns and trends in twiner feeds concerning specific topics or events •  Sta:s:cal analysis based on text mining analy:cs, noisy data •  Anempt to answer ques:ons about the nature of tweeter, for example:

–  the way tweeter influences human behaviour –  whether tweeter strengthens posi:ve or nega:ve emo:ons about an event –  whether it can mo:vate people to par:cipate in a public protest –  whether it can agitate or allay panic during extreme natural phenomena

such as floods, earthquakes sequences and typhoons, etc.

S.Ananiadou

Opinion and trend analysis using text mining

•  Synthesis of mul@ple views about a topic, issue or product. •  Sources: reviews, newswire ar:cles, blogs, and social media, such

as facebook, tweeter, google+ and myspace •  These sources are are opinion repositories and logs of trends and

lifestyle •  Opinion and trend analysis cuts across:

–  informa:on retrieval –  text mining –  automa:c summarisa:on –  sen:ment analysis.

•  Research in this area includes: –  learning the seman:c orienta:on and emo:onal stress of words –  scoring the sen:ment of documents –  analysing opinions and a]tudes etc.

S.Ananiadou

John McNaught

Text Mining Research Group and

NaCTeM (Deputy Director) John.McNaught@manchester.ac.uk

It’s your PhD, not mine

•  If you want me to supervise you in an area of interest to me, then I expect you to come up with at least a rough idea for a research proposal –  You’ll be more interested in working on something you “own”

– Whom would a top restaurant be more interested in employing?

•  A cook who could show he was good at buying ready-‐made meals?

•  Or a chef who could show he was capable of inven:ng a novel dish?

Proposals welcome in areas such as:

•  Text mining –  Informa:on extrac:on

•  named en:ty recogni:on, rela:on extrac:on, fact or event extrac:on

– Opinion mining (sen:ment mining) – Presenta:on of complex text mining results to users, interac:on aspects, search aspects

•  Issues in resource building for NLP/TM – Lexicons, terminologies, annotated corpora

Proposals welcome… •  Mapping between the language of experts and the language of non-‐experts – Many non-‐experts anempt to use/understand specialised sources (health problems, …)

•  Wri:ng aids –  TM is applied post-‐crea:on of document, no author present

•  Ambiguity greatest problem – Why not create seman:c metadata as author constructs document, resolve ambigui:es, propose extracted events, link document to knowledge sphere?

Proposals welcome… •  If you have domain or language exper:se

–  Proposals can be oriented towards that domain or language

– Although finding appropriate resources (lexica, corpora, language processing tools) may be a severe issue where NLP/TM is underdeveloped or nascent for some language

•  (so that might give further ideas) •  TM is of interest also to those in humani:es, social sciences, law, ..., so plenty of scope for topics in such domains (e.g. linking historical personages and historical events)

Some projects PhD students of mine have/are worked/working on

•  Arabic named en:ty recogni:on –  Hard because no capitalisa:on, lack of diacri:cs in MSA, ambiguity of names with common nouns

•  Opinion mining for Arabic •  Machine learning of template extrac:on rules

–  To help grammar rule writers •  Automa:c genera:on of seman:c clusters from defini:ons

–  To help with “:p of tongue” phenomenon and with communica:on among experts from different domains

•  Lexical simplifica:on for accessibility and low-‐literacy support

Information on NaCTeM

•  All our services are here: http://www.nactem.ac.uk/services.php •  Our tools are here: http://www.nactem.ac.uk/software.php •  Our publications http://www.nactem.ac.uk/aigaion2/index.php?/publications

Possible projects

Identification of conflicting information in biological literature

• Aim: finding statements that express some degree of difference/conflict, e.g.

Protein A is highly expressed in T-cells T-cells show reduced expression of Protein A

• Build on previous work (completed PhD)

Possible projects Support for logical modelling in

systems biomedicine •  Aim: extract information to construct quantitative

computational models of metabolic functions or diseases –  involves literature mining and data integration,

but also some mathematical skills (e.g. logical models and simulations)

–  one modelling project already running in a similar area

•  Multi-disciplinary supervisory team (from Life Sciences)

Possible projects Clinical and health-care text mining

– Aim: support clinical decision support by extracting and aggregating textual health data

– Extraction and structuring of patient-specific information from health-care records, literature and patient generated sources

•  combining text mining, ontologies and data analytics

– Multi-disciplinary supervisory teams (local hospitals: Christie, Children hospital, Hope)

Possible projects

Integrated data and text mining •  Aim: combine data that comes from multi-modal

sources, e.g. structured and unstructured – e.g. integration of clinical/experimental data

•  Many challenging questions to be asked: –  how to combine different types of data, weights etc –  defining kernel-based similarity methods to be used in

machine learning •  Requires good maths and computing skills

Contact

•  Goran Nenadic email: GN@cs.man.ac.uk IT building, IT308

http://gnode1.mib.man.ac.uk •  Small scale pilot projects around these topics will

be available

TextMining% - University of Manchesterstudentnet.cs.manchester.ac.uk/pgr/2012/CDTSeminar/... · 2012-10-22 · NaCTeM- ! The 1st publicly funded national text mining centre in the

Documents

TextMining DanPolaAlgoritmaDalamPenyelesaian ...

Multiobjective Optimization - University of...

文本挖掘（ TextMining）

Analiza danych nieustrukturyzowanych: Text...

Part-of-speech tagging and chunking with log-linear models.....

Textmining – Wissensrohstoff...

Piecewise Hermite Interpolation - University of...

Zhongwen Youxi He - University of...

SudokuGame! - University of...

Apply of Textmining Method to Study the Roles in Improving.....

TEXTMINING: APPROACHESANDAPPLICATIONS

Text Mining for Health Care and Medicine - NaCTeM

Evolutionary Algorithms - University of...

Linguistic techniques for Text Mining NaCTeM team Sophia...

Quantnet Basics: Visualization, Similarity, Text...

BioCreative Keynote Refs - OHSU...