Extracting Knowledge from Pydata London 2015

[email protected]

Jointly embeddings text and Knowledge Graph for information

extraction

Armando Vieira

Data Scientist @dataAI and @Stratified Medical

[email protected]

Summary

Why machines struggle to “understand” text?

The challenges of discover new knowledge in text

Deep Learning to the rescue

Words as distributed vectors

Combining text with knowledge graphs

[email protected]

Wouldn't it be great that...

We could extract “knowledge” expressed in text into a machine readable format?

[email protected]

Or that...

We could transform all biomedical information into an automated drug discovery process

[email protected]

NLP: the traditional way

[email protected]

[email protected]

Why understanding text is so hard for a machine?

The verbs nightmare

Nested structures

Syntactic is doable semantics is hard

Other challenges (negations,…)

Long range interactions

[email protected]

Deep learning to the rescue

[email protected]

How distributed representations solve the curse of dimensionality problem

[email protected]

[email protected]

[email protected]

Distributed representations are powerful

[email protected]

[email protected]

The Skip-gram algorithm

IDEA: Words together are semantically related Mikolov et al 2013

[email protected]

But its not the end of the story

The verbs nightmare

Nested relations structure

Syntactic is doable semantics is hard

Other challenges (negations,…)

Long range correlations

[email protected]

Neural Embeddings

Credit: Omer Levy

[email protected]

Mikolov et al. (2013)

[email protected]

[email protected]

[email protected]

What does each similarity term mean?

Observe the joint features with explicit representations!

uncrowned Elizabeth majesty Katherine second impregnate

… …

[email protected]

Words as vector operations

[email protected]

Gensim implementation in Python

[email protected]

[email protected]

How to train the embedding?

[email protected]

Advantages

Efficient coding of words and relations

Capture both local and global semantics

Easy to parallelize

Completely unsupervised

Can easily handle ambiguity

[email protected]

Limitations of word embeddings

They are (bi)linear machines

Perform poorly on infrequent words

Can not incorporate external knowledge

[email protected]

Knowledge graphs

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

Why its hard to expand knowledge?

Sparsely connected

Highest degree nodes are sometimes irrelevant

Some relations types are too vague

Integrate local and global (contextual) information

[email protected]

Combining text and graphs

[email protected]

What’s inside a knowledge graph?

[email protected]

Idea: combine KG and text corpus

[email protected]

The algorithm

Chang Xu et al

[email protected]

Data

Wikipedia 2014 • 3.5 billion word tokens • Vocabulary size: 2 million

Freebase • 44 million topics • 2.4 billion facts • > 1500 relation types

[email protected]

Results

Corpus of data ??

[email protected]

[email protected]

Beating humans in IQ test?

Analogy I Isotherm is to temperature as isobar is to: A) atmosphere, B) wind; C) Pressure; D) latitude; E) current

Analogy 2 Identify two words (one from each set of brackets) that form a connection (analogy) when paired with the words in capitals: CHAPTER (book, verse, read), ACT (stage, audience, play).

Classification Which is the odd one out? (i) calm, (ii) quiet, (iii) relaxed, (iv) serene, (v) unruffled.

Synonym Which word is closest to IRRATIONAL? (i) intransigent, (ii) irredeemable, (iii) unsafe, (iv) lost, (v) nonsensical.

Antonym Which word is most opposite to MUSICAL? (i) discordant, (ii) loud, (iii) lyrical, (iv) verbal, (v) euphonious

[email protected]

In average, yes!

Huang et al, June 2015

[email protected]

Resources

http://technology.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/ Chris Moody

https://levyomer.wordpress.com Levy Omer

[email protected]

How about biomedical data?

Few data (25 million documents)

Complex interactions between entities

Fat tail

Incorporate constrains from Physics, Chemistry & Biology

Non-linearities: complex manifold

[email protected]

From here… Neuroinflammation is the local reaction of the brain to infection, trauma, toxic molecules or protein aggregates. The brain resident macrophages, microglia, are able to trigger an appropriate response involving secretion of cytokines and chemokines, resulting in the activation of astrocytes and recruitment of peripheral immune cells. IL-1β plays an important role in this response; yet its production and mode of action in the brain are not fully understood and its precise implication in neurodegenerative diseases needs further characterization. Our results indicate that the capacity to form a functional NLRP3 inflammasome and secretion of IL-1β is limited to the microglial compartment in the mouse brain. We were not able to observe IL-1β secretion from astrocytes, nor do they express all NLRP3 inflammasme components. Microglia were able to produce IL-1β in response to different classical inflammasome activators, such as ATP, Nigericin or Alum. Similarly, microglia secreted IL-18 and IL-1α, two other inflammasome-linked pro-inflammatory factors. Cell stimulation with α-synuclein, a neurodegenerative disease-related peptide, did not result in the release of active IL-1β by microglia, despite a weak pro-inflammatory effect. Amyloid-β peptides were able to activate the NLRP3 inflammasome in microglia and IL-1β secretion occurred in a P2X7 receptor-independent manner. Thus microglia-dependent inflammasome activation can play an important role in the brain and especially in neuroinflammatory conditions.

[email protected]

To here

If protein A interacts with gene G at cell types C what other proteins related to A may interact with gene G at cell types C1?

If chemical Q attach to target T at protein P what chemicals may attach to target T1 at protein P1?

[email protected]

Looking for new knowledge

We are not really looking to understand language

Rather

Extract and “validate” novel knowledge.

Extracting Knowledge from Pydata London 2015

Documents

embeddings text

knowledge graphs

text corpus

new knowledge

external knowledge

text deep learning

rescue words

infrequent words