Top Banner
ELUTE Essential Libraries and Utilities of Text Engineering Tian-Jian Jiang
28

ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Jan 03, 2016

Download

Documents

Kelley Hardy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

ELUTEEssential Libraries and Utilities of

Text Engineering

Tian-Jian Jiang

Page 2: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Why bother?Since we already have...

Page 3: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

(lib)TaBE•Traditional Chinese Word Segmentation

•with Big5 encoding

•Traditional Chinese Syllable-to-Word Conversion

•with Big5 encoding

•for bo-po-mo-fo transcription system

Page 4: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

1999 - 2001?1999 - 2001?1999 - 2001?1999 - 2001?

Page 5: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

How about...

Page 6: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

libchewing•Now hacking for

•UTF-8 encoding

•Pinyin transcription system

•and looking for

•an alternative algorithm

•a better dictionary

Page 7: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

We got a problem•23:43 < s*****> 到底 (3023) + 螫舌 (0) 麼(2536) 東西 (6024) = 11583

•23:43 < s*****> 到底是 (829) + 什麼東西 (337) = 1166

•23:43 < s*****> 到底螫舌麼東西 大勝 到底這什麼東西

•00:02 < s*****> k***: 「什麼」會被「什麼東西」排擠掉

•00:02 < s*****> k***: 結果是 20445 活生生的被 337 幹掉 :P

Page 8: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Word Segmentation Review

Page 9: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Heuristic Rules*

• Maximum matching -- Simple vs. Complex: 下雨天真正討厭

• 下雨 天真 正 討厭 vs. 下雨天 真正 討厭

• Maximum average word length

• 國際化

• Minimum variance of word lengths

• 研究 生命 起源

• Maximum degree of morphemic freedom of single-character word

• 主要 是 因為

* Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/

Page 10: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Graphical Models• Markov chain family

• Statistical Language Model (SLM)

• Hidden Markov Model (HMM)

• Exponential models

• Maximum Entropy (ME)

• Conditional Random Fields (CRF)

• Applications

• Probabilistic Context-Free Grammar (PCFG) Parser

• Head-driven Phrase Structure Grammar (HPSG) Parser

• Link Grammar Parser

Page 11: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

What is a language model?

Page 12: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

A probability distributionover surface patterns of

texts.

Page 13: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

The Italian Who Went to Malta

•One day ima gonna Malta to bigga hotel.

•Ina morning I go down to eat breakfast.

•I tella waitress I wanna two pissis toasts.

•She brings me only one piss.

•I tella her I want two piss. She say go to the toilet.

•I say, you no understand, I wanna piss onna my plate.

•She say you better no piss onna plate, you sonna ma bitch.

•I don’t even know the lady and she call me sonna ma bitch!

Page 14: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

P(“I want to piss”) > P(“I want two pieces”)

For that Malta waitress,

Page 15: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Do the Math• Conditional probability:

• Bayes’ theorem:

• Information theory:

• Noisy channel model

• Language model: P(i)

)(

)()|(

BP

BAPBAP

Noisy channelp(o|i)

DecoderI O Î

)(

)()|(

)(

)()|(

AP

BPBAP

AP

ABPABP

)|()(maxarg)(

)|()(maxarg)|(maxarg

^

iopipop

iopipoipI

iii

Page 16: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Shannon’s Game

•Predict next word by history

•Maximum Likelihood Estimation

•C(w1…wn) : Frequency of n-gram w1…wn

)(

)()|(

11

111

n

nnn wwP

wwPwwwP

)(

)()|(

11

111MLE

n

nnn wwC

wwCwwwP

Page 17: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Once in a Blue Moon

• A cat has seen...

• 10 sparrows

• 4 barn swallows

• 1 Chinese Bulbul

• 1 Pacific Swallow

• How likely is it that next bird is unseen?

Page 18: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

(1+1) / (10 + 4 + 1 + 1)

Page 19: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

But I’ve seen a moonand I’m blue

• Simple linear interpolation

• PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1 ) + λ3P2(wn|wn-1 , wn-2)

• 0 ≤λi ≤ 1, Σiλi = 1

• Katz’s backing-off

• Back-off through progressively shorter histories.

• Pbo(wi|wi-(n-1)…wi-1) =

•kwwC

wwC

wwCd ini

ini

iniww ini

)( if ,)(

)()1( )1(

1)1(

)1(

1)1(

otherwise. ),|( 1)2(bo1)1( iniiww wwwPini

Page 20: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Good Luck!• Place a bet remotely on

a horse race within 8 horses by passing encoded messages.

• Past bet distribution

• horse 1: 1/2

• horse 2: 1/4

• horse 3: 1/8

• horse 4: 1/16

• the rest: 1/64

Foreversoul: http://flickr.com/photos/foreversouls/CC: BY-NC-ND

Page 21: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

3 bits? No, only 2!

0, 10, 110, 1110, 111100, 111101, 111110, 111111

Page 22: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Alright, let’s ELUTE

Page 23: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Bi-gram MLE Flow Chart

Permute candidates

right_gramIn LM?

hasleft_gram?

bi_gramIn LM?

left_gramIn LM?

temp_score =LogProb(right_gram)

temp_score =LogProb(bi_gram)

temp_score =LogProb(left_gram) +BackOff(right_gram)

temp_score =LogProb(Unknown) +BackOff(right_gram)

temp_score =LogProb(Unknown)

Update scores

temp_score +=previous_score

have 2 grams?

Yes

Yes

Yes

Yes Yes

No

No

No

No No

Page 24: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

INPUT input_syllables; len = Length(input_syllables); Load(language_model);scores[len + 1]; tracks[len + 1]; words[len + 1];FOR i = 0 TO len

scores[i] = 0.0; tracks[i] = -1; words[i] = "";FOR index = 1 TO len

best_score = 0.0; best_prefix = -1; best_word = "";FOR prefix = index - 1 TO 0

right_grams[] = Homophones(Substring(input_syllabes, prefix, index - prefix));FOR EACH right_gram IN right_grams[]

IF right_gram IN language_modelleft = tracks[prefix];IF left >= 0 AND left != prefix

left_grams[] = Homophones(Substring(input_syllables, left, prefix - left));FOR EACH left_gram IN left_grams[]

temp_score = 0.0;bigram = left_gram + " " + right_gram;IF bigram IN language_model

bigram_score = LogProb(bigram);temp_score += bigram_score;

ELSE IF left_gram IN language_modelbigram_backoff = LogProb(left_gram) + BackOff(right_gram);temp_score += bigram_backoff;

ELSEtemp_score += LogProb(Unknown) + BackOff(right_gram);

temp_score += scores[prefix];Scoring

ELSEtemp_score = LogProb(right_gram);Scoring

ELSEtemp_score = LogProb(Unknown) + scores[prefix];Scoring

scores[index] = best_score; tracks[index] = best_prefix_index; words[index] = best_prefix;IF tracks[index] == -1

tracks[index] = index - 1;boundary = len; output_words = "";WHILE boundary > 0

output_words = words[boundary] + output_words;boundary = tracks[boundary];

RETURN output_words;

SUBROUTINE ScoringIF best_score == 0.0 OR temp_score > best_score

best_score = temp_score;best_prefix = prefix;best_word = right_gram;

Bi-gram Syllable-to-Word

Page 25: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Show me the…

Page 26: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

William’s Requests

Page 27: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

And My Suggestions

• Convenient API

• Plain text I/O (in UTF-8)

• More linguistic information

• Algorithm: CRF

• Corpus: we need YOU!

• Flexible to different applications

• Composite, Iterator, and Adapter Patterns

• IDL support

• SWIG

• Open Source

• Open Corpus, too

Page 28: ELUTE E ssential L ibraries and U tilities of T ext E ngineering Tian-Jian Jiang.

Thank YOU