Top Banner
Cleaning Social IME Dictionary Yoh Okuno #IME2011
18
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cleaning Social IME Dictionary

Cleaning  Social  IME  Dictionary Yoh  Okuno  

#IME2011  

Page 2: Cleaning Social IME Dictionary

About  the  presenter

•  Name:  Yoh  Okuno  

•  Software  Engineer  at  Yahoo!  Japan  

•  Interest:  NLP,  Machine  Learning,  Data  Mining  

•  Skill:  C/C++,  Python,  Hadoop,  and  English.  

•  Website:  http://yoh.okuno.name/  

Page 3: Cleaning Social IME Dictionary

What  is  Social  IME? •  The  most  popular  “Cloud-­‐based”  Japanese  

input  method  (230k  unique  user  per  month)  

http://www.social-­‐ime.com/  

Page 4: Cleaning Social IME Dictionary

Shared  Dictionary  of  Social  IME

•  Noisy  &  Crazy  →  Needs  cleaning!  

shared  with  all  users

Page 5: Cleaning Social IME Dictionary

Character  alignment •  Align  pairs  of  Kana  and  Kanji  characters  monotonically  and  detect  failures  of  alignment  

•  Techniques  from  statistical  machine  translation  

•  Used  m2m-­‐aligner  because  of  its  functions  

http://code.google.com/p/m2m-­‐aligner/

四季多彩  しきたさい  西都原  さいとばる  iPhone  あいふぉん

四|季|多|彩|  し|き|た|さい|  西|都|原|  さい|と|ばる|  i|Ph|o|n|e|  あい|ふ|ぉ|ん|_|

Page 6: Cleaning Social IME Dictionary

Training  m2m-­‐aligner •  Train  3  datasets  

– Mozc’s  dictionary  (1.5  M  words)  

– unidic  (230k  words)  

– alt-­‐cannadic  (400k  words)  →  most  suitable    

•  Just  run  2  commands  

Page 7: Cleaning Social IME Dictionary

Trained  results •  Three  files  are  generated  

Alignment:

Error:

Model:

Page 8: Cleaning Social IME Dictionary

Applying  m2m-­‐aligner

•  Apply  to  4  datasets  

–  Social  IME  shared  dictionary  (93k  words)  

– Mined  from  Wikipedia  (169k  words)  

– Crawled  MS-­‐IME  dictionary  (18k  words)  

– Manually  corrected  MS-­‐IME  dictionary  (92k  words)  

– Hatena  keyword  (315k  words)  

Page 9: Cleaning Social IME Dictionary

Mining  words  from  Wikipedia

grep  like  “[一-­‐龠]+([ぁ-­‐んヴー]+)”  

Page 10: Cleaning Social IME Dictionary

Crawling  MS-­‐IME  user  dictionary

Page 11: Cleaning Social IME Dictionary

Hatena  keyword

Page 12: Cleaning Social IME Dictionary

Applied  results

•  Run:    

•  Results:  Dataset Social  IME Wikipedia MS-­‐IME MS-­‐IME2 hatena

Size 93k 169k 18k 97k 314k Align 48k 137k 16k 86k 235k Error 45k 32k 2k 10k 78k

Page 13: Cleaning Social IME Dictionary

Alignment  examples •  Not  perfect  but  practical  precision From  Social  IME:

From  Wikipedia:

“ゃ,ゅ,ょ,っ”  should  be  combined  with  the  previous  character

Page 14: Cleaning Social IME Dictionary

Error  examples  (from  Social  IME)

•  Error  analysis  is  most  interesting!  

Abbreviations: Emoticons  (顔文字):

Personal  Information:

Page 15: Cleaning Social IME Dictionary

Error  examples  (from  Hatena) Length  limit  (16  chars):

Chinese  /  Korean  /  old  Japanese  words:

Semantic  translation:

Page 16: Cleaning Social IME Dictionary

12/29  Released!!

Page 17: Cleaning Social IME Dictionary

Conclusion •  Described  how  to  clean  Social-­‐IME/Wikipedia/

MS-­‐IME  dictionary  using  m2m-­‐aligner  

•  Released  cleaned  dictionary  today!  

•  Future  work:  automatically  classify  pairs  with  

alignment  error  to  emoticons,  abbreviations,    

personal  information  and  so  on.  

Page 18: Cleaning Social IME Dictionary

Any  Question?