Cleaning Social IME Dictionary Yoh Okuno #IME2011
Jun 11, 2015
Cleaning Social IME Dictionary Yoh Okuno
#IME2011
About the presenter
• Name: Yoh Okuno
• Software Engineer at Yahoo! Japan
• Interest: NLP, Machine Learning, Data Mining
• Skill: C/C++, Python, Hadoop, and English.
• Website: http://yoh.okuno.name/
What is Social IME? • The most popular “Cloud-‐based” Japanese
input method (230k unique user per month)
http://www.social-‐ime.com/
Shared Dictionary of Social IME
• Noisy & Crazy → Needs cleaning!
shared with all users
Character alignment • Align pairs of Kana and Kanji characters monotonically and detect failures of alignment
• Techniques from statistical machine translation
• Used m2m-‐aligner because of its functions
http://code.google.com/p/m2m-‐aligner/
四季多彩 しきたさい 西都原 さいとばる iPhone あいふぉん
四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|
Training m2m-‐aligner • Train 3 datasets
– Mozc’s dictionary (1.5 M words)
– unidic (230k words)
– alt-‐cannadic (400k words) → most suitable
• Just run 2 commands
Trained results • Three files are generated
Alignment:
Error:
Model:
Applying m2m-‐aligner
• Apply to 4 datasets
– Social IME shared dictionary (93k words)
– Mined from Wikipedia (169k words)
– Crawled MS-‐IME dictionary (18k words)
– Manually corrected MS-‐IME dictionary (92k words)
– Hatena keyword (315k words)
Mining words from Wikipedia
grep like “[一-‐龠]+([ぁ-‐んヴー]+)”
Crawling MS-‐IME user dictionary
Hatena keyword
Applied results
• Run:
• Results: Dataset Social IME Wikipedia MS-‐IME MS-‐IME2 hatena
Size 93k 169k 18k 97k 314k Align 48k 137k 16k 86k 235k Error 45k 32k 2k 10k 78k
Alignment examples • Not perfect but practical precision From Social IME:
From Wikipedia:
“ゃ,ゅ,ょ,っ” should be combined with the previous character
Error examples (from Social IME)
• Error analysis is most interesting!
Abbreviations: Emoticons (顔文字):
Personal Information:
Error examples (from Hatena) Length limit (16 chars):
Chinese / Korean / old Japanese words:
Semantic translation:
12/29 Released!!
Conclusion • Described how to clean Social-‐IME/Wikipedia/
MS-‐IME dictionary using m2m-‐aligner
• Released cleaned dictionary today!
• Future work: automatically classify pairs with
alignment error to emoticons, abbreviations,
personal information and so on.
Any Question?