Top Banner
LINGVIST | Language learning meets AI STATISTICAL METHODS IN LANGUAGE LEARNING Machine Learning Estonia Meetup 2017-02-28
37

Lingvist - Statistical Methods in Language Learning

Apr 12, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI

STATISTICAL METHODS IN LANGUAGE LEARNINGMachine Learning Estonia Meetup2017-02-28

Page 2: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Last time…

“We have very little Machine Learning”- paraphrasing Ahti

Page 3: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Lets fix it!

Page 4: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

At the same time in Marketing team…

Page 5: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Lingvist Intro

• Foreign language learning application• We are obsessed with learning speed• Currently free to use• Web, iOS, Android versions• 16 courses (language pairs) publicly available

ET-EN, ET-FR,RU-EN, RU-FR,EN-DE, EN-ES, EN-FR, EN-RU,AR-EN, DE-EN, FR-EN, ES-EN, JA-EN, PT-EN, ZH-Hant-EN, ZH-Hans-EN

Homepage: lingvist.com

Page 6: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Page 7: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

You are expected to type in the correct answer

Page 8: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

If you don’t know then we show correct answer

Page 9: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Well done!

Page 10: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

We use statistics to…

• Prepare the course material• Predict what learners already know• Choose optimal repetition intervals during learning• Analyze common mistakes learners do (and help them to avoid these)

We use conversion, retention, engagement statistics also to drive most product decisions but I will not talk about it today.

Page 11: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Course material preparation

Page 12: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Frequency based vocabularyObjective:• Teach vocabulary based on frequency• Quickly reach to level which is practically useful• French: ~2000 words covers ~80% words in typical text

Solution:• Acquire big text corpus• Parse and tag (noun, verb, …) all words• Build word list in frequency order• Adjust ranking (down-rank pronouns, articles, …)• Review and adjustments by linguists

Page 13: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Sample sentence extractionObjective:• Sentences should represent typical context• Manual production is very time consuming

Solution:• Extract candidate sentence/phrases from text corpus• Rank sentences based on set of criteria• Linguists choose the most suitable• Sentences are redacted for consistency and completeness

Page 14: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Sample sentence rankingRanking criteria:• C1. Sentence length• C2. Complete sentence• C3. Previously learned words in course• C4. Natural sequence of words ("fast car“ vs “brave car”)• C5. Contain relevant context words (“go home”)• C6. Thematically consistent (“flower” and “bloom”)

Total score is weighted sum of sub-scores.

Page 15: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Extracted sample sentences sample

Page 16: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Dr. Haystack

• English corpus size used was ~3.7bln words• There is no conversational corpora of required size• Number of criteria leads to “The curse of dimensionality”• Words rarely used in context that linguists consider as good example• Harder than needle in the haystack

Page 17: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Predicting what user already knows

Page 18: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting what user already knowsObjective:• We have many users with previous knowledge in language• If we could predict what they know already...

- then we can exclude these words- save time- avoid boredom

• We have placement test feature for about a year- prediction is based on word frequencies- but this correlation is not high and we miss many known words- it still has a big positive impact on user retention - can we do better?

Page 19: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting what user already knows

User wait doubt letter between son wait Target word: wonder

User 1 1 1 1 0 1 0 0

User 2 1 0 1 0 1 1 1

User 3 0 0 0 1 1 1 1

How?• We don't teach new words – we ask first• What person already knows is valuable informationTraining the models:• Take all first answers from learning history (correct answer = user knows the word already)• Train model per word to predict knowledge of that word• Rank words by their predictive power• Train second model for each word using fixed set of most predictive words as inputs

Page 20: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting what user already knows

• 5000 models for each course (one model for each word in course)• User answers most predictive words (up to 50 words)• For each word in the course feed answers as input• Get the prediction for each word• Include or exclude word in course based on prediction• Include small % of excluded words despite (for validation)

Page 21: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting what user already knowsAverages of performance metrics:

RU-EN course Random Forestfirst 4000 words

Random Forestfirst 2000 words

Accuracy 0.74 0.72Precision for “known” 0.67 0.72Recall for “known” 0.69 0.72Precision for “unknown” 0.52 0.52Recall for “unknown” 0.54 0.57Training samples 2440 4959

Page 22: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting what user already knowsChallenges:• Distribution of samples is heavily skewed to beginning of the course• Dataset is biased due current placement test implementation:

- we excluded word if we predicted user knows the word- so we have little data about true positives and false positives

• Model has worse performance for some language pairs• Order of the words in the course influences the model

Page 23: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Predicting optimal repetition interval

Page 24: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting optimal repetition interval

Page 25: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting optimal repetition intervalBased on :• Forgetting curve: exponential decay, Hermann Ebbinghaus ~1885• Spaced repetition: C.A.Mace ~1932

Forgetting curve parameters are:• highly individual (depends on person)• highly contextual (depends on fact what is learned)

Challenge:Measure or estimate forgetting curve parameters• for this particular person• for this particular word or skill

Page 26: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Predicting optimal repetition intervalObjective:• Target word with learning history (3x, 1/10/50min, wrong/correct/wrong)• Predict interval user answering correctly with desired probability (~80-90%)

Method:• Take user learning history (all answers and preceding histories)• Calculate distance to our target word• Choose up to ~100 learning histories most similar to target word• Fit the curve through next repetition intervals and answers• Calculate the interval for desired probability that user answers correctly

Page 27: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Word # answers Last interval Last correct + N parameters Next interval Next correct

voiture 3 50 min Yes … ??? 80-90%

reste 2 6 min No 4 min Yes

reste 3 4 min Yes 1 hr Yes

voyage 3 30 min Yes 3 hrs No

voyage 4 3 hrs No 2 hrs Yes

… …

devriez 12 2 wk Yes 10 wk No

Clustering similar histories

Page 28: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Word # answers Last interval Last correct + N parameters Next interval Next correct

voiture 3 50 min Yes … ??? 80-90%

reste 2 6 min No 4 min Yes

reste 3 4 min Yes 1 hr Yes

voyage 3 30 min Yes 3 hrs No

voyage 4 3 hrs No 2 hrs Yes

… …

devriez 12 2 wk Yes 10 wk No

Clustering similar histories

Page 29: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Word # answers Last interval Last correct + N parameters Next interval Next correct

voiture 3 50 min Yes … ??? 80-90%

reste 2 6 min No 4 min Yes

reste 3 4 min Yes 1 hr Yes

voyage 3 30 min Yes 3 hrs No

voyage 4 3 hrs No 2 hrs Yes

… …

devriez 12 2 wk Yes 10 wk No

Clustering similar histories

Page 30: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Curve fitting

Page 31: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Mistake classification

Page 32: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Mistake classification

• Extract all wrong answers• Classify wrong answers: typos, wrong grammar form, synonyms, false-friends, …• Sort by most common mistakes• … and figure out what we can do about it

Page 33: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Reducing mistakes

• Improve the sample sentence• Give hints to user• Allow use to try-again

Page 34: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

Concluding remarks

Page 35: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Some learnings

• Deterministic history leads to biases• Adding some randomizations is good for discovery• Each language pair is analyzed separately (RU-EN vs FR-EN)• Noise (typos, bad samples etc) must be accounted for

Page 36: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI lingvist.com@lingvist

Technology

• Python (3.x)• NumPy, Scipy, Pandas – statistics, clustering, calculations• Scikit-Learn - machine mearning (Random Forest, Multinominal Bayes, feature extraction)• Gensim – distributional semantics (CBOW, word2vec, skip-gram …)• Semspaces – functions for working with semantic spaces• NLTK, Freeling, Stanford NLP – parsing, PoS tagging, pre-processing

Page 37: Lingvist - Statistical Methods in Language Learning

LINGVIST | Language learning meets AI @lingvist lingvist.com

THANK YOU!Credits go to team, mistakes are mine!