Automatic English Text Correction@tati_alchueyr
Automatic English Text CorrectionTatiana Al-Chueyr Martins@tati_alchueyr
Bratislava, 12 March 2016
PyCon SK 2016
Automatic English Text Correction@tati_alchueyr
tati.__doc__
● Brazilian● Lives in London (United Kingdom)● Pythonista and Open Source activist● Computer Engineer by Unicamp (Brazil)● Develops software programs since 2002● Works at EF (Education First)
○ Backend & DevOps leader of CTX Team
Automatic English Text Correction@tati_alchueyr
help(EF)
● EF: Education First● International education company
○ Language training○ Educational travel○ Academic degree level
● Funded in 1955 in Sweden by Bertil Hult● ~ 40,000 staff● ~ 500 offices and schools in more than 50 countries (including Slovakia ;))● Privately held by the Hult family
Automatic English Text Correction@tati_alchueyr
help(EF.CTX)
● Classroom Technology Experience● Teaching and learning applications (Web & Mobile)● Authoring platform
Automatic English Text Correction@tati_alchueyr
CTX.__team__
● CTX Team● Team travel
● Malta, November 2015
Automatic English Text Correction@tati_alchueyr
CTX.backend
● Rafael Cunha de Almeida● and I● trying to master Italian culinary
● London, February 2016
● Although I’m presenting this project alone, Rafa has contributed to it as much as I :)
Automatic English Text Correction@tati_alchueyr
objective
● Present a challenge● Introduce a useful dataset● Introduce a bunch of Python scripts● Collect ideas● Build collaboratively good quality open source tools which can help dealing
with this challenge
Automatic English Text Correction@tati_alchueyr
the challenge
Automatic English Text Correction@tati_alchueyr
The challenge
To assess (evaluate) students’ activities & exercises can be:
● Laborious● Repetitive● Slow● In other words... painful!
https://classteaching.files.wordpress.com/2013/10/marking-pile.gif
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
spelling
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
spelling
verb tense
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
spelling
verb tense
There are “only” 89
writings left to access this
week...
Automatic English Text Correction@tati_alchueyr
The challenge
Implement algorithms and tools which can help (teachers) assessing English written essays
Example of application available in several applications (including LibreOffice, Google Apps, MS Word):
● Highlight (potential) mistakes while user types in a text area
Automatic English Text Correction@tati_alchueyr
The challenge
● Input:○ English text
● Output:○ List of items containing:
■ Position in text■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc)■ Proposal of correction
Automatic English Text Correction@tati_alchueyr
the dataset
Automatic English Text Correction@tati_alchueyr
The dataset
EFCamDAT
● 551,036 written essays○ 2,897,788 sentences○ 32,980,407 word tokens
● by 85,864 learners● 16 levels of proficiency● 172 nationalities● annotated with corrections by English teachers
Automatic English Text Correction@tati_alchueyr
The dataset
Examples of essay topics
● Introducing yourself by email● Writing an online profile● Describing your favourite day● Telling someone what you’re doing● Replying to a new penpal● Writing about what you do● Writing a resume
● Giving instructions to play a game● Reviewing a song for a website● Writing an apology email● Writing a movie review● Turning down an invitation● Giving advice about budgeting● Covering a news story● Researching a legendary creature
Automatic English Text Correction@tati_alchueyr
The dataset
Examples of learners nationalities
● 36.9% Brazilians● 18.7% Chinese● 8.5% Russians● 7.9% Mexicans● 5.6% Germans● 4.3% French● ...
Automatic English Text Correction@tati_alchueyr
The dataset
EFCamDAT
● EF-Cambridge Open Language Database● Partnership between:
○ University of Cambridge (Department of Theoretical and Applied Linguistics)■ EF-Research Unit
○ EF Education First● Data collected from Englishtown
○ EF learning environment (online English school)
Automatic English Text Correction@tati_alchueyr
The datasetTypes of mistakes annotated
● X >> y: change from x to y● AG: agreement● AR: article● CO: combine sentence● C: capitalization● D: delete● EX: expression of idiom● HL: highlight● I(x): insert x● MW: missing word● NS: new sentence
● NWS: no such word● PH: phraseology● PL: plural● PO: possessive● PR: preposition● PS: part of speech● PU: punctuation● SI: singular● SP: spelling● VT: verb tense● WC: word choice● WO: word order
Automatic English Text Correction@tati_alchueyr
● 10 most common mistakes
The dataset
Automatic English Text Correction@tati_alchueyr
The dataset
How to get it?
● https://corpus.mml.cam.ac.uk/efcamdat1/access.php
Licence:
● Use non-commercial research● Commercial use when agreed upon agreement● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf
Automatic English Text Correction@tati_alchueyr
The dataset
Automatic English Text Correction@tati_alchueyr
The dataset
Automatic English Text Correction@tati_alchueyr
The dataset
Once you’ve registered
● It is possible to filter the dataset● export the dataset into a XML file
Automatic English Text Correction@tati_alchueyr
The dataset
Automatic English Text Correction@tati_alchueyr
a bunch of Python scripts
Automatic English Text Correction@tati_alchueyr
A bunch of (Python) scriptsDisclaimer
Code developed using the Extreme Go Horse Methodology during Hackday moments
They are a POC and lack:
- Proper automated tests- Proper code design & API- Documentation
https://gist.github.com/banaslee/4147370
Automatic English Text Correction@tati_alchueyr
A bunch of (Python) scripts
What do they do?
1. Fix the XML files2. Convert the XML files into good looking JSON files3. Implement heuristics to identify some common English mistakes
○ For now: spelling, capitalization and articles
4. Analysis of how efficient the algorithm was
Automatic English Text Correction@tati_alchueyr
A bunch of (Python) scripts
How to download them?
● https://github.com/ef-ctx/righter
Licence
● Apache version 2.0
Automatic English Text Correction@tati_alchueyr
Hands on
Automatic English Text Correction@tati_alchueyr
Mistakes identification
We wrote functions that apply heuristics and rules to detect mistakes related to:
1. Spelling2. Capitalization3. Article
Automatic English Text Correction@tati_alchueyr
Efficiency
In order to check their efficiency, we created:
● A few unit tests● Before committing any change, we’d evaluate
○ How close to the teacher’s annotations we reached, using:■ Precision■ Recall■ F-score
● We print a side-to-side comparison of what the teacher annotated and what the algorithm identified
Automatic English Text Correction@tati_alchueyr
Efficiency
https://en.wikipedia.org/wiki/Precision_and_recall
Automatic English Text Correction@tati_alchueyr
Efficiency
F-Score
Automatic English Text Correction@tati_alchueyr
Spelling
Automatic English Text Correction@tati_alchueyr
Spelling: heuristics
1. Remove unicode symbols (eg. —)
2. Transform diacritics (eg. é -> e)○ This is particularly important for names
3. Remove punctuation (eg. !, ?, .)
4. Check if word:○ Is inside dictionary (case insensitive)○ Has digits○ Is inside names file (created with domain specific names; eg. Englishtown)
5. If none of that is true, then word is probably misspelled
Automatic English Text Correction@tati_alchueyr
Spelling: results
Summary:
● total essays: 85,629
● mean precision: 0.7128 (std: 0.3580)
● mean recall: 0.6535 (std: 0.4212)
Automatic English Text Correction@tati_alchueyr
Spelling: precision and recall per learner level
Automatic English Text Correction@tati_alchueyr
Spelling: F-score per nationality
Automatic English Text Correction@tati_alchueyr
Capitalization
Automatic English Text Correction@tati_alchueyr
Capitalization: heuristics
1. Check if word starts a sentence○ Split on punctuation (!, ., ?, etc)
2. Check if word is a known capital word○ First person (I)○ Day of the week○ Month○ Language (eg. English, Spanish, French, etc)○ Country○ Names (selected from corpus to match context-specific names)
Automatic English Text Correction@tati_alchueyr
Capitalization: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
Automatic English Text Correction@tati_alchueyr
Capitalization: precision and recall per learner level
Automatic English Text Correction@tati_alchueyr
Capitalization: F-score per nationality
Automatic English Text Correction@tati_alchueyr
Articles
Automatic English Text Correction@tati_alchueyr
Articles: heuristics
1. Check words using a before vogals
2. Check words using an before consonants
Automatic English Text Correction@tati_alchueyr
Articles: results
Summary:
● total items: 47,054
● average precision: 0.9724 (std: 0.1602)
● average recall: 0.0718 (std: 0.2463)
Automatic English Text Correction@tati_alchueyr
Articles: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
Automatic English Text Correction@tati_alchueyr
Article: precision and recall per learner level
Automatic English Text Correction@tati_alchueyr
Articles: F-score per nationality
Automatic English Text Correction@tati_alchueyr
Overview
Automatic English Text Correction@tati_alchueyr
● efficiency of current heuristics
Mistakes identification per learner level
Automatic English Text Correction@tati_alchueyr
ideas
Automatic English Text Correction@tati_alchueyr
Next steps
● Clean up code● Spelling
○ Use probabilistic models■ http://norvig.com/spell-correct.html
● Capitalization○ POS-tagging to identify names of people, organizations, places
● Articles○ POS-tagging○ Deal with plurals○ Define heuristics for dealing with definite articles (the)
Automatic English Text Correction@tati_alchueyr
Next steps
● Add to user-interface of EF Class● Collect feedback from end-users (teachers)● Algorithm for proposing the correct forms● Dealing with the other kinds of mistakes● Implement a classifier using NPL (natural language processing) so we can
have input from the end-users if the suggestions are good or not - and learn with them
Automatic English Text Correction@tati_alchueyr
Ideas
●
Automatic English Text Correction@tati_alchueyr
PyCon SK is not over...
Automatic English Text Correction@tati_alchueyr
● Sunday (13/03)● 9:00 - 12:00● Organizer:
○ Rodolfo Carvalho
Join the Coding Dojo tomorrow! (13/03)
http://codingdojo.org/cgi-bin/index.pl?WhatIsCodingDojohttps://www.youtube.com/watch?v=vqnwQ3oVM1M