Top Banner
Automatic English Text Correction @tati_alchueyr Automatic English Text Correction Tatiana Al-Chueyr Martins @tati_alchueyr Bratislava, 12 March 2016 PyCon SK 2016
62

Automatic English text correction

Apr 14, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Automatic English Text CorrectionTatiana Al-Chueyr Martins@tati_alchueyr

Bratislava, 12 March 2016

PyCon SK 2016

Page 2: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

tati.__doc__

● Brazilian● Lives in London (United Kingdom)● Pythonista and Open Source activist● Computer Engineer by Unicamp (Brazil)● Develops software programs since 2002● Works at EF (Education First)

○ Backend & DevOps leader of CTX Team

Page 3: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

help(EF)

● EF: Education First● International education company

○ Language training○ Educational travel○ Academic degree level

● Funded in 1955 in Sweden by Bertil Hult● ~ 40,000 staff● ~ 500 offices and schools in more than 50 countries (including Slovakia ;))● Privately held by the Hult family

Page 4: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

help(EF.CTX)

● Classroom Technology Experience● Teaching and learning applications (Web & Mobile)● Authoring platform

Page 5: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

CTX.__team__

● CTX Team● Team travel

● Malta, November 2015

Page 6: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

CTX.backend

● Rafael Cunha de Almeida● and I● trying to master Italian culinary

● London, February 2016

● Although I’m presenting this project alone, Rafa has contributed to it as much as I :)

Page 7: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

objective

● Present a challenge● Introduce a useful dataset● Introduce a bunch of Python scripts● Collect ideas● Build collaboratively good quality open source tools which can help dealing

with this challenge

Page 8: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

the challenge

Page 9: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The challenge

To assess (evaluate) students’ activities & exercises can be:

● Laborious● Repetitive● Slow● In other words... painful!

https://classteaching.files.wordpress.com/2013/10/marking-pile.gif

Page 10: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

Page 11: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

Page 12: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

spelling

Page 13: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

spelling

verb tense

Page 14: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

spelling

verb tense

There are “only” 89

writings left to access this

week...

Page 15: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The challenge

Implement algorithms and tools which can help (teachers) assessing English written essays

Example of application available in several applications (including LibreOffice, Google Apps, MS Word):

● Highlight (potential) mistakes while user types in a text area

Page 16: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The challenge

● Input:○ English text

● Output:○ List of items containing:

■ Position in text■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc)■ Proposal of correction

Page 17: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

the dataset

Page 18: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

EFCamDAT

● 551,036 written essays○ 2,897,788 sentences○ 32,980,407 word tokens

● by 85,864 learners● 16 levels of proficiency● 172 nationalities● annotated with corrections by English teachers

Page 19: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Examples of essay topics

● Introducing yourself by email● Writing an online profile● Describing your favourite day● Telling someone what you’re doing● Replying to a new penpal● Writing about what you do● Writing a resume

● Giving instructions to play a game● Reviewing a song for a website● Writing an apology email● Writing a movie review● Turning down an invitation● Giving advice about budgeting● Covering a news story● Researching a legendary creature

Page 20: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Examples of learners nationalities

● 36.9% Brazilians● 18.7% Chinese● 8.5% Russians● 7.9% Mexicans● 5.6% Germans● 4.3% French● ...

Page 21: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

EFCamDAT

● EF-Cambridge Open Language Database● Partnership between:

○ University of Cambridge (Department of Theoretical and Applied Linguistics)■ EF-Research Unit

○ EF Education First● Data collected from Englishtown

○ EF learning environment (online English school)

Page 22: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The datasetTypes of mistakes annotated

● X >> y: change from x to y● AG: agreement● AR: article● CO: combine sentence● C: capitalization● D: delete● EX: expression of idiom● HL: highlight● I(x): insert x● MW: missing word● NS: new sentence

● NWS: no such word● PH: phraseology● PL: plural● PO: possessive● PR: preposition● PS: part of speech● PU: punctuation● SI: singular● SP: spelling● VT: verb tense● WC: word choice● WO: word order

Page 23: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

● 10 most common mistakes

The dataset

Page 24: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

How to get it?

● https://corpus.mml.cam.ac.uk/efcamdat1/access.php

Licence:

● Use non-commercial research● Commercial use when agreed upon agreement● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf

Page 25: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Page 26: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Page 27: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Once you’ve registered

● It is possible to filter the dataset● export the dataset into a XML file

Page 28: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Page 29: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

a bunch of Python scripts

Page 30: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

A bunch of (Python) scriptsDisclaimer

Code developed using the Extreme Go Horse Methodology during Hackday moments

They are a POC and lack:

- Proper automated tests- Proper code design & API- Documentation

https://gist.github.com/banaslee/4147370

Page 31: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

A bunch of (Python) scripts

What do they do?

1. Fix the XML files2. Convert the XML files into good looking JSON files3. Implement heuristics to identify some common English mistakes

○ For now: spelling, capitalization and articles

4. Analysis of how efficient the algorithm was

Page 32: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

A bunch of (Python) scripts

How to download them?

● https://github.com/ef-ctx/righter

Licence

● Apache version 2.0

Page 33: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Hands on

Page 34: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Mistakes identification

We wrote functions that apply heuristics and rules to detect mistakes related to:

1. Spelling2. Capitalization3. Article

Page 35: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Efficiency

In order to check their efficiency, we created:

● A few unit tests● Before committing any change, we’d evaluate

○ How close to the teacher’s annotations we reached, using:■ Precision■ Recall■ F-score

● We print a side-to-side comparison of what the teacher annotated and what the algorithm identified

Page 36: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Efficiency

https://en.wikipedia.org/wiki/Precision_and_recall

Page 37: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Efficiency

F-Score

Page 38: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling

Page 39: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: heuristics

1. Remove unicode symbols (eg. —)

2. Transform diacritics (eg. é -> e)○ This is particularly important for names

3. Remove punctuation (eg. !, ?, .)

4. Check if word:○ Is inside dictionary (case insensitive)○ Has digits○ Is inside names file (created with domain specific names; eg. Englishtown)

5. If none of that is true, then word is probably misspelled

Page 40: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: results

Summary:

● total essays: 85,629

● mean precision: 0.7128 (std: 0.3580)

● mean recall: 0.6535 (std: 0.4212)

Page 41: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: precision and recall per learner level

Page 42: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: F-score per nationality

Page 43: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization

Page 44: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: heuristics

1. Check if word starts a sentence○ Split on punctuation (!, ., ?, etc)

2. Check if word is a known capital word○ First person (I)○ Day of the week○ Month○ Language (eg. English, Spanish, French, etc)○ Country○ Names (selected from corpus to match context-specific names)

Page 45: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: results

Summary:

● total essays: 76,980

● mean precision: 0.5714 (std: 0.4005)

● mean recall: 0.5550 (std: 0.4472)

Page 46: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: precision and recall per learner level

Page 47: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: F-score per nationality

Page 48: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles

Page 49: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: heuristics

1. Check words using a before vogals

2. Check words using an before consonants

Page 50: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: results

Summary:

● total items: 47,054

● average precision: 0.9724 (std: 0.1602)

● average recall: 0.0718 (std: 0.2463)

Page 51: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: results

Summary:

● total essays: 76,980

● mean precision: 0.5714 (std: 0.4005)

● mean recall: 0.5550 (std: 0.4472)

Page 52: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Article: precision and recall per learner level

Page 53: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: F-score per nationality

Page 54: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Overview

Page 55: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

● efficiency of current heuristics

Mistakes identification per learner level

Page 56: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

ideas

Page 57: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Next steps

● Clean up code● Spelling

○ Use probabilistic models■ http://norvig.com/spell-correct.html

● Capitalization○ POS-tagging to identify names of people, organizations, places

● Articles○ POS-tagging○ Deal with plurals○ Define heuristics for dealing with definite articles (the)

Page 58: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Next steps

● Add to user-interface of EF Class● Collect feedback from end-users (teachers)● Algorithm for proposing the correct forms● Dealing with the other kinds of mistakes● Implement a classifier using NPL (natural language processing) so we can

have input from the end-users if the suggestions are good or not - and learn with them

Page 59: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Ideas

Page 60: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

PyCon SK is not over...

Page 61: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

● Sunday (13/03)● 9:00 - 12:00● Organizer:

○ Rodolfo Carvalho

Join the Coding Dojo tomorrow! (13/03)

http://codingdojo.org/cgi-bin/index.pl?WhatIsCodingDojohttps://www.youtube.com/watch?v=vqnwQ3oVM1M

Page 62: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Questions?Thanks :)

@[email protected]