Top Banner
Bringing Data Science to the Speakers of Every Language Robert Munro, PhD CEO, Idibon ACM Conference on Knowledge Discovery and Data Mining (KDD), 2014
53
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bringing Data Science to the Speakers of Every Language

Bringing Data Science to the Speakers of

Every Language

Robert Munro, PhD

CEO, IdibonACM Conference on Knowledge Discovery and Data Mining

(KDD), 2014

Page 2: Bringing Data Science to the Speakers of Every Language

idibon

About me: technology and global development

CEO, Idibon

Global text analytics in

50+ languages

Working with leaders in

industry & social good

Industry: CTO / CIO

Energy infrastructure in Liberia and

Sierra Leone

Global epidemic tracking

Crowdsourcing and natural language

processing for disaster responseOther

Ph.D. in NLP from Stanford

Bicycled 20+ countries

Page 3: Bringing Data Science to the Speakers of Every Language

idibon

Recommendations for language

processing for social good

Look beyond English

Inherent benefit understanding and

support speakers of every language

Employ people in those languages

Crowdsourced workers speak 100s of

languages, and want to use them

Embrace the variation

You can’t rely on consistent spellings, but

you can learn to model the diversity

Page 4: Bringing Data Science to the Speakers of Every Language

idibon

How many languages are in the connected world?

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

Year

# o

f la

ngu

ages

How has this changed?

Page 5: Bringing Data Science to the Speakers of Every Language

idibon

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

How many languages are in the connected world?

How has this changed?

Year

# o

f la

ngu

ages

Page 6: Bringing Data Science to the Speakers of Every Language

idibon

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

How many languages are in the connected world?

Year

# o

f la

ngu

ages

Page 7: Bringing Data Science to the Speakers of Every Language

idibon

How many languages are in the connected world?

5 5 5 5 5 5 4.5 4

50

1500

5000

2000

1400

720540 500

Year

# o

f la

ngu

ages

Putting a phone in the hands of everyone on the planet is the easy part

Understanding everyone is going to be more complicated

Page 8: Bringing Data Science to the Speakers of Every Language

idibon

Every human

communication this year

Source: Ethnologue, Nationalencyklopedin

Page 9: Bringing Data Science to the Speakers of Every Language

idibon

7% of our communications are

digital, most is still direct

spoken language

Page 10: Bringing Data Science to the Speakers of Every Language

idibon

If every online picture is worth a

thousand words, it would double

social media.

Every

picture

Page 11: Bringing Data Science to the Speakers of Every Language

idibon

Every 3 months, the world's text

messages exceed the word count

of every book.

Every

book. Ever.

Source: Google Books

Page 12: Bringing Data Science to the Speakers of Every Language

idibon

Print communication is smaller

than anything shown.

Page 13: Bringing Data Science to the Speakers of Every Language

idibon

Print communication is smaller

than anything shown.

Ditto any one social network.

Page 14: Bringing Data Science to the Speakers of Every Language

idibon

The Twitter “firehose” is about

the size of the dot above the “i” in

English.

Beyond the processing

capacity of most organizations.

Might not be a representative

sample of all human activity for

your area of interest.

Page 15: Bringing Data Science to the Speakers of Every Language

idibon

There are more than 6,000 other

languages.

Only the top 1% are shown.

Page 16: Bringing Data Science to the Speakers of Every Language

idibon

No language from the Americas

made the cut.

Quechua

Page 17: Bringing Data Science to the Speakers of Every Language

idibon

Email spam would be larger than

every block except spoken

Mandarin (官话).

Source: Mashable

Page 18: Bringing Data Science to the Speakers of Every Language

idibon

Short messages (SMS and IM)

make up 2% of the world’s

communications.

The largest and most

linguistically diverse form of

written communication that has

ever existed.

# PhDs focused on processing

large volumes of short messages

in low resource languages?

1

Page 19: Bringing Data Science to the Speakers of Every Language

idibon

If the Facebook “like” is a one-word

language it is in the top 5% of

languages by word count.

Page 20: Bringing Data Science to the Speakers of Every Language

idibon

Your browser probably won't

show Sundanese script

(ᮘᮘᮘᮘᮘᮘᮘ)

Page 21: Bringing Data Science to the Speakers of Every Language

idibon

Combined.

Sundanese speakers outnumber the

populations New York, London,

Tokyo and Moscow.

Page 22: Bringing Data Science to the Speakers of Every Language

idibon

You misread “Sundanese” as

"Sudanese" which is a variety of

Arabic

We have a blind spot for knowing

about the existence of languages.

Page 23: Bringing Data Science to the Speakers of Every Language

idibon

This is the breakdown of

languages that most of our data is

moving towards

Page 24: Bringing Data Science to the Speakers of Every Language

idibon

January 12, 2010

An earthquake struck Haiti on January 12, 2010

Most local services failed, but most cell-towers remained functional.

Page 25: Bringing Data Science to the Speakers of Every Language

idibon

Messages start streaming in

Page 26: Bringing Data Science to the Speakers of Every Language

idibon

Messages start streaming in

Page 27: Bringing Data Science to the Speakers of Every Language

idibon

Mission 4636

Message translated,

categorized & geolocated

Location is refined &

actionable items are identified

“Fanm gen tranche pou fè yon

pitit nan Delmas 31”

“Fanm gen tranche pou fè yon

pitit nan Delmas 31”

Undergoing children delivery

Delmas 3118.495746829274168, 72.31849193572998

Emergency

(18.4957, -72.3185)

“Fanm gen tranche pou fè yon

pitit nan Delmas 31”

Undergoing children delivery

Delmas 3118.495746829274168, 72.31849193572998

Emergency

Page 28: Bringing Data Science to the Speakers of Every Language

idibon

Global collaboration

2,000 volunteers, transferred to paid workers in Haiti

Page 29: Bringing Data Science to the Speakers of Every Language

idibon

Lopital Sacre-Coeur ki nan vil Okap, pre pou li resevwamoun malad e lap mande pou mounki malad yo ale la.

“Sacre-Coeur Hospital which located in this village of Okap is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.”

Page 30: Bringing Data Science to the Speakers of Every Language

idibon

Lopital Sacre-Coeur ki nan vil Okap, pre pou li resevwamoun malad e lap mande pou mounki malad yo ale la.

“Sacre-Coeur Hospital which located in this village of Okap is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.”

Page 31: Bringing Data Science to the Speakers of Every Language

idibon

Lopital Sacre-Coeur ki nan vil Okap, pre pou li resevwamoun malad e lap mande pou mounki malad yo ale la.

“Sacre-Coeur Hospital which located in this village of Okap is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.”

Page 32: Bringing Data Science to the Speakers of Every Language

idibon

Local knowledge

Workers collaborating to find locations:

Dalila: I need Thomassin Apo please

Apo: Kenscoff Route: Lat: 18.495746829274168, Long:-72.31849193572998

Apo: This Area after Petion-Ville and Pelerin 5 is not on Google Map. We have no streets name

‘here’ = anywhere

Feedback from responders:

"just got emergency SMS, child delivery, USCG are acting, and, the GPS coordinates of the location we got from someone of your team were 100% accurate!"

(18.4957, -72.3185)

Apo

Dalila

Haiti responders

The ability for someone to make a real-time difference at any other place in the world:

Apo: I know this place like my pocket

Dalila: thank God u was here

Page 33: Bringing Data Science to the Speakers of Every Language

idibon

How do we automate processing the

world’s data?

Page 34: Bringing Data Science to the Speakers of Every Language

idibon

English

Generations of standardization in spelling and simple

morphology

Whole words suitable as features for NLP systems

Most other languages

Relatively complex morphology

Less (observed) standardized spellings

More dialectal variation

Page 35: Bringing Data Science to the Speakers of Every Language

idibon

Haitian Krèyol

No standard (wide-spread) spellings

More or less French spellings

More or less phonetic spellings

Frequent words (esp pronouns) are shortened and

compounded

Regional slang / abbreviations

Page 36: Bringing Data Science to the Speakers of Every Language

idibon

Haitian Krèyol

mèsi, mesi,

mèci, merci

C a p - H a ï t i e n

K a p a y i s y e n

Page 37: Bringing Data Science to the Speakers of Every Language

idibon

The extent of the subword variation

>30 spellings of odwala (‘patient’) in Chichewa

>50% variants of ‘odwala’ occur only once in the data

used here:

Affixes and incorporation

‘kwaodwala’ -> ‘kwa + odwala’

‘ndiodwala’ -> ‘ndi odwala’ (official ‘ngodwala’ not present)

Phonological/Orthographic

‘odwara’ -> ‘odwala’

‘ndiwodwala’ -> ‘ndi (w) odwala’

Page 38: Bringing Data Science to the Speakers of Every Language

idibon

Chichewa

The word odwala (‘patient’) in 600

text-messages in Chichewa and the

English translations

Page 39: Bringing Data Science to the Speakers of Every Language

idibon

Modeling the variation gives accurate results

ndimmafuna manthwala

(‘I currently need medicine’)

ndimafuna mantwala

ndi-ma-fun-a man-twala

ndi-ma-fun-a man-twala

ndi -fun man-twala

(“I need medicine”)

Category = “Request for aid”

ndi kufuni mantwara

(‘my want of medicine’)

ndi kufuni mantwala

ndi-ku-fun-i man-twala

ndi-ku-fun-iman-twala

ndi -fun man-twala

(“I need medicine”)

Category = “Request for aid”

1) Normalize spellings

2) Segment

3) Identify predictors

1 in 5 classification errors with raw messages

1 in 20 classification error post-processing. Improves with scale.

Page 40: Bringing Data Science to the Speakers of Every Language

idibon

Comparison with English

Acc

ura

cy:

Mic

ro-f

Percent of training data

Page 41: Bringing Data Science to the Speakers of Every Language

idibon

Taking it to the world

Page 42: Bringing Data Science to the Speakers of Every Language

idibon

The benefits of understanding everyone

Human diseases eradicated in the last 75 years:

Increase in a ir t ravel in the last 75 years:

s m a l l p o x

Page 43: Bringing Data Science to the Speakers of Every Language

idibon

Reports of ‘strange new illnesses’ pre-date official records

H1N1 (Swine Flu)

months

(10% of world infected)HIV

decades

(35 mil l ion infected)

H1N5 (Bird Flu) weeks

(>50% fatal)

Page 44: Bringing Data Science to the Speakers of Every Language

idibon

…but the reports are in 1000s of languages

90% of ecological d ivers i ty 90% of l inguist ic d ivers i ty

в п р е д с т о я щ и й о с е н н е - з и м н и й п е р и о д в Ук р а и н е о ж и д а ю т с я д в е э п и д е м и и г р и п п а

ن م د ي ز زم ن و ل ف ن ر ا و ي ط ل ا را ص م ي ف

香港现 1例H 5 N 1禽流感病例曾游上海南京等地

Page 45: Bringing Data Science to the Speakers of Every Language

idibon

Crowdsourcing, big data, and expert analysts

в п р е д с т о я щ и й о с е н н е - з и м н и й п е р и о д в Ук р а и н е о ж и д а ю т с я д в е э п и д е м и и г р и п п а

ن م د ي ز زم ن و ل ف ن ر ا و ي ط ل ا را ص م ي ف

香港现 1例H 5 N 1禽流感病例曾游上海南京等地

Most information is in plain language:

Multiple ski l l and processing strategies required.

Page 46: Bringing Data Science to the Speakers of Every Language

idibon

Digital Disease Discovery

Big Datam a c h i n e l e a r n i n g :

ex t ra c t i o n , f i l t e r i n g & p r i o r i t i z a t i o n

Global monitor ingS a fe r w o r l d

Analysts s e v e ra l d o m a i n ex p e r t s

Crowdsourc ingt h o u s a n d s o f

n a t i v e - l a n g u a g e s p e a ke rs

Reportsm i l l i o n s p e r d a y :m a ny l a n g u a g e s ,

m u c h n o i s e

Page 47: Bringing Data Science to the Speakers of Every Language

idibon

The impact of scalable monitoring

Found historical signals that pre -dated Google Flu Trends by 3 weeks, CDC by 5 … on CNN

How can we f i l ter/model media -dr iven ampl i f icat ion?

“ I ’m Jacqui Jeras with today ’s cold and flu report . . .

across the mid-At lant ic states, a l i tt le b it of an increase”

January 4, 2008 CNN Weather

Page 48: Bringing Data Science to the Speakers of Every Language

idibon

The impact of scalable monitoring

Tracked Ebola in Uganda 5 days before World Health Organization.

“we were able to pul l in much r icher data f rom a larger number of sources, so we knew not just how many people were infected, but what k ind of t ransport they took when they went f rom their v i l lage to the hospita l in the nearest main town .”

Robert Munro, UN General Assembly on “Big Data and Global Development ”

Age + gender + v i l lage + went to hospital .

Th is is personal ly ident i fy ing & could lead to persecut ion.

Open data would have a negat ive impact .

Page 49: Bringing Data Science to the Speakers of Every Language

idibon

The impact of scalable monitoring

Tracked E-Coli Outbreak in Germany 2 days before ECDC.

How do we motivate informat ion process ing?

Margins are smal l : only for -prof i t b ig data and crowdsourcing can have a susta ined impact

Page 50: Bringing Data Science to the Speakers of Every Language

idibon

Idibon’s current work

Hurricane Sandy

Idibon’s CTO ran FEMA’s Aerial

Damage Assessments.

We have >1,000,000 manual

tags on communications.

MIT Humanitarian Response Lab

Identifying reports about supply-

line interruptions.

Research data from a

combination of crowdsourcing

and natural language processing

Page 51: Bringing Data Science to the Speakers of Every Language

idibon

Recommendations for language

processing for social good

Look beyond English

Inherent benefit understanding and

support speakers of every language

Employ people in those languages

Crowdsourced workers speak 100s of

languages, and want to use them

Embrace the variation

You can’t rely on consistent spellings, but

you can learn to model the diversity

Page 52: Bringing Data Science to the Speakers of Every Language

idibon

Further reading

Munro, Robert. 2013. Crowdsourcing and the Crisis-Affected Community: lessons learned and looking forward from Mission 4636. Journal of Information Retrieval 16(2).

Munro, Robert. 2012. Processing short message communications in low-resource languages. PhD Thesis, Stanford University. Stanford, CA

Munro, Robert and Christopher Manning. 2012. Short message communications: users, topics, and in-language processing. Second Annual Symposium on Computing for Development (ACM DEV 2012), Atlanta.

Munro, Robert, Lucky Gunasekara, Stephanie Nevins, Lalith Polepeddi and Evan Rosen. 2012. Tracking Epidemics with Natural Language Processing and Crowdsourcing. Spring Symposium for Association for the Advancement of Artificial Intelligence (AAAI), Stanford.

Munro, Robert. 2011 Subword and spatiotemporal models for identifying actionable information in Haitian Kreyol. Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), Portland.

Munro, Robert and Christopher Manning. 2010. Subword Variation in Text Message Classification. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2010), Los Angeles, CA.

Page 53: Bringing Data Science to the Speakers of Every Language

Thank you

Robert Munro, PhD

CEO, Idibon