Top Banner
Lecture 7: Learning from Massive Datasets October 2013 Machine Learning for Language Technology Marina Santini, Uppsala University Department of Linguistics and Philology
69

Lecture 7: Learning from Massive Datasets

Jan 20, 2015

Download

Education

Marina Santini

In this lecture we explore how big datasets can be used with the Weka workbench and what other issues are currently under discussion in the real world, for ex: big data applications, predictive linguistic analysis, new platforms and new programming languages.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 7: Learning from Massive Datasets

Lecture 7: Learning from Massive Datasets

October 2013

Machine Learning for Language Technology

Marina Santini, Uppsala University

Department of Linguistics and Philology

Page 2: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

2

Outline Watch the pitfalls Learning from massive datasets

Data Mining Text Mining – Text Analytics Web Mining Big Data

Programming Languages and Framework for Big Data

Big Textual Data & Commercial Applications

Events, MeetUps, Coursera

Page 3: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

3

Practical Machine Learning

Page 4: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

4

Data Mining Data mining is the extraction of implicit,

previously unknown and potentially useful information from data (Witten and Frank, 2005)

Page 5: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

5

Watch out!Machine Learning is not just about:

1. Finding data and blindly applying learning algorithms to it

2. Blindly compare machine learning methods:1. Model complexity2. Representativeness of training data distribution3. Reliability of class labels

Remember: Practitioners’ expertise counts!

Page 6: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

6

Massive Datasets Space and Time Three ways to make learning feasible (the old

way) Small subset Parallelization Data chunks

The new way: Develop new algorithms with lower computational

complexity Increase background knowledge

Page 7: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

7

Domain Knowledge Metadata

Semantic relation Causal relation Functional dependencies

Page 8: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

8

Text Mining Actionable information Comprehensible information Problems

Text Analytics

Page 9: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

9

Definition: Text Analytics A set of NLP techniques that provide some

structure to textual documents and help identify and extract important information.

Page 10: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

10

Set of NLP (Natural Language Processing ) techniques Common components of a text analytic

package are: Tokenization Morphological Analysis Syntactic Analysis Named Entity Recognition Sentiment Analysis Automatic Summarization Etc.

Page 11: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

11

NLP at Coursera (www.coursera.org)

Page 12: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

12

NLP is pervasiveEx: spell-checkers

Google Search Google Mail Facebook Office Word […]

Page 14: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

14

Sentiment Analysis

Page 15: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

15

Text Analytics Products and Frameworks

Commercial Products: Attensity Clarabridge Temis Lexalytics Texify SAS SPSS IBM Cognos etc.

Open Source Frameworks:

• GATE• NLTK• UIMA• openNLP• etc.

Page 16: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

16

However… (I) NLP tools and applications (both commercial

and open source) are not perfect. Research is still very active in all NLP fields.

Page 17: Lecture 7: Learning from Massive Datasets

17

Ex: Syntactic Parser Connexor

What about parsing a tweet? “My son, Ky/o, asked me for the first time today how

my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fair “ (Twitter Tutorial 1: How to Tweet Well)

Page 18: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

18

Why NLP and Text Analytics for Text Mining?

Why is it important to know that a word is a noun, or a verb or the name of brand?

Broadly speaking (Think about these as features for a classification problem!) Nouns and verbs (a.k.a. content words): Nouns are

important for topic detection; verbs are important if you want to identify actions or intentions.

Adjectives = sentiment identification. Function words (a.k.a. stop words) are important for

authorship attribution, plagiarism detection, etc. etc.

Page 19: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

19

However… (II) At present, the main pitfall of many NLP

applications is that they are not flexible enough to: Completly disambiguate language Identify how language is used in different types of

documents (a.k.a. genres).

For instance, in tweets langauge is used in a different way than an emails, language used in email is different from the language used in academic papers, etc. )

Often tweaking NLP tools to different types of text or solve language ambiguity in an ad-hoc manner is time-consuming, difficult and unrewarding…

Page 20: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

20

What for? Text summarization Document clustering Authorship attribution Automatic medadata extraction Entity extraction Information extraction Information discovery ACTIONABLE INTELLIGENCE

Page 21: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

21

Actionable Textual Intelligence

Business Intelligence (BI) + Customer Analytics + Social Network Analytics + Crisis Intelligence […] = Actionable Intelligence

Actionable Intelligence is information that:1. must be accurate and verifiable2. must be timely3. must be comprehensive4. must be comprehensible5. !!! give the power to make decisions and to act straightaway

!!!6. !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL

DATA !!!

Page 22: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

22

Big Data BIG DATA [Wikipedia]:

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data.

Examples include Big Science, web logs, RFID, sensor networks,

social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.

Page 23: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

23

Big Unstructured TEXTUAL Data

“Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data –commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and web pages.” [DM Review Magazine, February 2003 Issue]

ECONOMIC LOSS!

Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice.

Page 24: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

24

Simple search is not enough… Of course, it is possible to use simple search.

But simple search is unrewarding, because is based on single terms. ”a search is made on the term felony. In a simple

search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies” [ Source: Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13]

Page 25: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

25

Programming languages and frameworks for big data

Page 26: Lecture 7: Learning from Massive Datasets

26

R R is a statistical programming language. It is a free

software programming language and a software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R's popularity has increased substantially in recent years (wikipedia)

http://www.r-project.org/

Page 27: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

27

Page 28: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

28

MeetUps: R in Stockholm

Page 29: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

29

Can R help out? Can R help overcome NLP shortcomings and

open a new direction in order to extract useful information from Big TEXTUAL Data?

Page 31: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

31

Companion website by Stefan Th. Gries BNC=British National Corpus (PoS tagged)

Page 32: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

32

BNC

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.

The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.

Page 33: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

33

R & the BNC: Excerpt from Google Books

Page 34: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

34

What about Big Textual Data? Non standardized language Non standard texts Electronic documents of all kinds, eg. formal,

informal, short, long, private, public, etc.

Page 35: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

35

Not distributed system Open Source

R Scala (also distributed

systems) Rapid Miner Weka …

Commercial SPSS SAS MatLab …

The name Scala is a portmanteau of "scalable" and "language", signifying that it is designed to grow with the demands of its users. James Strachan, the creator of Groovy, described Scala as a possible successor to Java

Page 36: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

36

From The Economist: The Big Data scenario

Page 37: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

37

Commercial applications for Big Textual Data

Recorded Future web intelligence (anticipating emerging threats, future trends, anticipating competitors’ actions, etc.)

Gavagai large-scale textual analysis (prediction and future trends)

Page 38: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

38

Thanks to Staffan Truffe’ for the ff slides

Page 39: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

39

Size

Page 40: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

40

In a few pictures…

Page 41: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

41

Metrics, structure and time

Page 42: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

42

Metric

Page 43: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

43

Structure

Page 44: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

44

Time

Page 45: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

45

Facts

Page 46: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

46

Pipeline

Page 47: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

47

Multi-Language

Page 48: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

48

Text Analytics

Page 49: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

49

Predictions

Page 50: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

50

Gavagai Jussi Karlgren (PhD in Stylistics in Information Retrieval) Magnus Sahlgren (PhD thesis in distributional semantics) Fredrick Olsson (PhD thesis in Active Learning)

(co-workers at SICS)

The indeterminacy of translation is a thesis propounded by 20th-century American analytic philosopher W. V. Quine.

Quine uses the example of the word "gavagai" uttered by a native speaker of the unknown language Arunta upon seeing a rabbit. A speaker of English could do what seems natural and translate this as "Lo, a rabbit." But other translations would be compatible with all the evidence he has: "Lo, food"; "Let's go hunting"; "There will be a storm tonight" (these natives may be superstitious)… (wikipedia)

Page 51: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

51

Ethersource presented Thanks to F. Olsson for the ff slides

Page 52: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

52

Associations

Page 53: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

53

Language is flux

Page 54: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

54

Learning from use

Page 55: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

55

Scope

Page 56: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

56

Architecture

Page 57: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

57

Web vs printed world

Page 58: Lecture 7: Learning from Massive Datasets

58

Noise…

Page 59: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

59

Multi-linguality

Page 60: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

60

SICS

Watch the videos!

Page 61: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

61

Big Data MeetUp, Stockholm

Page 62: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

62

BIG DATA communities

Page 63: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

63

Future Directions in Machine Learning for Language Technology Deluge of data Little linguistic analysis in the realm of big-

data real-world platforms and applications Top-down systems cannot efficiently deal with

irregularity and unpredictability of big textual data

Data-driven systems can make it. However, …we know that computers are not at ease with

natural languages used by humans, unless they learn how to learn linguistic structure underlying natual language from data…

Page 64: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

64

For a data-driven approach… Annotated datasets that are needed for complete

supervised machine learning are costly, time-comsuming and require specialist expertise.

Is complete supervision even thinkable when we talk about tera-, peta- or yottabytes? How big should then be the training set?

Alternative solutions: Semi-supervised methods (combination of labelled and

unlabelled data) Weakly supervised methods (human-constructed rules

are typically used to guide the unsupervised learner) Unsupervised learning results cannot still compete

with suprevised learning in many tasks…

Page 65: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

65

A new way to explore: Incomplete Supervision

Relies on partially labelled data: ” Human experts — or possibly a crowd of laymen

— annotate text with some linguistic structure related to the structure

that one wants to predict. This data is then used for partially supervised learning with a statistical model that exploits the annotated structure to infer the linguistic structure of interest.” p. 4

Page 66: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

66

Example ”…it is possible to construct accurate and robust part-of-

speech taggers for a wide range of languages, by combining (1) manually annotated resources in English, or some other language for which such resources are already available, with (2) a crowd-sourced target-language specific lexicon, which lists the potential parts of speech that each word may take in some context, at least for a subset of the words.

Both (1) and (2) only provide partial information for the part-of-speech tagging task. However, taken together they turn out to provide substantially more information than either taken alone. “ p. 4-6

Oscar Täckström “Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision” PhD Thesis, Uppsala University, 2013 (http://soda.swedish-ict.se/5513/)

Page 67: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

67

Conclusions This course is an introduction to

Machine leaning for Language Technology”.

You get a flavour of the problems we come across when devising models for enabling machines to analyse and make sense of natural human language.

The next big big big step is to bring as much linguistic awareness as possible into big data.

Page 68: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

68

Reading Witten and Frank (2005) Ch. 8

Page 69: Lecture 7: Learning from Massive Datasets

Lect. 7: Learning from Massive Datasets

69

Thanx for your attention!