Top Banner
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig and Fernando Pereira Google 2011. 10. 24 Eun-Sol Kim
13

The Unreasonable Effectiveness of Data

Feb 24, 2016

Download

Documents

cadee

The Unreasonable Effectiveness of Data. Alon Halevy, Peter Norvig and Fernando Pereira Google 2011. 10. 24 Eun -Sol Kim. The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Data

Alon Halevy, Peter Norvig and Fernando PereiraGoogle

2011. 10. 24Eun-Sol Kim

Page 2: The Unreasonable Effectiveness of Data

• The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve.

- Eugene Wigner, The Unreasonable Effectiveness of Mathematics in the Natural Sciences

• Essentially, all models are wrong but some are useful

- George Box

Page 3: The Unreasonable Effectiveness of Data

Two approaches to AI• GOFAI ( Good Old-Fashioned Artificial

Intelligence )– Based on Logic– Symbolic AI

• SML ( Statistical Machine Learning )– Based on empirical data ( sensor data or databases )– Inductive inference based on data, generalize data to

rules, predict on future data

Page 4: The Unreasonable Effectiveness of Data

• Scene completion using millions of photographs- Hays et al., CMU, SIGGRAPH 2007

Page 5: The Unreasonable Effectiveness of Data

The power of data

Page 6: The Unreasonable Effectiveness of Data

Learning from Text at Web Scale• Brown Corpus– 1 Million English

words– Complete sen-

tences, no spelling errors, no gram-matical errors

• Google a trillion-word corpus– 100 time larger

than Brown corpus– Frequency counts

for all sequences up to 5 words long.

Page 7: The Unreasonable Effectiveness of Data

Some lessons of web-scale learning

1. Use available large-scale data rather than annotated data

– We can find useful semantic relationships au-tomatically from the statistics of search queries and the corresponding results or from the ac-cumulated evidence of web-based text pat-terns without annotated data.

Page 8: The Unreasonable Effectiveness of Data

2. Memorization is a good policy

- Memorizing specific phrases is more effective than general patterns.

- Machine translation example : Large memo-rized phrase tables that give candidate map-pings between specific source- and target-lan-guage phrases.

- For many tasks, words and word combinations provide all the representational machinery we need to learn from text.

Page 9: The Unreasonable Effectiveness of Data

Conventional two approaches to NLP

• Deep approach– Hand-coded grammars and ontologies– Complex networks of relations

• Statistical approach– Learning n-gram statistics from large

corpora

Page 10: The Unreasonable Effectiveness of Data

New approaches to NLP• Combination of two conventional ap-

proaches– Statistical relational learning• Represent relations between objects with

rule ( first-order-logic)• Model built by statistical learning

Page 11: The Unreasonable Effectiveness of Data

Semantic interpretation• Semantic web– A convention for formal representation lan-

guages that lets software services interact with each other

• Semantic interpretation– Imprecise, ambiguous natural languages.– Embodied in human cognitive and cultural pro-

cesses whereby linguistic expression elicits ex-pected responses and expected changes in cog-nitive states

Page 12: The Unreasonable Effectiveness of Data

The challenges for achieving accurate semantic interpretation• Interpreting the content

– methods to infer relationships between column headers or mentions of entities in the world.

• Web-scale data might be an important part of the solution.– Hundreds of millions of independently created tables.– Tables represent structured data– With table, we can resolve semantic heterogeneity.

Page 13: The Unreasonable Effectiveness of Data

Choose a representation That can use unsupervised learning

On unlabeled dataWhich is so much more plentiful than

labeled data.