Top Banner
When is A=B? Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch
40

When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Apr 01, 2015

Download

Documents

Maliyah Offield
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

When is A=B?

Donald KossmannSystems Group, ETH Zurich

http://systems.ethz.ch

Page 2: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Acknowledgments

Page 3: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

Page 4: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

Reality: We all are insane! • When do you start believing that your paper is

not worth publishing?

Page 5: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Speculations on IT Trends• Big Data: Automating Experience– Logic -> Statistics– Open World Semantics

• Hybrid Systems: Get best of humans & machines– to err is human

• Systems– DNA, Quantum: trade energy for precision– Distributed systems: design for failure– Intel’s SCC: non-cache-coherent processors

Page 6: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Speculations on IT Trends

• Big Data: Automating Experience– Logic -> Statistics– Open World Semantics

• Hybrid Human & Machine Systems– to err is human

• Systems– DNA HW: trade energy consumption for precision– Distributed systems: design for failure

Computers are becoming insane!

Page 7: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Implications

• We need to model insanity– (too crazy for this talk)– (will use Mechanical Turk to simulate craziness)

• We need to revisit algos & complexity theory– focus of this talk

Page 8: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Traditional Complexity Theory

• Cost is a function of input

• Example: sorting in O(N * log N)

Algo/Problem

cost

input

Page 9: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

“Modern” Complexity Theory

• Cost is a function of input, quality, error rate

• Example: sorting is O(???)

Algo/Problem

cost

input quality error

Page 10: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Alternative Complexity Theory

• Quality is a function of input, budget, error rate

• Example: sorting is O(???)

Algo/Problem

quality

input budget error

Page 11: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Agenda

• Case Study: Entity Resolution, Joins– when is A=B?

• Case Study: Sorting– when is A<B?

Page 12: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Problem Statement

• You are the director of the Louvre– you have gazillions of unknown paintings– you have a bunch of students that guess: p(A) = p(B)?

• You would like to group the paintings by painter– minimize cost (work of students)– minimize errors (#paintings in wrong room)

• Assumption: There is a ground truth!– (Many problems have no ground truth;

e.g., grouping the best paintings.)

Page 13: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Naïve Algorithm

• Step 1: select two random paintings

• Step 2: ask students to compare them

• Step 3: goto Step 1 until done

• How can we do better???

Page 14: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Votes Graph

A B

C D

• Is A = B?

Page 15: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Votes Graph

A B

C D

• Is A = B? YES!

Page 16: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Votes Graph

A B

C D

Page 17: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Votes Graph

A B

C D

• Is B = C?• Is A = D?

Page 18: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Votes Graph

A B

C D

• Is B = C? YES!• Is A = D? NO!

Page 19: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Votes Graph

A B

C D

• Is B = C? ???

Page 20: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Votes Graph

A B

C D

• Is B = C? YES!

50

30

-100

-1

Page 21: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Decision Functions

• Input: Votes graph (with weights)two nodes

• Output: Yes, No, Do-not-know

• Desired Properties:– Consistency: do not invent anything– Convergence: do not always punt– Reflexivity, Symmetry, Transitivity, Anti-transitivity

Page 22: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Min-Max Function• Compute pScore, nScore– take all positive, negative paths– score of path: minimum of weights of edges (AND)– pScore = maximum of score of all positive paths (OR)– nScore = maximum of score of all negative paths (OR)

• Make decision based on quorum (e.g., q=3)– Yes: pScore – nScore > q– No: nScore – pScore > q– Do-not-know: otherwise

Page 23: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Min/Max with Conflicts

A B

C D

• Is B = C? YES• pScore = 30• nScore = 1

• Is A = D? NO• pScore = 0• nScore = 30

50

30

-100

-1

Page 24: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Naïve Algorithm V2.0

• Step 1: select two random paintings, p1, p2

• Step 2: if (MinMax(p1,p2) == Do-not-know)

ask students to compare themelse return MinMax(p1, p2)

• Step 3: goto Step 1 until done

Page 25: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Min/Max and Transitivity?

B C

A

D5

5 -2

E

5

3

A = D? YES• pScore = 5• nScore = 2

D = E? YES• pScore = 3• nScore = 0

A = E? Do-not-know• pScore = 3• nScore = 2

Page 26: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

When is A=E?

B C

A

D5

5 -2

E

5

3

Compute “A=E”: Need at least 5 votes for success.Compute “D=E”: In best case, only 2 more votes needed.

Page 27: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

When is A=E?

B C

A

D5

5 -2

E

5

3

Crowdsource A=E: Need at least 5 votes for success.Crowdsource D=E: In best case, only 2 votes needed.

Many more surprises like that!!!

Page 28: When is A=B? Donald Kossmann Systems Group, ETH Zurich .
Page 29: When is A=B? Donald Kossmann Systems Group, ETH Zurich .
Page 30: When is A=B? Donald Kossmann Systems Group, ETH Zurich .
Page 31: When is A=B? Donald Kossmann Systems Group, ETH Zurich .
Page 32: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Related Work & Alternatives

• R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000.

• M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011.

• Huge body of work on ER in DB, II communities.

• Other decision function: MinCuts!

Page 33: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Summary

• Getting A=B right more important than algorithm– Naïve algo with Min/Max >> Correlation Clustering

• Result of A=B depends on C, D, …– sounds trivial, but has nasty implications– need a decision function: new cost/precision tradeoffs – Some trad. algos (e.g., CC) do not work

• Complexity: Still unknown!– interesting future work

Page 34: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Agenda

• Case Study: Entity Resolution, Joins– when is A=B?

• Case Study: Sorting– when is A<B?

Page 35: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Revisit Sorting Algos

• How do traditional sorting algorithms behave– Quicksort – Bubblesort

• Look at new sorting algorithms based on graph– PageRank– Min/Max– Schulze method

• Focus on Quicksort vs. Bubblesort here– Just give a glimpse of what can happen

Page 36: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Quicksort: Effect of built-in transitivity

• Sort the following sequenceNeutral, Painful, Good, Excellent, Bad

• Use “Good” as pivot element for partitioningFumble “Painful < Good” comparisonExcellent, Painful, Good, Neutral, Bad

• One bad comparison propagates to three misclassifications– quality of result can become arbitrarily bad– difficult to extend QSort algo with safety net.

Page 37: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Results (20% error, uniform)

10 20 30 400

20

40

60

80

100

120

QuickSortBubbleSort

Cost (number of iterations of algorithm)

Quality (%)

Page 38: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Summary

• Some algos implicitly exploit transitivity– difficult to control cost/quality tradeoff– might result in a poor result for specific application

• QuickSort >> Bubblesort no longer true– depends on error and quality expectation– there are better and worse ways to exploit transitivity

depending on budget and error behavior– confirms observations of “A=B” study

Page 39: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Related Work on Sorting

• Ludwig Busse et al.: The information content in sorting algorithms. 2012.

• M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011.

• Qurk (MIT) & Deco (Stanford) projects. 2011-2013.

• …

Page 40: When is A=B? Donald Kossmann Systems Group, ETH Zurich .

Conclusion & Future Work

• Computers are becoming insane– because they automate more of the insane world– because we are hitting the limits of trad. computing– consequence: quality becomes a major metric

• Adding “quality” has dramatic implications– need to revisit algorithms to become fault-tolerant– need to revisit complexity: totally open– need to revisit debugging and testing: totally open