How to Fake Data if you must Department of Statistics Rachel Fewster.

Post on 28-Mar-2015

227 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

How to Fake Dataif you must

Department of Statistics

Rachel Fewster

Who wants to fake data?

• Electoral finance returns…

• Toxic emissions reports…

• Business tax returns…

Land areas of world countries: real or fake?

Land areas of world countries: real or fake?

123456789

IIIIIIII

III

IIIII

Land areas of world countries: real or fake?

123456789

IIIII

IIIIIIIIIII

123456789

IIIIIIII

III

IIIII

Land areas of world countries: real or fake?

123456789

IIIII

IIIIIIIIIII

123456789

IIIIIIII

III

IIIII

This one seems more

even…This one has as

many 1s as 5-9s

put together!

This one is

right!

Real land areas of world countries

123456789

IIIIIIII

III

IIIII

11 of them begin with

digits 1 – 4…

Only 5 begin with digits

5 – 9…

Friday’s Newspaper:123456789

IIII IIIIIIII IIIIIIIIIIIII

IIIII

10 out of 34 numbers

began with a 1…

None out of 34 began with

a 9!

The Curious Case of the Grimy Log-books

• In 1881, American astronomer Simon Newcomb noticed something funny about books of logarithm tables…

The Curious Case of the Grimy Log-books

The books always seemed grubby on the first pages…

… but clean on the last pages

The first pages are

for numbers beginning

with digits 1 and 2…The last

pages are for

numbers beginning

with digits 8 and 9…

The Curious Case of the Grimy Log-books

People seemed to look up numbers beginning with 1 and 2 more often than they looked up numbers beginning with 8 and 9.

Why?

Because numbers beginning with 1 and 2 are MORE COMMON than

numbers beginning with 8 and 9!!

Newcomb’s Law

American Journal of Mathematics, 1881

30% of numbers begin with a 1 !!

< 5% of numbers begin with a 9 !!

The First Digits…Over 30% of numbers begin with a 1

Only 5% of numbers

begin with a 9

The First Digits…

Numbers beginning with a 1

Numbers beginning with a 9

There is the same “opportunity” for numbers to begin with 9 as with 1 …

but for some reason they don’t!

0.301 = log10(2/1)

0.176 = log10(3/2)0.125 = log10(4/3)

d

d 1log10

Chance of anumber starting with digit d

Reactions to Newcomb’s law

Nothing!

…for 57 years!

Enter Frank Benford: 1938

Physicist with the General Electric Company

Assembled over 20,000 numbers and counted their first digits!

‘A study as wide as time and energy permitted.’

Populations

Numbers from newspapers

Drainage rates of rivers

Numbers from Readers Digest articles

Street addresses of American Men of Science

About 30% begin with a 1 About 5% begin with a 9

Benford gave the ‘law’ its name……but no explanation. Anomalo

us numbers

!!

“…The logarithmic law applies to outlaw numbers that are without known relationship,

rather than to those that follow an orderly course;

and so the logarithmic relation is essentially a Law of Anomalous Numbers.”

Explanations for Benford’s Law

• Numbers from a wide range of data sources have about 30% of 1’s, down to only 5% of 9’s.

• Benford called these ‘outlaw’ or ‘anomalous’ numbers. They include street addresses of American Men of Science, populations, areas, numbers from magazines and newspapers.

• Benford’s ‘orderly’ numbers don’t follow the law – like atomic weights and physical constants

What is the explanation

?

Popular Explanations

• Scale Invariance

• Base Invariance

• Complicated Measure Theory

• Divine choice

• Mystery of Nature

These two say that IF there is a universal law,

it must be Benford’s.

They don’t explain whythere should be a law

to start with!

In a nutshell … If you grab numbers from all over

the place (a random mix ofdistributions), their digit

frequencies ultimately converge to Benford’s Law

Complicated Measure Theory

That’s why THIS works well

It doesn’t explain why street addresses of American Men of Science works well!

It doesn’t reallyexplain WHAT will work well, nor why

The Key Idea…

If a hat is covered

evenly in red andwhite

stripes…

Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon

Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon

The Key Idea…

… it will behalf red

and half white.

If a hat is covered

evenly in red andwhite

stripes…

The red stripes and the white stripeseven out over the shape of the hat

If the red stripes cover half the base, they’ll cover about half the hat

What if the red stripes cover 30% of the base?

0 0.3 1 1.3 2 2.3 3 3.3 4 4.3 5 5.3 6

Then they’ll cover about 30% of the hat.

What if the red stripes cover precisely fraction 0.301 of the base?

0.301 = log10(2/1)

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

Then they’ll cover fraction ~0.301 of the hat.

Think of X as a random number…

We want the probability that X has first digit = 1

Let the ‘hat’ be a probability density curve for X

Then AREAS on the hat give PROBABILITIES for X

Think of X as a random number…

We want the probability that X has first digit = 1

Let the ‘hat’ be a probability density curve for X

Then AREAS on the hat give PROBABILITIES for X

Pr(1 < X < 5) = 0.95

Area = 0.95 from 1 to 5

Total area = 1

In the same way ….

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

If the red stripes somehow represent the X values with first digit = 1,

and the red stripes have area ~ 0.301,

then Pr(X has first digit 1) ~ 0.301.

So X values with first digit=1 somehow lie on a set of evenly spaced stripes?

Write X in Scientific Notation:

So X values with first digit=1 somehow lie on a set of evenly spaced stripes?

Write X in Scientific Notation:

nrX 10r is

between 1 and

10

n is an integer

For example…

nrX 10r is

between 1 and

10

n is an integer

21024.1124 1106.776

For example…

nrX 10For the first

digit of X, only r

matters!

21024.1124 1106.776

21en exactly wh

1 digit first has

r

X

For example…

nrX 10For the first

digit of X, only r

matters!

21024.1124 1106.776

21en exactly wh

1 digit first has

r

X

1 < r < 2

r > 2

nrX 1021en exactly wh

1 digit first has

r

X

Take logs to base 10…

)10log(loglog nrX Or in other words…

nrX loglog

nrX loglogr is

between 1 and

10

n is an integer

nrX loglogr is

between 1 and

10

2loglog 1log

...when i.e.

21when

1digit first has

r

r

X

n is an integer

nrX loglogr is

between 1 and

10

301.0log 0

...when i.e.

r

n is an integer

2loglog 1log

...when i.e.

21when

1digit first has

r

r

X

nrX loglog

n is an integer301.0log 0

when1digit first has

r

X

X has first digit 1 precisely when log(X) isbetween n and n + 0.301 for any integer n

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

X from 1 to 2

X from 10 to 20

X from 100 to 200

nrX loglog

n is an integer301.0log 0

when1digit first has

r

X

X has first digit 1 precisely when log(X) isbetween n and n + 0.301 for any integer n

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

STRIPES!!

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

X values with first digit = 1 satisfy:

and so on!

The ‘hat’ is the probability density curve for log(X)

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

X values with first digit = 1 satisfy:

The ‘hat’ is the probability density curve for log(X)

X from 1 to 2

X from 10 to 20

X from 100 to 200

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

So X values with first digit=1 DO lie on evenly spaced stripes, on the log scale!

The PROBABILITY of getting first digit 1 is the AREA of the red stripes,~ approx the fraction on the base, = 0.301.

We’ve done it!

We’ve shown that we really should expect the first digit to be 1 about 30% of the time!

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

The log scale distorts: small numbers (e.g. 100) are stretched out; larger numbers (e.g. 900) are bunched up.The first digit corresponds to regularly spaced stripes on the log scale.

Intuitively…

So the smallest numbers (first digit = 1) are

stretched out, and get the highest probability!

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

We need a lot of stripes to balance out big ones and little ones! We get one stripe every integer…So we need a lot of integers!

When is this going to work?

The distribution of X needs to be

WIDE on the log scale!

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

X ranges from 0 to 6 on the log scale…So it ranges from 1 to 106 on usual scale!

When is this going to work?

1 .. 2 .. Miss a few ... 999,999 .. 1,000,000

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

These are Benford’s ‘Outlaw Numbers’!

All we need is a distribution that is:• WIDE (4 – 6 orders of magnitude or more)• Reasonably SMOOTH …Then the red stripes will even out to cover about 30% of the total area.

In Real Life…

World Populations: From 50 for the Pitcairn Islands …To 1.3 x 109 for China…

Wide (9 integers => 9 stripes)

First digits very good fit to Benford!

In Real Life…

World Populations: From 50 for the Pitcairn Islands …To 1.3 x 109 for China…

Electorate populations? From 583,000 to 773,000 in California:

Of course not! All the first

digits are 5, 6, or 7…

The hat has less than one stripe! Benford doesn’t work here.

But naturally occurring populations are a different story!Cities in California:

- from 94 in the city of Vernon…- to 3.9 million in Los Angeles…

Yes! It’s Benford!

Wide enough (5 integers => 5 stripes)

Powerball Jackpots?- from $10 million to $365 million…

Not bad!

Orders of magnitude only 1.5 …

… but sometimes you just hit lucky!Data with kind permission from www.lottostrategies.com

Your tax return….?

If you plan to fake data, you should first check whether it ought to be Benford!

BUT the IRD has a few other tricks up its sleeve too….

To find out more:• A Simple Explanation of Benford’s Law by R. M. Fewster The American Statistician, to appear. PDF fromwww.stat.auckland.ac.nz/~fewster/benford.html

• Judy Paterson’s CMCT course, Term 1 2009: Centre for Mathematical Content in Teaching

Thanks for listening!

top related