Top Banner
1 Benford’s Law… Is it magic? Gaetan “Guy” Lion July 2010
29

Benford's law

Jan 14, 2015

Download

Education

Gaetan Lion

Explanation and applications of Benford's Law.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Benford's law

1

Benford’s Law… Is it magic?

Gaetan “Guy” Lion

July 2010

Page 2: Benford's law

2

What is the probability that the population number of any country starts with any of

the first digit: 1,2,3,4,5,6,7,8, or 9?

The probability that the population number of any country starts with any of the first digit is probably: 1/9 = 11.1%...

Countries population. Frequency of first digit

0%

2%

4%

6%

8%

10%

12%

1 2 3 4 5 6 7 8 9

First digit

Fre

qu

ency

Page 3: Benford's law

3

… The Correct Answer

First digit Frequency1 28.4%2 14.9%3 13.5%4 9.9%5 9.0%6 9.0%7 5.4%8 6.3%9 3.6%

100.0%

Countries population. Frequency of first digit

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8 9

First digit

Fre

qu

ency

Actual

Speculated

Page 4: Benford's law

4

Countries populations follow Benford’s Law

Chi Square P value the two distributions are the same: 0.8

Countries population. Frequency of first digit

0%

5%

10%

15%

20%

25%

30%

35%

1 2 3 4 5 6 7 8 9

First digit

Fre

qu

en

cy

Population

Benford

Page 5: Benford's law

5

Benford’s LawBenford’s law states that in lists of numbers from many real-world data, the first digit frequency is defined by this equation: Log (1+1/First Digit)This results in the frequency distribution shown below that is different from a uniform distribution.

Benford's Law distributionLOG(1+1/First Digit)

First Digit Frequency1 30.1%2 17.6%3 12.5%4 9.7%5 7.9%6 6.7%7 5.8%8 5.1%9 4.6%

100.0%

Benford's Law vs Uniform distribution

0%

5%

10%

15%

20%

25%

30%

35%

1 2 3 4 5 6 7 8 9

First Digit

Fre

qu

ency

Benford

Uniform

Page 6: Benford's law

6

When does this law work?

The data crosses at least one scale (or order of magnitude) as shown below:

Scale RangeScale 1 1 to 9Scale 2 10 to 99Scale 3 100 to 999Scale 4 1,000 to 9,999Etc… Etc…

You preferably need a sample > 100.

Page 7: Benford's law

7

Demographic data follows Benford Law very closely

The U.S. has over 3,000 counties. All shown demographic measures follow Benford’s Law pretty closely. This very large sample renders the Chi Square Goodness of fit test very (if not excessively) rigorous.

U.S. Census 2000 of counties population

Benford Population Births DeathsNatural increase

Internatio-nal

migrationDomestic migration

Net migration

1 30.1% 30.9% 30.3% 28.6% 31.2% 35.2% 29.8% 29.8%2 17.6% 17.9% 16.5% 16.7% 17.6% 18.7% 17.8% 18.4%3 12.5% 12.6% 13.7% 13.0% 12.1% 13.0% 13.0% 13.0%4 9.7% 9.8% 10.4% 10.0% 9.2% 8.2% 9.6% 9.5%5 7.9% 6.7% 7.8% 9.0% 7.6% 6.6% 8.3% 7.9%6 6.7% 6.6% 6.3% 7.6% 6.5% 5.9% 6.7% 7.2%7 5.8% 5.4% 5.8% 5.6% 6.5% 5.9% 6.5% 5.5%8 5.1% 5.5% 4.6% 4.9% 4.8% 3.7% 4.4% 4.5%9 4.6% 4.6% 4.6% 4.6% 4.5% 2.9% 4.0% 4.1%

Chi Square P value 0.37 0.24 0.08 0.41 0.00 0.30 0.48

Page 8: Benford's law

8

NYSE Stocks volume

This captures the first digit frequency of volume of over 2,000 NYSE stocks on June 21st. The fit is excellent both visually and statistically.

NYSEFirst Digit Benford Volume

1 30.1% 30.8%2 17.6% 16.4%3 12.5% 13.6%4 9.7% 9.8%5 7.9% 8.0%6 6.7% 6.4%7 5.8% 5.6%8 5.1% 5.2%9 4.6% 4.2%

Chi Square P value 0.73

NYSE Stocks' Volume on June 21

0%

5%

10%

15%

20%

25%

30%

35%

1 2 3 4 5 6 7 8 9

First digit

Fir

st d

igit

fre

qu

ency

Benford

Volume

Page 9: Benford's law

9

PG&E SmartMeter test

First Digit Benford Analog SmartMeter1 30.1% 33.0% 33.0%2 17.6% 22.0% 22.0%3 12.5% 12.1% 12.1%4 9.7% 9.9% 9.9%5 7.9% 4.4% 4.4%6 6.7% 5.5% 5.5%7 5.8% 5.5% 5.5%8 5.1% 3.3% 3.3%9 4.6% 4.4% 4.4%

Chi Square p value 0.90 0.90

This captures 91 observations between April and July 2010 of analog vs SmartMeter kWh consumption readings. Both the visual and statistical fit are pretty good.

Benford vs PG&E kWh meters

0%

5%

10%

15%

20%

25%

30%

35%

1 2 3 4 5 6 7 8 9

First digit

Fre

qu

en

cy

of

firs

t d

igit

Benford

PG&E

Page 10: Benford's law

10

Tennis pros ATP pointsATP points

0%

5%

10%

15%

20%

25%

30%

35%

1 2 3 4 5 6 7 8 9

First Digit

Benford

ATP

The number of ATP points of the first 1,600 professional tennis players follow closely Benford’s Law. Because of the large sample the associated P value is small.

Page 11: Benford's law

11

Even when it is not supposed to work… It kind of does.

I investigated Bernie Madoff’s monthly returns vs its closest competitor (GATEX). Although those data sets were not fit to use Benford’s Law the visual fit was surprisingly good.

Benford's Law test: Madoff vs GATEX and S&P 500 distribution of monthly returns first digit

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

1 2 3 4 5 6 7 8 9

First Digit of monthly return

Fre

qu

ency

Benford

Madoff

GATEX

Page 12: Benford's law

12

Is Benford Law magic?

Bacteria

>

No, a simple rule is that there are more small things than large things in the universe…

Page 13: Benford's law

13

… a simple explanation…The general principle is that there are more smaller observations vs larger ones. There are probably nearly twice as many 1s as there are 2s and three times as many 1s as there are 3s, etc… Using such a principle throughout gives us a frequency that is close to Benford’s Law.

First Digit frequency Benford's Law vs Simple rule

0%

5%

10%

15%

20%

25%

30%

35%

40%

1 2 3 4 5 6 7 8 9

First digit

Fir

st d

igit

fre

qu

enc

y

Benford

Simple

Simple ruleBenford Simple proportion

Digit log(1+1/d) rule 1/d

1 30.1% 35.3% 1.002 17.6% 17.7% 0.503 12.5% 11.8% 0.334 9.7% 8.8% 0.255 7.9% 7.1% 0.206 6.7% 5.9% 0.177 5.8% 5.0% 0.148 5.1% 4.4% 0.139 4.6% 3.9% 0.11

2.83

We would need a sample > 1,000 to reach statistical significance at the 0.05 level that those two distributions are different.

Page 14: Benford's law

14

Extending Benford’s Law beyond first digit

Benford’s Law is not limited to the first digit. You can use as many digits as you want using the formula: Log(1+1/Digits) For instance, the frequency of numbers that start with 367 = Log(1+1/367) = 0.12%.

Page 15: Benford's law

15

Benford vs Simple rule for first two digits

When dealing with first two digits (10 – 99), Benford’s Law and the Simple Rule have indistinguishable distributions. You would need samples > 700,000 to reach statistical significance at the 0.05 level that the two distributions are different.

1st two Digits distribution Benford's Law vs Simple rule

0%

1%

1%

2%

2%

3%

3%

4%

4%

5%

10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98

First two digits

Fir

st t

wo

dig

its

freq

uen

cy

Benford

Simple

Page 16: Benford's law

16

Time series growing by 2% per period

A time series growing by 2% per period over 116 periods replicates almost exactly Benford’s Law frequency distribution. This makes sense. The difference between 1 and 2 is a 100% increase vs between 2 and 3 is only a a 50% increase, etc… This entails there

will be a lot more 1s than other digits.

First digit frequenciesBenford Actual

First digit Expected Observed1 30.1% 30.2%2 17.6% 17.2%3 12.5% 12.9%4 9.7% 9.5%5 7.9% 7.8%6 6.7% 6.9%7 5.8% 6.0%8 5.1% 4.3%9 4.6% 5.2%

Chi Square 1.00

Difference between one digit and the next

0%

20%

40%

60%

80%

100%

120%

1 2 3 4 5 6 7 8 9

First Digit

100%

50%

33%25%

20% 17% 14% 12%

Page 17: Benford's law

17

Math properties of Benford’s Law

• Scale invariance: if a set of numbers closely follows Benford’s Law (BL), multiplying the numbers by any possible constant will create another set of numbers that also follows Benford’s Law. See the “Ones Scaling Test” on next slide.

• Base invariance: if a set of numbers follows BL using a different base (Log, natural log, etc…) will also create another set of numbers that follows BL.

Page 18: Benford's law

18

The Ones Scaling TestLooking at tax return numbers that followed BL closely, someone used the Ones Scaling Test to see if the number of “1s” would remain the same if multiplied by a constant. In this case, they multiplied the set of numbers by 1.01 and did that 696 times. This corresponds to multiplying the numbers progressively up to a factor of 1,000 as 1.01^696 = 1,000.

As shown, across all iterations the number of 1s remained very stable around the BL predicated level of 30.1%.

Source: “The Scientist and Engineer’s Guide to Digital Signal Processing. Steve Smith, PhD.

Page 19: Benford's law

19

What can we do with Benford’s Law?Quite a bit it turns out!

Page 20: Benford's law

20

A few Benford’s Law applications…

• Investigating political elections integrity;

• Checking tax returns for fraud;

• Uncovering accounting fraud;

• Detecting false insurance claims.

Page 21: Benford's law

21

Iran Election

Mahmoud Ahmadinejad's vote totals have more '2s' and fewer '1s' than expected. Roukema speculates Iranian officials replaced 1s by 2s. So, for instance, in some town where he received 1,954 votes, they would report his having received 2,954 votes.

Source: Nate Silver. fivethirtyeight.com

Page 22: Benford's law

22

Franken Vote count

“…This hugely violates Benford's Law -- there are not nearly enough totals beginning in 1 and too many beginning in numbers like 5, 6 and 7. The odds of these anomalies having occurred by chance are greater than a quadrillion to one against… the reason this pattern emerges is because precinct sizes in Minnesota are not truly random. There is a large number of precincts in Minnesota that are designed to serve between 1,000 and 2,000 voters; since Franken won about 42 percent of the votes statewide, this leads to a relatively high number of instances where his vote totals are in the high single digits (672, 704, 588, etc.)”Source: Nate Silver. fivethirtyeight.com

Senator

Page 23: Benford's law

23

Inspector Clouseau demonstrates how to run a fraud investigation

Page 24: Benford's law

24

Detecting fraud (an example). Step 1

Checks 483 Checks 522

First Digit Benford 09 Q4 Benford 10 Q11 145 155 157 1462 85 76 92 783 60 57 65 674 47 51 51 525 38 36 41 406 32 27 35 607 28 30 30 288 25 27 27 259 22 25 24 26

483 483 522 522

Chi Square 0.84 Chi Square 0.06

A company issued 483 checks in 2009 Q4 that was audited and everything checked out. It also issued 522 checks in 2010 Q1. A fraud investigator notes that 09 Q4 pattern fit Benford Law very closely (P value 0.84). He notes that the fit deteriorated in 010 Q1 9 (P value 0.06).

Page 25: Benford's law

25

Step 2. Focus on the differenceBenford vs 2010 Q1

0%

5%

10%

15%

20%

25%

30%

35%

1 2 3 4 5 6 7 8 9

Benford

10 Q1

As shown, the company has issued many more checks starting with the ‘6’ digit than expected (60 vs 35 for BL).

Page 26: Benford's law

26

Step 3. Focus on the 6s first two digits

We have 28 checks out of 522 starting with the two digits 66 vs 3.4 expected per Benford’s Law. This calls for further investigation.

Checks 522

First 2 dig. Benford 10 Q160 3.7 361 3.7 462 3.6 563 3.6 264 3.5 565 3.5 466 3.4 2867 3.4 468 3.3 369 3.3 2

35 60

# of checks

1st two Digits distribution Benford's Law vs Simple rule

0%

1%

1%

2%

2%

3%

3%

4%

4%

5%

10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98

First two digits

Fir

st t

wo

dig

its

freq

uen

cy

Benford

Simple

Page 27: Benford's law

27

Step 4. Focus on the 66s to three digits

Carrying this analysis to the first three digits, we see an unusual # of checks starting with ‘666’ and ‘668.’ Later, we find that the checks starting with ‘666’ were legitimate ones that four employees wrote to pay for a monthly service that cost $5.95 per month plus tax or $6.66 with tax. Meanwhile, 9 of the 10 checks starting with ‘668’ were fraudulent ones.

First 2 dig. Benford 10 Q1660 0.3 1661 0.3 1662 0.3 1663 0.3 0664 0.3 0665 0.3 1666 0.3 12667 0.3 0668 0.3 10669 0.3 2

3.4 28

# of checks

Page 28: Benford's law

28

Replicating Clouseau’s success

• The NY District Attorney’s Office applied the same methodology to uncover 103 checks out of 784 that were not authentic;

• The State of Arizona uncovered a $2 million check fraud in 1993;

• The State of North Carolina uncovered a $4.8 million procurement fraud over 2002 – 2005.

Page 29: Benford's law

29

The Key

• Benford’s Law helps you find “the needle in the hay stack” within the data;

• This does not mean all anomalies are fraudulent. But, it helps in finding the ones that are.