Statistics group project_Fraud Detection

YOUNG INDIA FELLOWSHIP

Statistics Course Group Project

Members :

Abhishek Chopra

Adhiraj Sarmah,

Kshitij Garg

Mahesh Jakhotia

Tulasi Prasad Chaudhary

7/25/2011

The group project is based on real case study taken from the Atlanta primary school test papers. The growing pressure among the teachers to improve the test performance of their classes has resulted in malpractices. We have to find out the methodologies to find out the fraud if done in the following case.

1

Contents

1) Problem Statement 2

2) Logical Analysis 2-4

3) Inference 4

4) Our Interpretation of the Cheating Process 4

5) Statistical Approaches 5

6) ANOVA 5

7) Pictorial Method 8

8) The Wincoxon Rank Sum Test 9

9) Appendix

a. Table A.1 : Division of questions into groups based

on the approach 1 used in ANOVA test 9

b. Table A.2 : Class Results 10

c. Table A.3: Class B Results 11

d. Table A.5 : Class A Results 12

e. Table A.5 : Class B Results 13

2

GROUP PROJECT STATISTICS – FRAUD DETECTION

Problem Statement: We have been given 2 sets of data of 2 different classrooms and we are required to

strategize and analyze to eventually determine whether there was a teacher fraud in one or both of the

classrooms.

There can be 4 different scenarios:

1) Both A & B data have been tampered.

2) Both A & B data have not been tampered.

3) A is Fraud, B is Not

4) B is Fraud, A is Not

We have summarized our thought processes in the following document and demonstrated them through the

help of excel sheets attached in the folder. We have used various approaches to derive the solution. Each

and every methodology has its own assumptions and its own pros & cons.

Logical Analysis:

STEP – 1: We calculate the total number of correct answers for every question in both the classes.

Since we took a student wise-question wise analysis and assign a correct score with the value „1‟, it

also shows the total number of students who got each question correctly for both the classes

STEP -2: We then find the Total Number of correct answers of the entire class and divide it by the

total number of students to arrive at the average mean number of correct answers per student for or

both the classes.

STEP – 3: We take the analysis of STEP -2 and then plot line-graphs for both the classes with

Questions on the X-Axis and Class Performance on the Y-Axis. The analysis of this will provide a

broad perspective on whether there is any evidence of fraud or not.

# We found that in Class – A, Questions 30 to Questions 36 clearly show an anomaly.

STEP-4: We decided to focus on the anomaly region. We analyzed the questions 30-36 and tried to

see if there were any abnormal patterns in them for both the classes.

3

# There was very clearly a pattern of answers of exact and uniform correct answers to questions 30-

36 for class A for particular 16 students, which wasn‟t so in Class B.

STEP – 5: We calculated the Average score (i.e. Average no. of correct answers) for each of these

16 students in class A which included questions 30-36. We then found the mean score of these 16

students = 46%.

For Class B, The mean score of all the students is: 38%

4

STEP – 6: We calculated the Average score (ie. Average no. of correct answers) for each of these

16 students in class A EXCLUDING the questions 30-36. We then found the mean score the 16

students of Class A, the mean DECREASED to 42% (ie. A decrease of 4%)

For Class B, The mean score of all the students INCREASED to 40%. (ie. An increase of 2%)

INFERENCE: Therefore we can say that the set of questions 30 to 36, show reasonable proof

to believe that some form of cheating/tampering was done in respect to these questions.

Our interpretation of the Cheating Process

1) From questions 30 to 36, the graphs present a consistent growth for 16 students from the other

students from the average growth visually, which can be summed up to 16 x 6 questions, which is

equal to 96 questions that have been probably tampered with.

2) The reasons to choose that particular set of questions (from 30 to 36) could be

a) Since it is given that the level of difficulty increases with the questions it is logical to assume

that more students would get correct answers for the first half of the questions compared to the

second half, because the difficulty level would be low at the beginning. In the same manner, the

second half of the question would be expected to show lesser correct answers as the difficulty

would be higher.

b) So it would be logically smart on the teachers part to attempt to tamper/cheat in the second half

of the questions, since most of the students would be expected to get the correct answers in the

first half. Even in the second half, it would be smarter to avoid tampering with the last few

questions since they are the most difficult, and an increased number of correct answers for those

questions will immediately be easily exposed to detection. So it would be logical to choose

questions from somewhere within the beginning of second half and significantly before the last

few questions.

3) A set of questions which are consecutively chosen for editing also eases the time factor required to

edit the answers manually, which talks about the limited time available to an invigilator or a teacher

generally. And 96 questions is a good number of questions to change the entire average of the class

performance to a significant level which is an increased level of 4 % as we later found from our

analysis..

Statistical Approaches used:

1) Anova Method: Initially we divided the classes into groups and applied anova to see if the groups

have the same distribution or not. If one of the groups did not have the same distribution we could

conclude that the data of that group was tampered as it disturbed the distribution of the whole class.

We used two approaches to divide into groups. Later on we used the Tukey Method to find out the

groups which had a deviated mean.

5

2) Pictorial distribution: A graph was plotted with the questions on the X axis and the class

performance on the Y Axis. When we analyzed the class A graph we found out that between the

questions 30-36 the plot was flat and the results were higher than the performances in the other

questions. We can conclude on a pictorial basis that fraud has been done in these questions.

3) The Wilcoxon Rank Sum Test: If we want to use the samples without considering the normal

assumptions we can use the Rank Sum approach (used for non-normal distribution) discussed in

section 9.2 of the text book. Since the other tests are based on a lot of mathematical assumptions

which are not satisfied by the given data, we can use this approach which requires weaker

mathematical assumptions.

Approach 1 : ANOVA Approach

To compare the means and distributions of various groups, ANOVA is preferred to multiple “t-tests” as

ANOVA leads to a single test statistic for comparing all the means, so the overall risk of type-I error can

be controlled. If we ran many t tests, each at a given alpha level, we couldn‟t know what the overall risk of

a type 1 error is. Certainly the more tests one runs, the greater the risk of a false positive conclusion

somewhere among the tests.

Initially we divided the groups of class A according to the toughness level of the questions. The toughness

level was divided according to the area of right answers answered by the students. For example if the total

number of questions answered by the group is 445. We divided the group into eight groups by classifying

them in to equal areas of (445/8=56). The cumulative sum of total scores in each group is 56.

The data was divided into eight groups. The grouping has been shown in appendix section Table A.1

.Anova test was applied on the above groups to find out if the means of the groups was same or different.

Test Hypothesis

Ho : u1= u2=……u8

Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which

resulted in the varying of its mean from the other groups)

Results for CLASS A

Anova: Single Factor for CLASS A

SUMMARY Groups Count Sum Average Variance

Column 1 5 52 10.4 16.3 Column 2 4 47 11.75 7.583333 Column 3 4 62 15.5 5.666667 Column 4 6 58 9.666667 8.266667 Column 5 6 60 10 16.8 Column 6 6 53 8.833333 48.56667 Column 7 4 67 16.75 0.916667

6

Column 8 9 46 5.111111 8.861111

ANOVA Source of Variation SS df MS F P-value F crit

Between Groups 539.6763 7 77.09661 5.076268 0.000444 2.277143 Within Groups 546.7556 36 15.18765

Total 1086.432 43

ANOVA Results for CLASS B

Source of Variation SS df MS F P-value F crit

Between Groups 287.308 7 41.044 3.695126 0.004144 2.277143

Within Groups 399.8738 36 11.10761

Total 687.1818 43

In the test results we find out that the F Statistic value of “between groups” in class A is 5.07 which is

higher than the critical F value.(2.27). This proves that the null hypothesis Ho that the means are equal can

be rejected.

But a small flaw in this argument is that the samples size of each group is different and this disturbs one of

the basic assumption of the ANOVA that equal variance of groups is required except in the cases when the

groups are of equal sizes. Second point to be noted is that the groups have to be independent. Hence we

have to use a different approach to satisfy the above assumptions. We have now divided the group in such

a way that it entails questions of all difficulty levels. We used a circular approach to divide the questions

into four categories. For example we put questions 1 to 4 in the following four groups. And then questions

5-8 in the following four groups, thus each group had questions of all types making it a homogenous

model. Please refer Appendix Table A.4 - Class A Results for more details of the grouping.

Assumptions for ANOVA:

1) The sample measurements are selected from a normal population.

2) The samples are independent.

3) The unknown population & variance for the measurements from sample I are ui and c2 respectively.

Now let me explain why our current approach in a way satisfies the given assumptions.

The normal approximation is the least crucial. The ANOVA test is a test on means; the central limit

theorem has its effect. The central limit theorem may not work for a small sample size. Hence we have

taken a large sample size per group. 11 is the sample size and we the central limit theorem can be applied

approximately. However one particular alternative is the Kruskal-Wallis Rank Test which is discussed in

the section 10.2 of the textbook. This method can be applied to non-random samples. But since this

methodology was not taught in the class room we leave that solution and focus on using the ANOVA test

for solving the current problem.

7

The assumption of equal variances is important if the sample sizes are substantially different. But since we

have chosen the same sample size therefore the variance can‟t be a problem over here. When all n‟s are

equal, the effect of even grossly unequal variances is minimal.

Coming to the independence problem, since we have used a homogenous group in which the questions

from easy to tough have been taken therefore the group as a whole is independent from the other groups.

We have used a circular approach to make sure that each group has a homogenous set if questions similar

to the other groups.

Test Hypothesis

Ho : u1= u2=……u8

Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which

resulted in the varying of its mean from the other groups)

CLASS A Results

1 2 3 4 13 9 4 14 12 10 13 9 15 19 15 14 14 14 12 10 7 7 8 14 11 9 9 3 14 2 12 3 3 16 17 16 16 18 17 2 8 9 8 1 6 4 6 2

Anova: Single Factor


Column 1 11 119 10.81818 17.76364 Column 2 11 117 10.63636 30.45455 Column 3 11 121 11 19 Column 4 11 88 8 34.8

ANOVA Source of

Variation SS df MS F P-value F crit

Between Groups 66.25 3 22.08333 0.865859 0.466735 2.838745

Within Groups 1020.182 40 25.50455

Total 1086.432 43

8

ANOVA Results for Class B

Source of Variation SS df MS F P-value F crit

Between Groups 51.72727 3 17.24242 1.08536 0.366324 2.838745

Within Groups 635.4545 40 15.88636

Total 687.1818 43

Now the ANOVA F test has helped us in just finding whether we need to reject the Ho or not. But rejection

of null hypothesis that means are equal does not indicate specifically which means are not equal. Therefore

we can use the Tukey method to find out the differences among the specified means. By this method we

can specifically point out the group in which the tampering has been done.

Approach 2: The Pictorial Method

We can see that the frequency curve comes out to be normal for class B, but it is skewed on the higher side

in class A. This skewed nature can be attributed to "Tampering or Cheating by the teacher". The mean for

class A (Mean = 20.23) is too high as compared to class B (Mean = 16.78)

And as seen from the "Question vs. No. of students who attempted it correct " plot we can say that Q-30 to

Q- 36 in class A consists of the tampered data. These questions don't follow the normal trend and show an

increased peak in between of the decreasing curve.

So, we trim off Q-30 to Q-36 from both the classes and then plot them again for remaining questions. And

we can see that both the curves come out be normal this time and there is no skew nature in class A. The

mean for class A (Mean = 16.32) has also reduced and is now comparable to class B (Mean = 15.44) So it

can be easily said that some tempering was done from Q-30 to Q-36 in class A.

Approach 3 : The Wilcoxon Rank Sum Test

If we want to use the samples without considering the normal assumptions we can use the Rank Sum

approach (used for non-normal distribution) discussed in section 9.2 of the text book. Since the other tests

are based on a lot of mathematical assumptions which are not satisfied by the given data, we can use this

approach which requires weaker mathematical assumptions.

This test requires the following conditions:

1) Identical distributions but not necessary normal.

The null hypothesis is that the two population distributions are identical. And the alternative test is that the

mean of one of the groups is larger than the other group. If the null hypothesis is rejected it implies that

both the groups are not distributed in an identical way which implies that on one of the groups a fraud has

been done. We can use the critical values and reject the values based on the statistic values.

9

Here the two groups could be the data from the two classes or the different groups of questions divided in a

homogenous manner. But since this has not been covered in the syllabus we haven‟t done the problem with

this method.

APPENDIX

Table A.1 : Division of questions into groups based on the approach 1 used in ANOVA test

Cumulative

Sum

8

Groups

13

22

26

40

52

62

75

84

99

118

133

147

161

175

187

197

204

211

219

233

244

253

262

265

279

281

293

296

299

315

332

348

364

382

399

401

409

418

426

427

433

437

443

445

10

Table A.2 : Class A Results

CLASS A

Groups A B C D E F G H

13.00 10.00 19.00 14.00 14.00 2.00 16.00 2

9.00 13.00 15.00 12.00 11.00 12.00 16.00 8

4.00 9.00 14.00 10.00 9.00 3.00 18.00 9

14.00 15.00 14.00 7.00 9.00 3.00 17.00 8

12.00

7.00 3.00 16.00

1

8.00 14.00 17.00

6

4

6

2


Column 1 5 52 10.4 16.3 Column 2 4 47 11.75 7.583333 Column 3 4 62 15.5 5.666667 Column 4 6 58 9.666667 8.266667 Column 5 6 60 10 16.8 Column 6 6 53 8.833333 48.56667 Column 7 4 67 16.75 0.916667 Column 8 9 46 5.111111 8.861111


Between Groups 539.6763 7 77.09661 5.076268 0.000444 2.277143

Within Groups 546.7556 36 15.18765

Total 1086.432 43

11

Table A.3: Class B Results

CLASS B Results

Groups A B C D E F G H

13 2.00 15.00 7.00 3 4 4.00 8.00

5 6.00 10.00 10.00 10 12 10.00 4.00

6 9.00 14.00 12.00 10 10 11.00 5.00

14 4.00

6.00 4 8 6.00 2.00

10.00

4

1.00 5.00

9.00

12

3.00 4.00

1.00 2.00

3.00

3.00

1.00



Column 1 4 38 9.5 21.66667 Column 2 6 40 6.666667 10.26667 Column 3 3 39 13 7 Column 4 4 35 8.75 7.583333 Column 5 6 43 7.166667 15.36667 Column 6 4 34 8.5 11.66667 Column 7 7 36 5.142857 16.47619 Column 8 10 37 3.7 4.011111


Between Groups 287.308 7 41.044 3.695126 0.004144 2.277143

Within Groups 399.8738 36 11.10761

Total 687.1818 43

12

Table A.4 - Class A Results

Class A :

1 2 3 4 13 9 4 14 12 10 13 9 15 19 15 14 14 14 12 10 7 7 8 14 11 9 9 3 14 2 12 3 3 16 17 16 16 18 17 2 8 9 8 1 6 4 6 2



Column 1 11 119 10.81818 17.76364 Column 2 11 117 10.63636 30.45455 Column 3 11 121 11 19 Column 4 11 88 8 34.8

ANOVA Source of


Between Groups 66.25 3 22.08333 0.865859 0.466735 2.838745

Within Groups 1020.182 40 25.50455

Total 1086.432 43

13

Table A.5 : Class B Results

CLASS B

1 2 3 4 13 5 6 14 2 6 9 4 10 9 15 10 14 7 10 12 6 3 10 10 4 4 12 4 12 10 8 4 10 11 6 1 3 1 8 4 5 2 5 4 2 3 3 1



Column 1 11 81 7.363636 20.65455 Column 2 11 61 5.545455 11.27273 Column 3 11 92 8.363636 11.45455 Column 4 11 68 6.181818 20.16364

ANOVA Source of


Between Groups 51.72727 3 17.24242 1.08536 0.366324 2.838745

Within Groups 635.4545 40 15.88636

Total 687.1818 43

Statistics group project_Fraud Detection

Spiritual

students of class

division of questions

correct score

particular set of questions

class results

class b results11

uniform correct answers

lesser correct answers