Top Banner
Text Mining and Continuous Assurance Kevin Moffitt
27

Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Aug 18, 2015

Download

Technology

TECSI FEA USP
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Kevin Moffitt

Page 2: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Continuous Assurance

• Allows for the automated and frequent review of business data

• Current focus is on the structured data

– General ledgers

– Financial statements

– XBRL

• However, we cannot ignore the information found in unstructured

data

– Textual data, for example narrative portion of financial disclosures

• Up to 85% of the data in financial disclosures is in the form of text

Page 3: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Text Mining

• Many methods for extracting data from text

• One popular method is to use dictionaries/word lists

• E.g. Dictionary to identify positive language in business

documents…

SATISFIES

PREEMINENT

REWARDED

BENEFITTING

SOLVING

COLLABORATIONS

BOOST

TREMENDOUS

GREATEST

PERFECTLY

DELIGHTING

COMPLIMENTING

EXCITING

REBOUNDED

CONCLUSIVE

ASSURE

INNOVATED

ENJOYING

CREATIVE

GREATLY

Page 4: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Drawbacks of Dictionary Method

• Single words

– Context Free

– Naïve

Page 5: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Lexical Bundles

• Frequent multi-word sequences in a given corpus (e.g. financial

reports, history journals, biology journals)

• More context in phrases than in individual words

• Criteria for identifying lexical bundles

– Sequences of words four words or longer

– Occurred in at least 15% unique documents

– Occurred at a rate of at least 20 times per million words

Example Lexical Bundles from Annual

Reports

the fair value of

be adversely affected by

as a percentage of

assets and liabilities and

Page 6: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Lexical Bundles

• Research objective - Use Lexical Bundles to discriminate between

Fraudulent and Non-fraudulent Financial Reports

Page 7: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Research Questions

• RQ1: What are the most frequently used lexical bundles in fraudulent and

non-fraudulent Management Discussion and Analysis section (MD&A) of

annual reports?

• RQ2: Which lexical bundles are used at a considerably different rate in

fraudulent and non-fraudulent MD&As?

• RQ3: Can lexical bundles be used to classify fraudulent and non-fraudulent

MD&As at a rate greater than chance?

Page 8: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Sample Selection

• Identified 101 fraudulent annual reports (10-Ks) from set of SEC investigations

• Analyzed the Management Discussion and Analysis (MD&A) section

of 10-K

– Gives investors view of company from management’s perspective

– contains some of the least structured language in the 10-K

– Most read part of 10-K

Page 9: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Sample Selection

Sample selection criteria for fraudulent 10-Ks

Companies identified as fraudulent by

searching through AAERs 141

Disqualified because fraud did not involve 10-

Ks (20)

Disqualified because 10-K was not available

from the EDGAR DB (10)

Disqualified because 10-K did not contain

management discussion section (10)

Final count of qualifying fraudulent 10-Ks used

in the sample 101

Page 10: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Sample Selection—Types of Fraud

Type of Fraud Companies

Overstatement of revenues 44

Combination of overstating revenue and

understating expenses

25

Disclosure issue 10

Overstatement of inventory 6

Other income increasing effects 6

Understatement of provisions for loan-

loss reserves

5

Other 5

Page 11: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Sample Selection – Non-Fraudulent sample

• 101 Matching Non-Fraudulent 10-Ks were identified

Page 12: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Results

Page 13: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Lexical Bundle Identification

• 560 Lexical Bundles were identified

Page 14: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Creative Accounting

Lexical Bundle

Fraud Bundles Per

Million Words

NonFraud Bundles

Per Million Words

%

difference

in process

research and

development

199 76 160%

goodwill and other

intangible assets 121 82 47%

Page 15: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Big Bath Charges

• Wholesale aggressive restructuring to improve

cost and expense structure for the future

– Disposition of long-lived assets

Lexical Bundle

Fraud Bundles Per

Million Words

NonFraud Bundles

Per Million Words

%

difference

disposition of long

lived assets and

49 21 139%

Page 16: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Fair Value Accounting

• Subjective method for assigning value to an asset

– Change value of assets

– Understate debt obligations

– Misrepresent foreign currency exchange adjustments

Lexical Bundle

Fraud Bundles

Per Million Words

NonFraud Bundles

Per Million Words

%

difference

the fair value of 257 171 50%

in foreign

currency

exchange

41 21 97%

Page 17: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Lexical Bundles used more Frequently in Non-Fraudulent

MD&As

• Conservative language for accounting practices

Lexical Bundle

Fraud Bundles Per

Million Words

NonFraud Bundles

Per Million Words

%

difference

to continue as a

going concern 15 91 513%

disclosures about

market risk 85 115 36%

material impact on

the 38 52 35%

Page 18: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Principal Component Analysis

• Variable reduction procedure

– Combines correlated variables into principal components

• Principal components

– First component accounts for maximum amount of total variance in the observed variables

– Components are uncorrelated

• Components are made up of correlated variables

– Overlapping lexical bundles are combined

Correlated bundles transformed into one principal component

4-word bundles 6-word component

there can be no

there can be no assurance

can be no assurance there can be no

assurance that can be no assurance that

be no assurance that

Page 19: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Principal Component Analysis

• 560 Lexical Bundles were reduced to 88 principal

components

Page 20: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Component 1

principles generally accepted in

accounting principles generally accepted

generally accepted in the

accepted in the united

with accounting principles generally

affect the reported amounts

reported amounts of assets

that affect the reported

to make estimates and

factors that could cause

actual results to differ

results to differ materially

of assets and liabilities

actual results may differ

to differ materially from

differ materially from those

forward looking statements this

in the united states

allowance for doubtful accounts

are expected to be

company believes that the

Page 21: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Component 1

with accounting principles generally accepted in the united states

that affect the reported amounts of assets and liabilities

are expected to be

company believes that the

to make estimates and

factors that could cause

forward looking statements this

allowance for doubtful

accounts

actual results to

actual results may differ materially from those

“GAAP and expected results”

Page 22: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Component 2

have a material adverse

material adverse effect on

a material adverse effect

adverse effect on the

business financial condition and

could have a material

effect on the company's

can be no assurance

be no assurance that

there can be no

assurance that the company

of one or more

the company will be

no assurance that the

of the company's products

that the company will

and will continue to

Page 23: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Component 2

could have a material adverse effect on the company's

there can be no assurance that the company will be

business financial condition

and

of one or more

of the company's products

and will continue to

“Could be bad”

Page 24: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Classification Results

• Discriminant Analysis

– 71% of cross-validated cases were correctly

classified

Discriminating factor (PC) Beta Discriminating factor (PC) Beta

Impact and exposure .464 Price and offsets .335

Material difference -.421 COGS and change

in accounting

principle

.330

Common stock and

adverse affects

.412 Fair market value .313

Going concerns .363 Exercise of stock

Options

.298

New product

introductions

.339 Number of Factors -.287

Page 25: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Confusion Matrix

Predicted Class

Fraudulent Non-Fraudulent

Actual Class

Fraudulent 70 31

Non-Fraudulent 28 73

Page 26: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Confusion Matrix Results

FNFPTNTP

TNTPAccuracy

TNFP

FPFPR

FNTP

TPTPR

FPTP

TPecision

Pr Precision = .714

True Positive Rate = .693

False Positive Rate = .277

Accuracy = .708

Predicted Class

Fraudulent Non-Fraudulent

Actual

Class

Fraudulent 70 (TP) 31 (FN)

Non-Fraudulent 28 (FP) 73 (TN)

Page 27: Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Conclusion

• Lexical bundles have more contextual meaning than unigrams

– Results are easier to interpret

• Lexical bundles may be used to classify documents

• Lexical bundle analysis can be used in any type of textual dataset

• This process and other text mining processes can be integrated into

continuous assurance solutions

– Rapid identification of suspicious documents