Probability & Statistical Inference Lecture 1

MSc in Computing (Data Analytics)

Probability & Statistical Inference Lecture 1

Lecture Outline Introduction

General Info Questionnaire

Introduction to Statistics Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation

Introduction Name : Aoife D’Arcy Email: [email protected] Bio: Managing Director and Chief Consultant at the Analytics

Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 12 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; I have developed particular expertise in risk analytics, fraud analytics, and customer insight analytics.

Lecture Notes: Will be available online on

www.comp.dit.ie/bmacnamee and later on webcourses

mailto:[email protected]

http://www.comp.dit.ie/bmacnamee

Course OutlineWeek Topic

1 Introduction to Statistics2 & 3 Probability Theory4 Introduction to SAS Enterprise Guide5 Probability Distributions6 Confidence Intervals7 & 8 Hypothesis testing9 Assignment10 - 12 Regression Analysis13 Revision

Exam & AssignmentExam The end of term exam accounts for 60% of

the overall mark

Assignment The assignment is worth 40% of the overall

mark. The assignment will be handed out in week 5 Week 9’s class will be dedicated to working

on the assignment.

Software SAS Enterprise Guide will be the

software that will be used during the course.

Applied Statistics and Probability for EngineersJohn Wiley & SonsDouglas C. Montgomery

Probability and Statistics for Engineers and ScientistsPearson EducationR.E. Walpole, R.H. Myers, S.L. Myers, K. Ye

Modelling Binary DataChapman & HallDavid Collett

Probability and Random Processes Oxford University PressG. Grimmett & D. Stirzaker

Statistical InferenceBrooks/ColeGeorge Casella

Recommended Reading

Questionnaire

Section 1: Statistics are everywhere

We are bombarded with Statistics

http://www.irishtimes.com/newspaper/frontpage/2012/0918/1224324122326.html http://www.irishtimes.com/newspaper/world/2012/0914/1224324008884.html http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-

27th-3229426.html

http://www.irishtimes.com/newspaper/frontpage/2012/0918/1224324122326.html

http://www.irishtimes.com/newspaper/world/2012/0914/1224324008884.html

http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-27th-3229426.html

http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-27th-3229426.html

The internet is full of interesting statistics

http://www.usatoday.com/news/politics/twitter-election-meter

http://www.usatoday.com/news/politics/twitter-election-meter

Statistics can be misleading An ad claimed:

“9 Out of 10 Dentists prefer Colgate” What is wrong with this statement?

Consider these complaints about airlines published in US News and World Report on February 5, 2001

Can we conclude the United airlines has the worst customer service?

Statistics in Everyday Life With the increase in the amount

of data available and advancement`s in the power of computers, statistics are being used more and more frequently

Question: Is it good that statistics are used so much

and what happens when statistics are misused?

Statistics can be misleading

Misinterpreted Statistics can be Devastating

In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow.

He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543).

What is wrong with this assumption?

Video http://www.youtube.com/watch?v=4TKbIi

dbyhk&feature=fvwrel

http://www.youtube.com/watch?v=4TKbIidbyhk&feature=fvwrel

http://www.youtube.com/watch?v=4TKbIidbyhk&feature=fvwrel

Challenges As an Analytics practitioner you will

face a number of challenges:

Create insight from all available data (and there is lots of it)

Interpret statistic correctly Communicate statistically driven

insight in a way that is clearly understood

Objective of this course

Give you a set of statistical skills to allow you, as an analytics practitioner, turn data into insight!!

The Analytics Process & Statistics

Section Overview Statistics and Analytics Introduction to CRISP

Data Analytics Is Multidisciplinary

Databases

StatisticsPatternRecognition

KDD

MachineLearning AI

Neurocomputing

Predictive Analytics

Data Warehousing

Analytics Process

Data Insight Business Decision

Analytics Is A Lot Of ThingsWhat’s the best that can happen?

What will happen next?

What if these trends continue?

Why is this happening?

What actions are needed?Where exactly is the problem?How many, how often, where?What happened?

Optimization

Predictive modellingForecasting/extrapolation

Statistical analysisAlerts

Query/drill down

Ad hoc reports

Standard reports

Com

petit

ive

adva

ntag

e

Degree of intelligence

Pred

ictiv

e An

alyt

icsAc

cess

&

repo

rting

For this course we will concentrate on Statistical Analysis

What’s the best that can happen?

What will happen next?

What if these trends continue?

Why is this happening?

What actions are needed?Where exactly is the problem?How many, how often, where?What happened?

Optimization

Predictive modellingForecasting/extrapolation

Statistical analysisAlerts

Query/drill down

Ad hoc reports

Standard reports

Com

petit

ive

adva

ntag

e

Degree of intelligence

Pred

ictiv

e An

alyt

icsAc

cess

&

repo

rting

CRISP-DM Evolution Over 200 members of the CRISP-DM SIG

worldwide DM Vendors: SPSS, NCR, IBM, SAS, SGI,

Data Distilleries, Syllogic, etc System Suppliers/Consultants: Cap Gemini,

ICL Retail, Deloitte & Touche, etc End Users: BT, ABB, Lloyds Bank, AirTouch,

Experian, etc Crisp-DM 2.0 is due…

Complete information on CRISP-DM is available at: http://www.crisp-dm.org/

http://www.crisp-dm.org/

http://www.crisp-dm.org/

http://www.crisp-dm.org/index.htm

CRISP-DM Features of CRISP-DM:

Non-proprietary Application/Industry neutral Tool neutral Focus on business issues

As well as technical analysis Framework for guidance Experience base

Templates for Analysis

Data

Business Understandin

g

Data Understandin

g

Data Preparation

Modelling

Evaluation

Deployment

Phases & Generic TasksBusiness

Understanding Data

Understanding Data

Preparation Modeling Deployment Evaluation

Determine Business Objectives

AssessSituation

DetermineData Mining

Goals

ProduceProject Plan

Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives


Understanding Data

Understanding Data


CollectInitialData

DescribeData

ExploreData

VerifyData

Quality

Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.


Understanding Data

Understanding Data


SelectData

CleanData

ConstructData

IntegrateData

FormatData

Data Preparation

The data preparation phase covers all activities to construct the data that will be fed into the modelling tools from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.


Understanding Data

Understanding Data


SelectModelingTechnique

GenerateTest Design

BuildModel

AssessModel

Modelling

In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.


Understanding Data

Understanding Data


EvaluateResults

ReviewProcess

DetermineNext Steps

Evaluation

Before proceeding to final deployment of a model, it is important to thoroughly evaluate it and review the steps executed to construct it to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.


Understanding Data

Understanding Data


Plan Deployment

Plan Monitoring &

Maintenance

ProduceFinal

Report

ReviewProject

Deployment

Creation of a model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.

Crisp - DM Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment.

Crisp – DM – Areas covered in this course

Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment

Section 2: Descriptive Statistics & Distributions

Topics1. Introduction to Statistics2. The Basics 3. Measures of location: Mean, Median &

Mode.4. Measures of location & Skew.5. Measures of dispersion: range, standard

deviation (variance) & interquartile range.

Introduction to Statistics According to The Random House College

Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data.

There are two main branches of Statistics: The branch of statistics devoted to the

organisation, summarization and the description of data sets is called Descriptive Statistics.

The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.

Process of Data AnalysisPopulation

Representative

Sample

Sample Statistic

A Statistical population is a data set that is our target of interest.

A sample is a subset of data selected from the target population.

If your sample is not representative then it is referred to as being bias

Describe

Make Inferenc

e

Types of Data: Numeric Data Numeric data can be of two types:

Continuous Data: Data is continuous if it has an interval of real numbers for its range The number of centimetres of rain that fell in

March Discrete Data: Data is defined as discrete

if it has a finite range The number of correct answers in a 10 question

quiz

Types of Data: Categorical Data Data that is broken into discrete categories is

referred to as categorical data Categorical data has two main types:

Nominal: A nominal variable has a discrete number of categories or levels with no logical order Gender: Male, Female Working Status: Employed, Unemployed, Home-maker,

Student, Retired Ordinal: An ordinal variable has a discrete number

of categories or levels with a logical order Income Level: Low, Medium, High Places in a race: 1st, 2nd, 3rd, 4th, 5th, 6th

Class Task Task: Classify the type of each of the

data the following examples: The profit margin made from customers of

an online clothing company The type of interest rate you can be

charged on a mortgage i.e. Fixed rate, Adjustable rate

Number of dependents a associated with a loan applicant

Let’s Start at the Very Beginning When learning to read and write we start

with A-B-C, when starting to count we start with 1-2-3 and of course The Von Trappe family singers started with Do-Re-Me!

When learning statistics you start with the arithmetic mean or a simple average

The Arithmetic Mean

Year Canada China Germany* Russia** United Kingdom

United States

Total Gold Total Gold Total Gold Total Gold Total Gold Total Gold1992 18 7 54 16 82 33 112 45 20 5 108 371996 22 3 50 16 65 20 63 26 15 1 101 442000 14 3 59 28 56 13 88 32 28 11 92 372004 12 3 63 32 49 13 92 27 30 9 103 362008 18 3 100 51 41 16 72 23 47 19 110 36Mean 17 4 65 28 58 19 85 30 28 9 102 38

The table below shows the total medals won and gold medals won by each country in the last 5 Olympic games

• * Germany combines East and West Germany prior to reunification ** Russia or The Soviet Union• Data source http://www.databaseolympics.com/index.htm

http://www.databaseolympics.com/index.htm

Arithmetic Mean – The Formula The formula for calculating the sample

arithmetic mean of n data points x1, x2 ..... xn:

x x i

1

n

n

:x is referred to as x-bar

Attributes of the Arithmetic Mean It is straight-forward to calculate It is easy to interpret the mean It gives us a good estimate of where

a set of numbers is centred This is referred to as the central tendency of a sample

It is sensitive to outliers

Other Measures of Central Tendency

Median: The middle value of an ordered set of values, i.e. 50% higher and 50% lower

Mode: The most commonly occurring value in a distribution

Calculating the MedianYear Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110

Medals (Sorted)

17411010810710310194949492900

Sort the data Median =

97.5

Calculating the Mode

Medals Count174 1110 1108 1107 1103 1101 194 392 10 1

Mode = 94

Year Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110

Count frequenci

es

When to Use Each Central Tendency Value?

Question: When and why would you use the median over the mean?

Let’s Look at the Variation in our Data

01 - 2

424 - 4

848 - 7

474 - 9

797 - 1

21

121 - 146

146 - 170

0

2

4

6

8

10

12

14

16

18

20

Distribution of the Total Olympic Medals won by any Country from 1964 - 2008

Coun

t

01 - 2

424 - 4

848 - 7

474 - 9

797 - 1

21

121 - 146

146 - 170

0

2

4

6

8

10

12

14

16

18

20

Distribution of the Total Olympic Medals won by any Country from 1964 - 2008

Coun

t

Let’s Look at the Variation in our Data

Central Tendency / Location

Spread/Variation

Measures of Spread or VariationRangeVarianceStandard DeviationInter-quartile Range

Calculating the Range The Range in calculated by

subtracting the minimum value in a data set from the maximum value

The main advantage to using the range is the ease with which it is calculated

The major disadvantage of the range is that it is highly sensitive to outliers

Calculating the Variance As an example of Variance consider the

following data:OBS Data

1 32 43 8

Sum 15Mea

n5


following data:OBS Data Mean Deviatio

n1 3 5 -22 4 5 -13 8 5 3

Sum 15 15 0Mea

n5 15 0


following data:OBS Data Mean Deviatio

n(Deviation

)2

1 3 5 -2 42 4 5 -1 13 8 5 3 9

Sum 15 15 0 14Mea

n5 15 0 4.67

Variance – The Formula Square the deviations around the mean

before summing. For n data points x1, x2 ..... xn:

Divide by n-1 (?) to get the average of squared deviations:

x i xn 2i1

n

s2 x i xn 2

i1

n

n 1

Standard Deviation – The Formula Take the square root of the variance.

The value is in the original unit

s x i xn 2

i1

n

n 1

Standard Deviation

Question: Why might it be useful to have the value is in the original unit?

Percentiles The nth percentile is a value that has a

proportion of the sample taking values at or lower than it, and taking values larger than it

Example: if your grade in an industrial engineering class was located at the 84th percentile, then 84% of the grades were equal to or lower than your grade and 16% were higher

n100

100 n100

Inter-quartile Range The median is the 50th percentile The 25th percentile and the 75th

percentile are called the lower quartile and upper quartile respectively (or 1st and 3rd)

The difference between the lower and upper quartile is called the inter-quartile range

Quartiles ExampleMedals (Sorted)

17411010810710310194949492900

Sort the data

25th Percentile= 1st Quartile = 93

50th Percentile= Median= 97.5

75th Percentile = 3rd Quartile = 107.5

Inter-quartile Range 107.5 – 93 = 14.5

Year Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110

Proportions The proportion, p, of items in a population

that belong to a certain class, for example: The proportion of your customers that are

female The proportion of voters that will vote for

Labour in the next election A proportion is calculated as:

where C is the number of items in a population of size N that belong to the class of interest

p CN

Skew – The Shape of a DistributionThere are a number of ways of describing the shape of a distribution.

We will consider only one – skew.

Skew is a measure of how asymmetric a distribution is.

Symmetric Distributions = skew is zero

There are few very large data points which create a 'tail' going to the right (i.e. up the number line)

Note: No axis of symmetry here - skew > 0 (i.e. it is positive)

Example: Lifetime of people, house prices

Positive Skew

There are few very small data points which create a 'tail' going to the left (i.e. down the number line)

Note: No axis of symmetry here - skew < 0 (i.e. it is negative)

Examples: Examination Scores, reaction times for drivers

Negative Skew

Mean, Median & Mode are the same and are found in the middle

66

5 6 74 5 6 7 8

3 4 5 6 7 8 9

Mean = 102/17 = 6Median = 6Mode = 6

Skew & Measures of Location - Symmetry

ModeMedianMean

66

5 6 75 6 7 8 95 6 7 8 9 10 11

Mean = 121/17 = 7.12Median = 7Mode = 6

In general: Mode < Median < Mean

Positive Skew

ModeMedianMean

Mean = 83/17 = 4.89Median = 5Mode = 6

In general: Mode > Median > Mean

66

5 6 73 4 5 6 7

1 2 3 4 5 6 7

Negative Skew

Section 3: Graphs and Visualisation

Graphical Displays A way of letting people get a 'picture' of

relationships in the data set.

The simpler the better should be a rule in graphical display.

People can remember pictures better.

A good graph should show something that is not easy to ‘see’ using tables.

Bar Charts Used to display categorical data or

discrete data with a modest number of values.

A Bar is drawn to represent each category. The Bar height represents the frequency

or % in each category. Allows for visual comparison of relative

frequencies. Need to draw up a frequency distribution

table first.

Core Statistical Plots

4265.

375 88.75

112.12

5135

.5

158.87

5182

.25

205.62

5More

0

5

10

15

20

25

Points Scored by any Team in Six Nations Champi-

onship 2000 - 2011

Core Statistical Plots Comparisons Column Charts

Box Plots

Core Statistical Plots Correlations Scatter Plots

Trends(time)

Line Charts

Core Statistical Plots Proportions Pie Chart

Column Chart

Some Hans Inspiration to Finish UP http://www.youtube.com/watch?v=fTznEI

ZRkLg

http://www.youtube.com/watch?v=fTznEIZRkLg

http://www.youtube.com/watch?v=fTznEIZRkLg

Probability & Statistical Inference Lecture 1

Documents