Top Banner
MSc in Computing (Data Analytics) Probability & Statistical Inference Lecture 1
79

Probability & Statistical Inference Lecture 1

Feb 25, 2016

Download

Documents

dallon

Probability & Statistical Inference Lecture 1. MSc in Computing (Data Analytics). Lecture Outline. Introduction General Info Questionnaire Introduction to Statistics Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation. Introduction. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Probability & Statistical Inference Lecture 1

MSc in Computing (Data Analytics)

Probability & Statistical Inference Lecture 1

Page 2: Probability & Statistical Inference Lecture 1

Lecture Outline Introduction

General Info Questionnaire

Introduction to Statistics Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation

Page 3: Probability & Statistical Inference Lecture 1

Introduction Name : Aoife D’Arcy Email: [email protected] Bio: Managing Director and Chief Consultant at the Analytics

Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 12 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; I have developed particular expertise in risk analytics, fraud analytics, and customer insight analytics.

Lecture Notes: Will be available online on

www.comp.dit.ie/bmacnamee and later on webcourses

Page 4: Probability & Statistical Inference Lecture 1

Course OutlineWeek Topic

1 Introduction to Statistics2 & 3 Probability Theory4 Introduction to SAS Enterprise Guide5 Probability Distributions6 Confidence Intervals7 & 8 Hypothesis testing9 Assignment10 - 12 Regression Analysis13 Revision

Page 5: Probability & Statistical Inference Lecture 1

Exam & AssignmentExam The end of term exam accounts for 60% of

the overall mark

Assignment The assignment is worth 40% of the overall

mark. The assignment will be handed out in week 5 Week 9’s class will be dedicated to working

on the assignment.

Page 6: Probability & Statistical Inference Lecture 1

Software SAS Enterprise Guide will be the

software that will be used during the course.

Page 7: Probability & Statistical Inference Lecture 1

Applied Statistics and Probability for EngineersJohn Wiley & SonsDouglas C. Montgomery

Probability and Statistics for Engineers and ScientistsPearson EducationR.E. Walpole, R.H. Myers, S.L. Myers, K. Ye

Modelling Binary DataChapman & HallDavid Collett

Probability and Random Processes Oxford University PressG. Grimmett & D. Stirzaker

Statistical InferenceBrooks/ColeGeorge Casella

Recommended Reading

Page 8: Probability & Statistical Inference Lecture 1

Questionnaire

Page 9: Probability & Statistical Inference Lecture 1

Section 1: Statistics are everywhere

Page 10: Probability & Statistical Inference Lecture 1

We are bombarded with Statistics

http://www.irishtimes.com/newspaper/frontpage/2012/0918/1224324122326.html http://www.irishtimes.com/newspaper/world/2012/0914/1224324008884.html http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-

27th-3229426.html

Page 11: Probability & Statistical Inference Lecture 1

The internet is full of interesting statistics

http://www.usatoday.com/news/politics/twitter-election-meter

Page 12: Probability & Statistical Inference Lecture 1

Statistics can be misleading An ad claimed:

“9 Out of 10 Dentists prefer Colgate” What is wrong with this statement?

Consider these complaints about airlines published in US News and World Report on February 5, 2001

Can we conclude the United airlines has the worst customer service?

Page 13: Probability & Statistical Inference Lecture 1

Statistics in Everyday Life With the increase in the amount

of data available and advancement`s in the power of computers, statistics are being used more and more frequently

Question: Is it good that statistics are used so much

and what happens when statistics are misused?

Page 14: Probability & Statistical Inference Lecture 1

Statistics can be misleading

Page 15: Probability & Statistical Inference Lecture 1

Misinterpreted Statistics can be Devastating

In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow.

He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543).

What is wrong with this assumption?

Page 17: Probability & Statistical Inference Lecture 1

Challenges As an Analytics practitioner you will

face a number of challenges:

Create insight from all available data (and there is lots of it)

Interpret statistic correctly Communicate statistically driven

insight in a way that is clearly understood

Page 18: Probability & Statistical Inference Lecture 1

Objective of this course

Give you a set of statistical skills to allow you, as an analytics practitioner, turn data into insight!!

Page 19: Probability & Statistical Inference Lecture 1

The Analytics Process & Statistics

Page 20: Probability & Statistical Inference Lecture 1

Section Overview Statistics and Analytics Introduction to CRISP

Page 21: Probability & Statistical Inference Lecture 1

Data Analytics Is Multidisciplinary

Databases

StatisticsPatternRecognition

KDD

MachineLearning AI

Neurocomputing

Predictive Analytics

Data Warehousing

Page 22: Probability & Statistical Inference Lecture 1

Analytics Process

Data Insight Business Decision

Page 23: Probability & Statistical Inference Lecture 1

Analytics Is A Lot Of ThingsWhat’s the best that can happen?

What will happen next?

What if these trends continue?

Why is this happening?

What actions are needed?Where exactly is the problem?How many, how often, where?What happened?

Optimization

Predictive modellingForecasting/extrapolation

Statistical analysisAlerts

Query/drill down

Ad hoc reports

Standard reports

Com

petit

ive

adva

ntag

e

Degree of intelligence

Pred

ictiv

e An

alyt

icsAc

cess

&

repo

rting

Page 24: Probability & Statistical Inference Lecture 1

For this course we will concentrate on Statistical Analysis

What’s the best that can happen?

What will happen next?

What if these trends continue?

Why is this happening?

What actions are needed?Where exactly is the problem?How many, how often, where?What happened?

Optimization

Predictive modellingForecasting/extrapolation

Statistical analysisAlerts

Query/drill down

Ad hoc reports

Standard reports

Com

petit

ive

adva

ntag

e

Degree of intelligence

Pred

ictiv

e An

alyt

icsAc

cess

&

repo

rting

Page 25: Probability & Statistical Inference Lecture 1

CRISP-DM Evolution Over 200 members of the CRISP-DM SIG

worldwide DM Vendors: SPSS, NCR, IBM, SAS, SGI,

Data Distilleries, Syllogic, etc System Suppliers/Consultants: Cap Gemini,

ICL Retail, Deloitte & Touche, etc End Users: BT, ABB, Lloyds Bank, AirTouch,

Experian, etc Crisp-DM 2.0 is due…

Complete information on CRISP-DM is available at: http://www.crisp-dm.org/

Page 26: Probability & Statistical Inference Lecture 1

CRISP-DM Features of CRISP-DM:

Non-proprietary Application/Industry neutral Tool neutral Focus on business issues

As well as technical analysis Framework for guidance Experience base

Templates for Analysis

Page 27: Probability & Statistical Inference Lecture 1

Data

Business Understandin

g

Data Understandin

g

Data Preparation

Modelling

Evaluation

Deployment

Page 28: Probability & Statistical Inference Lecture 1

Phases & Generic TasksBusiness

Understanding Data

Understanding Data

Preparation Modeling Deployment Evaluation

Determine Business Objectives

AssessSituation

DetermineData Mining

Goals

ProduceProject Plan

Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives

Page 29: Probability & Statistical Inference Lecture 1

Phases & Generic TasksBusiness

Understanding Data

Understanding Data

Preparation Modeling Deployment Evaluation

CollectInitialData

DescribeData

ExploreData

VerifyData

Quality

Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

Page 30: Probability & Statistical Inference Lecture 1

Phases & Generic TasksBusiness

Understanding Data

Understanding Data

Preparation Modeling Deployment Evaluation

SelectData

CleanData

ConstructData

IntegrateData

FormatData

Data Preparation

The data preparation phase covers all activities to construct the data that will be fed into the modelling tools from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.

Page 31: Probability & Statistical Inference Lecture 1

Phases & Generic TasksBusiness

Understanding Data

Understanding Data

Preparation Modeling Deployment Evaluation

SelectModelingTechnique

GenerateTest Design

BuildModel

AssessModel

Modelling

In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.

Page 32: Probability & Statistical Inference Lecture 1

Phases & Generic TasksBusiness

Understanding Data

Understanding Data

Preparation Modeling Deployment Evaluation

EvaluateResults

ReviewProcess

DetermineNext Steps

Evaluation

Before proceeding to final deployment of a model, it is important to thoroughly evaluate it and review the steps executed to construct it to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Page 33: Probability & Statistical Inference Lecture 1

Phases & Generic TasksBusiness

Understanding Data

Understanding Data

Preparation Modeling Deployment Evaluation

Plan Deployment

Plan Monitoring &

Maintenance

ProduceFinal

Report

ReviewProject

Deployment

Creation of a model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.

Page 34: Probability & Statistical Inference Lecture 1

Crisp - DM Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment.

Page 35: Probability & Statistical Inference Lecture 1

Crisp – DM – Areas covered in this course

Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment

Page 36: Probability & Statistical Inference Lecture 1

Section 2: Descriptive Statistics & Distributions

Page 37: Probability & Statistical Inference Lecture 1

Topics1. Introduction to Statistics2. The Basics 3. Measures of location: Mean, Median &

Mode.4. Measures of location & Skew.5. Measures of dispersion: range, standard

deviation (variance) & interquartile range.

Page 38: Probability & Statistical Inference Lecture 1

Introduction to Statistics According to The Random House College

Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data.

There are two main branches of Statistics: The branch of statistics devoted to the

organisation, summarization and the description of data sets is called Descriptive Statistics.

The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.

Page 39: Probability & Statistical Inference Lecture 1

Process of Data AnalysisPopulation

Representative

Sample

Sample Statistic

A Statistical population is a data set that is our target of interest.

A sample is a subset of data selected from the target population.

If your sample is not representative then it is referred to as being bias

Describe

Make Inferenc

e

Page 40: Probability & Statistical Inference Lecture 1

Types of Data: Numeric Data Numeric data can be of two types:

Continuous Data: Data is continuous if it has an interval of real numbers for its range The number of centimetres of rain that fell in

March Discrete Data: Data is defined as discrete

if it has a finite range The number of correct answers in a 10 question

quiz

Page 41: Probability & Statistical Inference Lecture 1

Types of Data: Categorical Data Data that is broken into discrete categories is

referred to as categorical data Categorical data has two main types:

Nominal: A nominal variable has a discrete number of categories or levels with no logical order Gender: Male, Female Working Status: Employed, Unemployed, Home-maker,

Student, Retired Ordinal: An ordinal variable has a discrete number

of categories or levels with a logical order Income Level: Low, Medium, High Places in a race: 1st, 2nd, 3rd, 4th, 5th, 6th

Page 42: Probability & Statistical Inference Lecture 1

Class Task Task: Classify the type of each of the

data the following examples: The profit margin made from customers of

an online clothing company The type of interest rate you can be

charged on a mortgage i.e. Fixed rate, Adjustable rate

Number of dependents a associated with a loan applicant

Page 43: Probability & Statistical Inference Lecture 1

Let’s Start at the Very Beginning When learning to read and write we start

with A-B-C, when starting to count we start with 1-2-3 and of course The Von Trappe family singers started with Do-Re-Me!

When learning statistics you start with the arithmetic mean or a simple average

Page 44: Probability & Statistical Inference Lecture 1

The Arithmetic Mean

Year Canada China Germany* Russia** United Kingdom

United States

  Total Gold Total Gold Total Gold Total Gold Total Gold Total Gold1992 18 7 54 16 82 33 112 45 20 5 108 371996 22 3 50 16 65 20 63 26 15 1 101 442000 14 3 59 28 56 13 88 32 28 11 92 372004 12 3 63 32 49 13 92 27 30 9 103 362008 18 3 100 51 41 16 72 23 47 19 110 36Mean 17 4 65 28 58 19 85 30 28 9 102 38

The table below shows the total medals won and gold medals won by each country in the last 5 Olympic games

• * Germany combines East and West Germany prior to reunification ** Russia or The Soviet Union• Data source http://www.databaseolympics.com/index.htm

Page 45: Probability & Statistical Inference Lecture 1

Arithmetic Mean – The Formula The formula for calculating the sample

arithmetic mean of n data points x1, x2 ..... xn:

x x i

1

n

n

:x is referred to as x-bar

Page 46: Probability & Statistical Inference Lecture 1

Attributes of the Arithmetic Mean It is straight-forward to calculate It is easy to interpret the mean It gives us a good estimate of where

a set of numbers is centred This is referred to as the central tendency of a sample

It is sensitive to outliers

Page 47: Probability & Statistical Inference Lecture 1

Other Measures of Central Tendency

Median: The middle value of an ordered set of values, i.e. 50% higher and 50% lower

Mode: The most commonly occurring value in a distribution

Page 48: Probability & Statistical Inference Lecture 1

Calculating the MedianYear Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110

Medals (Sorted)

17411010810710310194949492900

Sort the data Median =

97.5

Page 49: Probability & Statistical Inference Lecture 1

Calculating the Mode

Medals Count174 1110 1108 1107 1103 1101 194 392 10 1

Mode = 94

Year Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110

Count frequenci

es

Page 50: Probability & Statistical Inference Lecture 1

When to Use Each Central Tendency Value?

Question: When and why would you use the median over the mean?

Page 51: Probability & Statistical Inference Lecture 1

Let’s Look at the Variation in our Data

01 - 2

424 - 4

848 - 7

474 - 9

797 - 1

21

121 - 146

146 - 170

0

2

4

6

8

10

12

14

16

18

20

Distribution of the Total Olympic Medals won by any Country from 1964 - 2008

Coun

t

Page 52: Probability & Statistical Inference Lecture 1

01 - 2

424 - 4

848 - 7

474 - 9

797 - 1

21

121 - 146

146 - 170

0

2

4

6

8

10

12

14

16

18

20

Distribution of the Total Olympic Medals won by any Country from 1964 - 2008

Coun

t

Let’s Look at the Variation in our Data

Central Tendency / Location

Spread/Variation

Page 53: Probability & Statistical Inference Lecture 1

Measures of Spread or VariationRangeVarianceStandard DeviationInter-quartile Range

Page 54: Probability & Statistical Inference Lecture 1

Calculating the Range The Range in calculated by

subtracting the minimum value in a data set from the maximum value

The main advantage to using the range is the ease with which it is calculated

The major disadvantage of the range is that it is highly sensitive to outliers

Page 55: Probability & Statistical Inference Lecture 1

Calculating the Variance As an example of Variance consider the

following data:OBS Data

1 32 43 8

Sum 15Mea

n5

Page 56: Probability & Statistical Inference Lecture 1

Calculating the Variance As an example of Variance consider the

following data:OBS Data Mean Deviatio

n1 3 5 -22 4 5 -13 8 5 3

Sum 15 15 0Mea

n5 15 0

Page 57: Probability & Statistical Inference Lecture 1

Calculating the Variance As an example of Variance consider the

following data:OBS Data Mean Deviatio

n(Deviation

)2

1 3 5 -2 42 4 5 -1 13 8 5 3 9

Sum 15 15 0 14Mea

n5 15 0 4.67

Page 58: Probability & Statistical Inference Lecture 1

Variance – The Formula Square the deviations around the mean

before summing. For n data points x1, x2 ..... xn:

Divide by n-1 (?) to get the average of squared deviations:

x i xn 2i1

n

s2 x i xn 2

i1

n

n 1

Page 59: Probability & Statistical Inference Lecture 1

Standard Deviation – The Formula Take the square root of the variance.

The value is in the original unit

s x i xn 2

i1

n

n 1

Page 60: Probability & Statistical Inference Lecture 1

Standard Deviation

Question: Why might it be useful to have the value is in the original unit?

Page 61: Probability & Statistical Inference Lecture 1

Percentiles The nth percentile is a value that has a

proportion of the sample taking values at or lower than it, and taking values larger than it

Example: if your grade in an industrial engineering class was located at the 84th percentile, then 84% of the grades were equal to or lower than your grade and 16% were higher

n100

100 n100

Page 62: Probability & Statistical Inference Lecture 1

Inter-quartile Range The median is the 50th percentile The 25th percentile and the 75th

percentile are called the lower quartile and upper quartile respectively (or 1st and 3rd)

The difference between the lower and upper quartile is called the inter-quartile range

Page 63: Probability & Statistical Inference Lecture 1

Quartiles ExampleMedals (Sorted)

17411010810710310194949492900

Sort the data

25th Percentile= 1st Quartile = 93

50th Percentile= Median= 97.5

75th Percentile = 3rd Quartile = 107.5

Inter-quartile Range 107.5 – 93 = 14.5

Year Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110

Page 64: Probability & Statistical Inference Lecture 1

Proportions The proportion, p, of items in a population

that belong to a certain class, for example: The proportion of your customers that are

female The proportion of voters that will vote for

Labour in the next election A proportion is calculated as:

where C is the number of items in a population of size N that belong to the class of interest

p CN

Page 65: Probability & Statistical Inference Lecture 1

Skew – The Shape of a DistributionThere are a number of ways of describing the shape of a distribution.

We will consider only one – skew.

Skew is a measure of how asymmetric a distribution is.

Page 66: Probability & Statistical Inference Lecture 1

Symmetric Distributions  = skew is zero

Page 67: Probability & Statistical Inference Lecture 1

There are few very large data points which create a 'tail' going to the right (i.e. up the number line)

Note: No axis of symmetry here - skew > 0 (i.e. it is positive)

Example: Lifetime of people, house prices

Positive Skew

Page 68: Probability & Statistical Inference Lecture 1

There are few very small data points which create a 'tail' going to the left (i.e. down the number line)

Note: No axis of symmetry here - skew < 0 (i.e. it is negative)

Examples: Examination Scores, reaction times for drivers

Negative Skew

Page 69: Probability & Statistical Inference Lecture 1

Mean, Median & Mode are the same and are found in the middle

66

5 6 74 5 6 7 8

3 4 5 6 7 8 9

Mean = 102/17 = 6Median = 6Mode = 6

Skew & Measures of Location - Symmetry

Page 70: Probability & Statistical Inference Lecture 1

ModeMedianMean

66

5 6 75 6 7 8 95 6 7 8 9 10 11

Mean = 121/17 = 7.12Median = 7Mode = 6

In general: Mode < Median < Mean

Positive Skew

Page 71: Probability & Statistical Inference Lecture 1

ModeMedianMean

Mean = 83/17 = 4.89Median = 5Mode = 6

In general: Mode > Median > Mean

66

5 6 73 4 5 6 7

1 2 3 4 5 6 7

Negative Skew

Page 72: Probability & Statistical Inference Lecture 1

Section 3: Graphs and Visualisation

Page 73: Probability & Statistical Inference Lecture 1

Graphical Displays A way of letting people get a 'picture' of

relationships in the data set.

The simpler the better should be a rule in graphical display.

People can remember pictures better.

A good graph should show something that is not easy to ‘see’ using tables.

Page 74: Probability & Statistical Inference Lecture 1

Bar Charts Used to display categorical data or

discrete data with a modest number of values.

A Bar is drawn to represent each category. The Bar height represents the frequency

or % in each category. Allows for visual comparison of relative

frequencies. Need to draw up a frequency distribution

table first.

Page 75: Probability & Statistical Inference Lecture 1

Core Statistical Plots

4265.

375 88.75

112.12

5135

.5

158.87

5182

.25

205.62

5More

0

5

10

15

20

25

Points Scored by any Team in Six Nations Champi-

onship 2000 - 2011

Page 76: Probability & Statistical Inference Lecture 1

Core Statistical Plots Comparisons Column Charts

Box Plots

Page 77: Probability & Statistical Inference Lecture 1

Core Statistical Plots Correlations Scatter Plots

Trends(time)

Line Charts

Page 78: Probability & Statistical Inference Lecture 1

Core Statistical Plots Proportions Pie Chart

Column Chart

Page 79: Probability & Statistical Inference Lecture 1

Some Hans Inspiration to Finish UP http://www.youtube.com/watch?v=fTznEI

ZRkLg