Top Banner
SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust
44

SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Jan 11, 2016

Download

Documents

Julia Anthony
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

SQA StatisticsDavid YoungDepartment of Mathematics and Statistics, University of StrathclydeRoyal Hospital for Sick Children, Yorkhill NHS Trust

Page 2: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Introduction/Overview• Course content• Computer labs – usernames and software (Minitab, Excel, R)

www.minitab.com• Different approach than AH Statistics• Emphasis on applications but theory needed to apply the

correct approach – not mathematically intensive• Applications of statistics to your own field of expertise e.g.

Geography, Modern Studies, Psychology, Science• http://personal.strath.ac.uk/david.young/SQA/• Lab sessions – opportunity to talk to staff from other disciplines

Page 3: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

What is Statistics?The American Heritage Dictionary defines statistics as …

"A branch of mathematics dealing with the collection, analysis,

interpretation, and presentation of masses of data."

Page 4: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Introduction•what is statistics and why do we need it?•statistics is the science of collecting, analysing, presenting and

interpreting data•it enables the objective evaluation of research questions of

interest• it provides the means to weigh up how much evidence the

collected data provide for and against the research hypothesis of interest

Page 5: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Examples of Research Hypotheses•Can the symptom of oral dryness (xerostomia) in terminal

cancer patients be relieved using a mucin-containing oral spray?

• Is a new painkilling drug more effective than the best alternative currently on the market?

•How common is the problem of aggression towards hospital staff?

•How does deprivation impact life expectancy in Scotland?•How is tourism impacting wildlife in the Galapagos Islands?

Page 6: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Statistics and Medical Research•statistics plays an increasingly important role in research•it is not possible, for example, to have a new drug treatment

approved for use without solid, statistical evidence to support claims of efficacy and safety

•over the last few decades, many new statistical methods have been developed which have particular relevance for researchers in different fields e.g. psychology, clustering, big data

•these methods can be applied routinely using statistical software packages

Page 7: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Importance of Statistics•all researchers should understand some basic statistical

concepts to ensure …•appropriate study design and data collection•application of the correct method of statistical analysis when

using software•accurate and honest reporting of data gathered •adequate understanding of claims made by other researchers

when reviewing reports

7

Page 8: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Basic Terminology•data: values of the variable(s) of interest recorded on one or

more individuals e.g. age, gender, duration of illness•population: the large group of individuals under study (e.g. all

cancer patients)•sample: a subset of the population of interest ‘selected’ as

being representative of the population as a whole

Page 9: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Sampling•draw useful conclusions about a population of interest•‘target population’ too large – practical problems in terms of

time, money, staff, resources•study a sample of the population of interest•use the sample of individuals to infer something useful about

the underlying population•vast amounts of data can be collected which should be

summarised in a useful way – graphically or numerically

Page 10: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Sampling

POPULATION

SAMPLE

inference

Page 11: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Types of Datacategorical data

•nominal – the data can be classified into a number of specific categories with no particular ordering e.g. eye colour, blood type, marital status, gender

•ordinal – the data can be classified into a number of specific categories which can be placed in some order of importance e.g. pain scores (mild, moderate, severe or unbearable), depravation category score within Glasgow (ranges 1–7 from affluent to poor classified by postcodes)

Page 12: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Types of Datanumerical data

•discrete – data are recorded as a whole number and usually only take specific values e.g. age in years, number of cigarettes smoked in a day, number of children

•continuous – data are recorded to the precision of the measuring instrument and usually take any value within a certain range e.g. height, weight, blood pressure

Page 13: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Problem 1Which of the following are categorical variables?

(a) gender(b) no. of children(c) diastolic blood pressure(d) diagnosis(e) height

Page 14: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Problem 2Which of the following are continuous numerical variables?

(a) blood type(b) peak expiratory flow rate(c) age last birthday(d) exact age(e) family size

Page 15: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Graphical Summaries• enables checking of underlying assumptions required for any

formal analysis (e.g. are data normally distributed)• helps to identify patterns or outliers• gives a idea of relationships between variables• interpretation of patterns is subjective• beware of data dredging

Page 16: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Pie Chart

Page 17: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Bar Chart

Page 18: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Scatter Plot

Page 19: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Histogram

Page 20: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Types of Histogram

Page 21: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Types of Histogram

Page 22: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Types of Histogram

Page 23: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Types of Histogram

Page 24: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Numerical Summaries• measures of location

– mean, median, mode• mean (average value)

– if there are n observations x1, x2,…, xn in a data set then the mean is calculated as …

– which can be written mathematically as

n

xxxx n

...21

n

xn

ii

1

Page 25: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Measures of Location• median (‘middle value’): to calculate the median …

– arrange the data in ascending order– if n is odd then the median is the middle value– if n is even then the median is the average of the ‘two middle’

values• e.g. 15 17 17 19 24 36 42 has median value 19• e.g. 1 2 5 7 7 8 has a median value (5+7)/2=6• the median is often denoted as Q2

• mode (most common value)– in the two examples above the modal values are 17 and 7

respectively

Page 26: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Measures of Spread• measures of spread

– variance, standard deviation, range, inter-quartile range• variance

– the variance of n observations x1, x2,…, xn is given by …

• standard deviation– s (i.e. variance) is called the standard deviation and has the same

units of measurements as the data

1

)(...)()( 222

212

n

xxxxxxs n

Page 27: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Measures of Spread• range

– the difference between the maximum and minimum values in a data set

• inter-quartile range– arrange the data in ascending order and calculate the upper (Q3)

and lower quartiles (Q1) – the quartiles, along with the median, divide the data set into four

equal parts (i.e. quarters)– the inter-quartile range is Q3-Q1

• unlike the range, the inter-quartile range is unaffected by outliers

Page 28: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Example 1•the following numbers refer to the length of time spent in

hospital (days) for 7 patients after a particular operation … 2 2 3 2 15 1 3 •are these data are categorical or numerical?•calculate the mean, median and mode of this data •compute the range, inter-quartile range and standard

deviation of this data•are any of the data values unusual?•is it reasonable to assume that these data are normally

distributed?

Page 29: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Example 2• the following numbers are the ages of children admitted to an

outpatient department with burns …• 7 6 6 9 10 8 7 7 7 7 6• are these data are categorical or numerical?• calculate the mean, median and mode of this data • compute the range, inter-quartile range and standard

deviation of this data• are any of the data values unusual?• is it reasonable to assume that these data are normally

distributed?

Page 30: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Relationship between Mean and Median

30 40 50 60 70 80 90 100 110 120

Length of stay (days)

Mean=66.47

Median=53.00

Page 31: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Choice of Summary Statistics•for data which are normally distributed (or symmetric) and

mean and standard deviation are the appropriate measures of location and spread respectively

•when data are skewed, the mean and standard deviation are influenced by outliers and the appropriate measures of location and spread for reporting this type of data are the median and inter-quartile range

•note that for normally distributed data (and symmetric data) the mean and median will be approximately the same

Page 32: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Problem 3When a distribution is skewed to the right …

(a) the median is greater than the mean(b) the median is equal to the mean(c) the tail on the left is shorter than the tail on the right(d) the standard deviation is less than the variance(e) the majority of observations are less than the mean

Page 33: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Age and Weight Minitab Example

Descriptive Statistics: Age, Weight

Variable N Mean StDev Minimum Q1 Median Q3 MaximumAge 20 34.150 3.602 27.000 31.250 34.500 36.750 42.000Weight 20 65.65 12.04 52.00 56.00 63.00 74.25 92.00

Page 34: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Histogram of Age Variable

Page 35: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Boxplot of Age Variable

Page 36: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Histogram of Weight Variable

Page 37: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Boxplot of Weight Variable

Page 38: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

The Normal Distribution• the normal (or Gaussian) distribution is the most important of

all the distributions since it has a wide range of practical applications

• the characteristic bell-shape of this distribution describes many distributions which occur in practice

• the distribution is symmetric about the mean• set proportions of the area under the curve fall within set limits

from the mean …

68.2% lies within one standard deviation of the mean95.4% lies within two standard deviations of the mean99.7% lies within three standard deviations of the mean

Page 39: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Histogram of Normal Data

Page 40: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Uses of the Normal Distribution•many physical measurements are closely approximated by the

normal distribution, particularly measurements in which there is natural variation (e.g. some biological measurements like height and weight are normally distributed)

•data which are normally distributed can be more easily analysed by using parametric methods of statistical testing (see later)

Page 41: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Example•In a published paper it is reported that the mean age of 500

injecting drug users in Glasgow is 28 years with a standard deviation of 5 years.

•Assuming that the authors have reported the correct summary statistics for the data they have gathered, what can you assume about the distribution of the ages?

•What do the reported summary statistics tell you about the location and spread of the ages of injecting drug users in Glasgow?

Page 42: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Interpretation

Page 43: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Problem 4In general, which of the following statements is FALSE?

(a) the sample mean is more sensitive to extreme valuesthan the median

(b) the sample inter-quartile range is more sensitive to extreme values than the standard deviation

(c) the sample standard deviation is a measure of spread around the sample mean

(d) the sample standard deviation is a measure of central tendency around the median

(e) if a distribution is normal, then the mean will be equal to the median

Page 44: SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal Hospital for Sick Children, Yorkhill NHS Trust.

Problem 5If diastolic blood pressure has a distribution which is normally distributed …

(a) there would be fewer observations below the mean than above it

(b) the standard deviation would be equal to the mean(c) the majority of observations would be less than two

standard deviations from the mean(d) the standard deviation would estimate the spread of

blood pressure measurements(e) the mean will be equal to the median