Top Banner

of 52

Basic Concepts Manual

Jun 03, 2018

Download

Documents

alinds16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/12/2019 Basic Concepts Manual

    1/52

    Basic Concepts Reference Manual:

    A gentle overview

  • 8/12/2019 Basic Concepts Manual

    2/52

  • 8/12/2019 Basic Concepts Manual

    3/52

    3

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    These reference manuals have been developed to assist students in the basics of statistical computing sort of aStatistical Computing for Dummies. It is not our intention to use this manual to teach statistical concepts 1but ratherto demonstrate how to utilize previously taught statistical and data analysis concepts the way that professionals andpractitioners apply them through the able assistance of computing. Proficiency in software allows students to focusmore on the interpretation of the output and on the application of results rather than on the mathematical computations.

    We should pause here and strongly make the point that computers should serve as a medium of expediency of calculation not as a substitution for the ability to execute a calculation.

    In the Basic Concepts manual, we present statistical concepts, context for their use, and formulas where appropriate. Weprovide exercises to execute these concepts by hand. Then, in each subsequent manual, the concepts are applied in a

    consistent manner using each of the five major statistical computing packages Excel, SPSS, Minitab, R and SAS.

    1 Readers of this manual are assumed to have completed some introductory statistics course. For individuals wishing to review statisticalconcepts, we recommend Introduction to Stats by DeVeaux, Velleman and Bock.

  • 8/12/2019 Basic Concepts Manual

    4/52

  • 8/12/2019 Basic Concepts Manual

    5/52

  • 8/12/2019 Basic Concepts Manual

    6/52

    6

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    R R is a commands-driven programming environment to execute statistical analysis. Unlike all of the other softwarepackages we have discussed which are proprietary, R is an open-source program that is free and readily available viadownload from the internet. R is becoming quite popular in quantitative analysis in many fields including statistics, socialscience research (Psychology, Sociology, Education, etc.), marketing research, business intelligence, etc. R is animplementation of the S-Plus programming language that was originally developed by Bell Labs in the 1970s.

    For product information regarding R, please visit: http://cran.r-project.org/

    http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/
  • 8/12/2019 Basic Concepts Manual

    7/52

    7

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Organization of the Manuals

    After a brief review of the most common, and we believe essential, statistical/data analysis concepts that every college-educated person, regardless of discipline, should know we will then explain how each of these concepts is executed inExcel (2010), SPSS (v.18), Minitab (v.16), SAS (v. 9.2), and R.

    We have taken a software-oriented approach rather than a statistical concept-oriented approach, because it is the softwareapplication rather than the statistical concepts that represent the focus of this document. For example, our first concept isdescriptive statistics. Rather than explaining descriptive statistics through each package and then moving into the secondanalysis concept, we focus on all of the concepts in Excel, and then move to a focus on all of the concepts in SPSS, etc. Yes,we understand that from the readers perspective this may be a bit monotonous. After you finish your Ph.D. in Statistics,

    you can write your manual your way.Throughout each manual, we have used screenshots from the various packages, and have developed easy-to-followexamples using a common dataset.

    At the end of each manual , we have included a section titled Lagniappe. This word derives from New World Spanishla apa, the gift. The word ca me into the Creole dialect of New Orleans and there acquired a French spelling. It is stillused in the Gulf States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to apurchase.

    Our lagniappe for our readers includes the extra and interesting things that we have learned to do with each of thesesoftware programs that might not be easily found or well known. A little extra information at no extra cost!

  • 8/12/2019 Basic Concepts Manual

    8/52

    8

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Overview of Dataset

    Throughout these manuals, we will use a common dataset taken from a small manufacturing company the WidgeOnecompany.

    The WidgeOne dataset:

    An Excel file WidgeOne.xls Both qualitative and quantitative variables 23 variables total Three sheets in one workbook

    o Plant_Surveyo Employeeso Attendance

    40 observations

    VARIABLE MEANING VARIABLE TYPE SHEETEMPID Employee ID Qualitative ALLPLANT Plant ID Qualitative Plant_SurveyGENDER Gender Qualitative Plant_SurveyPOSITION Job Type Qualitative Plant_Survey JOBSAT Job Satisfaction (1-10) Quantitative Plant_Survey

    YRONJOB Years in current job Quantitative Plant_Survey JOBGRADE Job Level (1-10) Quantitative Plant_SurveySOCREL HR Social Relationship Score (0-10) Quantitative Plant_SurveyPRDCTY HR Productivity Rating (out of 100) Quantitative Plant_SurveyLast Name Employee Last Name Qualitative EmployeesFirst Name Employee First Name Qualitative Employees JAN Attendance in January (%) Quantitative Attendance

  • 8/12/2019 Basic Concepts Manual

    9/52

    9

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Here is a screen shot taken of WidgeOne.xls:

  • 8/12/2019 Basic Concepts Manual

    10/52

    10

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Data Analysis and Statistical Concepts

    As former practitioners who used statistics on an almost daily basis in our professions in finance, marketing, engineering,manufacturing and medicine, we have developed our TOP 6 list of the most common and most useful applications ofStatistics and Data Analysis.

    After a brief explanation of each concept, examples will be provided for how to execute these concepts by hand (with acalculator). We cannot emphasize strongly enough that the calculation of the concepts needs to be mastered and fullyunderstood before they can be effectively outsourced to a software application.

    Types of variables

    There are two distinct types of variables: quantitative and qualitative. Quantitative variables measure how much ofsomething (the quantity) that a unit possesses. For example, in the WidgeOne data set, the quantitative variableYRONJOB measures how many years each employee possesses. Quantitative variables are also known as continuousvariables.

    Qualitative variables identify if an observation belongs to a group. In the WidgeOne data set, Gender is a qualitativevariable it represents whether or not each employee can be qualified as a male or female. Qualitative variables cancertainly have number values such as 0 for male and 1 for female, but these numbers are still gender groups and

    absolutely cannot be treated as a quantitative value. If an employee has a 1 it indicates that the employee is a female itdoes not mean that the employee has more gender than someone with a 0. Qualitative variables are also known ascategorical variables.

    There are two types of qualitative variables: nominal and ordinal. As the name implies, the value of nominal variablescarry information about the name of the group they belong to - such as gender and plant. A special case of nominal

  • 8/12/2019 Basic Concepts Manual

    11/52

    11

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    variables are identifier variables. They (you guessed it) serve as a way to identify each observation and carry no otheruseful information. For purposes of analysis, these are treated as neither quantitative nor qualitative.

    Ordinal variables, also like the name implies, have a natural inherent order and measure how much of something asub ject possesses. Ordinal variables would look like a little, some, a lot or small, medium, large. Things startto get a little fuzzy here. An Ordinal variable can sometimes be treated as a quantitative (measures the quantity) only if weknow how much more each category is than the one preceding it.

  • 8/12/2019 Basic Concepts Manual

    12/52

  • 8/12/2019 Basic Concepts Manual

    13/52

    13

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    The formula for the calculation of a mean is:

    Where Xi = every observation in the dataset and N = the number of observations in the dataset

    We know how everyone LOVES formulas with Greek letters!

    FUN MANUAL CALCULATION!!

    Using the WidgeOne.xls dataset, calculate the mean years that men in the Norcross plant (n=10) have been in their current job (YRONJOB). The answer is on the next pagedont cheatdo it first to make sure that you understand how t ocalculate this foundational concept by hand.

    1

    n

    ii

    X X

    N

  • 8/12/2019 Basic Concepts Manual

    14/52

    14

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Did you get 9.66? Well done .

    A second measurement of central tendency of a dataset is the median. The median is literally, the middle of the dataset:

    It is the central value of an array of numbers sorted in ascending (or descending) order;

    50% of the observations lie below the median and 50% of the observations lie above the median;

    It represents the second quartile (Q2);

    It is unique.

    As with the mean, the median is used when the data is ratio scale (quantitative). However, unlike the mean, the mediancan accommodate extreme values.

    FUN MANUAL CALCULATION!!

    Take the men in the Norcross plant (n=10) again, and determine the median years they have spent in their current job.

    The answer is on the next page . Did you cheat last time? You can redeem yourself by doing this one by hand

  • 8/12/2019 Basic Concepts Manual

    15/52

    15

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Did you get 9.5? Well done.

    The mean and the median are pretty close 9.66 and 9.50, respectively. But which one is right? Which one should bereported as the central tendency or the most representative value of the years on the job for the men in the Norcrossplant? Mathematically they are both correct, but which one is best?

    The mean is the best measure of central tendency for quantitative variables under these circumstances:

    The distribution of the variable in question is unimodal.

    The distribution is also symmetric.

    In fact, both the mean and the median require that the distribution of the variable be unimodal. Otherwise, they are bothtypically misleading and even incorrect.

    What is unimodal you ask? When referring to the shape of the distribution (which we are) unimodal means there is onlyone maximum (only one hump).

  • 8/12/2019 Basic Concepts Manual

    16/52

    16

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    The following graphic is an example of unimodal distribution (this is a histogram of 100 mens heights):

    7674727068666462

    20

    15

    10

    5

    0

    Height (in inches)

    F r e q u e n c

    y

  • 8/12/2019 Basic Concepts Manual

    17/52

    17

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    And here is a bimodal (two hump) distribution (this is a histogram of 200 peoples heights):

    76726864605652

    40

    30

    20

    10

    0

    Height (in inches)

    F r e q u e n c y

    The mean and median height for both of these groups is around 63 inches. You can see that this is an accurate measure ofcentral tendency for the population in the first graphic, but it is certainly misleading for the population in the secondgraphic where there are actually two locations of central tendency. This is why the mean and the median are onlyappropriate for unimodal distributions!

  • 8/12/2019 Basic Concepts Manual

    18/52

    18

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    For the mean to be an appropriate measure of central tendency the data has to be symmetric as well as unimodal. Thedata has a symmetric distribution when the first half of the distribution is a mirror image of the second half. Theunimodal graphic of the man height is (roughly) symmetric:

    7674727068666462

    20

    15

    10

    5

    0

    Height (in inches)

    F r e q u e n c y

    If a distribution is not symmetric, then it is referred to as skewed. Data can be right and left skewed.

  • 8/12/2019 Basic Concepts Manual

    19/52

  • 8/12/2019 Basic Concepts Manual

    20/52

    20

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Here is an example of left skewed data:

    82.575.067.560.052.545.037.5

    16

    14

    12

    10

    8

    6

    4

    2

    0

    Generic V ariable

    F r e q u e n c y

    When the data is symmetric, the mean and the median should be pretty close, in which case you would use the mean asthe measure of central tendency. If the median and mean are not close, there is evidence that the distribution is skewed.Consider the men in Norcross again. What if employee 082 had 30 years with the company instead of 14 years? Howwould the mean and median be affected? The mean would increase to 11.26 while the median remains the same at 9.50(do this by hand to convince yourself of this concept). Go back and look at the formula for the mean and think about why

  • 8/12/2019 Basic Concepts Manual

    21/52

    21

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    the mean was so heavily affected, while the median was not. A boxplot will provide further evidence of symmetry (moreon them later).

    Steps in Identifying the Best Measure of Central Tendency

    o Ensure that the variable is indeed quantitative (i.e., can be measured with continuous numbers).o Generate and inspect a histogram of the variable and identify its modality (is it unimodal?). Inspect the histogram

    for approximate symmetry and possible outliers.o Generate and inspect a boxplot. Discuss further evidence of approximate symmetry and the existence of possible

    outliers.o Compare and contrast the mean and median as a final piece of evidence of symmetry (or non-symmetry).

    Your Final Decision

    o When data are unimodal and symmetric, the mean is the best measure of central tendency.o When data are unimodal and non-symmetric (skewed), the median is the best measure of central tendency.o When data are non-unimodal, one should use neither the mean nor the median, but instead present a qualitative

    description of the shape and modality of the distribution.

  • 8/12/2019 Basic Concepts Manual

    22/52

    22

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    A third measurement of central tendency is the mode. The mode is the most frequently occurring value in a dataset:

    There can be multiple modes;

    It is not influenced by extreme observations;

    Can be used with both qualitative and quantitative data.

    Go back to the WidgeOne.xls dataset and the men in the Norcross plant. What is the mode for their years on the job? Didyou get 14 years? Great! This is a measurement of central tendency. But 14 years is different (a lot different) from 9.66 and9.50 years. Is it correct?

    Technically yes, this would be mathematically correct, but not the most appropriate measurement to report as the centraltendency of the dataset. Typically, the mode is consi dered to be the weakest of the three measurements of centraltendency for quantitative data and is ONLY used if the mean or median is not available. When would that be?

    Calculate the mean and median gender of the dataset. Go ahead. We will wait.

    It cant be done. When the data in question is qualitative (e.g., gender, plant, position) the ONLY measurement of centraltendency that is available is the mode.

  • 8/12/2019 Basic Concepts Manual

    23/52

  • 8/12/2019 Basic Concepts Manual

    24/52

    24

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    The standard deviation provides us with the mean units of each observation from the mean. If this number is large, thedata is very spread out (i.e., the observations are different). If this number is small, the data is very compact (i.e., theobservations are very similar).

    FUN MANUAL CALCULATION!!

    Refer back to the WidgeOne.xls dataset. Calculate the standard deviation of the number of years on the job for the men inNorcross (n=10). Remember that the mean was 9.66 years.

    The answer is on the next pagedont cheatdo it first to make sure that you understand how to calculate thisfoundational concept by hand.

  • 8/12/2019 Basic Concepts Manual

    25/52

    25

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Did you get 3.30 ? Well done.

    What does this number MEAN? 3.30 what? It means that the standard deviation of the dataset is 3.30 years. The averagedeviation (in either direction) of each individuals tenure is 3.30 years from the mean of 9.66. Relative to the mean, wewould consider this data to be fairly comp actmeaning that the data is not very spread out (this will be seen more clearlyin the next section when a graphical representation is created).

    You may recall from your earlier Statistics course(s) a second statistical calculation that provides a second measurementof dispersion the variance. The variance is simply the square of the standard deviation. Although variance is animportant concept to statisticians, it is not typically used by practitioners. This is because variance is not very userfriendly in terms of interpretation. In the case of the men in Norcross, the variance would be reported as 10.88 yearssquared.

    There is another application of the term variance that has a more generic meaning that is heavily used by practitioners.It is the difference, either in absolute numbers or percentages, of each observation from some base value.

    For example, it is common for individuals to refer to a budget variance, where this number would be the actual numberminus the budgeted number:

    Project # Budget Hours Actual Hours Variance Variance %123 150 175 +25 +17%

    Remember when calculating the variance percentage in this context, you take the difference (150-175) divided by the budgeted number (150), not the actual number (many professionals make this mistakeonce).

  • 8/12/2019 Basic Concepts Manual

    26/52

    26

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Another method of representing the dispersion of a dataset is to provide the frequency counts for observations acrossspecified ranges.

    FUN MANUAL CALCULATION!!

    Using the WidgeOne.xls dataset, determine the number of individuals with job tenure (YRONJOB) in the followingcategories:

    Less than 5 years

    5 10 years

    More than 10 years

    Here is how your answer should appear:

    Category Frequency Relative Frequency Cumulative FrequencyLess than 5 years 9 22.50% 22.50%5-10 years 16 40.00% 62.50%

    More than 10 years 15 37.50% 100.00%Total 40 100.00%

    It is important to note that the categories are mutually exclusive (no observation can occur in two categoriessimultaneously) and collectively exhaustive (every observation is accommodated).

  • 8/12/2019 Basic Concepts Manual

    27/52

    27

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    This representation of the dispersion of the data is referred to as a frequency table and is the most common and one of themost useful representations of data.

    In this instance, we converted a quantitative variable into a qualitative variable for the purposes of developing afrequency table. We do this frequently to take a different kind of look at a quantitative variable.

    If we had a qualitative variable that we wanted to better understand, we would generate the appropriate measurement ofcentral tendency (Mode) and the measurement of dispersion (frequencies) through the application of a frequency table.

    What you need to know Measurements of dispersion provide information regarding how spread-out or compact thedata is. Typically this is communicated through the computation of the standard deviation AND some display of thefrequency counts of the observations across specified categories. If the data is qualitative, the only measurement ofdispersion comes from the frequency table.

  • 8/12/2019 Basic Concepts Manual

    28/52

    28

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Concept 3: Visualization of Univariate Data

    Typically, data analysis includes BOTH the computational analysis as well as some visual representation of the analysis.Many recipients of your work will never look at your actual calculations only your tables and graphs (remember thereference above to the great statistical unwashed?). As a result, visual representation of your analysis should receivethe same amount of attention and dedication as your computational analysis.

    Edward Tufte has published several books and articles on the topic of the visualization of data. We recommend isseminal work The Visual Display of Quantitative Information as an excellent reference on the topic. Seehttps://www.edwardtufte.com/ .

    When developing a visual representation of a single variable, the most common tools include Histograms, Pie Charts,Bar Charts, Box Plots and Stem and Leaf Plots. Each of these will be discussed briefly in turn.

    Histograms Histograms visually communicate the shape, central tendency and dispersion of the dataset. For this reason,Histograms, are heavily used in conjunction with the measurements of central tendency and the measurements ofdispersion to describe a particular variable (like we did while discussing central tendency). Histograms are used withQUANTITATIVE DATA. For all of the packages that we will discuss below, you can simply reference the quantitativevariable directly and a Histogram will be generated.

    https://www.edwardtufte.com/https://www.edwardtufte.com/https://www.edwardtufte.com/
  • 8/12/2019 Basic Concepts Manual

    29/52

    29

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    The following histogram was generated using Minitab:

    Note in this graphic that the left axis represents the actual frequency counts and the horizontal axis represents the jobtenure of the employees. From this graphic, it is easy to see that the data is (roughly) normally distributed with a mean,median and mode somewhere around 9 years.

    1815129630

    6

    5

    4

    3

    2

    1

    0

    Years on Job

    F r e q u e n c y

    Histogram of Widge One Employee Job Tenure

  • 8/12/2019 Basic Concepts Manual

    30/52

    30

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Pie Charts Pie charts can be useful for displaying the relative frequency of observations by category, if used properly.They can be used to visualize ordinal data, but bar charts are more appropriate to show the inherent order

    Consider these two guidelines:

    o Use 5 or fe wer slices if more than 5 slices are needed, use a table;o Order the relative frequencies in ascending (or descending) order.

    Using the same Job Tenure data, the associated pie chart, generated using Minitab, would look like this:

    5 to 10 YearsLess than 5 YearsMore than 10 Years

    Category

    37.5%

    22.5%

    40.0%

    Job Tenure of Widge One Employees

  • 8/12/2019 Basic Concepts Manual

    31/52

    31

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    It should probably be noted at this point that approximately 8% of all men and .5% of all women are colorblind.Although colorblindness comes in many different forms, the most common forms involve the colors red, green, yellowand brown. Individuals who are colorblind cannot discern from among these colors. Therefore, when constructing piecharts or any other type of colored visual representation of your analysis, avoid placing these colors adjacent to eachother.

    Bar Charts Bar Charts ARE NOT Histograms! Bar Charts are intended to represent the frequency counts ofQUALITATIVE data. The plant information from WidgeOne.xls would look like this:

    This bar chart was developed using Minitab.

    NorcrossDallas

    25

    20

    15

    10

    5

    0

    Plant

    C o u n

    t

    Bar Chart of Plant Employees

  • 8/12/2019 Basic Concepts Manual

    32/52

    32

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Bar Charts and Pie Charts are the primary tools used to display qualitative data, but keep in mind that, for ordinal data, bar charts are more appropriate than pie charts. Bar charts are able to illustrate the natural order of the data whereas a piechart cannot. When using bar charts as a visual of ordinal data, be sure to display the correct order of the data.Remember, when constructing graphical displays of nominal data, most software packages will order the values inalphabetical order, not the natural order. Often times you will have to go in and change it (dont worry we will showyou how).

    Stem and Leaf Plots Stem and leaf plots, like histograms, provide a visual representation of the shape of the data and thecentral tendency of the dataset. Here is the stem and leaf plot for the Job Tenure variable:

    2 0 017 0 2223312 0 4455516 0 6777(8) 0 8888899916 1 00001119 1 23335 1 44451 1 7

    When reading a stem and leaf plot, the first number represents the stem and the numbers to the right represent theleaves, while the number to the far right represents the frequency of the stem. For example, the first stem of the plotabove is a 17 and the first (and only) leaf is 0. This means that there is one observation that has 17.0 years on the job.To the far right of the 17, there is a 1. This indicates that there is only one employee with 17.x years on the job.

  • 8/12/2019 Basic Concepts Manual

    33/52

    33

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Boxplots The last tool described in this manual for visualizing univariate data is the boxplot. The boxplot builds on theinformation displayed in a stem-and-leaf plot and focuses particular attention on the symmetry of the distribution andincorporates numerical measures of tendency and location.

    Prior to creating a boxplot, you need to be familiar with the concepts of quartiles. The boxplot incorporates the median,the mean and the four quartiles of a variable. The quartiles of a dataset are the points where 25%, 50% (the same as themedian), 75% and 100% (the max value) of the data lies below. Quartiles are typically written as Q1, Q2, Q3, Q4,respectively. The data that lies between Q1 and Q3 is referred to as the Interquartile Range or IQR. This is the center 50%of the dataset.

  • 8/12/2019 Basic Concepts Manual

    34/52

    34

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Below is the boxplot for the Job Tenure variable from WidgeOne.xls.

    From this boxplot, you can see that Q1 begins at 5, Q2 (also the median) begins at 8 (the actual median of the dataset is8.35), Q3 begins at 11 and the highest value of the dataset is 17.0. Notice that the distance from the median line to the topof the IQR box is roughly the same distance as the median line from the bottom of the IQR box. From this, we wouldconclude that this dataset is relatively symmetric.

    As previously mentioned while discussing central tendency, box plots are an excellent tool to examine the symmetry ofthe data and identify potential outliers.

    18

    16

    14

    12

    10

    8

    6

    4

    2

    0

    Y e a r s o n

    J o

    b

    Boxplot of Job Tenure

    IQR

    Median

  • 8/12/2019 Basic Concepts Manual

    35/52

    35

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    The following graphic is a box plot of data with a right-skewed distribution:

    70

    60

    50

    40

    30

    20

    G e n e r i c

    V a r

    i a b l e

    You can tell that the distribution is right skewed because the inner-box distances from the median line are not equal andthe upper vertical line is longer than the lower.

  • 8/12/2019 Basic Concepts Manual

    36/52

    36

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    The following is a graphic is a boxplot of a left-skewed distribution:

    80

    70

    60

    50

    40

    G e n e r i c

    V a r i a

    b l e

    The opposite is true for the boxplot above. We can see that the distribution of the generic variable is left-skewed.

    What you need to know Many individuals, who are analytically very strong, often place insufficient emphasis ongraphics and visual representations of data. Many individuals who are not strong analytically, but need analysis tosupport their decision-making, often place an overemphasis on graphics and visualization. Individuals who can execute both well will go far. Histograms, Stem and Leaf and Boxplots are used with QUANTITATIVE DATA. Bar Charts, PieCharts, Column Charts are used with QUALITATIVE DATA.

  • 8/12/2019 Basic Concepts Manual

    37/52

    37

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Concept 4: Organization/Visualization of Multivariate Data

    Frequently, we need to understand and report the relationships between and among variables within a dataset. Whendeveloping visual representations of multiple variables, the most common tools include Contingency Tables (qualitativeand quantitative data), Stacked Bar Charts (qualitative data), 100% Stacked Bar Charts (qualitative data), and Scatter plots(quantitative data). Each of these will be discussed briefly in order.

    Contingency Tables One of the most common and useful methods of displaying the relationships between two or morevariables is the contingency table. This table is highly versatile and easily constructed. As an example, lets take theGENDER and PLANT variables from the WidgeOne.xls dataset. A contingency table of these two variables would looklike this:

    Counts of Employees by Gender and PlantCount of Gender PlantGender Dallas Norcross TotalFemale 13 7 20Male 10 10 20Total 23 17 40

    This table displays the frequency of the number of females and males at each plant.

  • 8/12/2019 Basic Concepts Manual

    38/52

    38

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    We could also display this table as percentages rather than as frequencies. In the following contingency table thepercentages are given as a percentage of each gender (row percentages). Specifically, the interpretation of the first cellwould be of all of the female employees, 65% work in Dallas.

    WidgeOne Employees by Gender and PlantPlant

    Gender D N TotalF 65.00% 35.00% 100.00%M 50.00% 50.00% 100.00%Grand Total 57.50% 42.50% 100.00%

    The percentages could easily be reversed to represent the percentage of individuals at each plant (column percentages):

    WidgeOne Employees by Gender and PlantCount of Gender PlantGender Dallas Norcross TotalFemale 56.52% 41.18% 50.00%Male 43.48% 58.82% 50.00%

    Total 100.00% 100.00% 100.00%

    In this version of the table, the first cell now communicates of all of the Dallas employees, 56.52% are female.

  • 8/12/2019 Basic Concepts Manual

    39/52

    39

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Finally, we can also represent the data as overall percentages:

    WidgeOne Employees by Gender and PlantCount of Gender PlantGender Dallas Norcross TotalFemale 32.50% 17.50% 50.00%Male 25.00% 25.00% 50.00%Total 57.50% 42.50% 100.00%

    In this version of the table, the first cell now communicates of all employees, 32.50% are females in Dallas.

    Before moving on, please ensure that you fully understand the differences across these three tables. They are subtle, butimportant.

  • 8/12/2019 Basic Concepts Manual

    40/52

    40

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Both gender and plant are categorical variables. We could incorporate a quantitative variable into this table such as jobtenure:

    Mean Job Tenure of Employees by Gender and PlantPlant

    Gender Dallas Norcross TotalFemale 8.85 6.94 8.19Male 7.13 9.66 8.40Grand Total 8.10 8.54 8.29

    This table now provides information about the average job tenure for each gender and each plant, and for each gender ateach plant. For example, the first cell now communicates, The females in Dallas have an average job tenure of 8.85 years.

    These contingency tables were created using MS Excel.

  • 8/12/2019 Basic Concepts Manual

    41/52

    41

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Stacked Bar Charts Stacked bars are a convenient way to display percentages or proportions, such as might be done in a

    pie chart, for multiple variables. For example, the proportion of each gender at each plant would be displayed like this ina stacked bar chart:

    This graphic is fine. However, when the population size differs particularly by a lot stacked bar charts are lessinformative. It is difficult to understand how the groups compare. For example, the difference in the number of Dallasand Norcross employees is not dramatic, but even here it is difficult to discern which has a greater proportion of men.

    Plant NorcrossDallas

    25

    20

    15

    10

    5

    0

    C o u n

    t

    MaleFemale

    Gender

    Bar Chart of Gender by Plant

  • 8/12/2019 Basic Concepts Manual

    42/52

    42

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    100% Stacked Bar Charts To solve this problem, we can apply a 100% stacked bar chart. This visualization tool simply

    calibrates the populations of interest like the two plants to both be evaluated out of a total of 100%. You can almostthink of 100% Stacked Bar Charts as side-by-side pie charts.

    Compare this graphic to the first Stacked Bar Graph. They are different. They communicate subtly different messages.

    Plant NorcrossDallas

    100

    80

    60

    40

    20

    0

    P e r c e n

    t

    MaleFemale

    Gender

    100% Bar Chart of Gender by Plant

    Percent within levels of Plant.

  • 8/12/2019 Basic Concepts Manual

    43/52

    43

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Scatter Plots What if we wanted to better understand if there is a meaningful relationship between two quantitativevariables? Such as the possible relationship between job tenure and productivity.

    This question can be addressed using a scatter plot, where one quantitative variable is plotted on the y-axis and thesecond quantitative variable is plotted on the x-axis:

    If two variables are considered to be related, we would expect to see some pattern within the scatter plot, such as a line. If job tenure and productivity were positively related, then we would expect to see a 45 degree line moving from the SW corner to the NE corner. This would indicate that as job tenure goes up, productivity goes up. If job tenure and

    70.00

    75.00

    80.00

    85.00

    90.00

    95.00

    100.00

    0 5 10 15 20

    P r o

    d u c

    t i v

    i t y

    Job Tenure

    Is Job Tenure Related to Productivity?

  • 8/12/2019 Basic Concepts Manual

    44/52

    44

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    productivity were negatively related, then we would expect to see a 45 degree line moving from the NW corner to theSE corner. This would indicate that as job tenure goes up, productivity goes down.

    In this scatter plot, neither of these linear patterns (or any other pattern) is reflected. This cloud is referred to as a NullPlot. As a result, we would conclude that job tenure and p roductivity are not related.

    We can derive additional information from this scatter plot. Specifically, we can determine the best fit line in the formy=mx+b. This is the linear equation that minimizes the distances between the predicted values and the actual values,where y = the predicted values of an employees productivity and x = the actual number of years of an employees jobtenure: y = - 0.5715x + 89.318. This equation generates an R 2 value of 0.1124, where this value represents the per centageof the variance of the dependent variable (productivity) that can be explained by the independent variable (job tenure).

    Detailed explanations of these concepts are outside of the scope of this document, but are heavily used in Statistics andform the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommendStatistical Methods and Data Analysis by Ott and Longnecker.

    What you need to know Stacked Bar charts are used to display the counts within groupings of qualitative variables.When those groupings are of different sizes, a 100% Stacked Bar Chart is preferred. You can think of 100% Stacked BarCharts as side by side Pie Charts. Scatterplots are used to communicate if a relationship exists between two quantitativevariables.

  • 8/12/2019 Basic Concepts Manual

    45/52

  • 8/12/2019 Basic Concepts Manual

    46/52

    46

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    A side-by-side box plot also has the same requirements the box plots should be built by quantitative variable andgrouped by a qualitative variable. Lets use the same two variables again, YRONJOB and Plant:

    Nice! Now we can see that the Dallas plant employees have a larger range of job tenures, and that the median job tenureat the Norcross plant is larger than the median job tenure at the Dallas plant.

    Both the side-by-side histogram and boxplot were generated using Minitab.

    18

    16

    14

    12

    10

    8

    6

    4

    2

    0

    Dallas

    Y e a r s o n

    J o

    b

    Norcross

    Panel variable: Plant

    Boxplots of Widge One Employee Job Tenure by Plant

  • 8/12/2019 Basic Concepts Manual

    47/52

    47

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Concept 5: Random Number Generation and Simple Random Sampling

    The statistical concepts cover ed up to this point would really fall under the heading of Data Analysis or Basic

    Descriptive Statistics. These concepts enable us to describe or represent a given dataset to other people and areemployed once the data have been gathered. They represent a critical, albeit simple, set of analytical tools. Now lets takea step backwhat if the data NEEDS to be gathered?

    Entire disciplines exist in the areas of experimental design and sampling. Although the scope of this document does notinclude an examination of these areas, we will address a foundational concept of these areas random number generationto support simple random sampling using statistical software. Humans are woefully deficient in our ability to generatetruly random numbers. In f act, human random number generation is so NOT random, that computer programs have been written that accurately predict the random numbers that humans will select.

    Randomly generated numbers can be forced to follow a particular probability distribution and/or fall between anestablished minimum and maximum value. We will be generating numbers which follow a uniform distribution, whereevery number as has the same probability of occurrence. This is the most common execution of random numbergeneration. It should be noted that random numbers could follow any probability distribution (e.g., normal, binomial,Poisson, etc).

    One of the primary rationales for generating a string of random numbers is to select a sample of observations for analysis.Often, researchers do not have the time the access, or the money to analyze every element in a dataset. Assigning arandom number to every element in a dataset and then selecting, for example, the first 50 elements when sorted basedupon the random number, is a statistically valid method of sampling. When a uniform distribution is used to generatethese random numbers, this process is referred to as simple random sampling where every element as a 1/n probabilityof selection. Simple random sampling using random number generation is a very common execution used by analysts toselect a subset of a population of elements for analysis.

  • 8/12/2019 Basic Concepts Manual

    48/52

    48

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Concept 6: Confidence Intervals

    As stated previously, Concepts 1- 4 fall under the heading of Descriptive Statistics, where the analyst has access to the

    entire dataset and is simply providing a description or visual representation of the central tendency or the dispersion ofthe dataset. Concept 5 Random Number Generation is an important tool that analysts use to subset a dataset or assignelements for survey or additional analysis. When a sample is analyzed for the purposes of better understanding apopulation, the process is referred to as Inferential Statistics 2. Here is a brief comparative of Descriptive Statistics andInferential Statistics:

    Descriptive Statistics Inferential StatisticsDataset Population (entire dataset) Sample from a PopulationAccuracy 100% accurate (assuming calculations were done

    correctly)Some Margin of Error will be expected

    Confidence 100% Typically, 90%, 95% or 99%Example Measurements of Central Tendency Confidence Intervals around a Population

    parameterPreference? ALWAYS Preferred! Never preferredbut is accepted as a

    trade off for cost and/or time.

    2 Inferential statistics is based on the Central Limit Theorem. Readers are assumed to have a working knowledge of this theorem. For a refresheron the Central Limit Theorem, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.

  • 8/12/2019 Basic Concepts Manual

    49/52

    49

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Concept 6 Confidence Intervals therefore is different from the first four concepts reviewed in this manual, because weare moving from descriptive statistics to inferential statistics.

    Simply stated, a confidence interval is an estimation of some unknown population parameter (usually the mean), basedon sample statistics, where the acceptable margin of error and/or confidence level is pre-established.

    The formula used to estimate a two-sided confidence level of a population mean is , where

    X = the sample mean;

    Z = the number of standard deviations, using the sampling distribution and the Central Limit Theorem, associated withthe established confidence level:

    90% confidence = 1.645

    95% confidence = 1.96

    99% confidence = 2.575

    Sx= the sample standard deviation;

    n = the number of elements in the sample.

    ( * ) X s

    X Z n

  • 8/12/2019 Basic Concepts Manual

    50/52

    50

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    The formula used to estimate a two-sided confidence level of a population proportion is

    Where p = the sample proportion;

    q = 1-p;

    Z = same as above;

    n = same as above.

    In both formulas, the expression after the + signs is the referred to as the Margin of Error.

    FUN MANUAL CALCULATION!!

    Lets assume that the WidgeOne.xls dataset is a representative sample of a larger manufacturing firm with hundreds ofemployees in Norcross, GA and Dallas, TX. Lets also assume that the HR department at WidgeOne has been chargedwith understanding the level of job satisfaction among employees. For cost reasons, they were unable to survey the entire

    ( * ) pq

    p Z n

  • 8/12/2019 Basic Concepts Manual

    51/52

  • 8/12/2019 Basic Concepts Manual

    52/52

    52

    Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

    Explanatory and response variables

    The main objective of multivariate analysis is to assess the relationship between two or more variables. A common type of

    relationship that we examine in statistics is the cause-effect relationship. The variables play two different roles in thisrelationship the explanatory role and response role. The response variable is the outcome of interest that is beingresearched. The explanatory variable is hypothesized to explain or influence the response variable. For example, researchstudies investigating lung cancer often specify survival status (whether an individual is alive after 20 years) as theresponse variable and smoking status (whether an individual used smoking tobacco and, if so, what amount) as theexplanatory variable.

    There are specific locations that are traditionally designated for the explanatory and response variables in the analysismethods we ve discussed . The following table summarizes the proper locations of these variables for each of theseanalyses.

    Method of Analysis Location of Explanatory Variable Location of Response Variable

    Stratified Analysis 1 or more columns RowsStratified Confidence Intervals 1 or more columns RowsContingency Table Rows ColumnsGrouped Histogram Different Panels X-axis

    Stacked Bar Charts Bars Stacks

    Side-by-Side Boxplots X-axis Y-axisScatterplot X-axis Y-axis