Top Banner

of 15

Statistic a 1

Jun 03, 2018

Download

Documents

Ancuta Caliment
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/11/2019 Statistic a 1

    1/15

    The term inferencerefers to a key concept in statistics in which we draw aconclusion from available evidence.The purpose of descriptive statisticsis to summarize or display data so we canquickly obtain an overview. Inferential statisticsallows us to make claims orconclusions about a population based on a sample of data from that population. Apopulation represents all possible outcomes or measurements of interest. A sample isa subset of a population.

    We use the term population in statistics to represent all possible measurements oroutcomes that are of interest to us in a particular study. The term sample refers to portion of the population that is representative of the population from which it wasselected.Datais simply dened as the value assined to a specic observation ormeasurement.!ata that is used to describe somethin of interest about a population is called aparameter.

    "or instance# let$s say that the population of interest is my wife$s three%year%oldpreschool class and my measurement of interest is how many times the little urchins

    use the bathroom in a day.&f we averae the number of trips per child# this ure would be considered aparameter because the entire population was measured. 'owever# if we want tomake a statement about the averae number of bathroom trips per day per three%year%old in the country# then !ebbie$s class could be our sample. We can consider theaverae that we observe from her class a statisticif we assume it could be used toestimate all three year%olds in the country.

    !ata that describes a characteristic about a population is known as a parameter.!ata that describes a characteristic about a sample is known as a statistic.Informationis data that is transformed into useful facts that can be used for a

    specic purpose# such as makin a decision.

    We classify the sources of data into two broad cateories( primary and secondary.)ou can obtain primary data in many ways# such as direct observation# surveys# ande*periments.Direct observation( "ocus roups are a direct observational technique where thesub+ects are aware thatdata is bein collected. ,usinesses use focus roups to ather information in a roupsettin controlled by a moderator. The sub+ects are usually paid for their time and arasked to comment on specic topics.Experiments:This method is more direct than observation because the sub+ects wil

    participate in an e*periment desined to determine the e-ectiveness of a treatment.An e*ample of a treatment could be the use of a new medical dru. Two roups woulbe established. The rst is the e*perimental roup who receive the new dru# and thsecond is the control roup who think they are ettin the new dru but are in factettin no medication. The reactions from each roup are measured and compared todetermine whether the new dru was e-ective.The benet of e*periments is that they allow the statistician to control factors thatcould inuence the results# such as ender# ae# and education of the participants.The concern about collectin data throuh e*periments is that the response of thesub+ects miht be inuenced by the fact that they are participatin in a study. The

  • 8/11/2019 Statistic a 1

    2/15

    desin of e*periments for a statistical study is a very comple* topic and oes beyondthe scope of this book.Surveys( This technique of data collection involves directly askin the sub+ect aseries of questions.The questionnaire needs to be carefully desined to avoid any bias or confusion forthose participatin. /oncerns also e*ist about the inuence the survey will have onthe participant$s responses. 0esearch has shown that the manner in which thequestions are asked can a-ect the responses a person provides on a questionnaire. A

    question posed in a positive tone will tend to invoke a more positive response andvice versa. A ood stratey is to test your questionnaire with a small roup of peoplebefore releasin it to the eneral public.

    Another way to classify data is by one of two types( quantitative or qualitative.

    Types of measurement scales:Anominallevel of measurement deals strictly with qualitative data. 1bservations arsimply assined to predetermined cateories. 1ne e*ample is ender of therespondent# with the cateories bein male and female.

    This data type does not allowus to perform any mathematical operations# such as addin or multiplyin. We alsocannot rankorder this list in any way from hihest to lowest. This type is consideredthe lowest level of data and# as a result# is the most restrictive when choosin astatistical technique to use for the analysis.

    )ou can use numbers at the nominal level of measurement. 2ven in this case# therules of the nominal scale still remain. An e*ample would be zip codes or telephone

    numbers# which can$t be added or placed in a meaninful order of reater than orless than. 2ven thouh the data appears to be numbers# it$s handled +ust likequalitativedata.

    1n the food chain of data# ordinalis the ne*t level up. &t has all the properties ofnominal data with the added feature that we can rank%order the values from hihestto lowest. An e*ample is if you were to have a lawnmower race. 3et$s say the nishinorder was 4cott# Tom# and ,ob. We still can$t perform mathematical operations onthis data# but we can say that 4cott$s lawnmower was faster than ,ob$s. 'owever# wecannot say how much faster. 1rdinal data does not allow us to make measurements

    between the cateories and to say# for instance# that 4cott$s lawnmower is twice asood as ,ob$s 5it$s not6.1rdinal data can be either qualitative or quantitative. An e*ample of quantitative datis ratin movies with 7# 8# 9# or : stars. 'owever# we still may not claim that a :%starmovie is : times as ood as a 7%star movie.

    ;ovin up the scale of data# we nd ourselves at the intervallevel# which is strictlyquantitative data.

  • 8/11/2019 Statistic a 1

    3/15

  • 8/11/2019 Statistic a 1

    4/15

    Relative Frequency Distribution0ather than display the number of observations in each class# this method calculatesthe percentae of observations in each class by dividin the frequency of each classby the total number of observations.

    Cumulative Frequency Distribution/umulative frequency distributions indicate the percentae of observations that areless than or equal to the current class. &t totals the percentaes of each class as youmove down the column. Cohn used his phone D times or less on D: percent of the dayin the month.

    rap!in" a Frequency Distribution# t!e $isto"ramA historam is simply a bar raph showin the number of observations in each classas the heiht of each bar.% the rst thin we need to do is open 2*cel to a blank sheet and enter our data in

    /olumn A startin in /ell A7.% ne*t enter the upper limits to each class in /olumn , startin in /ell ,7.% o to the Tools menu at the top of the 2*cel window and select !ata Analysis.% The /hart Wizard allows me more control over the nal appearance.

    Statistical Flo%er &o%er# t!e Stem and 'eaf DisplayThe ma+or benet of this approach is that all the oriinal data points are visible onthe display.

  • 8/11/2019 Statistic a 1

    5/15

    The stem in the display is the rst column of numbers# which represents the rstdiit of the olf scores. The leaf in the display is the second diit of the olf scores#with 7 diit for each score. ,ecause there were ? scores in the =>s# there are ? diitsto the riht of =.

    'ere# the stem labeled = 5?6 stores all the scores between =? and =E. The stem D 5>6stores all the scores between D> and D:.

    C!artin" a Frequency Distribution

    (ar C!arts,ar charts are a useful raphical tool when you are plottin individual data valuesne*t to each other.The historam that we visited earlier in the chapter is actually a special type of barchart that plots frequencies rather than actual data values.

    'ow do & choose between a pie chart and a bar chart &f your ob+ective is tocompare the relative sizeof each class to one another# use a pie chart. ,ar charts are more useful when youwant to hihliht the actual data values.

    'ine C!arts is used to help identify patterns between two sets of data.

  • 8/11/2019 Statistic a 1

    6/15

    3ine charts prove very useful when you are interested in e*plorin patterns betweentwo di-erent types of data. They are also helpful when you have many data points anwant to show all of them on one raph.

    ,ecause the line connectinthe data points seems to have anoverall upward trend# my suspicions

    hold true. &tseems the more showers ourwaterloed darlins take# the hiher

  • 8/11/2019 Statistic a 1

    7/15

    )easures of Central TendencyThere e*ist two broad cateories of descriptive statistics that are commonly used.The rst# measures of central tendency# describes the center point of our data setwith a sinle value. &t$s a valuable tool to help us summarize many pieces of data witone number. The second cateory# measures of dispersiondescribe how farindividual data values have strayed from the mean.

    The mean or avera"eis the most common measure of central tendency and is

    calculated by addin all the values in our data set and then dividin this resultby the number of observations. A%ei"!ted meanallows you to assin more weiht to certain values and less

    weiht to others.

    )ean of rouped Data from a Frequency Distribution *e*ample(

    b

    The mean of a frequency distribution where data is rouped into classes is only anappro*imation to the mean of the oriinal data set from which it was derived.This is true because we make the assumption that the oriinal data values are at the

    midpoint of each class# which is not necessarily the case. The true mean of the 9>oriinal data values in the cell phone e*ample is only :.? calls per day rather than:.@.

    The medianis the value in the data set for which half the observations arehiher and half the observations are lower. We nd the median by arranin thdata values in ascendin order and identifyin the halfway point.

    When there is an even number of data points# the median will be the averae of thetwo center points.Fsin our e*ample with the video ames# we rearrane our data set in ascendinorder( 9 : : : ? @ = = E 7=

    Accordin to the mean of thisfrequency distribution# Cohnaveraes :.@ calls per day on hiscell phone.

  • 8/11/2019 Statistic a 1

    8/15

    ,ecause we have an even number of data points 57>6# the median is the averae ofthe two center points. &n this case# that will be the values ? and @# resultin in amedian of ?.? hours of video ames per week.

  • 8/11/2019 Statistic a 1

    9/15

    &f you think all the data in your data set is relevant# then the mean is your bestchoice. This measurementis a-ected by both the number and manitude of your values. 'owever# very small orvery lare values can have a sinicant impact on the mean# especially if the size ofthe sample is small. &f this is a concern# perhaps you should consider usin themedian. The median is not as sensitive to a very lare or small value./onsider the followin data set from the oriinal video ame e*ample(9 : : : ? @ = =E 7=

    The number 7= is rather lare when compared to the rest of the data. The mean ofthis sample was @.@# whereas the median was ?.?. &f you think 7= is not a typicalvalue that you would e*pect in this data set# the median would be your best choice focentral tendency.The poor lonely mode has limited applications. &t is primarily used to describe data athe nominal scaleGthat is# data that is rouped in descriptive cateories such asender. &f @> percent of our survey respondents were male# then the mode of our datwould be male."rom !ata Analysis% !escriptive 4tatistics( mean# median# mode.

    )easures of Dispersion

    Ran"e is the simplest measure of dispersion and is calculated by ndin thedi-erence between the hihest value and the lowest value in the data set. = E D77 : % rane H 77 B : H =

    'owever# the limitation is that it only relies on two data points to describe thevariation in the sample.

  • 8/11/2019 Statistic a 1

    10/15

    Standard deviation is simply the square root of the variance. Cust as with thevariance# there is a

    standard deviation for both the sample and population. To calculate the standarddeviation# you must rst calculate the variance and then take the square root of theresult.The standard deviation is actually a more useful measure than the variance becausethe standard deviation is in the units of the oriinal data set.

    Calculatin" t!e Standard Deviation of rouped Data

    T!e Empirical Rule: %or-in" %it! Standard DeviationThe values of many lare data sets tend to cluster around the mean or median so thathe data distribution in the historam resembles a bell%shape# symmetrical curve.When this is the case# the empirical rule tells us that appro*imately @D percent of thedata values will be within one standard deviation from the mean."or e*ample# suppose that the averae e*am score for my lare statistics class is DDpoints and the standard deviation is :.> points and that the distribution of rades isbell%shape around the mean. ,ecause one standard deviation above the mean wouldbe E8 5DD J :6 and one standard deviation below the mean would be D:

    5DD B :6# the empirical rule tells me that appro*imately @D percent of the e*am scoreswill fall between D: and E8 points.Accordin to the empirical rule# if a distribution follows a bellshapeGa symmetricalcurve centered around the meanGwe would e*pect appro*imately @D# E?# and EE.=percent of the values to fall within one# two# and three standard deviations around thmean respectively.&n eneral# we can use the followin equation to e*press the rane of values within kstandard deviations around the mean( KJI% k L.

    C!ebys!evs T!eorem

  • 8/11/2019 Statistic a 1

    11/15

    /hebyshev$s theorem is a mathematical rule similar to the empirical rule e*cept thatit applies to any distribution rather than +ust bell%shape# symmetrical distributions./hebyshev$s theorem states that for any number k reater than 7# at least 57 B 7Ik86*7>> percent of the values will fall within k standard deviations from the mean. Fsinthis equation# we can state the followin(% at least =? percent of the data values will fall within two standard deviations fromthe mean by settin k H 8 into /hebyshev$s equation.% at least DD.E percent of the data values will fall within three standard deviations

    from the mean by settin k H 9into the equation.% at least E9.= percent of the data values will fall within four standard deviations fromthe mean by settin k H : into the equation.

    2*ample(

    This table supports /hebyshev$s theorem# which predicts that at least =? percent ofthe values will fall within two standard deviations from the mean. "rom the data set#we can observe that E? percent actually fall between 8>.9 and :E.7 home runs 59D ouof :>6. The same e*planation holds true for three and four standard deviations aroun

    the mean.

    )easures of Relative &osibtiondescribe the percentae of the data below acertain point.

    /uartilesdivide the data set into four equal sements after it has beenarraned in ascendin order.

    Appro*imately 8? percent of the data points will fall below the rst quartile# M7.Appro*imately ?> percent of the data points will fall below the second quartile# M8.And# you uessed it# =? percent should fall below the third quartile# M9.

    76 4tep 7( Arrane your data in ascendin order.86 4tep 8( "ind the median of the data set. This is M8.

    96 4tep 9( "ind the median of the lower half of the data set 5in parenthesis6. This iM7.

    :6 4tep :( "ind the median of the upper half of the data set 5in parenthesis6. Thisis M9.

    Interquartile ran"e % the &M0 measures the spread of the center half of ourdata set. &t is simply

    the di-erence between the third and rst quartiles# as follows( &M0 H M9 B M7. Theinterquartile rane is used to identify outliers# which are the black sheep of ourdata set. These are e*treme values whose accuracy is questioned and can causeunwanted distortions in statistical results. Any values that are more than( M9 J7.?&M0 or less than( M7 B 7.?&M0 should be discarded.

  • 8/11/2019 Statistic a 1

    12/15

    2*ample( 7> :8 :? :@ ?7 ?8 ?D =94ince there are eiht data values# M7 will be the median of the rst four values 5themidpoint between the second and third values6. M7H 5:8J:?6I8H :9.?3ikewise# M9 will be the median of the last four values 5the midpoint between thesi*th and seventh values6.M8H 5?8J?D6I8H ?@. &0M H M9% M7H ?@% :9.?H 78.?Any values reater than M9 J 7.? &0MH =:.=? or less than M7% 7.? &0MH 8:.=? shouldbe considered an outliner# therefore the value 7> would be an outliner in this data se

    The values for variance and standard deviation reported by 2*cel are for a sample. &fyour data set represents a population# you need to recalculate the results usinN inthe denominator rather than n B 7.

  • 8/11/2019 Statistic a 1

    13/15

    &robability topicsExperiment. The process of measurin or observin an activity for the purpose ofcollectin data. An e*ample is rollin a pair of dice.0utcome.A particular result of an e*periment. An e*ample is rollin a pair of threeswith the dice.Sample space.All the possible outcomes of the e*periment. The sample space forour e*periment is the numbers N8# 9# :# ?# @# =# D# E# 7># 77# and 78O. 4tatistics peopllike to put NO around the sample space values Event. 1ne or more outcomes that are

    of interest for the e*periment and which isIare a subset of the sample space. Ane*ample is rollin a total of 8# 9# :# or ? with two dice.

    Classical &robability refers to a situation when we know the number of possibleoutcomes of the event of interest and can calculate the probability of that event withthe followin equation(PQARH

  • 8/11/2019 Statistic a 1

    14/15

    We use sub+ective probability when classical and empirical probabilities are notavailable.Fnder these circumstances# we rely on e*perience and intuition to estimate theprobabilities.

    (asic &roperties of &robability * one event&f PQAR H 7# then 2vent A must occur with certainty.&f PQAR H ># then 2vent A will not occur with certainty.

    The probability of 2vent A must be between > and 7.The sum of all the probabilities for the events in the sample space must be equal to 7The complementto 2vent A is dened as all the outcomes in the sample space thatare not part of 2vent A and is denoted as A$. Fsin this denition# we can state thefollowin( PQAR J PQA$R H 7 or PQAR H 7 B PQA$R.

    T!e Intersection of Events2*ample( phone calls bychild and type of call./ontinency tables show the actual or relative frequency of two types of data at thesame time. &n this case# the data types are child and type of call.

    2vent A H the ne*t phone call will come from /hristin.2vent , H the ne*t phone call will involve a crisis.PQARH 8>I?>H >.:

    What about the probability that the ne*t phone call will come from /hristin and willinvolve a crisisThis event is known as the intersection of 2vents A and , and is described by A,.The number of phone calls from our continency table that meet both criteria is 7:#so( PQA and ,R H PQAR PQ,RH 7:I?>H >.8D

    A continency table indicates the number of observations that are classiedaccordin to two variables. The intersection of 2vents A and , represents the numbeof instances where 2vents A and , occur at the same time 5that is# the same phonecall is both from /hristin and a crisis6. The probability of the intersection of twoevents is known as a1oint probability.

    T!e union of EventsA and , represents the number of instances where either2vent A or , occur 5that is# the number of calls that were either from /hristin or wera crisis6.PQA and ,R H PQAR F PQ,RH 9:I?>H >.@D

  • 8/11/2019 Statistic a 1

    15/15

    /lassical probability requires knowlede of the underlyin process in order to countthe number of possible outcomes of the event of interest.2mpirical probability relies on historical data from a frequency distribution tocalculate the likelihood that an event will occur.The law of lare numbers states that when an e*periment is conducted a larenumber of times# the empirical probabilities of the process will convere to theclassical probabilities.

    The intersection of 2vents A and , represents the number of instances where 2ventsA and , occur at the same time.The union of 2vents A and , represents the number of instances where either 2vent or , occur.

    Conditional &robabilityWe dene conditional probability as the probability of 2vent A knowin that 2vent ,has already occurred.2*ample( the followin table shows the outcomes of our last 8> matches# alon withthe type of warm%up before we started keepin score.

    Without any additional information# the simple probability of each of these events isas follows( PQARHEI8>H>.:?PQ,RH79I8>H>.@?# PQA$RH77I8>H>.??# PQ,$RH=I8>H>.9?4imple or prior probabilities are always based on the total number of observations. &

    the previous e*ample# it is 8> matches.Unowin this piece of info# what is the probability that !ebbie will win the matchThis is the conditional probability of 2vent A iven that 2vent , has occurred.3ookin at the previous table# we can see that 2vent , has occurred 79 times.,ecause !ebbie has won : of those matches 5A6# the probability of A iven , iscalculated as follows( PQAI,RH:I79H>.97We can also calculate the probability that !ebbie will win( PQAI,$RH?I=H>.=7

    /onditional probabilities are also known as posterior probabilities. /onditionalprobabilities are very useful for determinin the probabilities of compound events asyou will see in the followin sections.

    Independent versus Dependent Events2vents A and , are said to be independent of each other if the occurrence of 2vent ,has no e-ect on the probability of 2vent A. Fsin conditional probability# 2vents Aand , are independent of one another if( PQAI,R H PQAR&f 2vents A and , are not independent of one another# then they are said to bedependent events.