Role of Biostatistics in dental research
Part I Paper 2
RESEARCH METHODOLOGY AND BIOSTATISTICS
Overall objective:
The student is able to apply the basic concepts of statistics
and principles of scientific enquiry in planning and evaluating the
results of dental practice and participate in and conduct
descriptive exploratory and survey students in dental and evaluate
apply results of research studies in health, dental medicine and
related fields in the practice of dental.
Behavioral objective:-
The student is able to
Design a study, identifying a population and methods of
selection of the sample required
Present data in appropriate tables, graphs and diagrams
Calculate averages, variation, linear correlation and
regression.
Calculate the confidence intervals and simple tests of
significance using normal, t, F, (2 distributions.
Compute commonly used vital and health statistical and estimate
population using arithmetic progression methods.
Construct instruments for eliciting data through questioning
observation and measurement methods and techniques.
Quantify, analyze describe and interpret data.
Critique dental studies.
Select and write a clear statement of a research able
problem.
Search and analyze the literature for facts and theory relating
to the problem.
Identify and state relevant assumption methods of selection of
the sample required.
Make recommendations based on the finding for application to
nursing and further research
Prepare and write a scientific report of the study.
Methods of Teaching: -
Lectures and discussion with power point presentations
Seminars and practical with power point presentations
Methods of evaluation:
Regular attendance, Seminars, written test and dissertation
Suggested practical:
Each student will select and present critique of dental
study.
Survey and asses selected studies in dental with particular
reference to the research process presentation of individually
selected problems at each step of the research process and are
independent for evaluation and discussion.
QUESTION PATTERN
Time: 1 hour
Short Answer 5 ( 2
10 Marks
Short Note
5 X 6
30 Marks
Internal Assessment
10 Marks
UNIT
DESCRIPTION
I
1.1 Introduction and overview of Biostatistics
1.2 Scope of Biostatistics
1.3 Biostatistics in Dentistry
1.4 Applying study result to patient care
II
2.1 Review of descriptive statistics
(Central tendency, dispersion, plotting)
2.2 Correlation and regression
III
3.1 Testing of statistical hypothesis
3.2 Statistical inference with mean, proportion and normal
deviate
3.3 Sampling distributions (t, F, (2)
IV
4.1 ANOVA (one & two way classification)
4.2 Non-Parametric tests
a). Sign test
b). Wilcoxon Signed Rank tests
c). Mann Whitney U test
d). Wald Wolfwitch Run test
e). Krushkal Wallis test
V
5.1 Concept of research & research process
5.2 Principle and various methods of research process
5.3 Utilization of research, the result section has a research
report & conclusions
5.4 The Checklist for the reading literature
STATISTICSDifferent authors give different definition for
statistics from time to time. But, a definition must aim at laid
down the meaning; scope and definition of subject. Statistics is
used in two senses Viz, singular and plural.
In the singular sense it denotes numerical facts whereas; in the
plural sense it denotes statistical methods.
Among them, two authors C. E. Croxton and D. J. Cowdon give the
precious definition for statistics, and Prof. Horce Secrist gives
the best definition.According to C.E.Croxton and D.J.Cowden,
A branch of mathematics that deals with collection,
Classification, analysis and interpretation of numerical data is
called as statistics.
From this definition, the main divisions of statistics are,
i. Collection of Data,
ii. Classification of data,
iii. Analysis of Data,
iv. And interpretation of numerical data.
According to Prof. Horce Secrist.
Statistics is a field of study concerned with
(1) The collection , organization, summarization, and analysis
of data, and
(2) The drawing of inferences about a body of data when only a
part of the data is observed.
Simply put, we may say that data are numbers, numbers contain
information, and the purpose of statistics is to investigate and
evaluate the nature and meaning of this information.
Statistics is the science of compiling, classifying, and
tabulating numerical data and expressing the results in a
mathematical or graphical form.
The aggregates of facts affected to a marked extent by
multiplicity of causes, numerically expressed, enumerated or
estimated according to reasonable standard of accuracy, collected
in a systematic manner for a predetermined purpose and placed in
relation to each other is called statistics.
This definition gives the characteristics of the statistics. The
characteristics of statistics are,
It is aggregate of facts.
It is affected to a marked extent by multiplicity of causes.
It is numerically expressed.
It should be enumerated or estimated.
It should be collected in a systematic manner for a
predetermined purpose
It should be collected with reasonable standard of accuracy.
It should be placed in relation to each other.
BIOSTATISTICSThe tools of statistics are employed in many
fields-business, education, psychology, agriculture, and economics,
to mentioned only few. When the data analyzed are derived from the
biological sciences and medicine, we use the term biostatistics to
distinguish this particular application of statistical tools and
concepts.
Biostatistics is that branch of statistics concerned with
mathematical facts and data relating to the biological events.
Medical statistics is a further specialty of Biostatistics, when
the mathematical facts and data are related to health, preventive
medicine and disease.
Essential Feature of Statistics
The essential features of statistics are evident from various
definitions of statistics:
a) Principles and methods for the data collection of
presentation, analysis and interpretation of numerical data of
different kinds
i. Observational data. Quantitative data
ii. Data that have been obtained by a repetitive operation
iii. Data affected to a marked degree of a multiplicity of
causes
b) The science and art of dealing with variation in such a way
as to obtain reliable results.
c) Controlled objective methods whereby group trends are
abstracted from observations on many separate individuals.
d) The science of experimentation which may be regarded as
mathematics applied to experimental data.
The objective of dental science is primarily to improve the oral
health of an individual and hence relevant knowledge has to be
obtained by observation of groups of individuals. The treatment of
a patient with best course of action depends on the overall oral
hygiene or health status.
Fundamental processes involved in the organization of oral
health care services are:
Acquisition of information i.e., monitoring data, from
independent study and systematic enquiry (scientific research)
Dissemination of information e.g., by teaching, demonstrating,
writing, publishing.
Application of knowledge and skill i.e., provision of health
care and related services such as environmental control (e.g.,
fluoride adjustment, regulation of harmful substances, etc) and
manufacturing of health products.
Judgment or evaluation by the application of proportional
ethics, laws, regulation, policies, guidelines, criteria and
standards.
Administration i.e., the management of personnel, facilities,
materials, funds and other resources to facilitate four processes
outlined above.
Uses of Biostatistics:
1) To define normalcy
2) To test whether the difference between two populations,
regarding a particular attribute is a real or a chance
occurrence.
3) To study the correlation or association between two or more
attributes in the same population.
4) To evaluate the efficacy of vaccines, sera etc. by control
studies.
5) To locate, define and measure the extent of morbidity and
mortality in the community.
6) To evaluate the achievements of public health programs.
7) To fix priority in public health programs.
Uses of Biostatistics in dental science:
1) To assess the state of oral health in the community and to
determine the availability and utilization of dental care
facilities.
2) To indicate the basic factors underlying the state of oral
health by diagnosing the community and solutions to such
problems
3) To determine success or failure specific oral health programs
or to evaluate the program action.
4) To promote health legislation and in creating administrative
standards for oral health.
Role / Importance / Applications / Uses of Biostatistics in
dental research:
To maintain the patient record
To maintain the patient previous treatment and next or further
treatment procedure
Long time process of record to helpful to seen previous
treatment procedure also helps in the current treatment idea.
Suppose new drug launch in the market, biostatistics analysis
gives idea this drug is more effective than other drugs.
Statistical analysis to gives idea about which drug commonly or
averagely used for particular treatment or all treatments
To estimate number of patients visiting in future (weekly,
monthly or yearly)
To know which age of people or male/female have more dental
problems
Dental problems vary by area, culture, habits, or water also by
the village, district, city, countries.
A dental complaint varies for age, sex, area, culture, habits,
etc.
To compare two or more of treatment, drug, or surgeons, or time
taken for same complaint, which is better? Or all are no
difference.
Any one of the drugs may be used for a treatment, whether this
two effect are same or not same
Compare and estimate for treatment time, cure level, etc.
Compare and estimate students intelligence
To estimate a person when will get a dental complaint or when
will cure of a treatment taken patient
To analyze people dental knowledge out of 100% how much have
very poor / poor / average / good / very good knowledges
Patient record also given patient family history of fast,
present and future.
To do the basic calculation: total number of patients visiting,
average number of patient visit by age, sex, treatment, complaint,
finished, and undergoing, etc.
It gives, enough or we want to improve for patient details
Before treatment and after treatments, it is significant.
Number of patient visits varies department wise, if varies why?
To analyze and find out the inference.
Applications of Biostatistics in patient care / applying study
results in patient care:
A patient record gives overview and idea of the patient
treatment and further steps
Suitable treatment or method to apply the patient
To know the maximum, minimum, and average value of any of the
patients character.
The character varying patient-patient or else, if vary what are
the reason
Previous analyses give what disease attacks for which type of
population (age, sex, area, culture, etc.). These analyses is much
helpful to give the instruction to prevent or take care of the
disease,
Suppose more number of drugs available in the market, then we
select suitable drug for satisfying patient co-operate, cost, time
or any one of satisfaction or all.
How much percentage of patient cures a particular treatment that
treatment cures level is very low then advice to medical research
for develops the treatment, here we use statistical analysis,
whether newly developed treatment is effective?
To estimate patient cure time, next visit, number of visits for
particular treatment, etc.
To estimate the number of patient in future
A statistical analysis inference; particular disease gives major
problem or most affect the regular life. In the situation, taking
further steps to prevent or control these diseases.
Why need Statistics?
The objectives of this paper are twofold:
(1) To teach the student to organize and summarize data and
(2) To teach the student how to reach decisions about large body
of data by examining only small part of the data.
The concepts and methods necessary for achieving the first
objective are presented under the heading of descriptive statistics
and the second objective is reached through the study of what is
called inferential statistics.
Need of quantifying the data: As per the definition of
STATISTICS (i.e., A branch of mathematics that deals with
collection, Classification, analysis and interpretation of
numerical data) it mainly deals with numerical data. Hence,
whenever we have the numerical data then only statistics can be
applied. But in many situations researcher cant get numerical data.
(i.e., it will be of mixture of numerical and qualitative
characteristics)
So to draw valid conclusion from the qualitative characteristics
it essential to quantify the qualitative information into
quantitative by giving ranks or scale values.While conducting an
oral health examination, the investigator makes observations
according to his judgment of the situation. This depends on his
skill, knowledge, experience and temperament.
Grading of plaque scores or malocclusion or the quality of diet
of an individual are situations, which are influenced by the
particular investigator who makes the observations. If the same
observer repeats the observation on the same case after some time
lapse, he may or may not agree with his previous assessment.
Similarly, if more than one investigator observes the same
individual, all of them may not agree in their assessment. The
variability in measurement can be handled using statistics.
Epidemiology and biostatistics are sister sciences or
disciplines. The former collects facts relating to groups of
population in place, times and situations, while the later converts
all facts into figures and at the end translates them into facts,
interpreting the significance of their results. Facts are
qualitative in nature and do not admit several kinds of statistical
treatment and hence have to be converts into figures for
statistical analysis.
Both the science of epidemiology and biostatistics deal with
facts-figures-facts, which is termed as quantitative
methodology.
In community dentistry, the approach is primarily through
epidemiology and social or behavioral sciences, all of which
require intensive studies, by collecting facts, which are
quantitative and later, expressed into figures, which are
quantitative.
Example:
The oral health worker is interested in knowing how many people
have good oral hygiene or otherwise, the circumstances when it
takes place and also the age at which various upsets take place,
whether it is equally distributed among the sexes, which group is
at risk of developing diseases leading to mortality., which areas
of town-rural or urban are more or less affected by the diseases.
As most of these events are counted, they are the foundations of
dentistry. And because these numbers come in with variation between
people or from place to place or from time to time, statistics
finds its role in dentistry.
Data:
The raw material of statistics is data. For our purposes we may
define data as numbers. The two kinds of numbers that we use in
statistics are numbers that the result from the taking in the usual
sense of the term of a measurement, and those that result from the
process of counting;
Example: When a nurse weighing a patient or takes a patients
temperature, a measurement consisting of a number such as 150
pounds or 100 degrees Fahrenheit, is obtained.
Quite a different type of number is obtained when a hospital
administrator counts the number of patients-perhaps 20-discharged
from the hospital on a given day. Each of the three numbers is a
datum, and three taken together are data.
Variable
If, as we observe a characteristic, we find that it takes on
different values in different persons, places, or things, we label
the characteristic a variable.
Example: Diastolic blood pressure, heart rate, heights of adult
males
Random variable
Whenever we determine the height, weight, or age of an
individual, the results is frequently referred to as a value of the
respective variable. When the values obtained arise as a result of
chance factors, so that they cannot be exactly predicted in
advance, the variable is called a random variable.
Example: Adult height-when a child is born, we cannot predict
exactly his or her height at maturity. Attained adult height is the
result of numerous genetic and environmental factors. Values
resulting from measurement procedures are often referred to as
observations or measurements.
Population
The average people thinks of a population as a collection of
entities, usually people. A population or collection of entities
may, however, consist of animals, machines, places, or cells.
For our purposes, we define a population of entities as the
largest collection of entities for which we have an interest at a
particular time. If we take a measurement of some variable on each
of the entities in a population, we generate a population of values
of that variable. We may, therefore, define a population of values
as the largest collection of values of a random variable for which
we have an interest at a particular time. Populations may be finite
or infinite. If a population of values consists of a fixed number
of these values, the population is said be finite. If, on the other
hand, population consists of an endless succession of values, the
population is an infinite one.
Example: We are interested in the weights of all the children
enrolled in a certain country elementary school system; our
population consists of all these weights. If our interest lies only
in the weights of first grade students in the system, we have
different population-weights of first grade students enrolled in
the school system. Hence populations are determine or defined by
our sphere of interest.
Sample
A sample may be defined simply as a part of a population.
Suppose our population consists of the weights of all the
elementary school children enrolled in a certain country school
system. If we collect for analysis the weights of only a fraction
of these children, we have only a part of our population of
weights, that is, we have a sample.
TYPES OF VARIABLE(1). Quantitative variable
A quantitative variable is one that can be measured in the usual
sense. Measurements made on quantitative variables convey
information regarding amount.
Example: Weights of preschool children, age of the patients.
(2). Qualitative variable
Some characteristics are not capable of being measured in the
sense that height, weight, and age are measured. Many
characteristics can be characterized only.
Example: When an ill person is given a medical diagnosis
Object is said to posses or not posses some characteristic of
interest. In such cases measuring consist of categorizing.
(3). Discrete random variable
Variables may be characterized further as to whether they are
discrete or continuous.
A discrete random variable is characterized by gaps or
interruptions in the values that it can assume. These gaps or
interruptions indicate the absence of values between particular
values that the variable can assume.
Example: The number of daily admissions to a general hospital is
a discrete random variable since the number of admissions each day
must be represented by a whole number, such as 0, 1, 2, or 3. The
number of admissions on a given day cannot be number such as 1.5,
2.432, and 3.9009.
The number of decayed, missing, or filled teeth per child in an
elementary school is another example of discrete random
variable.
(4). Continuous random variable
A continuous random variable does not posses the gaps or
interruptions characteristic of a discrete random variable. A
continuous random variable can assume any value within a specified
relevant interval of values assumed by the variable.
Example: Height, weight, age, water fluoride of individual
SCALES OF MEASUREMENT OF DATA
It is necessary to express the data measurements clearly, either
in units or as categories. Each level of measurement form scales of
measurements which are defined by the degree of accuracy and
sophistication of the measuring device.Measurement: This may be
defined as the assignment of numbers to objects or events according
to a set of rules. The various measurement scales result from the
fact that measurement may be carried out under different set of
rules.Commonly following scales are used
i). Nominal scale: (By name, label, and tag)The lowest
measurement scale is the nominal scale. As the name implies it
consist of naming observation or classifying them into various
mutually exclusive and collectively exhaustive categories.Example:
includes such dichotomies, Outcome of cancer Dead, alive
Goals of RCT Achieved, not achieved
ii). Ordinal scale: (With Implicit order of
relationship)Whenever observation are not only different from
category to category but can be ranked according to some criterion,
they are said to be measured on an ordinal scale.Example: OHI score
Poor, Fair, Good
Students intelligence Above average, Average, Below Averageiii).
Interval Scale: (Number between characters)The interval scale is
more sophisticated scale than the nominal and ordinal scale in that
with this scale it is not only possible to order measurements, but
also the distance between any two measurements is known. The
interval scale unlike the nominal and ordinal scale is a truly
quantitative scale. We know say, that the difference between
measurements of 20 and a measurement of 30 is equal to the
difference between measurements of 30 and 40. The ability to do
this implies the use of a unit distance and a zero point, both of
which are arbitrary.
Example: Age of the patient, BP, Water fluoride level.
iv). Ratio scale: (Relative Magnitude)The highest level of
measurement is the ratio scale. This scale is characterized by the
fact that equality of ratios as well as equality of intervals may
be determined. Fundamental to the ratio scale is a true zero
point.Example: Gingival bleeding per 1000 people, Height by
weight
RELIABILITY OF DATA
Reliability is checked by testing the findings or results from
the data. If the agency has used proper methods to collect the
data, the statistics may be relied upon.
Reliability indicates the consistent result in repeated
observation. Many determine reliability of data. Major factors
are:
Inherent variation like unused reagents used after a lapse of
long time. Zero marked in weighing machine is not obtained,
etc.
Observers variation like the same person doing repeated
measurements. E.g. BP recordings MP smear examination, pulse rate
recording, etc.
Variable fluctuations like reply by respondents according to
their capability of understanding questions and replying.
Inter-observer variations like many people, many instruments at
recording.
VALIDITY OF DATA
Data obtained by measurement should measure what it is supposed
to measure. Concept of validity relies upon the specific situations
at data collection.
Example: Oral interview on abortion practice is not valid
Infertility of no issues is not valid
Fever in non malaria area is not valid
Validity is measured by sensitivity and specificity. Sensitivity
is true positive observation correctly identified by a test.
Specificity is true negative observation correctly identified by a
test. Notation for test validation of measurements of dataTrue
picture (e.g. Disease)Total
+-
Test Result (e.g. Screening Test)+ab(a+b)
-cd(c+d)
Total(a+c)(b+d)(a+b+c+d)
Sensitivity = Number of Positive value of test result and true
picture / total number of Positive value of true picture
Specificity = Number of Negative value of test result and true
picture / total number of Negative value of true picture
Positive predictive value = Number of Positive value of test
result and true picture / total number of Positive value of test
result
Negative predictive value = Number of Negative value of test
result and true picture / total number of Negative value of test
result
SOURCES OF DATA
The performance of statistical activities is motivated by the
need to answer a question. For example, clinicians may want answers
to questions regarding the relative merits of competing treatment
procedures. Administrators may want answers to questions regarding
such areas of concern as employee morale or facility utilization.
When we determine that the appropriate approach to seeking an
answer to a question will require the use of statistics, we begin
to search for suitable data to serve as the raw material for our
investigation. Before the data collection, type of data should be
decided. That is, primary data or secondary data. The choice of
data depend on,
Nature and scope of study,
Availability of finance, time factors,
The degree of accuracy needed,
Nature of investigation (individual or government study).
Generally most of the survey primary data is preferable.
The main sources of data are
1). Routinely kept records 2). Surveys3). Experiments
Data can be collected through either
a). Primary source
b). Secondary source1. Routinely kept records
It is difficult to imagine any type of organization that does
not keep records of day-to-day transaction of its activities. OP
medical records, for example, patient habits while OP sheet contain
a patient habits on the facilities of business activities. When the
need for data arises, we should look for them first among routinely
kept records.
2. SurveysIf the data needed to answer a question are not
available from routinely kept records, the logical source may be a
survey. Suppose, for example, that the administrator of a clinic
wishes to obtain information regarding the mode of transportation
used by patients to visit the clinic. If admission forms do not
contain a question on mode of transportation, we may conduct a
survey among patients to obtain this information.
3. Experiments
Frequently the data needed to answer a question are available
only as the result of an experiment. A nurse may wish to know which
of several strategies is best for maximizing patient compliance.
The nurse might conduct an experiment in which the different
strategies of motivating compliance are tried with different
patients. Subsequent evaluation of the responses to the different
strategies might enable the nurse to decide which is most
effective.
a). Primary Source
The first hand information that is collected for the first time
by the investigator for the purpose of his study is called primary
data.
This is first hand information.
This data is original in character.
The primary data collection methods: To collect the primary data
five methods are commonly used. They are,
1. Direct personal investigation2. Oral health examination3.
Indirect oral investigation
4. Questionnaire method5. Local correspondent method 6.
Enumeration method(1). Direct personal investigation: In this
method, the investigator personally meets the informants and
collects the information by asking them questions. The person form
that the information is collected is called informants. This method
is intensive rather than extensive. The investigator must be keen
observer and tactful and courteous in behavior.
Suitability:
This method can be employed, when
High accuracy is needed.
The coverage area is small.
The confidential data is needed.
The intensive study is needed. And
Sufficient time is available.
Merits:
Original (first hand) data is collected.
The collected data are highly reliable.
The high degree of accuracy can be achieved.
Due to personal approach response will be more.
Correct information can be extracted from the informant.
Cross-examination is possible.
Miss interpretation on the informant part can be avoided.
Demerits:
This method is not advisable when coverage area is large and
time, finance factor are low.
Possibility of bias is more.
Untrained investigator cannot bring good result.
It is expensive and time consuming.
(2). Oral health examination:
When information is needed on the oral diseases, this method
provides more valid information than health interview. It is
conducted by dentists, technicians, and the trained investigator.
This method cannot be considered for an extensive study because it
is expensive and also one has to consider the treatment to people
suffering from certain diseases.(3). Indirect oral
investigation:
If the informant is unwilling (reluctant) to provide
information, this method can be used. But in this method the
investigator dont meet the actual informant. Alternatively, the
investigator meets the witnesses or third parties or friends who
are in touch with the informant. Investigator interviews the people
who are directly or indirectly connected to the informant and
collect the information.
For example: To collect the information relation to gambling or
drinking or smoking habit the informant wont provide information.
Even, they wont response the study. On such situations the
investigator has to approach friends, neighbors, etc., of the
actual informant to collect the information. Usually police
department adopts this method.
Example: Police department, riots, alliance, etc.,
Merits:
It is simple and convenient method.
It is suitable when the investigation area is large.
It saves time, money and labor factors.
The information is unbiased.
Adequate information can be collected.
Demerits:
The result is based on third parties prejudice.
To get adequate information much number of persons may be
interviewed.
Interview with an improper man will spoil the result.
Bad information will spoil the result.
(4). Questionnaire method:
In this method, a separate questionnaire consisting of a list of
questions for the enquiry is prepared. There are two ways collect
information through this method,
(1). Mailed questionnaire
(2). Direct questionnaire
(i). Mailed questionnaire method
This questionnaire is sent to the informants requesting them to
do extend their co-operation by fill-upping the questionnaire and
correct replay of the questionnaire. To get the quick and better
response, the postal expense is borne by the investigator. After
receiving the sent questionnaires back analysis is carried out. The
research workers of state and central governments adopt this
method.(ii). Direct questionnaire method
The investigator directly meets the informants and collects the
information by asking questions, from questionnaire.
Suitability:
This method is advisable, if,
The coverage area is wide.
There is a legal compulsion to supply information, so that
non-response risk is eliminated.
Merits:
This method is most and economical comparing with other
methods.
This method of data collection covers wide area and reduces
money, time and labor
Bias is less since the data is collected directly from the
respondents.
Demerits:
There is no direct contact between the investigator and
respondent.
The accuracy and reliability are less.
This method is suitable among literate people only.
There is the possibility of delay in receiving
questionnaire.
The people may furnish wrong information.
Asking supplementary questions is not possible.
Framing questionnaire:
In this mailed questionnaire method, questionnaire is the
communication media between the investigator and the informant.
Hence, the success of investigation is based on the questionnaire.
So the questionnaire must be designed with adequate skill,
efficiency and experience.
Characteristics of Good questionnaire:
Number of questions should be minimum
Questions should be short and simple to understand.
Questions should be arranged in logical order.
Questions may have multiple-choice answers.
Personal questions are to be avoided.
The questions that require calculations are to be avoided.
Questions of sensitive and personal type should be avoided.
The wordings of questionnaire shouldnt hurt the feelings of
respondents.
Questionnaire information must be given.
Questionnaire should look attractive.
Pre - Test: After the questionnaire is prepared, pre test is to
be done.
The process of refining the validation of questionnaire by
collecting information from the related respondents in small number
with the framed questionnaire in the view of overcoming the
shortcomings of questionnaire is called as pre test. If any
shortcoming is found in the questionnaire, it will be incorporated
in the questionnaire. After the required changes are incorporated,
pilot study is employed.Pilot study: Whenever the investigator has
to deal with large survey, he should not plunge directly. After the
pre-test is over, to overcome the shortcomings of the analysis
pilot study is carried out. This is a small-scale survey with a
small number of persons. The collected data through the pilot study
is analyzed. If any technical difficulty in the analysis is found
then the questionnaire will be altered. The main survey is taken if
the pilot study doesnt reveal any analytical difficulties. (See
Figure 1.)
(5). Local Correspondents Method:
In this method instead of collecting the information by the
researcher, local agents are appointed to collect the information.
They collect the information from the informant and the collected
data is sent to the actual researcher or investigator. The data
collection is done according to local correspondents taste.
Newspaper agencies, magazines, etc. adopt this method.
Suitability:
If the data is required regularly from the wide area, this
method can be used.
Merits:
Extensive information is collected.
This is most cheep economical method.
Information will be collected regularly.
Demerits:
Information may be biased.
Degree of accuracy cant be maintained.
Data may be of duplicate nature.
(6). Enumerator method:
In this method, a number of enumerators are selected and trained
to collect the data. They are provided the questionnaires and
trained to fill up the questionnaire. They meet the informant along
with the questionnaire and collect the data by filling up the
questionnaire. The enumerator explains the object, purpose of the
study to the informant.
Merits:
Intensive information is collected.
This method yields reliable and accurate results.
This method is helpful even if the informants are illiterate,
because the investigator is going to record the information.
Due to personal contact, the non-response is less.
Demerits
This method leads to more money and time
Personal bias of enumerator leads to wrong conclusion.
b). Secondary Source The second hand information that is,
collected from the already existing sources for the study is called
as secondary data. That is, the researcher gets the required
information from the information that is already collected by some
one for his purpose. The sources of secondary data are,
Published sources:
The data that is published by the various governments, local and
international agencies are published data.
International publications:
IMF, IBRD, ICAFE and LINO etc., publish the data regular time
intervals.
Central and state governments:
Department of union and state government regularly publish the
data. The other organizations are, RBI-Bulletin; census of India;
Indian trade journal etc,
Semi-official publications:
The semi government institutions like district, panchayat,
municipal, corporation etc, publish the statistical data.
Research institutions publication:The research institutions such
as Indian statistical institution (ISI); Indian agricultural
statistics research institute (IASRI) etc., publish the data.
Journals and newspapers:
Some journals like Indian finance, commence etc, publish the
current and important material on statistics and socio-economic
problems.
Unpublished sources:
There are various unpublished data sources. Various government
and private office maintain them. These are the data carried out by
the researchers in universities or research institutions.
Precautions in Using Secondary Source
The secondary data is not a reliable one and the data taken in
olden days will be inadequate. So before using the secondary data
in the analysis, some precautions must be taken.
The precaution steps are,
Suitability of data:
The available data should be suitable for his study. This
characteristic is to be examined by the investigator himself. The
data should be coherent with scope of the present analysis.
Adequacy of data:
After the suitability is tested, the data must be adequate for
the study. That is adequate data must be extracted from the source
to carry out analysis.
Reliability of data:
Reliability is checked by testing the findings or results from
the data. If the agency has used proper methods to collect the
data, the statistics may be relied upon.
COLLECTION OF DATAThe first and foremost step of the research
process is data collection. Before the statistical investigation,
the researcher has to know the nature, objective and scope of
investigation, time and type of investigation and the desired
degree of study.
The two types of investigation are
Census/complete enumeration method.
Sampling method.
Census Method
A data collection method that investigates or collects
information each and every unit of the population is called as
census method. That is, in this method the data is collected from
all the population units. For e.g., To study the average height of
the students of a particular college then the investigator has to
investigate (Measure) all the students height in that college.
Population: The collection of individual items about which the
study of the investigation is concerned is called as
population.
Merits:
The data is collected from all the items of study. Hence, bias
is minimized data is more accurate reliable and
The highest accuracy can be maintained.
Results drawn from the data collected through this method is
more representative and true.
Demerits:
When the coverage area is wide, this method is not suitable.
Because it will take more money, time and energy.
The cost needed is more, hence the organization that posses huge
finance and manpower can only adopt this method.
If the population size is infinite, this method is not
suitable.
If the study is of destructive type product this method is not
suitable.
Destructive type product: The product that cant be used after
its initial use is called destructive type product.
Type of population: The two types of population are,
Hypothetical
Existent population.
The collection of concrete objects or persons under the study of
investigation constitutes the existent population. The existent
population may be finite or infinite. An existent population that
consists of countable number of individuals or objects is called as
finite population.
An existent population that consists of un-countable no of
individuals or objects is called infinite population. E.g., In the
study of economical level of a particular college students, the
totality of that college students and it will be finite. Hence it
is a finite population. E.g., In the study of characteristic
pattern of stars in the sky. All the stars in the sky constitute
the population. But there are infinite. Hence it is an infinite
population.
The collection of non-concrete object, which exists only in
imagination and un-countable constitutes hypothetical or
theoretical population. For e.g., In the study pattern of the
result of the coin tossing experiment, the researcher couldnt get
the concrete result. He can only imagine the result as head and
tail.
Hence the result of the coin tossing experiment constitutes the
hypothetical population.
Sampling Method
The method or technique that is adopted to select the sample
from the population is called as sampling method.
Sample: A finite subset or small part of population that has
exactly duplicate characteristic of population used to make valid
inference regarding the entire mass of population is called as
sample.
Objectives:
To get more information about the population with minimum effort
time and cost.
To estimate the population parameters through its statistic.
To obtain the degree of precision of the drawn result through
its statistic.
To draw valid conclusion about the population.
To give desired result with required precision with the given
minimum cost.
To identify the true representative of the population.
Merits:
It is more economical. (i.e.,) it saves time, money and energy
because of limited number of investigation units.
It helps to achieve high degree of accuracy.
It helps to get reliable results for the population.
It serves as the alternative method of census.
It helps to organize and administrate the survey easy.
If the approximate result is needed or required this method can
be used.
Demerits:
Careful planning must be followed otherwise the result will be
incorrect and biased.
The result is based on the investigator. The attitude of
personnel will affect the result.
There is possibility of large errors.
Hence
The sample must be true representative of population
Experienced personnel have to be employed to the fieldwork.
The sample size must be adequate number.
The coverage area should be small.
The two types of sampling methods are,
Probability sampling
Non-probability sampling.
Probability sampling:The sampling method that follows some
standard procedure and selects the units with pre-defined
probability is called probability sampling.
The six types of probability sampling method are,
1). Simple (Equal) Random (chance) Sampling.
2). Stratified Random Sampling.
3). Systematic Random Sampling.
4). Cluster Sampling.
5). Multistage Sampling.
(1). Simple random sampling: Sampling procedure that is used to
select the sample from the population in such a way that each
population units has an equal and independent chance of being
included in that sample is called as simple random sample.
This is the simplest method to select the sample. This method is
applicable when the population is of homogenous nature. This simple
random sample can be selected by two ways.
(i). Lottery method:
In this method, all the population units are numbered or named.
Then the numbers or the names are written on different slips or
cards of same size and shape so that a card is not distinguished
from others.
These cards are placed in a box and shuffled well so that no
particular card gets any preference in selection. From that box
sample is selected one by one, till the desired number of units are
selected.
The only one drawback of this method is if the population size
is very large, this method is not suitable.
(ii). Random number table method:
In this method is sample is selected from the population by
making use of random number table. The table which contains random
digits arranged in row and column format is known as Random number
table.
Selection process:
Random number table is arrangement of five digit numbers in row
and column format.
Selection process may be proceeded row wise or column wise.
Assign numbers to the population units.
Decide the sample size.
Count the number digits of population size. (i.e.,) k.
Read out number with k-digits from the random number table.
If the read number is greater than the population size, ignore
it and select the next number.
If the read number is less than the population size includes the
corresponding population unit in the sample.
Precede this process until required numbers of sample units are
selected.
There are several standard random number tables are available.
Among them some are,
L.H.C Tippets random number table: 10,400 four-digit
numbers.
Fisher and Yates random number table:15,000 two digit
numbers.
Kendall and B.B Smiths random number table: 25,000 four-digit
numbers.
Rand corporations random number table: 2,00,000 five-digit
numbers.
Merits:
There is less chance for personal bias.
As the sample size increases; the selected sample will be more
representative one.
Sampling errors can be measured.
This method saves money, time and labor.
Demerits:
This method requires complete list of population. But in many
enquires it is not possible.
As the sample size decreases the sample wont represent the
population.
If the population units are of heterogeneous nature this method
cant be employed.
(2). Stratified random sampling: A sampling method that selects
sample from the heterogeneous population by dividing the population
into homogenous sub-groups called stratum, is called as stratified
random sampling.
Since the population is of heterogeneous nature the population
is divided into stratums that are of homogenous nature. From that
each stratum, a number of sample units that constitutes the sample
is selected.
The two types of stratified random sampling method are,
(i). Proportional method: If the sample is selected from the
stratum proportionate to its size, then the sample is selected by
proportional method.
(ii). Optimum method: If the sample is selected from the stratum
by considering the cost, then the sample is selected by optimum
allocation method. That is, based on the cost, the sample is
selected.
Merits:
The sample selected by this method is more representative of
population.
If ensures grater accuracy.
For the heterogeneous population this method is more
reliable.
Demerits:
The process of dividing the population into strata requires more
time money and experience.
If the stratification is not proper, then the sampling bias will
prevail in the sample.
(3). Systematic sampling: A probability sampling method that
selects sample by making using up-to-date complete list of
population units is called as systematic sampling. In this method,
the selection of first sampling unit is selected with probability,
so it is also known as quasi-random sampling. After the selection
of first unit is selected then the remaining units of sample are
automatically selected using the random start range.
If the complete and up-to-date list of population units is,
available, then this method can be used.
Selection procedure:
Assume that we have to select n units from N population
units.
Arrange the items in numerical or alphabetical or geographical
or any other order.
Find the sampling interval K = N / n such that nk = N.
Select the random start i such that i < k.
Select the sample units of i-th, i+k-th, i+2k-th,.., i+(n-1)
k-th units to constitute the systematic sample.
Hence the random start determines the (Whole) sample.
Merit:
This method is simple and operationally more convenient.
Time and work involved in selection procedure is less.
Demerit:
This sample maynt represent the population.
If the population size is not multiple of sample size, one cant
get required number of sampling units.
(4). Cluster sampling: A probability sampling method that
selects the sample by grouping the population units into some
groups called clusters-similarity of objects, and selects the
sampling units through the selection of clusters is known as
cluster sampling.
Cluster sampling is same as stratified random sampling, but the
only difference is, in the former the entire units of the selected
clusters constitute sample. But in the later case, the sampling
units are selected from the selected strata.
Merits:
It introduces flexibility in sampling method.
It is suitable in large-scale survey, where the list preparation
is difficult.
Demerits:
It has less accurate than other methods.
(5). Multistage Sampling: When we consider the available
resources, concentrating on limited number of units for study,
multistage sampling helps us a lot. In national sample survey
multiphase sampling is used. For total health care programme the
question is which village, which house and which person is answered
in this type of sampling.
I stage-Village selection
II stage-Household selection
III stage-Person selection
Reduction in cost and permitting the available resources
concentrating on selected samples will be advantageous. Sampling
error enhancement is expected, since variation between the final
units will be lesser (within the group than between groups).
Unequal size at different stages may pose analytical
difficulties.Another Example:
I stage - Urine sugar positive case are selected by screening
tests
II stage All +ve cases under stage I are subjected for PPBS and
these who have above critical level of PPBS are selected.
III stage Among PPBS above critical level +ve, retinoscopy for
diabetic retinopathy is done and positive retinopathy cases are
selected.Non-probability sampling: The sampling method that doesnt
follow any standard procedure and selects the units with unknown
probability is called as non-probability sampling. This method is
directly opposite to the probability sampling method.
The three types of non-probability sampling methods are,
1. Judgment or purposive sampling.
2. Convenience sampling.
3. Quota sampling.
Judgment/purposive sampling: The sampling method, which selects
the sample units to achieve a specific purpose, is called as
judgment or purposive sampling method. In this method the samplers
choice plays major role in collecting the sampling unit.
For e.g. to know or study the cultural activity of the students
in a particular college the sampler has to select the students who
are interested in cultural activity. Then only the study reveals
the valid conclusion. If not so the sample does not reflect the
population characteristics- Cultural skill of the college. Hence he
has to find the students who are involved in that activity; from
them the investigator has to collect the information.
Merits:
It is simple method
The sample collected is more representative.
This method can be adopted for public policy, to make decision,
etc.,
Demerits:
Due to sampler interest, the sample maynt be true representative
of population.
Difficult to correct sampling errors.
The estimates will not be accurate.
Quota sampling: This method is similar to the stratified random
sampling.
In this method population is divided into various quotas and
then from the quota the sample is selected. The sample size per
quota is personal judgment. This is also known as stratified
purposive sampling method.
Merits:
This method reduces money and time.
Demerits:
Result is based on the investigators.
Personal bias is possible.
Since sample selection is based on random sampling. Sampling
errors cant be estimated.
Convenience sampling: The sampling method that selects the
sample units based on the continent of investigator is called as
convenient sampling. If
The universe is not clearly defined.
Sample unit is not clear.
Complete list is not available.
Then this method can be used.
Demerits:
This sample is not true representative of population
The results are biased.
But this method can be used for pilot study.
Applications of Sampling Designs
1. Identification of predisposing factors, precipitating factors
and perpetuating factors which influence health and disease.
2. Evaluation of health programmes.
3. Impact studies.
4. Coverage surveys.
5. Planning, administration and implementation of
activities.
6. Forecasting the future.
7. Environmental studies.
8. Evaluation of health status.
PRESENTATION OF DATA
After the data collection is over, the researcher has raw data.
(i.e., The information prior to the proper arrangement is known as
raw data.) They are huge and conducive. As such, the researcher
cant carryout analysis and they wont furnish any useful
information. So to condense and present the data into compact
manner we go for presentation of data. Presentation of data has
three main types of presentations. They are,
1. Classification,
2. Tabulation, and
3. Graphical representation.
Classification: The process of arranging the data into sequences
and groups according to their common characteristics and separating
them into different but related parts is called as
classification.
Objects:
The raw data are classified,
To condense the mass of data.
To present the data in simpler form.
To differentiate the similarity and dissimilarity among the
data.
To facilitate comparison and statistical treatment.
To bring out relation.
To facilitate further analysis.
To eliminate the unnecessary data.
Rules for classification:
The classes should be rigidly defined. (I.e.) there shouldnt be
any ambiguity in their rules.
The classes shouldnt overlap (i.e.) each item of data must have
its place in only one class.
The classification must be flexible to adjustment of new
situations.
The items included in total and sub total of class and subclass
must be same.
Types of classification:
Geographical classification: Classifying the data based on the
area of its occurrence such as states, districts, Taluks etc., is
called as geographical classification.
Chronological classification: Classifying the data based on the
time of its occurrence such as decades, Years, Months, etc., is
called as chronological classification.
Quantitative classification: Classifying the data based on some
characteristics that is capable of quantitative measurement like
age, price, weight etc., is called as quantitative
classification.
Qualitative classification: Classifying the data based on the
qualitative characteristics such as sex, honesty, literacy, etc.,
is called as qualitative classification.
That is, presence or absence of the characteristic is presented
in this type of classification.
Tabulation: The systematic arrangement of numerical data in the
form of rows and columns in accordance with some characteristics is
called as tabulation.
Objects:
To simplify complex data.
To clarify characteristics of data.
To facilitate comparison.
To detect errors and omissions in the data.
To facilitate statistical processing.
The parts of table are:
1. Table number,
2. Title,
3. Head note,
4. Caption,
5. Strata,
6. Body of table,
7. Foot-note,
8. Source-note.
The table number is used for identify and reference of the table
in future. For the reference and explanation the columns may also
have numbers.
Each table has to be given a suitable title. Suitable in the
sense, it must describe the content of table.
Head note is a statement about the tables that is placed below
the table title within brackets. Usually the measurements of the
table units are placed such as, in-millions; in crores; etc,
The headings of the columns are called as captions. They must be
brief and self-explanatory. This caption may have sub-headings.
The row headings names are called stabs.
The most important part of the table that contains the numerical
information is called body of table. To provide any explanation
about the items in the table, footnote is used.
Types of tabulation:
1. One-way tabulation,
2. Two way tabulation, and
3. Manifold tabulation.
One-way Table: The table that displays information on a single
variable is called as one-way table or univariate table. The
variable may be discrete or categorical.
Two-way Table: The table that displays information on categories
of a single variable over the categories of another variable is
known as two-way table or bi-variate table.
Manifold table: The table that shows information on more than
two variables categories is known as manifold table.
Frequency Distribution: A tabulation type that summarizes the
raw data in the form of table along with variable values or
variable class intervals and their corresponding frequencies is
known as Frequency table. It may be one-way or two-way or manifold
type.
Moreover, Frequency table
1) Organizes the data into compact manner without loss of
essential information.
2) Describes how the total frequency distributed over different
classes or discrete points.
There are three types of frequency tables. They are,
1. Discrete frequency table.
2. Continuous frequency table.
3. Relative frequency table.
Discrete Frequency table: A Frequency table that shows the
distribution of frequencies at different distinct values of
variable is known as discrete frequency table.
Procedure to form discrete frequency table:
1. Draw a table with three columns namely, variable, tally marks
and frequency.
2. Take the first observation.
3. Write down the observation in the variable column and put a
tally mark (|) against the written observation in the tally mark
column.
4. Take the next observation.
5. Check weather the observation is entered in the variable
column or not.
6. If it is entered, put another tally mark against the written
observation. Else, go to the step 3.
Repeat the procedures starting from 4 6 until all the
observations are entered in the table.
7. Count number of tally marks for each variable and put the
totals in the frequencies column.
8. The resultant table is called as discrete Frequency
Table.
9. If for any variable row has four tally marks, then the next
occurrence of that variable is marked by putting a cross mark over
the four bars. This process facilitates counting process.
Continuous Frequency table: A Frequency table that shows the
distribution of frequencies over different class intervals of
values is known as continuous frequency table.
Procedure to form Continuous frequency table:
1. Draw a table with three columns namely, variable, tally marks
and frequency columns.
2. Find the smallest and largest observations in the data
set.
3. Decide the class interval.
4. Write down the class limits with equal class intervals under
the heading variables.
5. Take the first observation.
6. Decide in which class it falls.
7. Put a tally mark (|) against the variable class in the tally
mark column.
8. Take the next observation.
9. Repeat the procedures starting from 6 - 8 until all the
observations are entered in the table.
10. Count number of tally marks for each variable class and put
the totals in the frequencies column.
11. The resultant table is called as continuous Frequency
Table.
Relative Frequency Table: A frequency distribution in which the
frequencies are expressed as fraction or percentage of total number
of observations is known as relative frequency distribution.
It is noted that, the sum of relative frequency is equal to one
when the frequencies are expressed as fractions and the total is
100 when the frequencies are expressed as percentage.
Graphical representation:
Classification and tabulation are used to present the data in
the neat, concise systematic and understandable manner. But, the
large amount of information, extending over a large number of
columns is difficult to understand the significance of data. Hence,
the statisticians are necessitated to introduce diagrams and
graphs.
Classification is the process of grouping of data into
homogenous groups or categories. Tabulation is the process of
presenting the classified data in tabular form.The process of
highlighting the salient features of study through graphs and
charts is called as graphical representation. This type of
presentation made easy to understand. Moreover, attractive graphs
and charts make understood at a glance for even layman.
Merits:
Diagrams are attractive and create interest in the mid of
readers.
Diagrams are easily understandable to even for the layman.
In interpretation, diagram saves much time.
i.e., human beings maynt like go through numerical figures. But
they may like to go through diagrams.
Diagrams make data simple.
i.e., at a glance of look on diagrams remembered and readers can
easily understand the pattern of data.
A diagram facilitates comparison of two or more sets of
data.
Diagrams reveal more information than data in a table.
Limitations:
Diagrams cant be analyzed or used for further analysis.
Diagrams shows approximate values only
It exposes only limited facts.
(i.e.) all details cant be presented in the form of
diagrams.
Construction of diagram needs some intelligence and
experience.
This is supplementing to tabulation not an alternative one.
Rules for making diagrams:
Every diagram must be given a suitable title of bold
letters.
The title conveys the main fact depicted by the diagram.
Sub-headings may also be given.
Title should be brief and self-explanatory.
Due to comparison, diagram must be drawn accurately and
neatly.
Each diagram should be numbered for further reference.
The type of diagram should be selected according to the nature
of data.
When many items are shown in the diagram, through different
patterns such as dots, crossing etc., index must be given.
Diagram must be simple as understandable by the layman.
There are two types of graphical representation. They are,
1. Graphs,
a. Frequency curves,
b. Frequency polygon, and
c. Ogives.
i. Less than ogives, and
ii. More than Ogives.
2. Charts/ Diagrams.
a. Bar chart,
i. Simple bar chart,
ii. Multiple bar chart,
iii. Stacked bar chart, and
iv. Percentage bar chart.
b. Pie- chart, andc. Histogram.
One-dimensional diagram: The diagram that is drawn to the single
set of data set is called one-dimensional diagram. The bar and pie
diagram are belongs to this one-dimensional diagram.
Bar chart: The visual representation of (qualitative or
categorical or discrete numerical) data is called as bar chart. The
bars are proportionate height to the frequency. The bars may be
horizontal or vertical. The distances between the bars are kept
uniform. Bar charts are drawn only for single discrete quantitative
or categorical variables.
The types of bar diagrams are
Simple bar chart.
Multiple bar chart,
Stacked bar chart.
Simple bar chart: The bar diagram that is drawn for a single set
of categorical or numerical data is called as simple bar
diagram.
Multiple bar chart: The bar diagram that is drawn to single
variable with more than one phenomenon is called as multiple bar
diagram. This facilitates the comparison. The categories of a
single variable are drawn side by side. The differentiation is
shown by different colors or patterns such as lines dots etc,
Stacked bar chart: A type of bar diagram that is drawn for
single variable with any number of (categorical or numerical)
categories is called as Stacked bar diagram. In this diagram the
categorical variables categories are placed on the bar by dividing
the portion of bar.
Percentage bar chart: Percentage bar diagram is a kind of
stacked bar chart, drawn for percentage of frequencies of
categorical variables with the equal bar height is called as
percentage bar diagram. The division of bars of categories is made
with the percentages. But in this case bars are of equal heights to
100%. But in the stacked bar diagram the height of bars are
unequal. That is, bars are proportional to the frequencies of the
base variables category.
Pie diagram: The graphical representation of single variables
categories in circle form is called pie diagram. In this graph the
circle is divided into the various pieces based on the frequency.
This type of diagram provides high understanding ability at a
glance. The each slide is divided by taking the whole data equal to
360 degrees.
Relative Frequency Histogram: A histogram constructed with the
help of relative frequencies rather than absolute frequencies is
known as relative frequency histogram.
Histogram: A bar diagram where the bars are constructed
continuously without (leaving space between bars) on the class
intervals in such a way that the height of bars are proportional to
the frequencies of relative classes is known as Histogram.
Frequency polygon: The graph formed by plotting the frequencies
against the mid points of continuous frequency distribution and
joining the points by straight lines is known as Frequency
polygon.
This can also be obtained from the histogram by joining the top
mid points of bars with straight lines.
Frequency Curve:
The graph that is formed by plotting the frequencies against the
mid points of continuous frequency distribution and joining the
points by free-hand curve is known as Frequency polygon.
This can also be obtained from the histogram by joining the top
mid points of bars with free hand curve.
Ogives:
The graph obtained by plotting the cumulative frequencies
against the class limits of continuous frequency distribution is
known as Ogives.
The two types of Ogives are,
1. Less than Ogive.
2. More than Ogive.
Less than Ogive:
The graph obtained by plotting the less than cumulative
frequencies against the upper class limits of continuous frequency
distribution and joining the points of smooth curve are known as
less than Ogive.
More than Ogive:
The graph obtained by plotting the more than cumulative
frequencies against the lower class limits of continuous frequency
distribution and joining the points of smooth curve are known as
more than Ogive.
DATA ANALYSIS
The process of drawing or obtaining the representative measure
from the raw, mass amount of data is called data analysis. To carry
out, the analysis, statistical methods are used. Hence it is called
statistical data analysis.
The three type of data analysis are
Univariate data analysis.
Bivariate data analysis.
Multivariate data analysis.
Univariate data analysis:
Analyzing or drawing representative measure for the
one-dimensional data set (it may be raw or grouped or ungrouped) is
called univariate data analysis. That is, the characteristics of
single data set are studied. The three types of Univariate Data
Analysis Tools are,
1. Measures of Central Tendency,
2. Measures of Dispersion,
3. Skewness, and
4. Kurtosis.
Bivariate data analysis:
Analyzing or obtaining the representative measure for two sets
of variables by considering both the variables simultaneously is
called bivariate data analysis. The variables type may be
quantitative or qualitative.
The two types of bivaritate measures are,
Associative measure and
Functional measure
Associative measure: The measure that is used to measure the
inter-relationship between the two types of variables is called
associative measure.
The two types of associative measures are,
Correlation and
Chi-square association
Chi-square association: The bivariate method that is used to
measure the relationship between two qualitative variables is
called chi square association method. This method tests whether the
two qualitative variables are dependent or independent.
Functional measure: The process of finding relationship between
the two sets of variables in the form of equation is called
functional measure. In this case, variables can be classified as
dependent and independent. The statistical method that finds the
functional relation of two sets of variables is known as regression
analysis.
Multivariate analysis:
The simultaneous study of several related and equally important
random variables is called multivariate data analysis. That is,
multivariate tool is used to deal more number of variables under
study.
The multivariate analysis is classified into.
Dependent analysis and
Interdependent analysis
Dependence analysis:
The method of studying the association between two sets viz.
dependent and independent variables is called dependence analysis.
That is, the relationship between the dependent set and independent
set is analyzed by this dependence analysis.
The five dependence analysis methods are,
Multiple regression,
Discriminant analysis,
Logit analysis,
Multivariate analysis of variance and
Canonical correlation.
Inter dependence methods:
The method of analyzing mutual association across all the
variables is called interdependence analysis. In this study no
distinction will be made such as dependent and independent.
The five interdependence methods are,
Principal component analysis:
Factor analysis
Cluster analysis
Log linear models and
Multidimensional scaling
Factor analysis:A data reduction technique that studies the
inter relationship among a set of variables by introducing new set
of variables that are fewer in number than the original set of
variables is called factor analysis.
Profile analysis: The graphical method of comparing a number of
ordinal variables based on different groups is called profile
analysis. That is the common opinion nature about the ordinal
variables is studied.
Friedman test: A non-parametric statistical method that is
applied to ranking data set to find the common agreement of ranking
between the respondents about the various factors is called
Frideman test.
Kendalls w test: This procedure is similar to Fridman test. The
merit of this method is it provides Kendalls concordance value that
represents the amount of common agreement between the
respondents.
Logistic regression: This method is used to examine the
relationship among the set of variables. That is, the statistical
method that is used to study about a dichotomous response variable,
which is explained by a number of explanatory variables, is called
as logistic regression. (It may be ordinal or interval or ranking
data)
The assumptions for logistic regression are,
Response variable is binary
The model for response and explanatory variable is log
linear.
DESCRIPTIVE STATISTICSMeasures of Central Tendency:
A single (single) representative measure
Describes the characteristics of entire mass of data
There are three types of measures of central tendency. They
are
Mean,
Arithmetic Mean,
Weighted Mean,
Geometric Mean,
Harmonic Mean.
Median,
Mode,
The characteristics of good average are:
It should be preciously (rigidly) defined.
It should be
Easy to understand.
Easy (Simple) to compute.
Based on all observation.
Capable of further analysis.
Its definition should be in the form of mathematical
formula.
It should not be influenced by extreme values.
It should have sampling stability. (Least affected by sampling
fluctuations)
Merits of averages:
It facilitate quick understanding of complex data:
The purpose of average is to represent a group of values in
simple and concise manner. That is, an average condenses the mass
of data into a single figure.
It facilitates comparison.
It facilitates to know about universe from sample.
If helps in decision-making.
It establishes mathematical relationship.
Mean: A single representative figure of a mass amount of data
which obtained by adding together all the values and dividing the
sum by the total number observations is called mean (i.e.) if the
series x1, x2, x3, , xn has the n observations. Than the mean value
of this series will be,
This is the most widely used measure of central tendency
tool.
Properties:
1. The sum of deviations taken from arithmetic mean is zero.
(i.e.,) (xi-x) = 0
2. The sum of squares taken from the mean other than is
minimum.
(i.e.,) , Where A is any value and x is mean of the
observations.
Merits:
It is easy to understand and calculate.
It is used in further calculations.
It is based on all the items.
It provides a good basis for comparison.
It is a more stable measure.
It is considered as good or idle average.
Demerits:
Mean is unduly affected by extreme values.
It is unrealistic.
It may lead to wrong conclusion.
It is not useful for studying the qualitative characters.
It is not suitable measure in case of highly skewed
distribution.
It gives greater importance for bigger values and smaller
importance for the smaller values in the series.
It cannot calculate for the frequency distribution with open-end
class.
Median: A measure of location calculated from the set of values
that divides the series into two equal parts is called as median.
That is one of part of data set contains the items less then median
and another part of data set contains the items greater then median
value. But the number of observations on both the sides is
equal.
1). For ungrouped data:
a. Arrange the observations in either ascending or descending
order of magnitude.
b. Find the number of observations in the data set. (i.e.,
n).
c. If n is odd, then the median of the data set is,
observation.
d. If n is even, then the median of the data set is,
2). For grouped data: (Discrete frequency distribution)
1. Form the cumulative frequencies.
2. Find
3. Find the cumulative frequency just greater than .
4. The observation (x value) that corresponds to that frequency
is the median of the set of observation.
3). For grouped data: (Continuous frequency distribution)
1. Form the cumulative frequencies.
2. Find
3. Find the cumulative frequency just grater than .
4. Find its corresponding class, it is the median class.
5. Find median by using the formula,
Merits:
It is easy to understand and compute.
It is quite rigidly defined.
It eliminates the effect of extreme items.
It is amenable to further process.
Median can be calculated for even qualitative phenomenon.
Its value generally lies in the distribution.
It can be calculated for frequency distribution with open-end
class interval.
This can be located graphically.
Demerits:
If the series is of irregular nature, median cannot be
computed.
It ignores the extreme values.
In the case of continuous case and even number of observations,
median is estimated but not calculated.
It is not based on all observations.
It is not amenable to algebraic treatments.
It is affected by the fluctuations of sampling.
It cant be calculated for continuous frequency distribution with
exclusive type class interval. To calculate the median the class
interval has to be converted into inclusive type class interval by
adding the value to both the limits (Upper And Lower).
Mode: A single value that appears more number of times (more
frequently) than other observations in the data set is called as
mode.
1). for ungrouped Data:
i). count the observations frequency.
ii). The observation that has occurred more number of times is
the mode of that data set.
2). For Grouped data: (Discrete frequency Distribution)
i). from the frequency distribution identify the highest
frequency.
ii). The observation corresponding to the highest frequency is
the mode of distribution.
3). For Grouped data: (continuous frequency Distribution)
i). From the frequency distribution identify the highest
frequency.
ii). The class interval corresponding to the highest frequency
is the modal class.
iii). Find mode by using the formula,
Merits:
It is easy to understand and calculate.
It is not affected by extreme values. It is simple and
precise.
It ca be located by mere inspection.
It can be determined by the graphic method. This value can be
determined to the open-end class interval.
Demerits:
It is ill-defined (If there is two observations occurs equal
number of times we cant calculate the mode-bi-modal
distribution)
It is amenable to further mathematical treatment.
It is not based on all observations.
It is difficult to compute, when there are both positive and
negative data in the series.
It is stable only when the sample size is large.
If there are both positive and negative values or any one or
more observation is zero, we cant find the mode of
distribution.
Comparison of Measures of Central Tendency Tools:
CharacteristicsMeanMedianMode
Precious DefinitionGivenGivenNot given
Procedure UnderstandingEasyEasyEasy
CalculationEasyEasyEasy
Observations UtilizationAll obsn:sNot all obsn:sNot all
obsn:s
Further treatmentAmenableNot amenableNot amenable
Sampling fluctuationsLeast affectedMuch affectedMuch
affected
Effect of extreme valuesMuch affectedNot affectedNot
affected
From the comparison table of Measures of Central Tendency table
it is noted that, among the tools mean holds many of the idle
average characteristics. Hence, Mean is considered as good or idle
average.
Measures of dispersion:
The statistical tool that measures the variation or the
scattered ness of values from its representative (Central) value is
called as dispersion.
Properties of good measure of variation are,
It should be easy to calculate and understand.
It should be rigorously defined.
It should be based on all observations and amenable to further
treatment.
It must have sampling stability.
If should not affected by extreme values.
The types measures of dispersion are,
Range,
Variance and Standard Deviation,
Mean deviation.
Range:The simplest measure of dispersion that is calculated by
subtracting the minimum value from the maximum value of the data
set is called as range.
i.e., Range = maximum value - minimum value.
Standard deviation: A most widely used important measure of
dispersion that is defined as positive square root of arithmetic
means of squared deviation values from arithmetic mean is called as
standard deviation. Standard deviation is denoted by.
That is, to stabilize the negative and positive variations. The
square of deviations is taken.
Formula for calculating standard deviation value is,
Where, N= Population size
If we have sample, then the sample standard deviation(s) is,
Where n= sample size
Merits:
It is rigorously defined.
Its value is always definite.
It is based on all observation of data.
It is amenable for further analysis.
It is less affected by sampling fluctuations.
It serves basis for measuring coefficient of correlation.
Sampling and statistical inference.
This is the most appropriate measure for the variability,
measurement of distribution.
As a best measure of dispersion, it posses most of the
characteristics of an ideal measure of dispersion.
Demerits:
It is not easy to understand and calculate.
It gives more weight to extreme values by squaring them.
It cannot be used for comparison
Co-efficient of variation or relative measure: This is a measure
of relative variation rather than absolute variation. In order to
decide which of the two distributions is more variable, we compare
the coefficient of variation. The distribution with greater CV is
said to be more variable. Such a measured is found in the
coefficient of variation, which expresses the standard deviation as
a percentage of the mean. The formula is given by
(Where, - is the population standard deviation and - is the
population mean) (or) (Where, s- is the sample standard deviation
and is the sample mean)
To find the variability of data set, find the individual
Co-efficient of Variation. The data set with greater co-efficient
of variation will have more variability (or less precise / less
consistent / less homogeneous).
Uses of coefficient of variation (C.V):
(i). The standard deviation is useful as a measure of variation
within a given set of data. When one desires to compare the
dispersion in two sets of data, however, comparing the two standard
deviations may lead to fallacious results.
(ii). It is used to compare two variables involved are measured
in different units
Example
We may wish to know, for a certain population, whether serum
cholesterol levels, measured in milligrams per 100ml, are more
variable than body weight, measured in pounds.
(iii). Although the same unit of measurement used, the two
measurements may be quite different.
Example
If we compare the standard deviation of weights of first grade
children with the standard deviation of weights of high school
freshmen, we may find that the latter standard deviation is
numerically larger than the former, because the weights themselves
are larger, not because the dispersion is greater.
PROBABILITY DISTRIBUTIONS
The relationship between the values of a random variable and the
probabilities of their occurrence may be summarized by means of a
device called a probability distribution. A probability
distribution may be expressed in the form of a table, a graph, or a
formula. Knowledge of the probability distribution of a random
variable provides the clinician researcher with a powerful tool for
summarizing and describing a set of data and for reaching
conclusions about a population of data on the basis of a sample of
data drawn from the population.
There are two types of probability distribution (1).
Discrete
(2) Continuous
Probability distribution of a discrete random variable
The probability distribution of discrete random variable is
table, graph, or other device used to specify all possible values
of a random variable along with their respective probabilities.
The following are two essential properties of a probability
distribution of a discrete variable
The following are example of discrete probability
distribution
1. Binomial
2. Poisson
THE BINOMIAL DISTRIBUTION
The binomial distribution is one of the most widely encountered
probability distributions in applied statistics. The distribution
is derived from a process known as a Bernoulli trial, named in
honor of the Swiss mathematician James Bernoulli (1654-1705), who
made significant contributions in the field of probability,
including, in particular, the binomial distribution. When a random
process or experiment, called a trial, can result in only one of
two mutually exclusive outcomes, such as dead or alive, sick or
well, male or female, the trial is called a Bernoulli trial.
The Bernoulli process A sequence of Bernoulli trials forms a
Bernoulli process under the following conditions.
1. Each trial result in one of two mutually exclusive, outcomes.
One of the possible outcomes is denoted (arbitrarily) as a success,
and the other is denoted a failure.
2. The probability of a success, denoted by p, remains constant
from trial to trial. The probability of a failure, 1-p, is denoted
by q.
3. The trials are independent; that is, the outcome of any
particular trial is not affected by the outcome of any other
trial.
Example1:
We are interested in being able to compare the probability of x
successes in n Bernoulli trials. For example, suppose that in a
certain population 52% of all recorded births are males. We
interpret this to mean that the probability of a recorded male
birth is 0.52. If we are randomly select five birth records from
this population, what is the probability that exactly three of the
records will be for male births?
Solution: Suppose the five birth records selected result in this
sequence of sexes
MFMMF
In coded we would write this as
10110
Since the probability of a success is denoted by, p=0.52
And the probability of a failure is denoted by, q= 1-p = 1-0.52
= 0.48
The probability of the above sequence of outcomes is found by
means of the multiplication rule to be
P (1, 0, 1, 1, 0) = pqppq = q2p3
Three successes and two failures could occur in any of the
following additional sequences as well
NumberSequenceProbability
110110pqppqq2p3
211100pppqqq2p3
310011pqqppq2p3
411010ppqqpq2p3
511001ppqqpq2p3
610101pqpqpq2p3
701110qpppqq2p3
800111qqpppq2p3
901011qpqppq2p3
1001101qppqpq2p3
We may now answer our original question: what is the
probability, in a random sample of size 5, drawn from the specified
population, of observing three successes (record of a male birth)
and two failures (record of a female birth)?
The answer to the question is
10(0.48)2(0.52)3 = 10(0.2304)(0.140608) = 0.32
General formula:
This expression called the binomial distribution.
Where, f(x) = P(X=x)
n = Number of trials
x = the random variable of success
p = probability of a success
q= probability of a failure = 1-p
This distribution satisfy the discrete probability distribution
properties
1. f(x)0, for all real values of x. this follows from the fact
that n and p are both nonnegative and, hence and, therefore, their
product is greater than or equal to zero.
2. This is seen to be true if we recognize that is equal
to1.
Example2:
Suppose that it is known that 30% of certain populations are
immune to some disease. If a random sample of size 10 is selected
from this population, what is the probability that will contain
exactly four immune persons?
Solution:
The probability of an immune persons to be 0.3 i.e. p =.0.3 and
q = 1-p = 1-0.3 = 0.7
The Binomial Parameters
The binomial distribution has two parameters, n and p. they are
parameters in the sense that they are sufficient to specify a
binomial distribution. The binomial distribution is really a family
of distributions with each possible value of n and p designating a
different member of the family. The mean and variance of the
binomial distribution are = np and 2 = np(1-p), respectively.
Strictly speaking, the binomial distribution is applicable in
situations where sampling is from an infinite population or from a
finite population with replacement. Since in actual practice
samples are usually drawn without replacemen