Top Banner
Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics
50

Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Dec 31, 2015

Download

Documents

Peter Ellis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Chuck Humphrey, University of AlbertaAtlantic DLI Training, 2008

DLI Orientation: ConceptsA Framework for Thinking about Data and Statistics

Page 2: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Outline Data and statistics: what are we talking about?

Key concepts for data and statistics Statistics are about definitions

Framework for numeric information DLI and standard data products

E-tables and databases Aggregate data Public use microdata Spatial data

Continuum of access Levels of service

Page 3: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

What are we talking about?

Page 4: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Numeric information

Statistics• numeric facts/figures • created from data, i.e,

already processed• presentation-ready

Data• numeric files created

and organized for analysis/processing

• requires processing• not display-ready

Page 5: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Statistics

Six dimensions or variables in this tableThe cells in the table are the number ofestimated smokers.

Geography

Region

TimePeriods

Unit of Observation Attributes

Smokers

Education

Age

Sex

Page 6: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Defining some key concepts

Statistics are based on a few key underlying concepts and knowing the definition of these concepts is useful in interpreting statistics.

Referring to the previous slide, what does Statistics Canada mean by “geography”?

Statistics Canada uses the concept of location to describe geography. “The concept of location is that of a physical place

where the activity of a statistical unit occurs and for which data are collected.”

Page 7: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Concept: unit of observation The concept of location refers to “statistical units.”

These are the units of observation for which data are collected and that statistics describe or summarize.

Statistical units for business surveys include the enterprise, the company, the establishment and the location.

Statistical units for social surveys include the census family, the economic family and the household. “There are two primary sources for social statistics: one is

administrative records, which generally collect information from the files of individuals; the other is from censuses and surveys where the unit of observation is the household and individuals within the household.”

Page 8: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Concept: universe Universe describes characteristics of the unit

of observation used in the selection of those from whom data are collected. This concept is closely associated with the sample design employed in selecting members of the unit of observation. The universe includes all members of the unit of observation, while the sample consists of just those members from whom data are collected.

Statistics Canada uses “target population” to describe each survey’s universe.

Page 9: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Concept: sample weights With the exception of some administrative

databases, Statistics Canada employs probabilistic sampling methods to select members in the unit of observation from its universe.

Typically, not every member in the unit of observation has the same probability of being selected. Consequently, Statistics Canada determines a sample weight that it includes with the data file to correct for the sample design and to provide population estimates.

Page 10: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Unit of observation & universe Together, the unit of observation and the universe

describe the objects from whom data are collected and to whom generalizations and descriptions are being made in statistical displays.

All statistical tables are based on a specific unit of observation. Because table headings don’t explicitly say, “the unit of observation is,” one is left to interpret this information from the table. With well designed tables, the unit of observation should be obvious.

Let’s look at the characteristics of a well defined table.

Page 11: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Title

ProducerDate

VariablesAverage TuitionDisciplineAcademic YearProvince

Statistical MetricDollars

Footnote

Unit of ObservationUniverse

Page 12: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Statistics are about definitions

Page 13: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Statistics are about definitions

Each characteristic or variable that is measured or recorded about the unit of observation must be clearly defined. Statistics Canada has definitions for some of the more frequently used concepts and variables on its website under “Definitions, data sources and methods.”

The Census Dictionary is an important source for definitions of the concepts and variables in each Census.

Page 14: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Definitions use classifications

The definitions for concepts and variables use classification systems to assign categories or values to the properties of the concepts. For example, Region in this table consists of Canada and the ten provinces.

Page 15: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Definitions use classifications

Some classifications are based on standards while others are based on convention or practice.

For example, Standard Geography classifications

Page 16: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Classifications involve categories

CategoriesSex

Total

Male

Female

Periods

1994-1995

1996-1997

Page 17: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Definitions and metadata

All of the definitions and information that describe the unit of observation, the universe, the sampling method, the concepts and the variables are critical to understand both the data and the statistics derived from the data.

We use to talk about codebooks and about the User’s Guide and Data Dictionary when speaking of data documentation. Now we refer to this documentation as metadata, which has been expanded to include documentation throughout the life cycle of a survey. The Data Documentation Initiative 3.0 standard is being used to organize this information.

Page 18: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.
Page 19: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Unit of observation and data

The unit of observation also defines an important structural characteristic of data files. A record in a data file represents the information for one member of the unit of observation.

Page 20: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Data

Page 21: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Stories are told through statistics

The National Population Survey in the previous example had over 80,000 respondents in 1996-97 sample and the Canadian Community Health Survey in 2005 has over 130,000 cases. How do we tell the stories about each of these respondents?

We create summaries of these life experiences using statistics.

Page 22: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Summary Statistics are derived from observational,

experimental or simulated data . A table is a format for displaying statistics and

presents a summary or one view of the data. Tables are structured around geography, time

and attributes of the unit of observation. Statistics are dependent on definitions and

classification systems. Statistics summarize individual stories into

common or general stories.

Page 23: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Framework for Numeric Information

Numeric Information

Page 24: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Numeric Information

Where does DLI fit in this scheme?

Page 25: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

DLI and standard data products DLI licence, article 1: “via the Data Liberation Initiative

(DLI), Statistics Canada will offer my educational institution, timely access, on a subscription basis, to standard Statistics Canada data products, such as public use microdata files (non-identifiable datasets containing characteristics pertaining to surveyed units), standard files and databases (containing aggregate data as defined and determined by Statistics Canada) and geography files, in available electronic formats.”

Standard data products use to be all products for sale in the Online Catalouge. Now Statistics Canada refers to standard electronic products, which includes also e-publications and e-tables, some of which are now free.

Page 26: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Dissemination policy In 2004, Statistics Canada introduced a new

policy stipulating that all standard electronic products will be available either through the Depository Services Program (DSP) or DLI. This means that libraries in the academic community belonging to both the DSP and DLI should have access to all standard electronic products.

Standard Electronic ProductsStandard Electronic Products

(electronic)

Page 27: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Standard product definitions

This next section provides definitions for e-tables, databases, aggregate data and public use microdata and presents some examples of each.

E-tables: these are tables in an electronic dissemination format (e.g., Beyond 2020 or Excel). Tables are displays for presenting the statistical results of a data analysis and provide one view of the data expressed through the selection of variables representing geography, time and attributes of the unit of observation.

Page 28: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

DLI e-table examples

The Canadian Centre for Justice Statistics Beyond 2020 tables Only tables in Beyond 2020 format; no public use

microdata Survey of Household Spending

Excel tables and a public use microdata file

Page 29: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Databases and aggregate data Databases consist of file structures for storing

aggregate data that can be viewed as either e-tables or retrieved as aggregate data. For example, CANSIM (a large database of time series) can be used for either purpose.

Aggregate data consist of statistics that are organized in a data structure and stored in a database or in a data file. These files are used for input into statistical analysis software.

Page 30: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Databases and aggregate data The data structure of an aggregate file is

based on tabulations organized by one or more of these factors: time, geography, or social characteristics.

Page 31: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Time series aggregate data Time series: each line of the data file

represents tabulations for a specific period of time. For example, a file of annual statistics from 1976 to 2005 would have 30 lines, one line for each year.

Page 32: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Geo-referenced aggregate data Geo-referenced data: each line of the data file

represents a spatial unit within which summary statistics have been tabulated. The spatial unit to which each line of data is associated is identified through a geo-code.

Using Beyond 2020, Census basic tabulations and profile series can be output for use with GIS software.

Correspondingly, Census boundary files are available through DLI that use codes from the Standard Geographic Classification system.

A Postal Code Conversion File (PCCF) exists to locate postal codes within Census geography.

Page 33: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

An

Exa

mpl

e fr

om E

-ST

AT

Page 34: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Geo-referenced aggregate data

“Small area statistics” are a special category of aggregate data. These data files consist of statistics for small geographic areas usually calculated from a population or manufacturing census or an administrative database with enough cases to create accurate summaries for small areas.

Page 35: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Cross-classified aggregate data Aggregate data, where each line in the file

represents characteristics of the unit of observation, are also known as “cross-classified” tables.

These data are often analyzed in the absence of a public use microdata product. For example, no public use microdata exit for vital statistics. Consequently, the cross-classified data for age and sex by cause of death is an important data source for researchers.

Page 36: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Cross-classified aggregate data

Page 37: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Microdata

This is raw data organized in a file where the lines in the file represent a specific unit of observation and the information on the lines are the values of variables.

There are different types of microdata files: master files, share files, public use files and synthetic files.

Page 38: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Confidential microdata Master files: these files contain the fullness of

detail captured about each case of the unit of observation. This detail is specific enough that the identify of a case can often be disclosed easily. Therefore, these files are treated as confidential.

Master files from the social data in Statistics Canada are available to the research community through the Research Data Centre Network.

Page 39: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Confidential microdata Share files: these are confidential files in

which the participants in the survey have signed a consent form permitting Statistics Canada to allow access to their information for approved research.

These files consist of a subset of the cases in the master file.

Access to share files may be granted to specific government departments without the need for their researchers to work within a Research Data Centre.

Page 40: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Public use microdata

These microdata are specially prepared to minimize the possibility of disclosing or identifying any of the individuals in the file. The original data from the master file are edited to create a public use microdata file.

Public use microdata files are only available for select social surveys that undergo a review of the Data Release Committee, an internal Statistics Canada committee. There are no ‘enterprise’ public use microdata files.

Page 41: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Public use microdata Steps in anonymizing microdata

Remove of all personal identification information (names, addresses, etc);

Include only gross levels of geography; Collapse detailed information into a smaller

number of general categories; Cap the upper range of values of variables with

rare cases; Suppress the values of a variable; or Suppress entire cases.

Page 42: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Public use microdata

Almost all public use microdata files are derived using cross-sectional samples, that is, samples where the data have been collected from respondents at one point in time.

Longitudinal samples, where data are collected from the same individuals two or more times, are difficult to anonymize and maintain any useful information.

Page 43: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Synthetic microdata

In an attempt to provide the research community with a version of the microdata that is like the master file but does not contain real cases, some author divisions are exploring the use of synthetic microdata files. Theoretically, these files return results close to the real data in the master file without the risk of disclosure.

Synthetic files are different from “dummy” files which have no data but rather have only the variable structure to allow the testing of syntax.

Page 44: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Spatial data

Statistics Canada provides spatial data files for each the different geographic levels in which it disseminates Census results. These files are available as digital boundary files or cartographic boundary files. Digital Boundary Files depict the full extent of the

geographical areas and extend into bodies of water. Cartographic Boundary Files depict the geographical areas

using only the major land mass of Canada and its coastal islands.

These files are only available on the DLI FTP site.

Page 45: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Continuum of access

It is one thing to know about the variety of Statistics Canada products that exists, but access to this information is a separate issue.

The following model describes the various dissemination channels through which Statistics Canada provides access.

Page 46: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Continuum of access

Think of the variety of channels as constituting a continuum along which levels of access are provided.

There are three characteristics that make up this continuum: Cost : which runs from free to expensive; Restrictions or conditions : which run from open or

no restrictions to very restricted; and Type of Information : which runs from statistics to

data.

Page 47: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Continuum of AccessContinuum of access

Page 48: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

STC continuum of access

ACCESS CHANNELSOpenFreeStatistics

RestrictedExpensiveData

Dep

osito

ry S

ervi

ce P

rogr

am

Rem

ote

Job

Sub

mis

sion

Sta

tistic

s C

anad

a W

ebsi

te

Dat

a Li

bera

tion

Initi

ativ

e

Cus

tom

Tab

ulat

ions

Res

earc

h D

ata

Cen

tres

Page 49: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Levels of data service

There are several models for organizing local services to support DLI materials. Thinking of these models in terms of levels of service is helpful in identifying a model appropriate to your institution’s resources and priorities.

Page 50: Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.

Levels of data serviceA. Retrieve data upon request and pass directly to

patron. May or may not catalogue DLI titles.B. Subscribe to a data extraction service and offer as

part of your electronic resources. May or may not catalogue titles.

C. Integrate into access services, include DLI in electronic resources, your catalogue and your website.

D. Add reference services to help patrons find data.E. Add data consulting services (help with manipulating

and formatting data) to both access and reference services.