11
Lecture 18Data Quality Issues
Ch. 14
2
Introduction
• Spatial data and analysis standards are important because of the range of organizations producing and using spatial data, and the amount of data transferred between these organizations.
• There are several types of standards:– Data standards– Interoperability standards– Analysis standards – Professional and certification standards
3
Introduction (continued)
• National and international standards organizations are important in defining and maintaining geospatial standards:– Federal Geographic Data Committee (FGDC) which
focuses on the national spatial data infrastructure (www.fgdc.gov)
– International Spatial Data Standards Commission which is a clearing house and gateway for international standards
– Open Geospatial Consortium (OGC) which is developing interoperability standards. Web Mapping Service (WMS) standards are an example.
GIS Certification
• What kind of certification is available?
• Two primary options:– Geographic Information Systems Professional
(GISP) is based on your work and volunteering experience.
– ESRI Technical Certifications are test based.
• The third option is a university based certification.
4
5
The Geospatial Competency Model
6
77
GIS Professional Certification URISA is the founding member of the
GIS Certification Institute, the organization that administers professional certification for the field
and is dedicated to advancing the industry.
Education: 30 Points
Experience: 60 Points
Contributions: 8 Points
The additional 52 points can be counted from any of the three categories.
The minimum number of points needed to become a certified GIS Professional as detailed in the three point schedules given below is 150 points. Thus, all applicants are expected to document achievements valued at a minimum of 150 points. To ensure that applicants have a broad foundation, specific minimums in each of the three achievement categories must be met or exceeded. These minimums are as follows:
8
9
A Sample of University Certificates
• UMM – undergraduate
• USM undergrad/grad
• UM – graduate
• Penn State
• University of Denver
• University of Southern California
• George Mason University
1010
Spatial Data Standards
• Data – measurements and observations
• Data quality – a measure of the fitness for use of data for a particular task (Chrisman, 1994).
• It is the responsibility of the user to insure that the data is fit for the task.
• Metadata – data about the data
1111
Spatial Data Standards
• Spatial Data Standards – methods for structuring, describing and delivering spatially-referenced data.
• Media Standards – the physical form of the data (CD/download etc).
• Format Standards – specify data file components and structures. These standards aid in data transfer.
• Spatial Data Accuracy Standards –document the quality of the positional and attribute accuracy.
• Document Standards – define how we describe spatial data.
1212
GIS Is Not PerfectA GIS cannot perfectly represent the world for many
reasons, including: • The world is too complex and detailed. • The data structures or models (raster, vector, or
TIN) used by a GIS to represent the world are not discriminating or flexible enough.
• We make decisions (how to categorize data, how to define zones) that are not always fully informed or justified, and are always biased.
• It is impossible to make a perfect representation of the world, so uncertainty is inevitable
• Uncertainty degrades the quality of a spatial representation
1313
Concepts Related to Data Quality
• Related to individual data sets:– Errors – flaws in data– Accuracy – the extent to which an estimated
value approaches the true value.– Precision – the recorded level of detail of your
data.– Bias – the systematic variation of the data
from reality.• Personal bias• Instrument bias
1414
1515
Concepts Related to Data Quality
• Related to source data:– Resolution – the smallest feature in the data
set that can be displayed.– Generalization- simplification of objects in the
real world to produce scale models and maps.
1616
Resolution and generalization of raster datasets
1717
Figure 10.3 Scale-related generalization
1818
Data Sets Used for Analysis
Must be:– Complete – spatially and temporally– Compatible – same scale, units of measure,
measurement level– Consistent – both within and between data
sets. – And Applicable for the analysis being
performed.
1919
Sources of Error (Uncertainty) in GIS
2020
A Conceptual View of Uncertainty
Real World
Conception
Data conversion and Analysis
Source Data, Measurements &Representation
Result
error propagation
2121
Uncertainty in The Conception of Geographic Phenomena
Many spatial objects are not well defined or their definition is to some extent arbitrary, so that people can reasonably disagree about whether a particular object is x or not. There are at least four types of conceptual uncertainty
• Spatial uncertainty• Vagueness• Ambiguity• Regionalization problems
2222
• Spatial uncertainty occurs when objects do not have a discrete, well defined extent.
• They may have indistinct boundaries.
• They may have impacts that extend beyond their boundaries.
• They may simply be statistical entities.
• The attributes ascribed to spatial objects may also be subjective.
Spatial uncertainty
2323
• Vagueness occurs when the criteria that define an object as x are not explicit or rigorous.
• For example:– In a land cover analysis, how many oaks (or
what proportion of oaks) must be found in a tract of land to qualify it as oak woodland?
– What incidence of crime (or resident criminals) defines a high crime neighborhood?
Vagueness (obscureness)
2424
Ambiguity
Ambiguity occurs when y is used as a substitute, or indicator, for x because x is not available.
• The link between direct indicators and the phenomena for which they substitute is straightforward and fairly unambiguous.
• Indirect indicators tend to be more ambiguous and opaque.
• Of course, indicators are not simply direct or indirect; they occupy a continuum. The more indirect they are, the greater the ambiguity.
2525
• Regional geography is largely founded on the creation of a mosaic of zones that make it easy to portray spatial data distributions.
• A uniform zone is defined by the extent of a common characteristic, such as climate, landform, or soil type.
• Functional zones are areas that delimit the extent of influence of a facility or feature—for example, how far people travel to a shopping center or the geographic extent of support for a football team.
• Regionalization problems occur because zones are artificial.
Regionalization problems
2626
Uncertainty in the measurement of geographic phenomena
Error occurs in physical measurement of objects. This error creates further uncertainty about the true nature of spatial objects.
• Physical measurement error• Digitizing error• Error caused by combining data sets with
different lineages
2727
Physical measurement error
Instruments and procedures used to make physical measurements are not perfectly accurate. For example, a survey of Mount Everest might find its height to be 8,850 meters, with an accuracy of plus or minus 5 meters.
• In addition, the earth is not a perfectly stable platform from which to make measurements. Seismic motion, continental drift, and the wobbling of the earth's axis cause physical measurements to be inexact. (GPSing error, GPSing error, remote sensing errorremote sensing error)
2828
Digitizing error
• A great deal of spatial data has been digitized from paper maps.
• Digitizing, or the electronic tracing of paper maps, is prone to human error. – Lines may be drawn too far, not far enough, or missed
entirely. Errors caused by digitizing mistakes can be partially, but not completely, fixed by software.
– Additional error occurs because adjacent data digitized from different maps may not align correctly. This problem can also be partially corrected through a software technique called rubbersheeting.
2929
Digitizing ErrorAny digitized map requires:
Considerable post-processing Check for missing features
Connect lines Remove spurious polygons Some of these steps can be
automated
3030
Error caused by combining data sets with different lineages
• Data sets produced by different agencies or vendors may not match because different processes were used to capture or automate the data. – For example, buildings in one data set may appear on the
opposite side of the street in another data set. – Error may also be caused by combining sample and
population data or by using sample estimates that are not robust at fine scales.
– "Lifestyle" data are derived from shopping surveys and provide business and service planners with up-to-date socioeconomic data not found in traditional data sources like the census. Yet the methods by which lifestyle data are gathered and aggregated to zones or are compared to census data may not be scientifically rigorous
3131
Uncertainty in the representation of geographic phenomena
• Representation is closely related to measurement. • Representation is not just an input to analysis, but
sometimes also the outcome of it. For this reason, we consider representation separately from measurement.– The world is infinitely complex, but computer system are finite. – Representation is all about the choices that are made in capturing
knowledge about the world
• Uncertainty in earth model: ellipsoid models, datum, projection types
• Uncertainty in the raster data model (structure)• Uncertainty in the vector data model (structure)
3232
• The raster structure partitions space into square cells of equal size (also called pixels).
• Spatial objects x, y, and z emerge from cell classification, in which Cell A1 is classified as x, Cell A2 as y, Cell A3 as z, and so on, until all cells are evaluated.
• A spatial object x can be defined as a set of contiguous cells classified as x.
• Commonly, a cell is not purely one thing or another, but might contain some x, some y, and maybe a bit of z within its area.
• These impure cells are termed mixed pixels or "mixels." • Because a cell can hold only one value, a mixel must be
classified as if it were all one thing or another. Therefore, the raster structure may distort the shape of spatial objects.
Uncertainty in the raster data structure
3333
Error in raster
• raster- because of the distortions due to flattening, cells in a raster can never be perfectly equal in size on the Earth’s surface. - when information is represented in raster form all detail about variation within cells is lost, and instead the cell is given a single value. largest sharelargest share, central central pointpoint (f.g. USGS DEM), and mean valuemean value (f.g. remote sensing imagery)
Largest share
Central point
8 6 7.5
Mean value
6.33
66.29
8
8
8 6
6
66
6
8x(1/6)+6x(5/6)=6.338x(3/4)+6x(1/4)=7.58x(1/7)+6x(6/7)=6.29
3434
Figure 10.8 Problems with remotely sensed imagery: (left) example of a satellite image with cloud cover (A), shadows from topography (B), and shadows from cloud cover
(C); (right) an urban area showing a building leaning away from the cameraSource: Ian Bishop (left) and Google UK (right)
3535
• Socioeconomic data—facts about people, houses, and households—are often best represented as points.
• For various reasons (to protect privacy, to limit data volume), data are usually aggregated and reported at a zonal level, such as census tracts or ZIP Codes.
• This distorts the data in two ways: – First, it gives them a spatially inappropriate representation
(polygons instead of points); – Second, it forces the data into zones whose boundaries
may not respect natural distribution patterns.
Uncertainty in the vector data structure
3636
Map representation error
Map scale Ground distance, accuracy, or resolution (corresponding to 0.5 mm map distance)
1:1,250 0.625 m
1:2,500 1.25 m
1:5,000 2.5 m
1:10,000 5 m
1:24,000 12 m
1:50,000 25 m
1:100,000 50 m
1:250,000 125 m
1:1,000,000 500 m
1:10,000,000 5 km
3737
Uncertainty in the data conversion and analysis of geographic phenomena
Uncertainties in data lead to uncertainties in the results of analysis; Data conversion and spatial analysis methods can create further uncertainty
• Data conversion error• Georeferencing and resampling• Projection and datum conversions• The ecological fallacy• The modifiable areal unit problem (MAUP)• Classification errors
3838
• The ecological fallacyThe ecological fallacy is the mistake of assuming that an overall characteristic of a zone is also a characteristic of any location or individual within the zone.
• The Modifiable Areal Unit Problem (MAUP)The results of data analysis are influenced by the number and sizes of the zones used to organize the data. The Modifiable Area Unit Problem has at least three aspects:
1. The number, sizes, and shapes of zones affect the results of analysis.
2. The number of ways in which fine-scale zones can be aggregated into larger units is often great.
3. There are usually no objective criteria for choosing one zoning scheme over another.
3939http://www.gistutor.com/concepts/24-intermediate-concept-tutorials/57-
ecological-fallacy-in-gis.html
Ecological Fallacy Example
4040
http://www.google.com/imgres?um=1&hl=en&client=firefox-a&sa=N&rls=org.mozilla:en-US:official&biw=1257&bih=845&tbm=isch&tbnid=ghU6S5VuksC-8M:&imgrefurl=http://www.indiana.edu/~gisci/courses/g438/lectures/gis_census.html&docid=VCO84JSYMIBN2M&imgurl=http://w
MAUP Example
4141
Classification error and quality check
4242
SelectingSelectingROIsROIs
Alfalfa
Cotton
Grass
Fallow
4343
Background:Background: ETM+, 7/15/01
Top image:Top image:IKONOS, Oct, 2000
Classification ResultClassification Result
4444
Confusion Matrix
1686
Grass Alfalfa Cotton Chili Fallow (corn)
total User accuracy (%)
Grass 110 22 0 0 0 132 83.3
Alfalfa 5 105 0 0 0 110 79.5
Cotton 0 0 945 5 0 950 99.5
Chili 0 0 50 42 0 92 45.7
Fallow 0 0 0 0 484 484 100
total 115 127 995 47 484 1768
Producer accuracy (%)
95.6 82.7 95.0 89.4 100
Classification resultsClassification resultsGGrroouunndd ttrruutthh
%4.951768
1686_ AccuracyOverlay
%3.891768/)4844844792995950127110115132(1768
1768/)4844844792995950127110115132(1686_
xxxxx
xxxxxIndexKappa
4545
• Producer accuracy is a measure indicating the probability that the classifier has labeled an image pixel into Class A given that the ground truth is Class A.
• User accuracy is a measure indicating the probability that a pixel is Class A given that the classifier has labeled the pixel into Class A
• Overall accuracy is total classification accuracy.• Kappa index (another parameter for overall accuracy) is a
more useful index for evaluating accuracy.– Errors of commission represent pixels that belong to another class
but are labeled as belonging to the class.– Errors of omission represent pixels that belong to the ground truth
class but that the classification technique has failed to classify them into the proper class.
Bases of Confusion Matrix
4646
Error Propagation
Real World
Conception
Data conversion and Analysis
Measurement &Representation
Result
error propagation
• the errors in the input will propagate to the output of the operation
• error propagation measures the impacts of error (uncertainty) in data on the results of GIS operations
4747
Finding and Modeling Errors
• Checking for errors– Visual inspection during data editing and
cleaning.– Attributes can be checked by using
annotation, line colors and patterns.– Double digitizing– Statistical analysis may identify extreme
values of attributes.
4848
Finding and Modeling Errors
• Error modeling– 1. Epsilon modeling
• Based on a method of line generalization, and adapted by Blakemore.
• It places an error band around a digitized line, describing the probable distribution of error.
• Error distribution is subject to debate:– Normal curve– Piecewise quartile distribution– Bimodal
• The epsilon band can be used in analyses to improve the confidence of the user in the result.
4949
Figure 10.17 Point-in-polygon categories of containmentSource: Blakemore (1984)
5050
Finding and Modeling Errors• Error modeling
– 2. Monte Carlo simulation – used in overlays.• Simulates input data error by adding random noise to the
line coordinates of the map data.
• Each input is assumed to be characterized by an estimate of positional error.
• This changes the shape of the line.
• The process is repeated multiple times and the randomized data put through the GIS analyses.
• Output:– A number
– A map
5151
Figure 10.18 Simulating effects of DEM error and algorithm uncertainty on derived stream networks
5252
Managing GIS Error
• To manage errors we must track and document them.
• The concepts introduced earlier:– Accuracy, Precision, Resolution,
Generalization, Bias, Compatibility, Completeness and Consistency
provide a checklist of quality indicators:
• These should be documented for each data layer.
5353
Managing GIS Error
• Data quality information can be used to create a data lineage.
• A record of the data history that presents essential information about the development of the data.
• This becomes the metadata.
5454
Living with uncertainty
• uncertainty is inevitable and easier to find,• use metadata to document the uncertainty• sensitivity analysis to find the impacts of input
uncertainty on output, • rely on multiple sources of data, • be honest and informative in reporting the results of GIS
analysis.• US Federal Geographic Data Committee lists five
components of data quality: attribute accuracy, positional accuracy, logical consistency, completeness, and lineage (details see www.fgdc.gov)
5555
Basics of FGDC
• Federal Geographic Data Committee (FGDC) metadata answers the who, what, where, when, how and why questions of geospatial data.
• The data structure and elements defined for FGDC metadata are described fully in the “Content Standard for Digital Geospatial Metadata” (CSDGM).
5656
SEVEN SECTIONS OF FGDC
The Federal Geographic Data Committee (FGDC), Content Standard for Digital Geospatial Metadata (CSDGM) organizes a metadata record into seven main sections: – Identification Information– Data Quality Information– Spatial Data Organization Information– Spatial Reference Information– Entity and Attribute Information– Distribution Information– Metadata Reference Information
5757http://www.maine.gov/megis/policies/megisfgdc.rtf
Identification Information
• What is the name of the dataset?• What is the subject or theme of the information included?• What is the scale of the dataset?• What are the attributes of the dataset?• Where is the geographic location of the dataset?• Who developed the dataset?• Who provided the source material for the dataset?• Who will publish the dataset?• When were the features of the dataset identified?• How are the features of the dataset depicted?• Why was the data set created?• Are there restrictions on accessing or using the data?• Are external files available that are related to the dataset?
5858http://www.maine.gov/megis/policies/megisfgdc.rtf
Data Quality Information
• How reliable are the data?• What are its limitations or inconsistencies? • What is the positional and attribute accuracy? • Is the dataset complete? • Were the consistency and content of the data
verified? • Where can the sources of the data be located?• What processes were applied to these sources
and by whom?
5959http://www.maine.gov/megis/policies/megisfgdc.rtf
Spatial Data Organization
• What spatial data model was used to encode the spatial data?
• How many and what kind of spatial objects are included in the dataset?
• Are methods other than coordinates, such as street addresses used to encode locations?
6060http://www.maine.gov/megis/policies/megisfgdc.rtf
Spatial Reference
• Are coordinate locations encoded using longitude and latitude?
• What map projections is used?
• What horizontal datum and/or vertical datum are used?
• What parameters should be used to convert the data to another coordinate system?
6161http://www.maine.gov/megis/policies/megisfgdc.rtf
Entity and Attribute Information
• What geographic information (roads, houses, elevation, temperature, etc.) is described?
• How is this information coded?
• What do the codes mean?
• What source was used for defining the attributes or codes, i.e. Cowardin classification?
6262http://www.maine.gov/megis/policies/megisfgdc.rtf
Distribution
• From whom can the data be obtained?
• What formats are available?
• What media are available?
• Are the data available online?
• What is the price of the data?
6363http://www.maine.gov/megis/policies/megisfgdc.rtf
Metadata Reference
• When were the metadata compiled, and by whom?
• When was the metadata record created?
• Who is the responsible party?
• When were the metadata last updated?