Top Banner
TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental
29

TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

Apr 01, 2015

Download

Documents

Zaria Vickrey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Data Cleaning Tools and Methodologies

Arthur D. Chapman

Australia / Brazil

Centro de Referência em Informação Ambiental

Page 2: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Background

• ERIN/CRIA

• speciesLink

• FAPESP/Biota

Page 3: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Species Data

• Museum/Herbarium

• Observation

• Survey

Page 4: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Data Error

• Names

• Geocode

• Altitude

• Collectors

• Dates

Page 5: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Data quality - fitness for use

Page 6: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Methods for geocode validation

• Internal Database Checks

• Outliers in Geographic Space - GIS

• Outliers in Environmental Space - Models

• Statistical outliers

Page 7: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Internal Database Checks

• Internal inconsistencies

• Checking one field against another– Text location vs geocode

• Checking one database against another– Gazetteers– DEM– Collectors

Page 8: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Geographic outliers - GIS

• Country, State, named district, etc.

Page 9: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Geographic outliers - GIS

Page 10: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Geographic Outliers - GIS

• Collectors – location vs date

Page 11: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Environmental Outliers

• Cumulative Frequency Curves

Page 12: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Acacia orites - 19 records - 9 Temperature parameters

0

5

10

15

20

25

30

35

tann

tmncm

tmxwm

tspan

tclq

twmq

twetq

tdryq

Tem

pera

ture

(C)

Reverse Jack-knife

Page 13: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Outliers in climate space

(T=0.95(√n)+0.2)

where ‘n’ is the number of records

Page 14: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

FloraMap

• CIAT (Columbia)

• PCA

• Cluster Analysis

• $US100

• Modelling

• 10-minute grids

Page 15: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Principal Components Analysis - FloraMap

Image from FloraMap (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data.

A. Principal Components Analysis B. Specimen record. C. Mapped specimen. D. Climate profile

Page 16: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Cluster Analysis - FloraMap

Image from FloraMap (Jones and Gladkov 2001) showing use of Cluster Analysis to identify an outlier in Rauvolfia littoralis specimen data.

A.Cluster Analysis B. Principal Components Analysis. C. Mapped specimen. D. Climate profile. E. Specimen record

Page 17: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Diva-GIS

• Free

• Simple GIS

• Modelling (BIOCLIM/Domain)

• Data Cleaning Tools

Page 18: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Diva-GIS – Coordinate Check

Using Diva-GIS to check coordinates by comparing a file of point specimen records (red) against a polygon of Bolivian provinces. Input dialogue box is shown at A, where it can be seen that “STATE” in the point file has been set to the equivalent “DEPARTMENT” in the polygon file (Hijmans et al. 2003).

Page 19: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Points outside Polygon – Diva GIS

Results from Diva-GIS (Hijmans et al. 2003) showing point records that fall outside all polygons in the Bolivian provinces polygon file. The highlighted record shows the linking between the results dialogue box and the mapped record

Page 20: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Mismatched Provinces – Diva GIS

Results from Diva-GIS (Hijmans et al. 2003) showing point records that do not match set relationships between the specimen point file and the polygon of Bolivian provinces. The highlighted record where the geocoding on the specimen record causes it to fall in the wrong province

Page 21: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Assign Coordinates – Diva GIS

Results from Diva-GIS (Hijmans et al. 2003) showing point records with geocodes automatically assigned. A. Unambiguous geocodes found by the program and assigned. B. Ambiguous geocodes identified. C. Appropriate geocodes not found.

Page 22: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Multiple possibilities – Diva GIS

Results from Diva-GIS (Hijmans et al. 2003) showing alternate geocodes for a record where use of the Gazetteer has produced a number of credible alternatives.

Page 23: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Cumulative Frequency Curves - DivaGiS

Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2 the corresponding mapped records. The Blue lines represent the 97.5 percentile

Page 24: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Bioclimatic Envelop – Diva GIS

Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Bioclimatic Envelope from BIOCLIM to identify outliers in climate space. In this case the percentile cut off is set at 95. Red points on the envelope correspond with red points on the map, green points in the envelope correspond with yellow points on the map

Page 25: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

ANUCLIM• $AUD1000 (with data files)

• Modelling (BIOCLIM / ESOCLIM)

• Cumulative Frequency Curves

• Parameter Extremes

Page 26: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Cumulative Frequency - ANUCLIM

Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the species accumulation curve with an identified outlier (labelled “bad”). Information from the “bad” record is displayed at the top of the log file (from Houlder et al. 2000).

Page 27: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Parameter extremes - ANUCLIM

Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the parameter extremes (top) and associated species accumulation curve (bottom) (from Houlder et al. 2000

Page 28: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Statistical Tests

• Outliers in Latitude

• Outliers in Altitude

• Outliers in collectors range/day or week– Especially 17th, 18th and 19th Century

collections

Page 29: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

TDWG- Lisbon Oct 2003

Thank You…

Questions?