N "711 I V c... L. c.:.. AgRISTARS Domestic Crops and Land Cover DC-l2-04264 JSC-17829 A Joint Program for Agriculture and Resources Inventory Surveys Through Aerospace Remote Sensing August 1982 EVALUATION OF SMALL AREA CROP ESTIMATION TECHNIQUES USING LANDSAT- AND GROUND-DERIVED DATA -yo L. Amis, M. V. Martin, W. G. McGuire, and S. ~. Shen ~.~~lockheed EngineeringandMa~t Services Company, Inc. lyndon B. Johnson Space Center Houston. Texas 77058
100
Embed
711 VcL.c.:.. I AgRISTARS - USDA - National Agricultural ... · AgRISTARS Domestic Crops andLand Cover ... Space and Life Sciences Directorate ... 5.2.5 STATISTICS FOR DIRECT PROPORTION
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
N "711 I V c...L. c.:..
AgRISTARS
Domestic Crops and Land Cover
DC-l2-04264JSC-17829
A Joint Program forAgriculture andResources InventorySurveys ThroughAerospaceRemote Sensing
August 1982
EVALUATION OF SMALL AREA CROP ESTIMATION TECHNIQUESUSING LANDSAT- AND GROUND-DERIVED DATA
-yo L. Amis, M. V. Martin, W. G. McGuire, and S. ~. Shen
~.~~lockheed EngineeringandMa~tServices Company, Inc.
lyndon B. Johnson Space CenterHouston. Texas 77058
)
I. Fleoort ~. I 2. a.o..mm-.l "-ion 11Io. 3. A_tNnt's CnlIOg No.
DC-L2-04264' JSC-17829C. Tit'. and Subtit •• 5. Aeoon Daft
August 1982Evaluation of Small Area Crop Estimation Techniques Using I.••••cwmiftg Org.mmion Cod.Landsat- and Ground-Derived Data7. AuthOrlsl e. PwfOl""';'. Or,IntDtioft ,,~ No.
M. L. /lmis, M. V. Ma rt in, w. G. McGuire, and S. S. Shen LEMSCO-1759710. Work Unit No.
9. Perform,"9 O'9l"iZlltlOn ~ and Ador-.
Lockheea Engineering and Management Services Company, Inc. 11. Coft1rKt or GrMt No.1830 NASA Road 1Houston, Texas 77258 HAS 9-15800
13. Tv•• tJf Aepor'f N PwiocI eo-..12. s_,", ~ NMle and Addr-. Final Report, 1980-81National Aeronautics and Space Administration
1~. Sool_iftt ~ Cod.Lyndon B. Johnson Space CenterHouston, Texas 77058 Techni cal Monitor: R. Heydorn
15. SuDCII_t.,-y Nom
The Agriculture and Resources Inventory Surveys Through Aerospace Remote Sensing is a joint programof the u.S. Department of Agriculture. the National Aeronautics and Space Administration, the NationalOceanic and Atmospheric Administration (U.S. Department of Commerce). the Agency for InternationalOevelooment (U.S. Deoartment of State). and the U.S. Deoartment of the Interior.
II. AbItrKt
Tnis aocument describes the studies completed in fiscal year 1981 in support of theclustering/classification and preprocessing activities of the Domestic Crops and Land Coverproject of the Agriculture and Resources Inventory Surveys Through Aerospace Remote Sensingprogram. The theme throughout the study was the improvement of subanalysis district(usually county level) crop hectarage estimates, as reflected in the following three objec-tives: (1) to evaluate the current U.S. Department of Agriculture Statistical ReportingService regression approach to crop area estimation as applied to the problem of obtainingsuoanalysis district estimates, (2) to develop and test alternative approaches to subanal-ySis district estimation, and (3) to develop and test preprocessing techniques for use inimproving subanalysis district estimates.
'7. K•••• Words (SU9llllt8d tJy Authorlsll 18. Distribution Sta~
19. SIcur.ty o..if. lof this reoortl 20. s.curity Claaif. lof tI'lis ~I 21. No. of PI9ft 22. ""ce.
Unclassified Unclassified 145
I
)
DC-L2-04264JSC-17829
EVALUATION OF SMALL AREA CROP ESTIMATION TECHNIQUESUSING LANDSAT- AND GROUND-DERIVED DATA
Job Order 71-352
This report describes the activities of the Domestic Cropsand Land Cover project of the AgRISTARS program.
PREPARED BYM. L. Amis, M. V. Martin, W. G. McGuire, and S. S. Shen
APPROVED BY
--re:o It'~T. C. Minter, Jr., ManagerSupporting Research Department
LOCKHEED ENGINEERING AND MANAGEMENT SERVICES COMPANY, INC.Under Contract NAS 9-15800
ForEarth Resources Research Division
Space and Life Sciences DirectorateNATIONAL AERONAUTI CS AND SPACE ADMINISTRATION
LYNDON B. JOHNSON SPACE CENTERHOUSTON, TEXAS
August 1982LEMSCO-17597
I
)
PREFACE
The Agriculture and Resources Inventory Surveys Through Aerospace RemoteSensing is a multiyear program of research, development, evaluation, and appli-cation of aerospace, remote sensing for agricultural resources, which began infiscal year 1980. This program is a cooperative effort of the U.S. Departmentof Agriculture, the National Aeronautics and Space Administration, the NationalOceanic and Atmospheric Administration (U.S. Department of Commerce), theAgency for International Development (U.S. Department of State), and theU.S. Department of the Interior.
The work which is the subject of this document was performed by the EarthResources Research Division, Space and Life Sciences Directorate, Lyndon B.Johnson Space Center, National Aeronautics and Space Administration andLockheed Engineering and Management Services Company, Inc. The tasks performedby Lockheed Engineering and Management Services Company, Inc., wereaccomplished under Contract NAS 9-15800.
v
ISection1. INTRODUCTION
CONTENTS
Page
1.1 OBJECTIVES ••..•.•.•••.•••..•.••..•.••....•••.•.•••••••••••••••• 1-11.2 DISCUSSION OF OBJECTIVES ••••••••.••••••••.••••••••••••••••••••• 1-21.3 DESCRIPTION OF THE DATA SET •••••••••••••••••••••••••••••••••••• 1-3
2. A BRIEF DERIVATION OF THE ESTIMATORS AND ASSUMPTIONS •••••••••••••••• 2-12.1 EDITOR SUBANALYSIS DISTRICT REGRESSION ESTIMATOR ••••••••••••••• 2-12.2 THE CARDENAS FAMILY OF ESTIMATORS •••••••••••••••••••••••••••••• 2-32.3 THE CLASSY-BASED DIRECT PROPORTION ESTIMATORS •••••••••••••••••• 2-52.3.1 MAXIMUM LIKELIHOOD APPROACH •••••••••••••••••••••••••••••••••• 2-72.3.2 LEAST SQUARES APPROACH ••••••••••••••••••••••••••••••••••••••• 2-9
~ 3. THE PREPROCESSING ALGORITHMS •••••••••••••••••••••••••••••••••••••••• 3-13.1 XSTAR: AN ALGORITHM TO CORRECT LANDSAT DATA FOR THE
EFFECTS OF HAZE AND SUN ANGLE •••••••••••••••••••••••••••••••••• 3-13.2 ATCOR: AN ALGORITHM TO CORRECT LANDSAT DATA FOR THE
EFFECTS OF HAZE, SUN ANGLE, AND BACKGROUND REFLECTANCE ••••••••• 3-43.3 MLEST: A DISTRIBUTION MATCHING ALGORITHM •••••••••••••••••••••• 3-5
4.1 INTRODUCTION •.....•••••••••..•..•.•..•...••••...•.••.••.•.••••• 4-14.2 FORMULATION OF GROUPS FOR TRAINING AND TESTING ••••••••••••••••• 4-24.3 QUESTIONS ADDRESSED IN THE EVALUATION STUDIES •••••••••••••••••• 4-44.4 PREPROCESSING •......•................•...........•...........•. 4-54.5 STATISTICAL EVALUATION APPROACH ..........•.•••..••••••••.•.•••• 4-64.6 EVALUATION OF PREPROCESSORS ••.•••.••.•••.•••••••••••••••••.•••• 4-9
vii
5.1.4 ESTIMATION RESULTS FOR SOIL STRATUM 4 ••••••••••••••••••••••••5.2 RESULTS OF THE CARDENAS REGRESSION AND
CARDENAS RATIO ESTIMATION PROCEDURES •••••••••••••••••••••••••••5.2.1 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION •••••••••. 5-20
Section5. STUDY RESULTS .••••..•..••••••.•.•••..•...•••••.••......•••••.•.•••••
5.1 CURRENT SUBANALYSIS DISTRICT REGRESSION ESTIMATOR ••••••••••••••5.1.1 EXPLANATION OF GRAPHS AND TABLES •••••••••••••••••••••••••••••5.1.2 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES BY COUNTy •••••••5.1.3 BEHRENS-FISHER TEST ••••••••••••••••••••••••••••••••••••••••••
5.2.2 BEHRENS-FISHER TEST ••••••••••••••••••••••••••••••••••••••••••5.2.3 F-TESTS OF VARIANCE ••••••••••••••••••••••••••••••••••••••••••5.2.4 RESULTS OF THE CLASSY-BASED DIRECT
PROPORTION ESTIMATION PROCEDURE ••••••••••••••••••••••••••••••5.2.5 STATISTICS FOR DIRECT PROPORTION ESTIMATORS ••••••••••••••••••5.2.6 RELATIVE BIASES JF ALTERNATIVE COUNTY ESTIMATORS •••••••••••••5.3 STUDY RESULTS: PR~PROCESSING ••••••••••••••••••••••••••••••••••5.3.1 HOTELLING'S T2 TEST RESULTS ••••••••••••..••••••••••••••••••••5.3.2 ATCOR HAZE lEVELS ••.•....•.•..•........•...........•.•..•...•5.3.3 COMPARISON OF REGRESSION LINES •.•••••••••••••••••••••••••••••
6. CONCLUSIONS AND RECOMMENDATIONS ••••••.••••••••••••••••••••••••••••••7 • REF ERE NC E S ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
Table Page5-1 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES USING CURRENT
USDA PROCEDURE FOR BEADLE COUNTy •••••••••••••••••••••••••••••••••• 5-25-2 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES USING CURRENT
USDA PROCEDURE FOR CLARK COUNTy ••••••••••••••••••••••••••••••••••• 5-35-3 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES USING CURRENT
USDA PROCEDURE FOR CODINGTON COUNTy ••••••••••••••••••••••••••••••• 5-45-4 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES USING CURRENTUSDA PROCEDURE FOR HAMLIN COUNTy •••••••••••••••••••••••••••••••••• 5-55-5 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES USING CURRENT
USDA PROCEDURE FOR KINGSBURY COUNTy ••••••••••••••••••••••••••••••• 5-65-6 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES USING CURRENT
USOA PROCEDURE FOR SPINK COUNTy ••••••••••••••••••••••••••••••••••• 5-75-7 BEHRENS-FISHER T-TEST OF MEAN ESTIMATES ••••••••••••••••••••••••••• 5-18~~)5-8 CONFIDENCE INTERVAL FOR ESTIMATED BIAS: CURRENT REGRESSION
EST I MATOR ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 5-195-9 THEORETICAL AND EMPIRICAL VARIANCE ESTIMATES USING CURRENTUSDA PROCEDURE FOR SOIL STRATUM 4••••••••••••••••••••••••••••••••• 5-215-10 BEHRENS-FISHER TEST OF MEAN ESTIMATES FOR SOIL STRATUM 4 •••••••••• 5-225-11 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENAS
REGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR RANGELAND •••••••••••• 5-235-12 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENASREGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR FLAX ••••••••••••••••• 5-245-13 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENASREGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR HAY CUT •••••••••••••• 5-255-14 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENAS
ESTIMATOR AND RATIO ESTIMATOR FOR ALFALFA ••••••••••••••••••••••••• 5-265-15 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENAS
REGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR GRASS •••••••••••••••• 5-275-16 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENAS
REGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR OATS ••••••••••••••••• 5-28)
;x
Table Page
\\
5-17 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENASREGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR WHEAT •••••••••••••••• 5-29
5-18 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENASREGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR CORN ••••••••••••••••• 5-30
5-19 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENASREGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR SUNFLOWERS ••••••••••• 5-31
5-20 BEHRENS-FISHER T-TEST OF MEAN ESTIMATES: CARDENAS REGRESSIONESTIMATOR •••••••••••••••••••••••••••••••••••••••••••••••••••••••••
5-21 BEHRENS-FISHER T-TEST OF MEAN ESTIMATES: CARDENAS RATIO.ESTIMATOR •••••••••••••••••••••••••••••••••••••••••••••••••••••••••
5-22 CONFIDENCE INTERVAL FOR ESTIMATED BIAS: CARDENAS REGRESSION
5-33
5-34
EST I MATOR ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 5- 35
5-23 CONFIDENCE INTERVAL FOR ESTIMATED BIAS: CARDENAS RATIOESTIMATOR ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 5-36
5-24 F-TESTS OF VARIANCE ••••••••••••••••••••••••••••••••••••••••••••••• 5-385-25 BIAS, MEAN SQUARED ERROR, AND F-RATIO USING THE MAXIMUMLIKELIHOOD APPROACH ••••••••••••••••••••••••••••••••••••••.••••••••5-26 BIAS, MEAN SQUARED ERROR, AND F-RATIO USING THE LEAST SQUARES
5-415-27 RELATIVE BIAS OF ALTERNATIVE COUNTY ESTIMATORS •••••••••••••••••••• 5-425-28 EDITOR WITHOUT PREPROCESSING •••••••••••••••••••••••••••••••••••••• 5-445-29 EDITOR WITH XSTAR PREPROCESSING - SINGLE HAZE CORRECTION USED
FOR BOTH ANALYSIS DISTRICT SAMPLE AND COUNTy •••••••••••••••••••••• 5-455-30 EDITOR WITH XSTAR PREPROCESSING - ANALYSIS DISTRICT AND COUNTY
SEPARATELY CORRECTED FOR HAZE ••••••••••••••••••••••••••••••••••••• 5-465-31 EDITOR WITH ATCOR PREPROCESSING ••••••••••••••••••••••••••••••••••• 5-475-32 EDITOR WITH MLEST PREPROCESSING ••••••••••••••••••••••••••••••••••• 5-485-33 EDITOR WITH MLEST PREPROCESSING WITH TRUE PROPORTIONS ••••••••••••• 5-495-34 STRATUM 12 HOTELLING'S T2 RESULTS OF 25 SEGMENTS IN BEADLE
cou NT Y •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 5 - 51
5-35 STRATUM 12 HUTELLING'S T2 RESULTS OF 20 SEGMENTS IN KINGSBURYCOUNTY •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 5-51
x
~ Table Page
5-36 CROP PROPORTIONS OF 75 SEGMENTS IN ANALYSIS DISTRICT •••••••••••••• 5-545-37 CROP PROPORTIONS OF 25 SEGMENTS IN BEADLE COUNTy •••••••••••••••••• 5-545-38 CROP PROPORTIONS OF 20 SEGMENTS IN KINGSBURY COUNTy ••••••••••••••• 5-555-39 MLEST TRANSFORMATION MATRIX A AND VECTOR B FOR BEADLE AND
KINGSBURY COUNTIES •••••••••••••••••••••••••••••••••••••••••••••••• 5-565-40 ATCOR-MEASURED HAZE LEVELS •••••••••••••••••••••••••••••••••••••••• 5-575-41 F-TEST FOR HOMOGENEITY OF VARIANCES ••••••••••••••••••••••••••••••• 5-595-42 EQUALITY OF TRAIN AND TEST REGRESSION LINES ••••••••••••••••••••••• 5-59
)
)xi
,) FIGURES
Fi gure Page5-1 Variance versus I(C) for rangeland in Beadle County •••••••••••••••• 5-85-2 Variance versus I(C) for sunflowers in Beadle County ••••••••••••••• 5-95-3 Variance versus I(C) for corn in Beadle County ••••••••••••••••••••• 5-105-4 Vari ance versus I(C) for wheat in Beadle County •••••••••••••••••••• 5-115-5 Variance versus I(C) for oats in Beadle County ••••••••••••••••••••• 5-125-6 Vari ance versus I(C) for grass in Beadle County •••••••••••••••••••• 5-135-7 Variance versus I(C) for alfalfa in Beadle County •••••••••••••••••• 5-145-8 Vari ance versus I(C) for hay cut in Beadle County •••••••••••••••••• 5-155-9 Variance versus I(C) for flax in Beadle County ••••••••••••••••••••• 5-16
)x;;;
)
AgRISTARS
CRDDC/LC
ERIMFYMSEMSSSRSSSEUSDA
ABBREVIATIONS AND ACRONYMS
Agriculture and Resources Inventory Surveys Through AerospaceRemote SensingCrop Report Distri ctDomestic Crops and Land CoverEnvironmental Research Institute of Michiganfiscal yearmean squared errormultispectral scannerStat ist ica 1 Report ing Servi cesum of squared errorU.s. Department of Agriculture
xv
1. INTROOUCTIUN
1.1 OBJECTIVESA major objective of the Statistical Reporting Service (SRS) of theU.S. Department of Agriculture (USDA) is the generation, with measurable pre-cision, of accurate area estimates for crops and other land cover types. Theareas of interest are national, regional, state, and various substate areasSUCh as crop reporting districts (CRO's), groups of counties, and individualcounties; currently, regression estimation is the method used, with landsatclassification results as the auxiliary variable of the estimator, and ground-observed data or ground truth from SRS operational surveys as the primaryvariable of the estimator. The ground truth is obtained by interviewing farmoperators located in randomly selected areas of land called SRS segments. Theregression estimator is defined over an analysis district, which is an area(usually a group of contiguous counties) in which the landsat acquisitionsused for estimation are the same for every point in the area. The area is"large" in the sense that it contains a sufficient number of ~S segments toreliably calculate regression coefficients.
This report documents the work done during fiscal year (FY) 1981 in the clus-tering and/or classification and preprocessing activities of the DomesticCrops and land Cover (DC/lC) project of the Agriculture and ResourcesInventory Surveys Through Aerospace Remote Sensing (AgRISTARS) program. Theobjectives of the research undertaken were threefold:1. To evaluate the current ~S regression approach to crop area estimation
when the area of interest is a single county or a small group of countiescalled a subanalysis district.
2. To develop and test new approaches to subanalysis district estimation.3. To develop and test preprocessing techniques for use in improving sub-
analysis district estimation.
1-1
)
1.2 DISCUSSION OF OBJECTIVESA subanalysis district is a subarea (usually a county) of an analysis districtin which there is an insufficient number of SRS segments to reliably calculateregression coefficients.
The regression estimator can produce unbiased estimates with measurable preci-sion for analysis districts; however, when the estimator developed over ananalysis district is applied to a subanalysis district, it can be biased. Theintent of the evaluation proposed in the first objective was to examinebiasness and the applicability of an SRS-formulated estimator of the variance.The study consisted of empirically estimating the bias and variance of thesubanalysis district estimator using a repeated sampling method. Reliableestimates of bias were thought to be possible because of an abundance ofground truth in some subareas. The empirical estimate of,the variance wouldbe compared to the formula-derived estimate, and, if possible, an improvedsubanalysis district variance estimator would be suggested.
An alternative regression approach developed by Manual Cardenas (ref. 1) wasevaluated. The Cardenas family of estimators (section 2.2) was derived par-ticularly for the case of small area estimation. Under certain assumptions,expressions for bias and variance of the estimators had been derived. Anotherclass of estimators, referred to as direct proportion estimators, were alsostudied. These estimators did not depend on classification, but they estima-ted the posterior probability of a pixel belonging to a crop class. It washoped that this approach would reduce bias, as well as variance, at the countylevel.
The focus of the preprocessing task was to effect some preliminary assessmentof various preprocessing algorithms, which were developed in other studies toremove or reduce the variations in multispectral data resulting from changesin spectral signatures caused by sun angle, atmospheric conditions (includingthe presence of aerosals and water vapor), and background reflectance.
1-2
1.3 DESCRIPTION OF THE DATA SETThe data set used was from a six-county area in South Dakota, which comprisedapproximately 40 percent of one Landsat scene and which was previously used bythe USDA in a soil study. The original data set included data from252 segments; each segment was 65 hectares (160 acres, or one-fourth squaremile) in area and had been chosen independently from 10 soil strata.Ground-truth data for these segments and registered Landsat data for twodates, July 26 and August 25, 1979, were supplied by the USDA. In itsestimation procedure, the USDA typically uses 259-hectare (1-square-mi1e)segments randomly selected from land use strata. Because some soil stratawere oversamp1ed, resamp1ing of the segments was necessary to more closelysatisfy the requirements of this study. After resamp1ing, 200 segments wereavailable for the data set. These segments contained nine crop types that hadsufficient numbers of pure pixels to train the classifier. There was somedoubt concerning the sufficiency of the South Dakota data for estimating biasand variance using repeated sampling methods (see section 5).
1-3
2. A BRIEF DERIVATION OF THE ESTIMATORS AND ASSUMPTIONS
2.1 EDITOR SUBANALYSIS DISTRICT REGRESSION ESTIMATORA subanalysis district regression estimator was proposed by Huddleston and Ray(ref. 2), and it is the one referred to throughout this document as thecurrent county-level estimator. It is, essentially, an analysis districtregression estimator applied to a subarea of that analysis district; that is,regression coefficients are estimated using samples from the analysisdistrict, whereas the mean being estimated is from a subpopulation of theanalysis district. If the subpopulation is a set C of c counties (a sub- "analysis district) then the separate form of the regression estimate of thetotal hectarage for Cis:
(1)
whereNk,c = the total number of area-frame units (segments) in the kth stratum for
the set C of c counties)
= the total number of strata for the set C of c counties= the average hectarage per sample unit from the ground survey for the
kth stratum for the crop of interest
bk = the estimated regression coefficient for the kth stratum whenregressing ground-truth hectarage on classified pixels for the nksampl e uni ts
2-1
the average number of pixels per area-frame unit for all units in thekth stratum for the set C of c counties that have been classified intothe crop of interest
xk = the average number of pixels per sample unit in the kth stratum thathave been classified into the crop of interest
=
The estimated variance of YREG,c has been proposed to be
v(VREG,C)Lc
2 (\- n~s2 (nk - 1)= l: Nk,c nk - 2 •k=l k,y nk
)(1 - r~) +.L +
(Xk,c - xk)2I(C) nk nk
~ (Xk· - Xk)2J=l J
whereNk = the total number of area-frame units in the kth stratum
nk = the number of sample units in the kth stratum
S2 = the sample variance for the reported hectarage for the kth stratumk,y
n ( - )2= ± y kj - Y kj=l nk - 1
r; = the sample coefficient of determination for the kth stratum
)
2-2
(2)
I(C) = 1 if C is a subset of the regression domain= 0 if C is the entire regression domain
When I(C) = 1, the above variance formula is derived by treating the part of Ccontained in the kth stratum as a single (fictitious) segment in which thenumber of pixels classified as the crop of interest is Xk • This is,cequivalent to assuming that there is no variation at all for the actual seg-ments in C. If there is such variation, then it is believed that the varianceformula overestimates the variability of the subanalysis district regressionestimator. Comparing the empirical variances with those obtained from thevariance formula appears to substantiate this belief. For all of the majorcrops and for almost all of the minor crops, the empirical estimate ofvariance tends to be much closer to the formula variance for I(C) = 0 than forI(C) = 1, with most of the empirically observed values of I(C) falling in theinterval [0, .1]. These results are found in section 5.
2.2 THE CARDENAS FAMILY OF ESTIMATORSOne of the problems encountered in estimating crop hectarage in a subanalysisdistrict is that there may be few or no sample segments with which to obtainunbiased estimates of the mean hectarage per segment in the subanalysis dis-trict. Consider, for example, the six-county South Dakota area, and let Ckdenote one of the counties. If Ykh is the population mean hectarage per seg-ment of a crop in land-use stratum h and in Ck, then the total for county kwould be
(3)
where~ = denotes the summation over all strata in county kh e:C k
Mkh = the total number of segments in the hth stratum within county k
2-3
~ An unbiased estimate of the Ykh may not be possible if few sample segmentsbelong to Ck; however, the analysis district does presumably contain suffi-cient sample segments to estimate Yh, the population mean crop hectarage persegment in stratum h. Thus, if the assumption that Ykh = Yh were made, thetotal for county k would be estimated by
(4)
where
=.LNhV* 1: tihVih ' an unbiased estimate of Yhh nh i=1
tihthe hth stratumYih = 1: y. h .It.h ' the sample mean per segment of the area inj=1 1 J 1
within countytih = the number of segments in the sample of the hth stratum withi n county i
) nh = the number of counties in the sample of the hth stratum
Nh = the number of counties in the hth stratum.
Recognizing that the above assumption is not satisfied in general, Cardenas,Craig, and Blanchard (ref. 1) defined a family of county-level estimatorsusing the classified pixels in each county and stratum as the auxiliary data.The family of estimators (referred to herein as the Cardenas family of esti-mators) for the kth county is given by
(5)
whereXkh = the mean number of pixels classified as the crop in question for the hth
stratum within county kXh = the mean number of pixels classified as the crop in question for the hth
) stratum
2-4
If Xkh is greater (less) than Xh, then the mean area estimate should beincreased (decreased) by an amount proportional to this difference. Itfollows that Bh should be positive.
If classification is such that Yihj = Axihj' where A is some constant, thenusing Bh = Yhtrh y!elds an unbiased estimator (referred to as the Cardenasratio estimator), Yrk of Yk•
Using a method similar to least squares estimation, estimates
yields an unbiased estimator (referred to as the Cardenas regression estima-.•.tor), Ysk of Yk when Yihj = a + bhXihj' where a and bh are constants •
.•. .•.The variances for Yrk and Ysk were derived by Cardenas et ale (ref. 1). Ifthe assumption is made that the within-county variance is equal for all coun-ties, then unbiased estimates of the variances were also given by Cardenaset a1•
2.3 THE CLASSY-BASED DIRECT PROPORTION ESTIMATORSOne of the objectives of this study is to develop improved county-level croparea estimators. This may be achieved by modeling the county-level probabil-ity distribution as if it came from a mixture of distributions.
The general mixture model is given bym
f(x) = L a.p(xli). 1 11=
where
2-5
(6 )
p{xji) =
a·1the probability density for distribution i
= the proportion of distribution i in the mixturem = the number of distributions in the mixturef{x) = the mixture probability density for a spectral value x
Applying the CLASSY clustering algorithm (ref. 3) to the unlabeled county-level data, it is possible to estimate m, p{xli), and ai for i = 1,···,m. Theproblem which remains is how to associate a crop label with each of the dis-tributions, p{xli). This distribution labeling problem is the subject of asignificant amount of ongoing research. Lennington and Terrell (ref. 4)described a maximum likelihood estimator for the proportion of a given distri-bution composed of a specific class. Chittineni (ref. 5) presented thismaximum likelihood result and a similar result based on a probability ofcorrect labeling criterion. Heydorn, Lennington, and Myers (ref. 6) presenteda least squares, or regression, approach to this same problem. In each ofthese approaches the model is
)lTk = t 8·k
a. + €i=1 1 1
wherelTk = the proportion of crop type k in the county of interest~k = a set of "fitting" coefficientsai = the mixture proportions described previously€ = error
(7)
)
Heydorn, Lennington, and Myers (ref. 6) have pointed out that this approachmay be considered a generalization of stratified proportion estimation.Chittineni (ref. 5) observed that if the 8ls are restricted to either 0 or 1(true distribution labeling), then the maximization problem may be solvedexactly for the case of two or three subcrop types using an exhaustive searchstrategy.
2-6
"-All of these techniques for estimating the 8ik coefficients require that a )small subset of labeled pixels be available. One way to select this subset oflabeled pixels is to choose pixels from only those segments within the countyof interest. This technique may not be feasible if the number of segments inthe county of interest is small. Therefore, it seems appropriate to choosepixels also from segments within the county and adjacent to the county.
Not all of the approaches to obtaining estimates of the 8ik were evaluated.The chosen candidates were the maximum likelihood approach and the leastsquares, or regression, approach, both of which will now be discussed in moredeta il•
2.3.1 MAXIMUM LIKELIHOOD APPROACHSuppose that the CLASSY clustering algorithm is applied to approximate themultivariate mixture density of the data in the county of interest. Thisresults in a set of multivariate normal densities, p(xli), i = l,···,m, and aset of prior probabilities, ai' i = l,···,m. Now, suppose that there is a setof data points, Xj' j = l,···,n, and let the random variable e be the classlabel which takes on values of i = l,···,c. The joint probability of observingdata point Xj associated with label e = i may then be formulated as follows.
mp(x.,e = i) = L a.P(x.,e = t Ii)J i=l 1 J
m= L a.P(e = tlx.,i)P(x./i) (8 )
i=1 1 J J
Assume that p(e = tIXj,i) = p(e = tli) = 8ti, which means that the labeledrandom variable e is conditionally independent of the observation Xj; i.e.,given that one is sampling from distribution i, no further information aboutthe class label is conveyed by knowing Xj.
2-7
Under this assumption, the proportion of class t may be estimated as
(9)
)
and ati may be interpreted as the proportion of distribution i that iscomposed of class t.
Now, a maximum likelihood approach may be used to estimate ati, assuming thatall ai and P{xjli) are given.
cGiven a random sample of N (= L Nt) 1abel ed data points from the county of
t=1interest, the likelihood function is
(10)
where x. , jt = 1, ···,Nt represents those data points labeled as coming fromJtclass t.
Under this mixture model, the likelihood function L may be written
c Nt mL = II II L a.at·p{x. Ii)t=1 jt=l i=1 1 1 Jt ( 11)
cTo maximize L subject to the constraints L ati = 1 for i = 1,"·,m is equiv-
t=1alent to maximizing the following function
t ~.(t a9, i - ~i=1 1 9,=1 I (12)
)where Tl., = 1, ·",m, is the Lagrange multipl ier.
1
2-8
aFMaximizing with respect to 6ti, a solution of ~ = 0 is given byti
where
(13)
(14)
Therefore, ati can be approximated using a fixed-point iteration scheme begin-ning with a . = 1 t = 1 ••• c i = 1,···,m. Once the solution of ao1' is
tl C' " , '"obtained, the proportion of class t can be estimated as
(15)
2.3.2 LEAST SQUARES APPROACHSuppose again that the CLASSY clustering algorithm has been applied to approx-imate the multivariate mixture density of the data in the county of interest.This results in a set of multivariate normal densities, p(xli), and priorprobabilities ai' i = 1, ···,m. The model considered in this case is aregression model where ati are just constants to be estimated, viz,
mp(e = tlx.) = I: at·p(i/x.) + g
J i=1 1 J
wherep(e = tlxj) = the posterior probability that Xj belongs to crop type tp(i IXj) = the posterior probability that Xj belongs to distribution i
= error
(16)
Now, the standard least squares techniques may be used to estimate ati• Thecriterion function to be minimized is
2-9
where
mK = IP{e =tlx.) - L a .P{ilx.)IFJ i=1 tl J
(17)
U-IIF ='V f{_)2dF, and F is the cumulative distribution of the mixture density_
To minimize K is equivalent to minimizing
2 m 2K = a p{e = tlx.) - L a oP{ilxo)uFJ i=1 t 1 J
j-~(e =1Ix.) - f. e .P(1Ix.02if. a.P(x.IJdX. (18)J i=1 tl Jj ~=1 1 J J J
Minimizing with respect to ati' the solution ism
a 0 = L: q·kE[p{e = tlxo)P{klx.)]t 1 k= 1 1 J J
where qik is the ikth element of the inverse of the matrixH = E[P{i IXj)P{klxj)].
;
Given a random sample of labeled data points and associated labels(xj,e = tj), j = 1,-" ,n, where tj € {l, ••• ,c}, ati ~an be estimated by
m ..• 1 nati = L qik n ?: lj/t(tJo)P{klxJ.)k=1 J=1
where
11ift=to_ J
o otherwi se= the ikth element of H-1 and the ikth element of H is
1 nn .L P{ilxJo)P{klxJo)J=l
2-10
(19)
The proportion of class 1 is then estimated by
2-11
(20)
3. THE PREPROCESSING ALGORITHMS
3.1 XSTAR: AN ALGORITHM TO CORRECT LANDSAT DATA FOR THE EFFECTS OF HAZE ANDSUN ANGLEThe XSTAR preprocessing algorithm is based on the Environmental ResearchInstitute of Michigan (ERIM) radiative transfer model for an atmosphere withno absorption.
Letting primes denote a desired standard condition, the optical thickness foreach multispectral scanner (MSS) channel I is represented as follows:
where•L RI = the Rayleigh optical thickness in channel I
(21)
aIY' = the aerosol optical thickness in each channel so that y' is a scalarmeasuring the amount of haze in the atmosphere in a hypotheticalspectral band for which aI = 1
aI = a function of the channel, independent of atmospheric haze
For Landsat-2 data, channels 1 through 4,
,\i
1.26801.0445
g, =.9142.7734
(22)
The values for aI were calculated from the estimated Landsat in-band opticalthickness for an atmosphere with a horizontal visual range of 23 kilometers(14.38 miles), which is relatively clear.
Similarly, for an observed condition, the optical thickness is(23 )
3-1
However, the Rayleigh optical thickness is independent of atmospheric haze;so,
and (24 )
The change in optical thickness from the standardized condition to be observedis then measured by y.
If XI is the observed and Xi is the standardized Landsat radiance value inchannel I, and assuming that other variables in the radiative transfer equa-tion are restricted so that only atmospheric optical thickness is significant(ref. 8), a correction equation is obtained:
(25 )
In general, both Xi and P{aIA) are functions of scanner geometry, illuminationand viewing geometry, optical thickness, and the background albedo of thestandardized conditions.
Excluding the higher order terms, represented by the polynomial functionP{afY)'
aIYX' = e XI I
or (26 )
Then X* specifies a point, or an origin, in the signal space relative to whichthe remainder of the signal space expands or contracts according to the effectof each multiplicative factor. For a sun angle of 39°,
)
61.9X* = 66.2
83.233.9
3-2
(27)
To apply the XSTAR preprocessing algorithm to Landsat data, Y, a measurementof the amount of correction required, must be found. It is assumed thatLandsat data distributions lie in a two-dimensional hyperplane in four-dimensional data space and the hyperplane position shifts with atmospherichaze. The XSTAR algorithm uses the tasseled-cap yellowness direction y* as ameasure of the component of the shift, which is perpendicular to the usualorientation of the plane. For the standardized condition, the average signal
,.value measure in the Y direction is
y* = -11.2082 Landsat counts (28)
A value for y that will shift the mean :ignal value CfI) is calculated so thatthe mean corrected signal value in the Y direction will be Y*.
Y* -_ ~ ~ a IY U 6 X ( 1 a IY) x*J"Ye - -I + - e I I1=1 Uo(29 )
with un = cos 39° and Uo = the cosine of the sun angle at time of acquisition.
aIYIf e is expanded and third and higher order terms are ignored,
4 2 [u~a = L a -xI=1 I U 0 I
4 ~~ XiJY I (30)b = L aI -XI-1=1 Uo
4 L~'Jc = L -x IYI - y*1=1 Uo
Then -b [ nJ (31)Y"'a1-1--;TFor extremely hazy conditions, 1 - 2ac/b2 may be negative and the square root;s set to zero; i.e.,
Y = .:!a
3-3
3.2 ATCOR: AN ALGORITHM TO CORRECT LANDSAT DATA FOR THE EFFECTS OF HAZE, SUNANGLE, AND BACKGROUND REFLECTANCE
The ATCOR algorithm is designed to simulate the effects of target reflectancePI' sun angle e, haze level TH' and average reflectance of adjacent areas PIon the radiance of a target as measured by Landsat in a channel I, and tocorrect for them. ATCOR assumes that radiances measured by the sensor can bemodeled by
(32)
where LI is the response in band I and AI and BI are coefficients for channel Iwhich depend on VI' eO' and TH•
An atmospheric model was developed for use with ATCOR; the VandeHust methodwas then used to compute, for a range of wavelengths, the radiances gatheredby the MSS for a range of values for VI' eO' TH, and PI. These values arerepresented by a table in ATCOR.
) Generally, eO is known, but PI and TH are not.PI can be calculated from the table.
However, if TH is known, then
)
The ATCOR program estimates TH, computes VI' and interpolates using the tablesfor AI~I,eO,TH) and BI(~I,eO,TH) to find the correction coefficients whichcan be used to make the desired corrections.
The atmospheric model consists of two homogeneous layers: a Rayleigh scatteringmolecular layer on top and a Mie scattering haze layer next to the Earth1ssurface. Most haze is present in this region. The method used to determine thelevel of haze present actually estimates the total effect of all aerosols in theatmosphere and does not distinguish between haze and cirrus clouds. However,because the model assumes that this contribution is from haze particles in thelower atmosphere, the correction is less than optimal. Water vapor and othergaseous absorption are neglected.
3-4
The ATCOR program assumes that it is possible to obtain an estimate for the ~actual reflectance of the darkest pixels in a Landsat image and that the pres-ence of haze will brighten the corresponding measurement at the sensor. Theprocedure for obtaining this estimate is discussed in reference 7. Theatmospheric model indicates that the effect of haze is greatest in channel 1.The average channel 1 value for the darkest pixel in each scan line is computed(Xmin). The reflectance of the darkest target is known or is set to a defaultvalue. From these values, the haze level which causes such a change between theactual or default (darkest) reflectance and the observed ~in is interpolated inthe table. That value is the estimate for TH•
AI and BI may then be obtained from the table and the correction applied.
If primes denote the desired standard sun angle, haze level, and averagebackground reflectance, then:
(33)
where
lJ - cos 6.o -Xi is the new radiance value for pixel X, channel!.
3.3 MLEST: A DISTRIBUTION MATCHING ALGORITHMThe MLEST algorithm is a statistical approach to finding an affinetransformation of the form
\,;
Y = AX + B (34)
which transforms clusters of normal distributions in the MSS signal space froma training area in a manner which best describes the clustering of distribu-tions in a recognition area.
3-5
The objective of this approach is to model atmospheric and background effectsusing a maximum likelihood algorithm to develop a transformation matrix A anda vector B, in which the matrix A is not restricted to a diagonal matrix.This allows the estimated changes in a single MSS channel to be expressed as aweighted sum of the ensemble of channels rather than as a scalar transforma-tion of only the data in that particular channel. This transformation is ableto correct for haze differences and for any other affine transformationspresent in the data, regardless of origin. The primary advantages of MLESTover the XSTAR and ATCOR algorithms are that the nondiagonal terms of thetransformation are included, and it is not necessary to make assumptions aboutminimum haze pixels.
The following procedure is used to evaluate the performance of the MLESTalgorithm.1. Obtain unlabeled clustering statistics for a training area. The overall
probability density function for accomplishing this is
where
MP (XJo) = L: a. p (X . Ii)
i=1 1 J(35)
)
M is the number of clusters in the training area,Xj is the jth pixel in the training areaa. is the proportion of the ith distribution in the training area
1
2. Use these statistics and the MSS channel data from the recognition area asinputs to the MLEST algorithm. The MLEST algorithm estimates an affinetransformation of the training statistics and the a priori clusterprobabilities which maximize the likelihood function.
3. Transform labeled statistics from the training area using the computedaffine transformation:
3-6
..• A
J!iR = AlJ· + B-ITA A AI
L:iR= AI:. A1T
wherethe subscripts Rand T refer to the recognition and training areas,respectively •.•.A is the estimated transformation matri x..•B is the estimated transformation vectorlJ· is the mean vector for the ith distribution-1
L:. is the covariance matrix for the ith distribution1
4. Use the transformed and labeled statistics to classify and label thepixels in the recognition area.
3-7
(36)
4. EXPERIMENT DESIGN DESCRIPTION
4.1 INTRODUCTIONThe approach to be used was to estimate empirically the bias and variance ofthe estimator by repeated sampling. In order to implement this approach, itwas necessary to determi ne the appropri ate number of segments from the analy-sis district needed for training the classifier, that is, to determine thesize of a training group. In addition, the number of training groups to beused had to be determined, and, if possible, these training groups madepairwise disjoint. Each training group would be used in the following ways:1. A classifier would be trained using the segments in the training group.2. The segments in the training group would be classified.3. The regression coefficients for an estimator would be estimated using the
ground-truth hectares and the number of classified pixels in the traininggroup segments.
4. A given subanalysis district would be classified, and an estimate would beobtained of crop area in the subanalysis district, using the regressionestimator in number 3.
The estimates of crop area obtained from the training groups would be used tocalculate a sample estimate of variance and a mean estimate of crop area.The sample estimate of variance would be compared to the formula-obtainedvariance; and, as a measure of bias, the mean hectarage estimate would becompared to the direct expansion estimate, based only on the ground-truthsegment data from the subanalysis district being estimated.
There was some question about the sufficiency of the South Dakota data forestimating bias and variance using the repeated sampling method justdescribed. For comparison purposes, such a procedure should use repeatedindependent selections of segments for training; that is, the training groupsshould not overlap. A preliminary test study explored the issue of requiringtraining groups large enough for classification accuracy while at the same
4-1
\/
,time needing nonoverlapping training groups for the empirical estimation ofbias and variance and for the use of subsequent statistical tests. This studyis described in section 4.2.
4.2 FORMULATION OF GROUPS FOR TRAINING AND TESTINGGiven the requirements imposed on the training groups by the repeated samplingmethod, a preliminary study was made to determine the appropriate size of thetraining groups for reliably estimating the mean and variance of the estima-tor. Some problems were anticipated and are now described.
The 252 65-hectare (one-fourth-square-mile) segments were obtained by samplingwithin-soil strata instead of land-use strata. Resampling, which wasnecessary because some strata were oversampled, reduced the number ofavailable segments to 200. Ideally, a large number of independently selectedand nonoverlapping groups of segments should be used with repeated sampling todo the empirical estimation. Because classification was also carried out in
.) this study, each nonoverlapping group had to contain a sufficiently largenumber of segments to train the classifier. If the number of availablesegments is fixed, the number of segments within each nonoverlapping groupdecreases as the number of groups increases. Thus, if there were enoughnonoverlapping groups to do the empirical estimation, these groups might notcontain enough segments to adequately train the classifier. On the otherhand, if there were enough segments in each nonoverlapping group to do thetraining, there might not be enough groups to do the empirical estimation.Therefore, it was apparent that, in order to have enough segments in eachgroup to obtain acceptable classification performance and enough groups toconduct the empirical estimation of the variance, the use of overlappinggroups was unavoidable. The training groups were determined with theseconstraints in mind. The county-level estimators were based on the require-ments that a simple random sample of segments would be chosen within eachland-use stratum and that the CLASSY clustering algorithm assumes a simplerandom sample from the population; thus, each of the soil strata that was
4-2
oversampled was resampled so that the new sample size for each stratum wouldbe proportional to the area in that particular stratum. (Simple randomsampling of a population is nearly equivalent to stratified random samplingwith proportional allocation.) After resampling. 200 segments were left inthe six-county area. These 200 segments were used to train the classifier.which was to be used as a benchmark in evaluating any other classifiersobtained in the repeated sampling process.
Previous experience in the FY 19BO DC/LC project indicated that seventy-five65-hectare (one-fourth-square-mile) segments probably contained a sufficientnumber of pixels to train a classifier. Thus. the 200 segments were randomlypartitioned into B sets containing 25 segments each and were denotedSit i = It"""tB• The training groups were formed by combining three sets at atime so that the intersection of any two training groups would be at most oneset of 25 segmentst and each would be used in exactly three different traininggroups. The collection of training groups used in this study is as follows:
Some of the advantages of combining the partitions to obtain training groupsinstead of using simple random sampling are:1. The maximum number of overlapping segments in any two groups can be
controlled.2. Each segment is chosen the same number of times, whereas in simple random
sampling some segments may never be chosen and some could be chosen morethan the others.
Each of the training groups was used to train a classifier. Then the entiresix-county area was classified and county-level estimates were obtained. Thevariability of the eight classifiers was examined, and the performances werecompared with the benchmark of training on all 200 segments.
4-3
The criteria on which the collection of training groups was accepted were:1. Individual classification performances did not depart significantly from
t he benchmark.2. The number of groups was large enough to provide reliable empirical
estimation.
4.3 QUESTIONS ADDRESSED IN THE EVALUATION STUDIESThe evaluation study for the current county-level estimator addressed twoquestions.
First, when the value I(C) = 1 is used in the variance formula, the resultingnumber was believed to be an overestimate of the variability of the currentcounty-level estimator. Recall from section 2.1 that I(C) = 1 is used when-ever C is a proper subset of the regression domain and is equivalent toassuming that there is no variation at all for the segments in C (the county).If C is the entire regression domain, then I(C) = 0; and the estimator issimply the current analysis district regression estimator. Obtaining anempirical estimate of the variance and comparing it to the formula varianceusing different values for I(C) would be a means of examining the aboveassumption and also of estimating a more realistic value for I(C).
The second question was whether or not the current county-level estimator wasan unbiased estimator of the total crop hectarage for a county. To answerthis question decisively would require knowing the true crop hectarage for acounty, and this information was not available. Instead, the standard used
'"for comparison was the direct expansion estimator for the county, Y = N x y,where N is the total number of possible segments in the county and y is thesample mean crop hectarage per segment for the given county.
The particular alternative county-level estimators that were evaluated were ofinterest because of the approaches that were taken in hectarage estimation atthe subanalysis district level. The Cardenas family of estimators compares
) the average number of crop pixels per segment in a given stratllTlto the
4-4
average number of crop pixels per segment in the given county in that stratumand adjusts the mean area estimate by an amount proportional to thisdifference. Therefore, it was desirable to compare the performance of two ofthe members of this family to the current county-level estimator as well as tocompare the performances of these two Cardenas estimators. In addition, theonly variance formula available for this family makes the assumption that, forall counties, the within-county variances are equal. To compare the empiricalestimate of variance to the formula variance as an indication of the validityof this assumption was also a desirable objective.
The two direct proportion estimators offered the possibility of estimatingcrop hectarage in a county using only the county Landsat data and relativelyfew ground-truth pixels. Another advantage was that the classification ofeach pixel is not done directly as is necessary for the regression type esti-mators. For making comparisons, however, this was also a disadvantage in thisstudy. For each county, one estimate of a crop proportion was obtained ratherthan eight estimates using eight classifier training groups. Questions re-garding the size of bias and variance were answered by using the proportionsand variances generated by the simple random sample approach (section 5.3) asthe standard.
4.4 PREPROCESSINGThe objective of the preprocessing study was to see if candidate preprocessingalgorithms applied to analysis district Landsat imagery have the capability forimproving crop area estimates at the county level when few (or no) trainingsegments are available from that county. Three preprocessing algorithms werechosen for study based on results of the Large Area Crop Inventory Experiment:XSTAR, ATCOR, and MLEST (see section 3). The XSTAR and ATCOR algorithms arehaze-correction models which transform the analysis district and the county tobe estimated to correct for the presence of haze and/or background effects andto make them look spectrally similar to the classifier. The MLEST algorithmtakes distributions present in the analysis district and estimates an affineshift correction which matches them to distributions from the county. A
4-5
)
transformation is obtained which may then be used on the statistics in theclassifier before classifying the county.
Ideally, sample segment data chosen from the analysis district (a six-countyarea) would be used as the training set on which to develop the regressionestimator, and an entire county would be used as the test area. However,ground truth was available only for sample segments in the county. In thisstudy, we tested whether preprocessing improves the estimates for a samplefrom a county, rather than whether it improves an actual county estimate.
In order to address the worst possible case, two test areas (sample segmentsfrom Beadle and Kingsbury counties, South Dakota) that did not overlap thetraining set were chosen. This effort not to duplicate sample segments fromthe training set in the county was made for two reasons: First, to achievedistinct test and training groups for the F-test and the Hotelling T2 test;and, second, to provide the IIworstllcase, where no sample segments from thearea of interest were available for tra~~ing. This selection also fulfilledthe requirement that sample segments from surrounding areas be available fortraining; the surrounding segments in this case were other training groupsegments that were i~ Beadle and/or Kingsbury Counties.
After estimates for the county samples were obtained by the USDA EDITOR systemand by the USDA EDITOR with MLEST, ATCOR, and XSTAR preprocessing, a compari-son was made to see which method produced estimates closer to the true ground-truth proportions.
The purpose was to ascertain if one of the preprocessing methods had in someway made the regression estimator, which was developed over the analysisdistrict, appropriate and accurate at the county level.
4.5 STATISTICAL EVALUATION APPROACHIt was apparent from the preliminary analysis that the overlap among sometraining groups would vitiate any statistical tests requiring that assumptionsof independence of random variables be satisfied. Tnis was accepted as a
4-6
necessary flaw in order to have enough training groups with enough segments ineach group to adequately train the classifier.
In evaluating the conjecture that the formula for the variance of the currentsubanalysis district regression estimator overestimates the variability ofthis estimator if I(C) = 1 is used, the following approach was taken: Thearithmetic means over the eight training groups were plotted against thecorresponding values of I(C) = 0 and I(C) = 1 for each crop. The linecontaining these two points expresses the linear relationship existing between,.I(C~ and the variance of Yc' as is evident from the formula for the varianceof Y. This line can be used to approximate the value of I(C) associated withcthe empirical estimate of variance.
The Behrens-Fisher test was used to investigate the bias of the currentcounty-level estimator. Whenever a sample of segments is randomly selectedfrom a county, the direct expansion estimator N x y is an unbiased estimatorfor the total county hectarage of a given crop. Likewise, each individual Yi ']is unbiased for the mean number of hectares per segment. Similarly, if thecurrent county-level estimator is unbiased for the total county hectarage of agiven crop, then this estimator divided by the number of segments in thecounty (N) is unbiased for the mean number of hectares per segment for a givencrop. The Behrens-Fisher test indicates whether the current county-levelestimator, when divided by N, systematically overestimates or underestimatesthe mean number of hectares per segment.
For the Cardenas ratio and regression estimators, the proportions of each cropin each county, as well as a "coefficient of variation" for each crop, arepresented in tables. These figures are calculated for each training group.This "coefficient of variation" is, for each county, the ratio of the squareroot of formula variance for the crop divided by the average number ofhectares of the crop. In addition, a sample coefficient of variation wasobtained using the estimates of a crop from the eight training groups assamples. These summary statistics for each estimator are presented by crop.
4-7
) The sample variances of the Cardenas estimators were compared, and the samplevariance of each Cardenas estimator was compared to the sample variance of thecurrent county-level regression estimator using an F-test. This was doneknowing that the independence assumptions were not satisfied. Indeed, notonly did some of the training groups overlap, but also all three estimatorshave the same y-variable, namely the ground truth hectarage per segment.However, it was believed that a comparison of the sample variances wouldindicate whether or not they were significantly different.
The Behrens-Fisher t-test described previously was the test for bias of theCardenas ratio and regression estimators.
The bias and the mean squared error (MSE) of the direct proportion estimatorswere calculated and recorded as summary statistics. Recall that the procedureis to cluster a county and obtain proportions, Qi' of each distribution in themixture model. Then, 500 labeled pixels are chosen randomly from segments
.) within the county to estimate St ' or the proportion of crop R. in distribu-i c
r tion 1. The proportion of crop 1., 1T~, is L Q.B~ , where c is the number ofA. • 1 1 A..
1= 1di stributions present. In addition, the crop proportions taken from the 500pixels are computed to obtain a third estimate, called the simple random sampleestimate. This procedure is repeated 50 times, each time choosing 500 labeledpixels randomly from the segments within the county. The average number oflabeled pixels available in each county is about 6700, a large enough numberthat the 50 repetitions can be considered independent. The proportion of eachcrop, determined using all the labeled pixels in a county, is considered thetrue proportion of that crop. In order to estimate the bias, the 50 estimatesare averaged and the mean compared to the proportion in the labeled pixels.The MSE is the sample variance of the 50 proportion estimates. An F-ratio iscomputed for each of the direct proportion estimators. This is the ratio ofthe sample variance of the direct proportion estimator to the sample varianceof the simple random sample estimates over the 50 repetitions. Independenceproblems are again present, since each of the three estimates (maximumlikelihood, least squares, and simple random sample estimates) is obtained overthe same set of 500 pixels.
4-8
4.6 EVALUATION OF PREPROCESSORSThe comparison of the performance of each preprocessor with the USDA EDITOR wasdone using the Hotelling T2 test, which compares the mean difference betweenground truth and the regression estimate per segment for both methods.Accepting the null hypothesis would indicate that there is no significantdifference between estimates produced by the two methods.
There remains some question as to whether the regression equation developedfrom the analysis district should be used on the county in evaluating theperformance of the MLEST algorithm, since a new (transformed) classifier isused on that county. Although an improvement in classification was obtainedusing the MLEST algorithm on the county data, a corresponding improvement inestimation might not occur using the regression lines developed on the analysisdistrict if these regression lines are not appropriate for the county. So inaddition to the Hotelling T2 test for the other preprocessors, two other testswere made: one to compare regression estimators for the USDA EDITOR and MLEST,which were developed on the county; and one to compare estimates from the USDA )EDITOR using the training regression lines and MLEST using the countyregression lines. This issue is discussed in further detail in section 5,Study Results.
If the results of the Hotelling T2 test show that one or more of the preproc-essing procedures produce estimates that are not significantly different fromthose produced by the USDA EDITOR alone, it is necessary to examine the meanvector to determine if the results of the preprocessing procedure are better,worse, or mixed. If the estimates using the preprocessor are closer to groundtruth for every crop than those of the EDITOR alone, then the results using thepreprocessing procedure are considered better; if they are further from groundtruth for every crop than those of the EDITOR alone, the results using thepreprocessing procedure are considered worse. If the estimates using thepreprocessor are closer to ground truth for some crops and further for others,it may be concluded that one procedure is not better than the other.
4-9
~)
)
In order to attempt to detect the presence or absence of haze or other differ-ences between the test and training areas, a two-sided F-test for homogeneityof variances and an F-test for equality of analysis district and countyregression lines is done for each crop. These tests are discussed in moredetail in section 5.
4-10
5. STUDY RESULTS
5.1 CURRENT SUBANALYSIS DISTRICT REGRESSION ESTIMATOR5.1.1 EXPLANATION OF GRAPHS AND TABLESFigures 5-1 through 5-9 contain plots of variance versus I(C) for the currentcounty regression estimator. For each crop, the formula variance usingI(C) = 1 is computed for each training group, and an average is obtained.Similarly, the formula variance using I(C) = 0 is computed for each traininggroup, and an average is obtained. These two numbers determine the lineassociated with each crop. The empirical estimate of variance is then usedwith this linear relationship to produce an empirical estimate of I(C).
Although these plots have been produced for only one county, other dataexhibited later provide a similar result: for the majority of crops in eachcounty, the empirically estimated values of I(C) are around zero. This tendsto confirm the statement that the formula variance provides an estimate whichgreatly overestimates the variance of the current county-level estimator.
5.1.2 THEORETICAL AND EMPIRICAL VARIANCE ~STIMATES BY COUNTYTables 5-1 through 5-6 present the preceding graphical results quantitativelyby county. The averages across training groups of the theoretical andempirical variance estimates for each crop are given. The empirically observedvalue of I(C) is also given, and it was determined by observing that in theformula I(C) is linearly related to the variance estimate. The averages of thevariance estimates with I(C) = 1 and also with I(C) = 0 provide two pointsdetermining the line representing this linear relationship. By using this factand the empirical estimate of variance, one can obtain a corresponding value ofI(C). For the majority of crops in each county, these values of I(C) are closeto zero.
5-1
~I;
111IN
TABLE 5-1.- THEORETICAL AND EMPIRICAL VARIANCE ESTIMATESUSING CURRENT USDA PROCEDURE FOR BEADLE COUNTY
Figure 5-8.- Variance versus I(C) for hay cut in Beadle County.
1
I I\
\..-1'----Variance
325 X 106jI
300 I•I
275 I•II
250 I·I225 I200
175 •III
lTl 150 I•I I~ I71 I125 •
II
100 I•III
75 I•III
50 I•
25o _. •
-.014 0 1. I (C)
Figure 5-9.- Variance versus I(C) for flax in Beadle County.
5.1.3 BEHRENS-FISHER TESTTable 5-7 contains the results of the Behrens-Fisher test described in sec-tion 4.5. This test is used as a guide in assessing the bias of the currentcounty-level estimator. The corresponding confidence intervals for theestimated biases are in table 5-8.
This test, a significance test for the difference between the means of twonormal populations, assumes that the two population variances are not thesame. For a fixed crop and county, the eight estimates of hectarage associ-ated with the eight training groups are considered as eight observations of arandom variable YI•
For the same crop and county, the n sample segments of the 200 that fall inthat county can each be thought of as providing an estimate of the mean hec-tarage per segment of that crop. By multiplying each of these n numbers bythe total possible segments in the county, n estimates of the hectarage of thecrop are obtained. Treating these as n observations of a random variable Y2'this two-sample test can then be applied to test for the equality of meansassociated with the random variables YI and Y2• The following should be keptin mind in interpreting the test results: First, the eight observations of YIare regression estimates based on means from training groups, some of whichoverlap; and second, the n observations of Y2 arise from individual segmentsand thus produce a large sample variance. This variance will occur as part ofthe denominator in the test statistic, and it will likely produce a numberwhich will fall within the interval determined by the critical values. Thiswould imply that the hypothesis of equal means would not be rejected as oftenas might be expected, given that the efficacy of Y2 as an estimator of thetrue population mean is suspect.
The other possibility for estimating the true mean was to use the segmentsfalling within a county to obtain the direct expansion estimate of the truemean. This number would then be a constant C against which the mean from thedistribution of VI could be tested for equality. The difficulty with thispossibility is that C is treated as the true mean, when in reality, it is only
5-17
\)
U1I•.....
ex>
· I''--''TABLE 5-7.- BEHRENS-FISHER T-TEST OF MEAN ESTIMATES*
Spink 2 ±2.04 ±2.03 ±2.01 ±2.03 ±2.10 ±2.04 ±2.13 ±2.08 ±2. 08*The hypothesis is that the population mean of the current county-level estimator equals the populationmean of the direct expansion estimator.
tHypothesis rejected.**No crop present.
U1I~
\.0
TA8LE 5-~.- CONFlUENCE INTERVAL FOR ESTIMATED BIAS: CURRENT REGRESSION ESTIMATOR[95% confidence]
County Rangeland Sunflowers Corn Wheat Oats Grass Alfal fa Hay cut Flax
an unbiased estimate of that mean, and it has a considerable amount ofvariance associated with it.
A decision to use the two-sample test was made, and, insofar as the mean of Y2can be considered the true population mean, the test results indicate thatthere is not enough statistical evidence to show that Y1, the current county-level estimator, is biased.
5.1.4 ESTIMATION RESULTS FOR SOIL STRATUM 4The current subanalysis district estimator was used to obtain crop hectarageestimates for soil stratum 4 for each of the eight training groups. (TheCardenas estimators were not evaluated on soil stratum 4 because their userequires knowing all of the land use stratum and soil stratum intersectionmeans, which were not available.) In an analysis similar to that which wasconducted for the six counties, an empirically derived value for I(C) wascalculated. These results are shown in table 5-9. Again, the empiricallyobserved values of I(C) cluster close to 0, with hay cut being the onlyexception. This gives additional credence to the conjecture that the varianceformula with I(C) = 1 produces overestimates.
The Behrens-Fisher test described in section 5.1.3 was used to ascertain ifthe current estimator produced biased crop hectarage estimates on soilstratum 4. Table 5-10 gives the results of the Behrens-Fisher tests. No non-zero ground truth was present for flax or grass in the sample of 20 segmentsfrom soil stratum 4. Of the remaining seven crops, there was not enough sta-tistical evidence to reject the null hypothesis of equal means. (A signifi-cant outcome for a crop would imply that the estimate for that crop is biased.)
5.2 RESULTS OF THE CARDENAS REGRESSION AND CARDENAS RATIO ESTIMATIONPROCEDURES
5.2.1 COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATIONTables 5-11 through 5-19 give, by crop, the proportion estimates and the"coefficients of variation" that were obtained for each training group for
)
5-20
<.nINI-'
TABLE 5-9.- THEORETICAL AND EMPIRICAL VARIANCE ESTIMATESUSING CURRENT USDA PROCEDURE FOR SOIL STRATUM 4
*The hypothesis is that the population mean of the Huddleston-Ray subanalysisdistrict estimator equals the population mean of the direct expansionestimator.
tNo crop present in sample from soil stratum 4.
5-22
TABLE 5-11. - COUNTY CROP PROPORTION AND COEFfI CIENTS OF VARIATIONS FOR CARDENAS~EGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR RANGELAND
(a) Cardenas regression estimator
r.r ,IIN , '( (,111 lJ140 (q It\ (,~••I b~'fi (,J4 '> (jjlJll I,~o I Mt:."!~P,WPO~ T ION CI(OP >iECrAI(t:.~lif{. Tbkl S C. v.
11",t.lll" II.\hM II. '.'l? rl.Bt> o.~().., II.1 11 U.JJ4 ,I. Uh'l U.III1/) ().~47 1~IU!ll.~"'tl II. ,l,h 11.',79 II.lli o. 2~ I 11.""0 II.J~b 11.4'-1,) II•.?b':> o.bol 27J.1'i~1'1~'I(LAH •... II.".~4 II .11.1) 1l.2..•R 0.~41 II.Zc..;~ 0.222 0.3~4 11.301 0.tt>8 6061614tl7114 U.lt2 o.~JJ II.1;'>1 11.121. II.140 O. l"~ o. I.,•• 1).IMb 0.1 bO 11331/94,.,3.JoCIIIlI"'",(IN O. Ibll \J.-it I (J • .?i9 o.~~~ n.3~9 0.11'1 11.20) 11.,1] 0.l74 46114lh~l~b 0.15" O.SH 0.171 1I.IH2 II.1 11 0.on7 00113 11.4'1, 0.0;11 S60o~3 46.84~\M1LI'~ 0.1.' ] II. IIJj o·rd 0.2')1 ll.240 U.llb~ (l·I'Jl ll.jOrj 0.)11 l3 9'~AIllIlJu O.3~~ U.14'1 II. 6~ 0.411 fl.5uo O. b1~ 0.3tl~ U.4117 0.:J1~ 1189 4011.04/< 1")(,S'ill"'Y 0.;>11>1 11.111 0.141 U.)ll II.Jllb 1I.1Jh lI.b?l 0.J94 0.J03 640;<'/Jt 11/ttW U. IA I U.l.tt! 0.1011 n.~o2 1l.1.111 o .12~ 0.521 0.231 O.~Ob 1tl61111j~••0..,PI'I" u. 11"> U .1111. o. ,>1+11 o •1.2Ii IJ.211#:l 0.236 O.Z':SO ll.l!'+ 0.241 fl67b9.I'>'J>1I~ O. ,>r',] 0.l4 'I O. I'j'j 0.102 0.1.10 0.112 O.lbl 0.1 4 0.2b4 524b23bIj8.C,tt<..r,
IN(...J (b) Cardenas ratio estimator
CllIhlTY (,I ~J I>14b C;17H 1>241 (jl"~fS b3'+~ hJbl1 (,';61 ME4N PI-tOPOI(TIU.., C~O~ IiEClAl-(t:.Stttrf AH •.•..• c. V.
TABLE 5-15.- COUNTY CROP PROPORTION AND COEFFICIENTS OF VARIATION FOR CARDENASREGRESSION ESTIMATOR AND RATIO ESTIMATOR FOR GRASS
(a) Cardenas regression estimator
COUNJY li1 ,>3 6140 lilH r.;J41 h?~ti tJ14!> 113bii b5h7 ~It:/llll "~OPU~T ION C~llP tiECTA'~t::SHf( r At~ts c. v.
ItlAlJl1: O.u.,A 0.0,+1 11.044 0.0':>'1 II.U,'I O.IIIJb U.1I4~ U.llbM 0.0':') !11b4]2211ti o • 'J,+Ii d.l1b 11.1•..•7 u.t32 0.1 4 O.J!:J!> 0.2d O.t"lb 0.339 ) 910~t1".llCl 0\1o(~ 0.l1.4 o • wH (). ()~;4 1l.04H U.III.t> 0.0'/'1 lI.lIll 1I.03tl 0.o1l7 l~tl8q(,4li704 o.")••~ o. 00 (I.2?4 O.13lj 0.114 0.2)'1 Q.t!lIt 0.112 0.4A6 0':)25087.64CUIJ (1"'.TllN o. ~"7. ().11-).'1 o.n~~ 0.04t 0.U~4 O.o':>ti O.L1bJ O.O:lb 0.ub4 !08~n1""12b O. 1l<J o. HI 0.2(' o.lld 0.110 0.23';) O.)HJ 11.)24 0,"'J9 4 46 b31.9dHAIl! JN 1l.11~4 0.tl~4 0.0<,1' 0.0';)') 0."4" 0.117:1 0.07':> u. o~:>l 0.061:1 IiliJ4) )0 nil 0.261 o. HJ 0.2u7 0.)1i" 0.1:01 0.2'11i O.lh) o .If)2 0.2bb 6379492.1lt< I'·jh•../IU •.••Y u. lid II.I ~ 1 0.0" •• (1.)21 il.llli O.Dl o.OJ! -.0)3 0.Ofi6 ~d2171.1]4 fill O.2'N O.~ "> 0.h'4 (I.4)J t).:')]b 0.!))4 0.1 t.l 0.241 0.036 )3 258158.29SfJ 1"1( 0.0"<; 0.113'+ 1I.0?h U.ill1 0.1117 O.OYI O. 131. 1I.~08 g:n~ B094:i~I}>iI';) 0.J47 O.lbl O. 1 ,tH 0.221 0.t41 0.326 11.419 0.516 26 d 11 36. 12<.n
IN (b) Cardenas ratio estimator-....j
CIIu."TY u 1i! I 6140 Gl1ri lJ7.47 u2':i/i G34!:J GJbli 1;';)67 HE 0\1\1 PkOPOH fI ON CHUP HECTA~ESt11:('rA~tS C. v.HE"'II E 1I.11~2 lI.flfij 0.04'1 O.Ob! O.tl.,~ o.o~u 0.07~ 0.045 0.~68 ~1817
each county. For example, in Beadle County the proportion estimate of range-land using the training group G123 is .366 of the total hectares in the threestrata over which the estimate was obtained. The corresponding "coefficientof variation" is .356 and is defined as the ratio of the square root of thevariance, calculated by the formula, and the average of the hectarageestimates across the eight training groups. The next to the last columncontains, for each county, the mean proportion estimate of rangeland and thesample coefficient of variation. In the last column are the estimatesrepresented as hectares.
By comparing the "coefficients of variation" that were computed using theformula variance for a training group to the sample coefficient of variation,
..•one can see that the variance formula of Y, which was derived under theassumption that the within-county variance is equal for all counties, seems tounderestimate the true variance.
"") 5.2.2 BEHRENS-FISHER TESTIn tables 5-20 and 5-21, the same two sample tests used to evaluate the biasof the current county regression estimator was used to test for bias in thetwo Cardenas estimators. The corresponding confidence intervals for theestimated biases are in tables 5-22 and 5-23. The same caution encouraged inexamining the results in the first application of the test is advised herealso. Because the hypothesis for the Cardenas ratio estimator is rejected for10 crop-county combinations and the hypothesis for the Cardenas regressionestimator is rejected for 8 combinations, both estimators can probably beconsidered biased by this test.
5.2.3 F-TESTS OF VARIANCEThe two-sided F-test was used to provide some idea of how the variance of thecurrent county-level estimator compared to the variances of the two Cardenasestimators and how the variances of the Cardenas estimators compared to eachother. These tests cannot be appealed to unequivocally because one of the
~ assumptions for performing the test is that the samples are from independent
5-32
UlI
WW
TABLE 5.20.- BEHRENS-FISHER T-TEST OF MEAN ESTIMATES: CARDENAS REGRESSION ESTIMATOR*
*The hypothesis is that the population mean of the current county-level estimator equals population meanof direct expansion estimator.t~pothes1s rejected.**No crop present.
populations. The overlapping training groups obviate· satisfying theindependence criterion; however, keeping in mind this limitation, the resultsin table 5-24 indicate that whenever there is a significant difference invariances, the Cardenas regression estimator and the Cardenas ratio estimatorappear to have larger variances than the current regression estimator.Additionally, there seems to be no significant difference in variances of theCardenas regression and the Cardenas ratio estimators.
5.2.4 RESULTS OF THE CLASSY-BASED DIRECT PROPORTION ESTIMATION PROCEDUREThe CLASSY-based direct proportion estimation procedure using the maximumlikelihood approach or the least squares approach can be outlined as follows:1. Apply the CLASSY clustering algorithm (noncrop specific to the county of
interest, sampled 1/64) to estimate m, p(xli), and ai.2. Randomly choose 500 labeled pixels from the segments within the county of
interest.3. Use the maximum likelihood approach described in section 2.3.1 to estimate )
ati•4. Use least squares approach described in section 2.3.2 to estimate ati.
m5. Compute the proportion of class t as wt = ~1 ai ati
6. Repeat steps 2, 3, 4, and 5, 50 times to estimate the bias and MSE of wt•
The procedure was also carried out without using the maximum likelihoodapproach or the least squares approach; the 500 labeled pixels were used tocompute the proportion wt directly. This is referred to as the simple randomsample approach. The number of labeled pixels in each county are as follows:Beadle, 8442; Clark, 7264; COdington, 5430; Hamlin, 3588; Kingsbury, 6086; andSpink, 9480.
H2: Variance Cardenas ratio estimator equals variance current regression estimator.H3: Variance Cardenas regression estimator equals variance Cardenas ratio estimator.
tJ1ypothesls rejected.
5.2.5 STATISTICS FOR DIRECT PROPORTION ESTIMATORSIn tables 5-25 and 5-26 are some statistics that were calculated for each ofthe direct proportion estimators. For each crop, the bias, mean squarederror, and F-ratio are listed. The bias is the difference between the averageof the 50 proportion estimates of a crop given by the particular estimator andthe proportion of ground truth of that crop in the sample segments that are inthe county.
The mean squared error (MSE) is the average of the squared errors over the 50runs. The F-ratio is the ratio of the variance of the 50 estimates given bythe direct proportion estimator to the variance of the 50 estimates obtained,each using a simple random sample of 500 labeled pixels to compute theproportion directly.
There seem to be no significant differences in variances of the CLASSY direct-proportion estimators using the simple random sample approach and either themaximum likelihood or the least squares approach.
5.2.6 RELATIVE BIASES OF ALTERNATIVE COUNTY ESTIMATORSTable 5-27 lists the relative biases of each of the alternative county-levelestimators considered. The relative bias is defined by the equation:Relative bias = estimator proportion of cro~ - true proportion of croptrue proportl0n of crop
The true proportion of a crop in a county is declared to be the proportion ofthat crop determined by all of the ground-truth pixels in the sample ofsegments from that county. It can be seen that the Cardenas estimatorsproduce relative biases consistently larger than those of the direct propor-tion estimators, which appear to have about a lO-percent relative bias.
5.3 STUDY RESULTS: PREPROCESSINGThe preprocessing study was to determine if any of three candidate preproces-sors might improve crop area estimation at a county level by correcting for
5-39
UlI.~o
I "
~
TABLE 5-25.- BIAS, MEAN SQUARED ERROR, AND F-RATIOUSING THE MAXIMUM LIKELIHOOD APPROACH
*Cardenas regression estimator.tCardenas ratio esti.ator.-MaxilllUl likelfhood direct proportion estilUtor.ttLeast squares direct proportion estimator.
~ IHuddleston-Ray esti••tor.
5-42
differences between the larger area, for which the estimator was developed,and the smaller county area, for which estimates are desired.
The ATCOR and XSTAR algorithms transform both the analysis district sample,which is used in training the classifier, and the county data to make themspectrally more alike by correcting for atmospheric and background differ-ences. Not only is a classifier developed that is different from the one thatwas developed using EDITOR alone, but a different regression estimator is alsoobtained. The MLEST algorithm, however, provides a transformation that isapplied to the training statistics from the analysis district. This trans-formation changes the classifier to better fit distributions that are presentin the county. Thus, the transformed MLEST classifier is used on a particularcounty, but is inappropriate for the analysis district; hence, the sameregression equation is developed on the analysis district by both EDITOR andEDITOR with MLEST. Any differences between EDITOR in county-level estimatesand EDITOR with MLEST preprocessing would be caused by the classifier, becausethe regression equations are the same.
Tables 5-28 through 5-33 present the classification results for the 75 segmentsamples from the analysis district that were used to train the classifier anddevelop the regression estimators for EDITOR and for EDITOR with eachpreprocessor. Also included are similar results for the two county samples,which were both disjoint from the training set.
5.3.1 HOTELLING'S T2 TEST RESULTSIt has been assumed in this study that there are some fundamental atmosphericand background differences between the six-county analysis district and theindividual county to be estimated. Hence, one single overall correction by apreprocessor of the entire six-county area might transform the spectral spacein some useful manner (i.e., reduce the effects of haze)t but it would notcorrect for the difference between individual counties and the analysisdistrict. Such differences would still remain, and no real classificationimprovement was obtained when USing XSTAR in this manner. The remaining studyused XSTAR and ATCOR in a manner which would have corrected for such
5-43
)
)
)
TABLE 5-28.- EDITOR WITHOUT PREPROCESSING(a) Classification results for 75 segments
*Percentage of correct classification.tThe prior probabilities in the MlEST-adjusted classifiers
for each county were the actual crop proportions in thecounties, rather than the corp proportions in the analysisdistrict.
5-49
differences (separately correcting the analysis district and the county),provided that the algorithms were effective in haze correction.
Although the preprocessing methods did not improve classification for allcrops for any county, MLEST (without prior knowledge of crop proportions inthe county) produced improved or similar classification results in the BeadleCounty test site for every crop but rangeland. However, MLEST did so poorlyin classifying rangeland, which constituted 33 percent of the county sample,that the overall PCC was significantly lower than for EDITOR alone. Similarresults were found for XSTAR in the same county when the entire analysisdistrict was corrected at once.
Tables 5-34 and 5-35 show the results of comparing crop-area estimates thatare obtained from each preprocessing procedure with those from EDITOR. TheHotelling T2 test was used in the comparison. Rejecting the null hypothesiswould indicate that the preprocessing method produced regression estimateswhich were significantly different from those produced without preprocessing.
In Beadle County, EDITOR with ATCOR produced regression estimates that weresignificantly different from those produced by EDITOR alone, but theseestimates were better for some crops and worse for others (mixed). EDITORwith XSTAR also produced mixed results in Kingsbury County. Mixed results areundesirable because the preprocessing procedure should produce the same esti-mates as EDITOR when little or no haze or background difference is present andbetter estimates for all crops when heavy haze is present. The inconsistencyof the XSTAR and ATCOR algorithms indicates that they are either insufficientin their compensation for large differences or that they are unreliable whenlittle difference is present. This conclusion is supported by resultspUblished in 1981 by Dave (ref. 8), which suggested that XSTAR corrects mainlyfor sun angle rather than for haze and that XSTAR contains assumptions whichare at variance with findings from the Dave study. The XSTAR and ATCORalgorithms will not be included in the FY 1982 study.
5-50
')TABLE 5-34.- STRATUM 12 HOTELLING'S T2 RESULTS OF
25 SEGMENTS IN BEADLE COUNTY[NULL HYPOTHESIS: EDITOR with preprocessing prOdUCeS]crop area estimates which are the same as those from
EDITOR without preprocessing.
Test Calculated Rejectl Resu ltsr2 accept IiO
EDITOR with no preprocessing 12.47 Acceptvs. EnITOR with XSTAR *(I)EDITOR with no preprocfssing 25.31 Acceptvs. EOITOR with XSTAR (II)EOITOR wi~h no preproc~ssin9 52.79 Reject Mixedvs. EDITOR with ATCOR (II)EOITOR with no preprocessing 20.483 Accl!ptvs. EDITOR with ML£STEOITOR with no preprocessing 42.79 Acceptvs. EDITOR with ML£ST (bothregression estimators devel-oped on Readle County)
) EDITOR with no preprocessin9 150.97 Reject Mixed,(regression estimator devel- mostlyoped on analysis district) vs. betterEDITOR with MLEST (regressionestimator developed on county)Critical value T2 (6, 13, .O5) 49.23
TABLE 5-35.- STRATUM 12 HOTELLING'S T2 RESULTS OF20 SEGMENTS IN KINGSBURY COUNTY
Test Calculated Acceptl Resultsr2 reject IiO
EDITOR with no preprocessing ill.89 Reject Mixedvs. EDITOR with XSTAR (II)EDITOR with no preprocessing 56.24 Acceptvs. EDITOR with ATCOREDITOR with no preprocessing 26.91 AcceptV5. EDITOR with MLESTCritical value T2 (8, 14, .O5) 86.08
)t*I. entire 6-county area transformed at once.
II. training and county areas transformed separately.
5-51
The MLEST procedure consistently produced regression estimates that were notsignificantly different from those produced by EDITOR alone. These regressionestimates were obtained both from the analysis district and from the county.The FY 1980 study showed that a regression developed on a training set wassignificantly different from a regression developed on an independent set.The only case in which EDITOR with MLEST produced estimates that were betterthan those produced by EDITOR alone was when estimates from the EDITORanalysis district regression estimator were compared with estimates producedby an estimator developed on the counties where MLEST had transformed theclassifier. Because the counties were in effect an independent test set(disjoint from the training segments), such a difference in the regressionestimates was probably similar to the training and test estimator differencesfound in the FY 1980 study and was not due to the use of MLEST~
Below is listed the mean vector used in calculating the Hotelling T2 statisticwhen EDITOR with MLEST was used with an estimator that was developed from seg-ments in Beadle County to estimate crop areas for those segments, and EDITORalone used an estimator developed from sample segments from the whole analysisdistrict.
In this equation, n is...stratum 12 (14), Ye isestimate for segment isegment i.
the number of sample segments from Beadle County in..the estimate for segment i from EDITOR alone, y is themfrom EDITOR with MLEST, and y is the ground truth for
5-52
)
)
When the ~C was positive, then MlEST with EDITOR provided estimates which,rop A
on the average, were closer to ground truth than EDITOR alone; when the ~Cropwas negative, EDITOR provided the better estimates. However, this improvementis probably caused by the different regression equations used, and not by theuse of MLEST.
This improvement highlights the fact that in production use with no testsegments available, EDITOR with MLEST would use the same regression equationsas would EDITOR alone. However, with the use of MLEST, the classifier used toclassify the county is not the same as the one used to develop the regressionequations. It would be expected that this would affect the estimates made onthe county. Any classification improvement caused by the use of MLEST mustovercome such degradation in estimator performance.
It should be possible to take advantage of potential improvements if testsegments were always available to produce a new estimator or if some othertype of estimator were used. Unfortunately, test segments are usually notavailable, since all available segments are needed to train the classifieradequately.
In order to better evaluate if differences between the crop proportions in theanalysis district and the county were affecting the performance of the MLESTclassifier, the crop proportions that are listed in tables 5-36 through 5-38were provided to the MLEST classifier to see if any classification improvementwould be obtained.
In Beadle County, two crops are not present at all, and in both counties cropproportions vary widely between analysis district and county for some crops.Classification results using these priors from the county were worse for onecounty and better for the other when compared to use of MLEST with priors fromthe training data. Differences in crop proportions between the analysisdistrict and county must then be larger before estimates of the countyproportions would improve an MLEST classifier.
5-53
TABLE 5-36.- CROP PROPORTIONS OF75 SEGMENTS IN ANALYSIS DISTRICT
Although MLEST performed consistently, it never produced estimates that werestatistically different from those produced by EDITOR. But there is a ques-tion of whether there was actually any difference in the haze level betweenthe analysis district and the two counties. Several methods were used toattempt to answer this question.
Table 5-39 lists the MLEST transformation matrix and vector used to transformthe training statistics before classifying each county.
Although the diagonal elements of the A matrices were all close to 1, neitherthe off-diagonal elements nor the values for the B vector were close to zero.This transformation is not close enough to an identity transformation to saythat there is no difference in the distributions of the analysis district andthe county; there may be some difference, but not very much.
5.3.2 ATCOR HAZE LEVELSTable 5-40 displays the haze levels measured by ATCOR for the two acquisitionsfor the analysis district and county samples.
5-55
TABLE 5-39.- MLEST TRANSFORMATION MATRIX A AND VECTOR BFOR BEADLE AND KINGSBURY COUNTIES
Analysis district. Haze level No. of segmentsG123 training group: 75
Acqui sit ion 1 0.177Acquisition 2 0.236
Beadle County test group: 25Acqui siti on 1 0.250Acquisition 2 0.257
Kingsbury County test group: 20Acqui sit ion 1 0.113Acquisition 2 0.209
*Haze levels are measured on a scale of 0.000 (no haze)to 1.000 (heavy haze).
5-57
Unfortunately, although the haze levels that were measured are useful for com-parison, it is not known at what haze level a classifier will first have poorperformance caused by haze (table 5-40). Note that there is little differencebetween the analysis district and either county for acquisition 2, but thereis some difference in Beadle County for acquisition 1. XSTAR does not producea haze diagnostic.
5.3.3 COMPARISON OF REGRESSION LINESThe third method used to look at the presence or absence of haze was tocompare regression lines obtained from the training area to those from theBeadle County test area.
Tables 5-41 and 5-42 show the tests for homogeneity of variances and the testfor equality of regression lines. The county regression lines were developedon only 14 segments, whereas the analysis district regression lines weredeveloped on 42 segments (both sets from stratum 12 only). As can be seen,homogeneity of variances was rejected for six of seven crops, and the regres- ')sion lines were not the same for the remaining crop. Any attempt to drawconclusions about haze level from these tests is limited.
The fourth and final attempt to reach an understanding of differences involvesan observation of XSTAR results. If classification results from XSTAR withEDITOR (when the whole six-county analysis district was corrected at once) hadbeen worse than those when the analysis district and the county were correctedseparately, then it would have been concluded that there was some differencebetween the county and the analysis district that was not being corrected,although the overall average haze level may have been reduced. In fact, thatwas not the case. The classification results when the whole area wascorrected at once were actually better. The conclusion drawn from thisattempt to measure haze and other differences between the county and analysisdistrict is that there is probably some difference present, but it is not alarge difference.