A GENERAL THEORY FOR EVALUATING JOINT DATA …pangea.stanford.edu/departments/ere/dropbox/scrf/documents/Theses/SCRF-Theses/2000...a general theory for evaluating joint data interaction

A GENERAL THEORY FOR EVALUATING JOINT DATA

INTERACTION WHEN COMBINING DIVERSE DATA SOURCES

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF GEOLOGICAL AND

ENVIRONMENTAL SCIENCES

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Evgenia I. Polyakova

April 2008

UMI Number: 3313643

Copyright 2008 by

Polyakova, Evgenia I.

All rights reserved.

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy

submitted. Broken or indistinct print, colored or poor quality illustrations and

photographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if unauthorized

copyright material had to be removed, a note will indicate the deletion.

®

UMI UMI Microform 3313643

Copyright 2008 by ProQuest LLC.

All rights reserved. This microform edition is protected against

unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 E. Eisenhower Parkway

PO Box 1346 Ann Arbor, Ml 48106-1346

© Copyright by Evgenia I. Polyakova 2008

All Rights Reserved

I certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

of Doctor of Philosophy.

/ i £} • <Jotvtuei

Dr. Andre Journel Principal Adviser



of Doctor of Philosophy. „..--""" '*"""" ,"

( iP_£—tJeH3aerj



of Doctor of Philosophy.

Dr. Paul Switzer

Approved for the University Committee on Graduate Studies.

m

iv

Abstract

Accounting for data interaction is a necessary and critical step in any data inte

gration algorithm. Data interaction, whether it is through information redundancy,

compounding or cancellation, can change completely the image provided by mere

association of individual data ignoring their interaction. Data interaction is just as

important as the individual data information content and depends on both data val

ues and the unknown being assessed. Yet, most data integration algorithms ignore

completely or partially data interaction, by assuming some form of data independence

for a given question asked or, worse, for any question asked. More advanced analysis

acknowledges dependence of data information content but still models it only as lin

ear dependence (linear correlation) between any two data rather than considering all

data together; and this linear correlation is assumed independent of the data values

and of the event or value being estimated.

In this study, the general problem of data integration is expressed by combining

probability distributions conditioned to each individual datum or data event into a

posterior probability for the unknown conditioned jointly to all data. The goal of

this thesis is to develop a method/model of statistical analysis accounting for data

interaction. Addressing this goal, we propose the nu expression [64] which is the

sister of previously developed tau expression. Both nu and tau expressions provide

an exact analytical solution to the problem of data integration by combining individ

ually conditioned probabilities while accounting for interaction between data. This

is achieved by separating individual data information and data interaction. The nu

and tau interaction parameters are data values-dependent and, even more critically,

v

unknown value-dependent. This data value-dependency (heteroscedasticity) allows

for a better representation of joint data interaction than do traditional regression or

kriging weights which are independent of the data values. However, the greater that

heteroscedasticity, the more difficult becomes the inference of the data interaction

parameters. We investigate the behavior of the nu and tau parameters versus data

values. The nu parameters being ratios of ratios of likelihood probabilities appear

more stable than the tau parameters and could be estimated starting from summary

statistics of the actual data values taken altogether. Also, the tau interaction weights

depend on specific ordering of the data. While such ordering is important, in most

applications it is the global (independent of the data sequence) representation of such

interaction that matters. The tau expression fails to provide such global measure.

The nu model allows the derivation of a single, data sequence independent, interac

tion measure.

The nu model proposed is extensively tested using synthetic data sets. The test ex

periments confirmed superior features of the nu model compared with the tau model

or traditional statistical approximations. The practicality of the nu expression will

depend on our ability in generating proxy training data from which to borrow and

export the v parameters.

VI

Acknowledgments

First of all I would like to thank my family who made me believe that everything is

within my reach. I am greatly thankful to my dad, Dr. Igor Polyakov, for making

many valuable comments related to my research. I am thankful to my husband, Ramil

Ahmadov, who helped me to improve the quality of the figures in this dissertation.

I am also thankful to my mom, Anna Polyakova, who helped me to keep things in

perspective.

I am deeply thankful to my adviser, Dr. Andre Journel, for his great insights, per

spectives, passion, and guidance.

Also, I would like to thank Dr. Jef Caers, Dr. Paul Switzer, Dr. Robert Dunbar, Dr.

Jerry Harris, and Dr. Tarantola for agreeing to serve on my committee and for their

valuable comments during the spring reviews.

I also would like to extend sincere gratitude to SCRF group. Special thanks to Dr.

Alexander Boucher, Dr. Jianbing Wu, Dr. Scarlet Castro, and Ting Li for helping

me with various aspects of my research.

Sincere gratitude is also extended to the staff of Department of Geological and Envi

ronmental Sciences for helping with the administrative aspect of my dissertation.

vn

Contents

Abstract v

Acknowledgments vii

1 Introduction 1

1.1 Data information and data interaction 3

1.2 The thesis proposal 5

1.3 Goals/objectives 7

1.4 A brief overview of the thesis chapters 8

2 A review of existing models 10

2.1 Conditional independence 10

2.1.1 The road to conditional independence 11

2.1.2 Heteroscedasticity 14

2.1.3 Bayesian networks 18

2.2 Probability combination algorithms 25

2.2.1 Linear pooling of probabilities 26

2.2.2 Supra Bayesian Methods 30

2.2.3 A brief overview of the tau representation 33

3 The nu representation 42

3.1 Derivation of the nu representation 43

3.1.1 The nu expression 43

3.1.2 Dictatorial property 51

vin

3.1.3 A measure for data interaction 52

3.2 Tau or nu expression? 52

3.3 Approximations based on the nu derivation 54

3.3.1 Evaluating the conditional independence assumption 55

3.3.2 The classified v0 approach 67

4 Application to binary data 77

4.1 An elementary case study 77

4.1.1 Equilateral configuration 77

4.1.2 Non-equilateral configuration 87

4.2 A 3D case study 95

4.2.1 The reference data set 95

4.2.2 The estimation configuration 97

4.2.3 Conditional probabilities and estimates 99

4.2.4 Ordering the data values combinations 102

4.2.5 Heteroscedasticity of the tau and nu weights 104

4.2.6 Independence-based estimates 107

4.2.7 The classified u0 approach 109

5 Application to non-binary data 117

5.1 A single constraint 117

5.2 Large non-Gaussian ternary case study 128

5.2.1 The reference data set 129

5.2.2 The estimation configuration 130

5.2.3 Conditional probabilities and estimates 133

5.2.4 Determining the v$ ' to ensure consistent probabilities . . . . 139

5.2.5 Classified VQ approach 142

5.2.6 Inference robustness 147

6 Summary and conclusions 151

6.1 Summary of major theoretical developments 151

6.2 The nu expression: Theory 153

IX

6.2.1 Tau expression 153

6.2.2 Nu expression 154

6.3 Approximations of the nu representation 156

6.4 Final conclusions 158

6.5 Future work 159

Bibliography 162

x

List of Tables

3.1 Joint distribution of indicators and their probabilities 56

3.2 Summary statistics: means and variances of 10,000 conditional proba

bilities P(A\B, C) and their approximations 60

3.3 Summary statistics: means and variances of 10,000 conditional proba

bilities P(A\B, C) and their transformed estimator 60

3.4 Eight data value combinations and their scores 69

4.1 Probability notation for the 16 joint occurrences 79

4.2 Distances between data-to-unknown and data-to-data 87

4.3 Summary statistics: means and variances of reference conditional prob

abilities and approximations stemming from nu representation for 931

data value combinations 113

4.4 Means and variances of 10 independent realizations So 113

4.5 The average means and variances of P(A = 1|D,B,C) over 90 combi

nations 114

5.1 Summary statistics for k = 1,2,3: spatial means and variances of

reference conditional proportions and of approximations defined by the

nu model and from the conditional independence assumption 142

5.2 Summary statistics: spatial means and variances of reference condi

tional probabilities and of estimates based on a classified nu represen

tation 148

5.3 The average means of the 50 eroded training data sets for k = 1, 2,3.

For comparison, the right column shows reference means 148

xi

5.4 Summary statistics: spatial means and variances of the 195 reference

conditional probabilities and of the estimates built from a nu represen

tation 149

5.5 Correlations of 195 reference proportion values P(A[D,B,C) with es

timates based on a classified nu representation 149

xn

List of Figures

2.1 Spatial geometry of the data for two different data values combinations. 16

2.2 Graphical representation of joint dependencies between variables A,

A , and D2 19

2.3 Graphical representation under conditional independence between vari

ables Di, and D2 given A 20

2.4 Graphical representations of bi-directional (1) and uni-directional (2)

Bayesian nets 21

2.5 Graphical interpretation of the joint probability P(A, B, C) based on

a different sets of relationships between three variables A, B, and C. . 22

2.6 Graphical representation of conditional independence between vari

ables D\, and D2 given A, and data independence between variables

A and B 23

2.7 Example of dual training images depicting the interaction between two

data types B and C 24

3.1 The scatterplots for VQ=1 model (left) and conditional independence

estimator (right) versus reference 59

3.2 The scatterplot for the estimator of fully conditional probability P(A =

0\B = 0,C = 0) based on transformed probabilities (y-axis) versus

reference (x-axis) 61

3.3 Four training classes and their respective representative scores 70

3.4 Training image depicting the interactions between data and unknown. 73

3.5 Data events definitions 74

xm

3.6 Training image (left) is summarized by the distribution of two summary

scores shown on the score map (right) 75

4.1 Spatial locations of three data Ji, I2, h and the unknown A 78

4.2 Conditional probabilities. Concordant data case: A — I\ — I2 — I3 — 1. 80

4.3 Data values-dependent error associated with the v0 = 1 model 83

4.4 The sequence-dependent Vi weights for the data concordant case

A = h = I2 = h = 1 84

4.5 The single sequence-independent VQ weight 85

4.6 The averaged error associated with data-value-dependent uQ model and

with the VQ = \ model 86

4.7 Non-equilateral data configuration 88

4.8 Conditional probabilities for non-equilateral case with A = I\ = I2 =

73 = 1 89

4.9 Checking the consistency relation 90

4.10 Approximation errors for the eight data value configurations 91

4.11 Error linked to the u0 = 1 model (non-equilateral case) 92

4.12 Error linked to "full independence" hypothesis (non-equilateral case). 92

4.13 Error linked to conditional independence (non-equilateral case). . . . 93

4.14 Bias (error) averaged over all data values combinations 94

4.15 Reference binary image generated by truncating a continuous Gaussian

realization at its upper quartile 96

4.16 Exhaustive indicator variograms, calculated over the 35 top layers. EW

is the east-west direction and NS is north-south direction 96

4.17 Data events definition 97

4.18 (1) The reference eroded data set S0, (2) its histogram, and (3) binary

reference field with the prior P(A = 1) = 0.274 100

4.19 (1) The estimate of fully conditioned probability P(A \ D,B,C) using

the VQ = 1 model, (2) its histogram, and (3) reference binary field with

the prior P(A = 1) = 0.274 101

xiv

4.20 (l)Histogram of error P*(A | D,B,C) - P(A | D,B,C) and (2) the

corresponding scatterplot of P*(A | D,B,C) based on u0 = 1 model

versus reference P(A | D,B,C) 102

4.21 Sequence-dependent interaction parameters T3 (red) and v3 (blue) for

data sequences (1) D B C / B D C , (2) D C B / C D B , and (3) CBD/BCD.104

4.22 Exact uQ parameter for 931 data value combinations 105

4.23 Scatterplots of estimated probabilities P*(A | D,B,C) versus the ref

erence P(A I D,B,C) 108

4.24 Exact u0 values versus average sand values defined over the three data

events D ,B ,C I l l

4.25 Scattergram of i/0 = 1 model (left) and classified u0 model (right)

relative to reference probability. 112

4.26 The histograms of the means of the 90 reference P(A = 1|D,B,C)

values (left), and their estimators based on the uQ — 1 model (center),

and classified z/0 approach (right) 115

4.27 The histograms of the variances of the 90 reference P(A = 1|D,B,C)

values (left), and their estimators based on the v0 = I model (center),

and classified UQ approach (right) 116

5.1 Reference categorical image generated using a training image generator

(the representation of the two categories A = 2 and A = 3 does not

reflect their proportions) 130

5.2 Exhaustive indicator variograms in x, y and z directions, calculated

over the 35 top layers for k — 1,2,3 131

5.3 Data events definition 132

5.4 (1) Spatial distribution, (2) histogram of the conditional proportions

P(A(u) = 1|D, B, C) defined over the reference eroded volume So, and

(3) the reference categorical field 134

5.5 (1) Spatial distribution, (2) histogram of the conditional probabilities

P*(i4(u) = 1|D,B,C) estimated with the model u^ = 1, and (3)

categorical reference field with respective proportions 136

xv

5.6 (1) Histogram of error P*(A = 1 | D,B,C) - P(A = 1 \ D,B,C)

and (2) the corresponding scatterplot of estimate P*(A = 1 | D,B,C)

versus reference P(A = 1 | D,B,C) 137 3

5.7 Histogram of £ P*(A = k | D,B,C) 138

5.8 Various scatterplots of the reference probability P(A = k j D,B,C)

with the estimator based on conditional independence assumption and (k)

the estimator based on UQ ' model 141

5.9 Classification of scores m^, m^2\ m^ 145

5.10 Scattergram of reference proportion P(A = fc|D,B,C) along x-axis

versus estimate P*(A = &|D,B,C) based on classified VQ ' model for

k=l (left), k = 2 (center), k — 3 (right) along y-axis 147

xvi

Chapter 1

Introduction

Accounting for data interaction is a necessary and critical step in any data inte

gration algorithm. Data interaction, whether it is through information redundancy,

compounding or cancellation, can change completely the naive image provided by

mere association of individual data ignoring their interaction. Think about datum 1-

a stockbroker advising purchase of a hot stock, then datum 2- another stockbroker

who compounds that buy advise, last, datum 3- a trustworthy friend who admits

knowing nothing about that stock but who alerts one that both stockbrokers receive

commissions from the same dubious investment house. Datum 3 is not related di

rectly to the question being asked: "Is that stock worth buying?", but it interacts

critically with the two other data resulting possibly in a dramatic change of their

information content. If the question being asked (the unknown) changes say into

"Will it rain tomorrow?", datum 3 (the two brokers work for the same house) loses

its weight, and one may want to safely compound the two concordant data 1 and 2

which advised one to bring an umbrella. Data interaction is just as important as indi

vidual data information content; data interaction depends on the data values and also

on the unknown being assessed (buy or not buy a stock versus will it rain tomorrow?).

Yet, most data integration algorithms ignore all or part of the concept of data inter

action, either by assuming some form of data independence for a given question asked

or, worse, for any question asked. In the better cases, dependence of data information

1

2 CHAPTER 1. INTRODUCTION

content is recognized but is modeled only

1. as linear dependence (linear correlation) between any two data rather than

considering all data together; and

2. this linear correlation is assumed independent of the data values and of the

event or value being estimated. The correlation PD1D2 between the two data D\

and D2 remains the same whether the two data take median-type, "middle-of-

the road" values {Di — di, D2 = d2} or extreme values

{Di = d\ » di, D'2 = d'2 » d2}

Then, and with much more severe consequences, the same correlation value PD1D2

is used no matter the unknown event A to which the two data are applied. It does

not matter whether the question A being asked relates to a stock buy or to the

appropriateness of carrying an umbrella. The same pair of median-type data val

ues {D± = di, D2 = d2} may carry a valuable information content when evaluating

a median-type outcome A = a, but may be of little value to evaluate an extreme

outcome A = a >> a.

The correlation matrix p hereafter defined is most often irrelevant to resolve the

previous questions:

P =

1

PD2Di

PAD!

PDXD2

1

PAD2

PDU

PD2A

1

(1.1)

What is needed is the trivariate probability involving jointly both data and the un

known, and that trivariate statistics is data values (di,d2) and unknown value ((in

dependent:

P(A = a, Dx = di, D2 = d2) = tp(a, du d2) (1.2)

From such joint probability one can retrieve the joint information content of the two

data {D\ = d\, D2 = d2} as to the outcome A = a occurring. That information

1.1. DATA INFORMATION AND DATA INTERACTION 3

content would take the form of the conditional probability:

P(A = a\D, = du D2 = d2) = p , / ^ ' / " ^ 2 - ^ = V»(a, <*lt d2) (1.3) tr\U\ — cti, L>2 — a2)

Both probabilities (1.2) and (1.3) are functions <p(-) and^(-) of the data values (d\, d2)

and of the outcome being assessed (a), a situation we will call as "heteroscedastic" or

data values-dependent. Any approximation of the probabilities (1.2) and (1.3) which

calls for any form of invariance as to data values is not to be taken lightly, a situation

we will call as "homoscedastic".

For example, the traditional correlation model underlying regression analysis and

kriging considers only dependence between any two (and only two) variables at a

time, as in the correlation matrix (2.29). In addition that correlation matrix

• is the same whatever the data values (di, d2) and the outcome value a: it is

homoscedastic;

• only characterizes linear dependence. For example, it says nothing about de

pendence between the two data D\ = d\, D\ = d\, although the data values

d\, d\ are actually available and may be better correlated with variable A;

• says nothing about the joint dependence (A, Di,D2). PDXD2 may be close to

zero and may have high correlation values for both PDXA and pn2A, and yet a

joint data event {Di = d1, D2 = d2} that is not informative of the event A = a,

that is:

P(A = a\Di = di, D2 = d2) « P(A = a).

1.1 Data information and data interaction

In the case of the two data values Di = di and D2 — d2 informing occurrence

of a specific outcome A = a of a third variable, what is needed is the conditional

probability (1.3). In the case of n data D\ = di,....,Dn — dn, one would need the


fully conditioned probability:

P(A = a\Di = d1,...,Dn = dn) = ^(a; du...,dn) (1.4)

a function ip(-) of the (n + l)-variate probability distribution. One would like to

decompose the function ip(-) into:

• the n elementary data contributions

P(A = a\Di — di), i = 1, ...,n

• some data values-dependent parameters 6 modeling the n data interaction in

presence of the specific outcome A = a being assessed.

The determination of the exact expression (1.4) would then be divided into the two,

potentially easier, tasks of

(1) determining the elementary probabilities

P(A = a\Di = di), i = 1,..., n

(2) determining the parameters 6 needed to combine the previous elementary prob

abilities into the fully conditioned probability (1.4):

ip(a, d1,...,dn) = Fe[P(A = a\Di = di) i = l,...,n] (1.5)

Further we will assume task (1) done: there are many calibration algorithms, in

cluding neural networks [68] or indicator/probability kriging ([49], [11]), that allow

evaluating elementary probabilities P(A\Di = di) associated to elementary data or

data events Di. There remains task (2) which is the objective of this thesis.

1.2. THE THESIS PROPOSAL 5

1.2 The thesis proposal

Combining information from different sources while accounting for their reliability

is a challenging and recurrent task in many scientific fields. Statisticians view this

problem stated in relation (1.5) as combining prior and pre-posterior probabilities

into a posterior probability (e.g. [31]; [13]). Most often some form of data condi

tional independence is assumed to obtain the posterior probabilities, which may lead

to incorrect and possibly non-conservative conclusions if any datum transforms the

information brought by the other data. In geology, there is rarely satisfying ground

to assume data independence or conditional independence. Different data stem from

events that are often associated to a common geological background. From one lo

cation to another nearby location, geological structures are related one to another,

leaving us with the challenge of building posterior probabilities that do not start by

assuming any form of independence.

The assumption of data independence is routinely used, for example, in linear re

gression theory, being the origin of the name "independent variables" [28]. The

independence assumption was overcome (only to a degree) by introducing the var-

iogram/covariance concept which accounts for linear dependence (and only linear)

between any two (and only two) data values or events, for example as observed at

two different locations in space. However, complex geological patterns whose de

scription involves multiple locations in space are beyond the reach of this traditional

two-point geostatistics until the recent introduction of multiple-point geostatistics

[70].

Any individual datum information can be coded into a conditional probability; this

raises the question of how to combine different probabilities. One simplistic solution

is to consider weighted linear averages of the prior probabilities to estimate the final

posterior conditional probability, as done in indicator kriging [45]. For example,

assume we are trying to find the probability P(A | B, C) conditioned to the two data

events B and C. One possible solution is:


P(A \B,C) = cnP{A | B) + a2P(A | C)

For such linear combination to take a value between [0, 1], one typically restricts

the weights a to be positive and sum to one which entails convexity of the result:

the combined probability is then bounded by the two prior probabilities P(A | B)

and P(A \ C). Such convexity is undesirable because it precludes the possibility of

compounding, say, the high probability P(A | B) with the concordant information

carried by P(A | C) into a combined probability P(A | B, C) higher than either.

The non-linear tau model introduced by Bordley [6] and Journel [48] allowed for an

efficient solution to the problem of data integration without any severe restriction on

the interaction parameters 9 as introduced in relation (1.5). Their work, however,

stopped short of providing an analytical expression for these data values-dependent

parameters 9 modeling the n data interaction in presence of the specific outcome

A = a. The tau model became the focal point of recent research at Stanford with

the key goal of obtaining and interpreting these interaction parameters. The major

breakthrough happened in 2004, when Krishnan [50] proposed a solution to the prob

lem of probabilistic data integration through the exact mathematical expression for

the interaction parameters 9. However, not surprisingly because they are exact, these

interaction parameters 9 were too complex to be used in practice as they are:

• data value and unknown value-dependent

• dependent on the specific ordering of the data: data Di being considered before

data D2.

However, Krishan's exact derivation is an excellent starting point to build approxi

mations for these interaction parameters. In this thesis we propose two such approx

imations. The first one broadens the traditional concept of conditional independence

assumption, assuming no-data-interaction. The second one approximation borrows

the interaction parameters 9 from the training data depicting the multivariate rela

tion between data and unknown, just like one would borrow a variogram or correlation

matrix from an outcrop. The other key contribution of this thesis is the modification

1.3. GOALS/OBJECTIVES 7

of the tau model into the so-called nu model with a single interaction parameter.

That single parameter, while still data values and unknown value dependent, is now

independent of the data ordering, and hence, provides a measure of the global (joint)

interaction between data and unknown.

1.3 Goals/objectives

Based on the above discussions and considerations, the goal of this thesis is to develop

model of statistical analysis taking into account joint data interaction. Meeting this

goal involves the following tasks:

1. Provide an overview of algorithms currently used in practice; identifying strengths

and weaknesses of each approach.

2. Identify the most critical components of any data integration algorithm.

3. Define a measure of data interaction.

4. Develop a data integration algorithm that measures joint data interaction.

5. Test that data integration algorithm for which the "truth" is known, and com

pare the results with those obtained by other statistical models.

Our study is guided by the following starting points:

• Data independence is not a valid assumption for most practical applications

involving a common genetical process (e.g. geology) linking all data together.

• Data interaction should be an essential component of any data integration al

gorithms.

• Ratios of probabilities are more stable than the component probabilities them

selves.

• An approximative model accounting for data interaction is better than an ar

bitrary assumption of no-data-interaction.


1.4 A brief overview of the thesis chapters

In a probabilistic setting, the data is assumed to inform the unknown in the form

of probability distributions. These probabilities differ because they originate from

different data events, because of different assumptions about the information content

of each datum, and because of different underlying theories about the physics under

lying each datum event [43]. We then have to aggregate these different probabilities

into a single distribution or posterior probability that that can be used for decision

making. This frame work has been studied over the past twenty years, and there

are many approaches to combining these probability distributions [43]. The easiest

solution is to directly assume independence between information. Section 2.1 offers

a critical overview of such assumption. However, it is important to build the models

not stemming from independence hypothesis as it is rarely satisfied in practice. Un

fortunately, one way or another many main-stream models can be shown to have a

link relating them to such independence assumption. In Section 2.2 we review three

major approaches to the problem of data integration. The first method reviewed in

this thesis is referred to as linear opinion pool. The second approach is known as

the supra Bayesian method where the probability distributions provided by data are

combined with the prior distribution using Bayes' rule. The third approach reviewed

is based on Journel's [48] and Krishnan's [50] derivation of the tau model.

Once a background of the problem has been developed, Chapter 3 suggests a method

ology to approach the problem of data combination. We built our methodology based

on the expression developed by Bordley [6] and independently approached from a dif

ferent angle by Journel [48], and further developed by Krishnan [50]. We refer to

this expression as the tau model because of the tau data interaction weights used

in that expression. In this work we propose another expression built on the tau

model, which we will refer to as the nu model. The nu model puts forward a single

global data interaction parameter which is not dependent on specific ordering of the

data. The tau model allows only for the derivation of sequence-dependent interaction

weights. The tau and nu sequence-dependent interaction weights are shown to have

1.4. A BRIEF OVERVIEW OF THE THESIS CHAPTERS 9

a one-to-one relation, they have different expressions limit, and have different behav

iors versus data values. The model with less variation over data will be easier to infer.

Both the tau and nu elementary interaction weights share the same attractive prop

erty to be dependent on the data values and, even more critically, on the unknown

value being assessed. This property has its drawback in that the interaction weights

are much more difficult to infer in practice. The concept of training image is pro

posed for such inference. A training image is a particular representation of the joint

data-to-unknown interaction, typically generated from the physics commanding that

interaction. A training image can be available directly from nature, e. g. a geological

outcrop, or it can be computer-simulated from an algorithm [70], [74].

In Section 3.2 we compare extensively the tau and nu models and suggest which model

is more appropriate in different applications.

In Chapter 4 various synthetic data sets are used to illustrate the theory presented

in Chapter 3. These examples led to some interesting, not all trivial conclusions and

observations.

In Chapter 5 the theory of Chapter 3 and testing of Chapter 4 is extended to non

binary data sets.

Chapter 6 concludes this thesis. Some common observations, collected from the

different chapters are made. Proposals for further research are suggested.

Chapter 2

A review of existing models

The problem of data integration could be considered as the most fundamental and is

pervasive in all modeling applications. Attempts at proposing rigorous probabilistic

solutions to this problem date back to the middle of the 20tfe century. We present three

such solutions: the linear pooling algorithm, the supra Bayesian approach, and the tau

model. The intention here is to summarize the key developments leading to this thesis

proposal. More thorough reviews can be found in Jacobs [43], Abidi and Gonzalez [1],

Genest and Zidek [31]. While many solutions for data integration exist, most share a

common link relating them to the traditional hypothesis of conditional independence.

This assumption is fundamental to many statistical algorithms and is used extensively

in practice. We, respectfully, argue that such assumption is often unrealistic and

should be not be accepted at face value, instead it should be documented from the

physics of the data. This raises the challenge of defining algorithms not starting from

any form of conditional independence. To be more specific, algorithms that take into

account the joint dependence between data and between data and unknown are highly

desirable.

2.1 Conditional independence

The hypothesis of conditional independence is widely used (whether it is stated or

not) in practice. This assumption refers to the notion of independence between data

10

2.1. CONDITIONAL INDEPENDENCE 11

given some unknown or given a question being asked. This assumption has been

taken for granted by many as if it was at the core of theoretical statistics: "...take a

notion of conditional independence as fundamental" [18] and "much of what appears

in the journals is yet another example of conditional independence" [58] are just a

few quotes to show the widespread usage of the conditional independence assump

tion. It has been acknowledged that the assumption of conditional independence is

rarely satisfied as "it is hard to find things that are truly independent" [51] yet it is

often adopted to "simplify computations and...to reduce the number and complexity

of the probability assessments" [39]. However, these non-traditional views are of

ten set aside by many statisticians who argue that at least approximately variables

can be treated as conditionally independent [18]. Many developments accept this

assumption without any sound prior justification. Indeed conditional independence

does lead to major simplification, but whether this assumption is appropriate must

be discussed. Under which physical conditions knowing a particular outcome of the

unknown makes the data independent? This question receives little coverage in the

statistical literature. As will be shown in later chapters of this thesis the assumption

of conditional independence is extremely stringent and rarely holds when applied to

spatially distributed data controlled by a common global physics such as geology.

Troublesome is the fact that in the presence of actual data dependence models built

on an assumption of conditional independence may lead to inconsistent probabilities,

that is probabilities that are outside the interval [0, 1] and do not follow basic laws

of probabilities. Of course, standardization and other ad hoc fixes can be applied to

correct such inconsistencies.

2.1.1 The road to conditional independence

Consider the derivation of the fully conditioned probability

P(A = a\D\ = d\, ...Dn = dn), that is the probability of the unknown event A = a

giving the n random variables (RV's) values D\ = di,...,Dn = dn. In all rigor, all

n data events D\,..., Dn should be used together (jointly) to model the sought-after

conditional probability P{A = a\D\ = di,...Dn = dn). In practice, however, when

12 CHAPTER 2. A REVIEW OF EXISTING MODELS

the data events "originate from different data sources, at possibly different scales

and resolutions" [48] it is rarely possible to directly build a model for such fully con

ditioned probability. It may be possible, however, to model the elementary single

event-conditioned probabilities P(A = a | A — di) with i = 1, ...,n. Such elementary

probabilities provide a standardized, unit-free, coding of information, across all data

types, which facilitates the task of data integration. Critically, as opposed to a de

terministic estimate, e.g. A* = / j (A) , the probability P ( A | A ) includes both the

information content of datum A and its uncertainty.

The goal is then to combine these elementary probabilities into an estimate or model

of the fully conditioned probability P(A = a| A = d\,..., Dn = dn):

P(A = a\Di = du ...,£>„ = dn) = if>(P(A = a),P(A = a | A = * ) , * = 1. - > n ) C2-1)

where P(A = a) is the prior probability, prior to considering any of the n data A -

The easiest solution to build such function ip is to assume the traditional hypothesis

of data conditional independence.

Let the notation A - i represent all data £>i,..., A - i in the sequence up to A ex

cluded. Conditional independence between the data events A and A - i giving the

unknown A amounts to assuming that knowing a particular outcome a of the variable

A somehow erases any interaction between data A - i and datum A which leads to:

P(Di = ck\A = a, A _ x = di_!) = P ( A = di\A = a) (2.2)

The chain rule for decomposing the (n + l)-variate probability is written [67] as:

n

P(A = a, A = di7 i = 1,..., n) = P(A = a) J ] P ( A = dt\A = a, A _ x = d;_x)

with A = 0.


Under the conditional independence assumption (2.2) that chain rule leads to:

n

P(A = a,Di = di, i = l,...,n) = P(A = a) J J P ( A = ck\A = a) i = l

and finally the conditional probability:

n

P(A = du ..., Dn = dn\A = a) = 11 P(Di = di\A = a) (2.3) i = i

That is the joint data likelihood P(D\, ...,Dn\A) is conveniently reduced to the prod

uct of much easier to infer elementary likelihoods P(Di\A).

The estimator of the fully conditioned probability P(A\D\,..., Dn) using Bayes inver

sion and the conditional independence assumption (2.3) leads to:

™* ^p{D^xr n

P{A)Y[P{Di = di\A = a) i = l P ( A , ...,Dn)

n

j ^ n P(A\A)P(A) i = l

P(Dlt...,Dn)

(2.4)

Hence: P ( A [ A , - , A . ) 1 fr P(A\Dt)P(Dt)

P(A) P ( A , . . . , A , ) i i P(A) {-b)

Even the restrictive data conditional independence hypothesis does not suffice to re

move the difficult-to-get data dependence term P(P )i , . . . , Dn).


This can be solved by considering ratios of updated probabilities:

P(A\Du...,Dn) _ P(A) A P(Dj\A)

1 1 PfDAnn-n.A} { ' P(nonA\Dt, ...,Dn) P(nonA) +J- P(Di\nonA)

We should also note here, that the validness of the conditional independence assump

tion will strongly depend on the support of the unknown A. If the support of the

unknown A is larger than these of the conditioning data, such assumption of condi

tional independence then can be justified.

Conveniently, another approach to get the joint data probability P(Di,...,Dn) in

equation (2.5) is to assume full data independence. Assuming that the data events

Di, ...,Dn are jointly independent leads to:

n

P(D1,...,Dn) = l[P(Di) (2.7) t = i

Hence, under both the assumption of conditional independence giving A = a and

the assumption of data independence, the sought-after fully conditional probability

is written as: P(A = a\D1,...,Dn) fTP(A = a\Di)

P{A = a) ~ l \ P(A = a) [ '

that is the updating ratio associated to all data is equal to the product of the ele

mentary updating ratios.

2.1.2 Heteroscedasticity

All the probabilities and conditional probabilities presented thus far are data values

and unknown value dependent: P(A = a\Di = di, i = l , . . . ,n) written concisely

as P(A\Di, i = l , . . . ,n) . Critically, the spread (e.g. variance) of such probabilities

which relates to uncertainty is data values dependent, a situation we will refer to as

"heteroscedastic". Conversely, independence of that spread from the data values and

from the unknown value is referred to as homoscedasticity.


The roots of the word homoscedasticity, or invariability of the error variance, come

from regression theory. It has been treated in some detail by many statistics and

economics texts ([19], [38], [42], [72]). Most often the assumption of homoscedas

ticity is made as a matter of pure convenience as it reduces considerably the mod

eling requirements. However, as has been pointed by Downs [19], it is the study

of heteroscedasticity which "may provide the only available evidence of interacting

variables". Such interaction between data for any given unknown may change the

naive assessment made from an association of individual data ignoring their interac

tion. Just like conditional independence, the homoscedasticity assumption should be

documented and made with caution, not accepted blindly as a matter of pure conve

nience. Unfortunately, the assumption of homoscedasticity is much too often taken

for granted, heteroscedasticity being seen as an illness to cure [19].

Some examples of homoscedasticity are:

• In regression theory and traditional geostatistics, the (regression) kriging weights

are homoscedastic in that they depend only on the variogram/covariance model

and the spatial geometry of the data, but are data values independent. Crit

ically, the kriging (estimation) error variance is also data values-independent:

an assumption or a result that is contrary to what is observed in practice.

For example, consider the two identical data configurations with different data

values as shown in Figure 2.1. The two geometric configurations of Figure 2.1 are

identical: in both cases the two data D\ and D?, are located at the same distance

from the unknown A. However, in Figure 2.1 (2), the much different data values

{D\ = 1%, D2 = 15%) would most likely carry a greater error potential in the

estimation of the unknown A than in Figure 2.1 (1) where the unknown A is

surrounded by two consistently small values (Z?i = 1%, D2 = 1.5%).

• The in-built assumption of homoscedasticity in regression has led to much effort


|D2=1.5°o)

(1) (2)

Figure 2.1: Spatial geometry of the data for two different data values combinations. (1) unknown A is surrounded by two points with small data values; (2) unknown A is surrounded by the same two points but with very different data values which potentially can lead to greater error.

to justify it [19], most notably by calling on the properties of the Gaussian ran

dom function. Indeed, a characteristic property of such multivariate Gaussian

distribution is that all conditional distributions are Gaussian fully character

ized by the conditional mean which identifies the linear regression estimate, i.e.

kriging, and the conditional variance which is homoscedastic and identifies the

non conditional error variance or kriging variance:

E{[A-A*]2\Di = di, i = l,...,n} = E{[A-A*}2}=a2K (2.9)

If one accepts without question the multivariate Gaussian distribution model,

then the homoscedastic assumption need not call for any further discussion.

• The homoscedasticity of data errors.

Assume the availability of n data events Di,...,Dn that inform the unknown

A with the corresponding error terms Ri, ..., Rn. These n data then can be

modeled as:

Di = fi(A) + IU(A) (2.10)


The measurement Di is seen as a physical deterministic function fa of the un

known A plus a random error or deviation Ri [48]. One can argue that the

model Di = fi(A) + Ri(A) is absolutely general as long as the distribution of

errors Ri is accepted as dependent on the variable A. Hence for A = a, the

data remains random Di = fi(a) + Ri(a) with the actual datum value d, corre

sponding to a particular realization r, (unknown) of the error random variable:

di = fi(a) + r-j. Then:

P(D'i = di\A = a) = P(Ri = n\A = a) \/i

and:

P(Di = di; i = l,...,n\A = a) = P(Ri=ri; i = 1, ...,n\A = a) (2.11)

The joint data likelihood calls for the equally difficult-to-get joint likelihood

of the n error RV's. Therefore several simplifying hypotheses are made, often

without further justification. The errors Ri are assumed:

1. conditionally independent given A = a,

2. with (homoscedastic) distribution independent of A

Under these two hypothesis, the joint data likelihood (2.11) becomes:

P(Di = di] i = 1, ...,n\A — a) =

JL " (2.12)

H P(IU = n\A = a) = [J P(Ri = n) i = i t= i

Lastly, a third hypothesis of Gaussian error distributions is commonly made.

We argue that errors are often directly related to the unknown A: change in the

unknown value should be reflected also on the distribution of the error term in

equation (2.10). In geostatistics, one particular form of such heteroscedasticity


is the commonly observed "proportional effect" [49] which refers to an increase

in the spatial variance in areas with greater local mean. In such cases, Var(i?j)

is directly affected by the specific unknown value a [50].

2.1.3 Bayesian networks

Bayesian networks are often used to make a set of variables and their dependencies

visually explicit. One example of such network is the bi-directional Bayesian network

[63] which is used to represent a joint multivariate probability distribution. For exam

ple, consider the tri-variate distribution of variables A, D\, and D2 shown in Figure

2.2. The graph 2.2 depicts all possible joint combinations of the variables with the

dependencies between these variables represented by the bi-directional arrows. Tradi

tionally these dependencies are modeled by covariance-related measures of similarity.

The Bayesian graph of Figure 2.2 considers not only the data dependence between

two data (nodes D\ and D2) but between the two data taken jointly (node D1D2).

As seen in this figure, the Bayesian nets are necessarily data-values dependent requir

ing that dependencies be remodeled for each new data value combination {a ,d[, d'2]

different from {a,di,d2}.

To obtain the fully conditioned probability P(A = a\Di = dx,D2 = d2) one would

need to consider all the dependencies (bidirectional arrows) between the unknown

A and the data events Dx, D2 and, most critically the joint data event DXD2. For

example, the joint marginal probability P(A, Di,D2) is derived as:

P{A,DX,D2) = P(A)P(D1\A)P(D2\A,D1) (2.13)

If one assumes conditional independence of the data D% and D2 given the third

variable A, the equation (2.13) simplifies into:

P{A,D^D2) = P(A)P(D1\A)P(D2\A) (2.14)


Figure 2.2: Graphical representation of joint dependencies between variables A, Di, and £>2-

This simplification is shown in Figure 2.3 with the resulting net requiring less mod

eling efforts than that of Figure 2.2: most arrows starting from the joint data event

D1D2 are not anymore shown (not needed).

In climate studies, such bi-directional relationship is called a "feedback". Positive

feedbacks work to enhance the effect of original forcing. Negative feedbacks decrease

or remove the effect of the original forcing. For example, the ice-albedo feedback [7]

is the mechanism in which warming of temperatures (Di) leads to a reduction of ice

and snow coverage (D2), decreasing albedo (i.e. the reflection coefficient of Earth

surface) and resulting in further snow and ice retreat, more absorption of heat and

warming of air. Thus, the temperature (D\) impacts the ice/snow cover (D2). In

return, the ice/snow cover (D2) influences the temperature (JDI). Based on this polar

amplification concept, high latitudes are the areas where global warming is expected

to be more pronounced.


[ D1 D2 J

Figure 2.3: Graphical representation under conditional independence between variables Di, and D^ given A.

An example of simplified bi-directional graph is shown in Figure 2.4 (1). In this

graph, the variables B and C interact affecting each other. However, at times the

relationship between the variables can take a simpler form where only one variable B

influences the other variable C; in Bayesian network such form of dependence is rep

resented by an uni-directional arrow such as in Figure 2.4 (2). For example, change

of incoming radiation (B) may result in change of ocean circulation (C) via change

of its thermal structure. However, ocean circulation has no impact on incoming ra

diation. Hence the relationship between radiative forcing and ocean circulation may

be considered as an uni-directional relationship.

As another example, consider the joint probability P(A — a,B = b,C = c) using the

three different joint representations of uni-directional Bayesian nets shown in Figure

2.5. In this Figure the leftmost graph (1) represents the situation in which data event


( 1 ) ( 2 ) Figure 2.4: Graphical representations of bi-directional (1) and uni-directional (2) Bayesian nets.

B is independent of both A and C and data event C is dependent on A. The middle

graph (2) represents the uni-directional dependence of data event C on data events

B and A. Finally, the graph (3) of Figure 2.5 is more complex since the data event

B also influences A.

The joint probability P(A = a,B = b,C = c) can be written for each of the three

uni-directional graphs:

(1). P(A = a,B = b,C = c) = P{C = c\A = a)P(A = a)P{B = b)

(2). P(A = a,B = b,C = c) = P(C = c\A = a,B = b)P(A = a)P(B = b)

(3). P(A = a,B = b,C = c) = P(C = c\A = a,B = b)P(A = a\B = b)P(B = b)

Limitations of Bayesian nets

• As mentioned before, all Bayesian nets are data values dependent. This requires

that the dependencies (i.e. the arrows) be remodeled for each new set of data

values. While such a dependence (heteroscedasticity) is often found in geological

settings, it is rarely put in evidence in the Bayesian nets actually used.

• As can be seen from Figure 2.2, the complexity of these nets grows exponen

tially as the number of the variables (and hence the number of joint data com

binations) increases. In practice, conditional independence is then assumed to


© © © © ®^® \ \ / \ /

© © © (1) (2) (3)

Figure 2.5: Graphical interpretation of the joint probability P(A, B, C) based on a different sets of relationships between three variables A, B, and C.

simplify the computational cost associated with Bayesian nets [8], [10], and [76].

This reduces considerably the effort to simulate the required dependencies. One

such simplification is shown in Figure 2.6 where the two data events D\ and D2

are assumed to be conditionally independent relative to the third variable A.

At the same time, the variable A is assumed independent of the variable B. In

this Figure, the node A is referred to as the parent node while nodes D1 and

D% represent its children. Assuming conditional independence then amounts to

ignoring an important link (arrow) between the two conditioning data children

D\ and Di- However, more often then not the data interact on each other. Such

link can be critical in modeling the joint probabilities.

A possible way to avoid the reliance of Bayesian nets on the assumption of

conditional independence is to use a global representation (proxy image) of the

joint distribution of all variables involved. Such global representation provides a

possible image of the joint interaction between data and unknown. In geostatis-

tics such representation of the joint distribution is called a "training image".

This concept was introduced back in 1992 when Guardiano and Srivastava [35]

CONDITIONAL INDEPENDENCE 23

r I B ) 1 A ;

0 ure 2.6: Graphical representation of conditional independence between variables and D2 given A, and data independence between variables A and B.

and Journel [47] proposed to use a training image to represent the "type of

heterogeneities that the geologists expect to be present in the actual subsurface

reservoir" [70]. Such image can be borrowed directly from a physical outcrop

or could be obtained by computer simulation of the physics that govern data

interaction and their relation with the unknown [70], [74]. For example, a train

ing image could be obtained by an unconditional realization generated from an

object-based algorithm [36]. The geologist expertise combined with massive

modern computer power allows the generation of such image. By scanning this

image, one can retrieve directly all the required conditional probabilities as ob

served proportions, without any call for conditional independence.

As an example, consider the assessment of an unsampled event A from data

events B and C where:

— A is the presence/absence of a subsurface channel sand at an unsampled

location

0


(1) Fades map B (2) Seismic signature C

Figure 2.7: Example of dual training images depicting the interaction between two data types B and C. Left: B for sand/no sand data. Right: C for seismic data.

- B indicates the presence/absence of sand data at nearby wells locations

- C is the result of seismic survey whose analysis leads to indirect indication

about channel occurrence [70]

A binary sand/no sand training image such as that of Figure 2.7(1) would

give a concept of spatial distribution of sand (here EW channels). Computer-

based simulation of the seismic survey would provide the seismic signature of

the training image (Figure 2.7(2)). The joint availability of the two related

training images shown in Figure 2.7 allows retrieving all corresponding training

probabilities of the type P(A\B), P(A\C), P(A\B, C) and thus evaluate the

data B, C interaction given A.

Link to Markov Chain

A commonly used model to represent a discrete time (or ID) stochastic process

is that of a Markov chain [57]. The Bayesian net shown in Figure 2.6 can be

2.2. PROBABILITY COMBINATION ALGORITHMS 25

seen as a special case of such a chain. In a Markov process, any previous state

is assumed irrelevant for predicting the probability of subsequent states given

knowledge of the current state:

P\Xn+\ = x\Xn — xn,...,Xi = x\) = P{Xn+i\Xn = xn) (2-15)

This property is known in statistical literature as memoryless property, and has

been widely accepted at face value. Unfortunately, while such property may

be appropriate for ID sequences of, for example, electrical events, it is already

questionable for 2D continuous electrical plates. It is most likely untrue for

geological events with no single directional origin.

In Bayesian nets such as that shown in Figure 2.6, it is assumed that the value

of particular node is conditioned only on its parent node leading to:

P{X1,...,Xn) = P{Xl)P{X2\X1)...P{Xn\X1,...,Xi.l) n n (2 16)

= Y[P(Xi\X1,...,Xi_1)*l[P(Xi\Parent(Xl)) i=l t = l

which replicates the condition (2.15) of Markov chain.

Using the notations of Figure 2.6, the joint probability P(Di, D2, B, A) can then

be written as:

P{DX,D2,B, A) = P(D2\A)P(D1\A)P(A)P(B) (2.17)

2.2 Probability combination algorithms

The estimators of the fully conditional probability

P(A = a\D\ = di,...,Dn = dn) based on direct assumptions of conditional inde

pendence (2.5) and in addition data independence (2.8) are widely used in practice


because they help reducing the inference burden and computational cost [18]. Yet

such assumptions can be hardly ever justified from the physics of the data. Unfor

tunately, such limiting assumptions find their way in most areas of mathematical

statistics including the major algorithms used for the problem of data integration.

In the next Section, we will introduce three algorithms for combining individually

conditioned probabilities and demonstrate their link to assumption of conditional

independence.

2.2.1 Linear pooling of probabilities

Because of the extensive use of linear algebra in applied mathematics, one easy solu

tion to the combination of individual probabilities is their linear weighting:

P(A = a\Di = di i = l, ...n) = ] T XiP{A = a\D{ = d{) (2.18)

To ensure a licit posterior probability the weights A, are typically constrained to sum

to one and be non-negative. This model was proposed by Stone [69] in 1961 and

has received a wide usage due to its simplicity. However, this model has one major

limitation. The constraints on the weights do not allow the posterior probability

P(A = a\Di = di i — 1, ...,n) to be valued outside of the limits defined by the pre-

posterior probabilities, i.e. the conditional probability P(A = a\Dt = di i = 1, ...,n)

is strictly valued between [min(P(A = a\Di = di),max(P(A = a\Di = di), i =

l , . . . ,n] .

The convenience of this method is that one can choose and interpret the weights Aj.

Also these weights can be made data-values and/or outcome-value dependent.

One common heuristic is to give high weights to individual data which have the high

est degree of expertise about the unknown A. While this seems like a reasonable

way to interpret A, we argue that such heuristic may be misplaced. The individual

information content is already expressed by the individual conditional probabilities

P(A = a\Di = di). The weights A should carry instead the concept of data interaction.


Three algorithms have been proposed for determining the linear weights A»:

1. Weights, seen as veridical (i.e. truthful) probabilities [43], are built around the

basic assumption that only one particular datum Di is representative of the true

distribution. The weight Aj in equation (2.18) is considered as the probability

that the datum i is the correct representation of this true distribution. The

datum Di with weight A* closest to one is considered as the most likely veridi

cal. Unfortunately, the literature stops short of providing a unique, rigorous,

determination of the veridical weights Aj. It is, however, intuitively appealing

to assign higher weights to good ("correct") as opposed to poor data. Unfor

tunately, it is not easy to quantify mathematically a subjective concept such as

good or bad. When combining diverse data sources, the level of data expertise

and the amount of information the data share with other data should somehow

be accounted for [30]. Defining the concept of information sharing is already not

trivial, mathematically quantifying it poses an even greater challenge. Before

trying to determine the weights Aj, it is essential to clarify which aspect of data

integration these weights are to quantify and what information the elementary

conditional probabilities P{A\Di) of equation (2.18) already account for.

2. Experimental weights from minimizing error.

Bates and Granger [4] proposed to determine the pooling weights by minimizing

an error criterion.

Assume that n data were used, each individually, to determine the n probability

density functions (pdf's) / i , . . . , /„ , each being an estimate of the actual pdf 6

of the unknown. These n individual pdf's are then combined linearly:

n

In this setting, it is assumed that each probabilistic opinion fa, Vz, is unbiased,

i.e. E(fi - 6) = 0.


One solution has been proposed by Dickenson [20], [21] under the assumption

that the data errors R = 6 — fi are normally distributed with zero means and

covariance a^. The weights are then found by minimizing the objective function:

mm m

o = E E A< v « + a E A * -*) (2'2°) t = l j=l i = l

m m

where the first term ^ ]T) î^jaij *s the variance or the expected squared error i=i i = i

m

of 9, and the second term a(Yl K ~ 1) ensures that the weights sum up to 1. i = l

A simple optimization, similar to that used for ordinary kriging, leads to the

following solution for the weights A [21], [37]:

A = S - ^ / e * ! ) - ^ (2.21)

where e '=(l , . . . , l ) , t denotes the transpose and S = E[(R - E\R])(R - E[R]Y]

is the covariance matrix for the data errors.

Freeling [24] established that under this minimum variance criterion the weight

Aj is larger for more accurate and less redundant (correlated) data. In other

words, data that are accurate and independent are rewarded with higher weights

A;.

There are several problems with this approach: first and foremost is the infer

ence of the pdf estimation error covariance E. Next, at best that covariance

only accounts for error linear dependence only two data at a time instead of

all data taken together. Last, the optimization proposed does not guarantee

non-negative weights Aj > 0.

3. Dual indicator weights.

In geostatistics, a link of the linear pooling expression (2.18) to dual indicator

kriging [22], [46] suggests obtaining the weights by modeling all the required two-

point covariances between data and between data and unknown, i.e. Cov(A, Di)

and Cov(Di,Dj), and then solving an indicator kriging system. In the case


when the unknown A and the n data Dj with i = l,...,n are binary events,

the elementary conditional probabilities P(A\Di) can be written in terms of

expectations as:

P M _ i i n - u - P ( A = ^D* = X) - EÂD^ - Cov(A,Di) + E(A)E(Di) [ ' U i } F ( A = 1) " E(Dt) ~ E(Dt)

Substituting the above expression into equation (2.18), we obtain for all i =

l,...,n:

(2.22)

= ^ ) + £aiCoi;()l1A)

with 5Z Aj = 1. The equation (2.22) identifies the dual indicator kriging system %=\

with dual weights «j = \/E(Di).

Both the minimum objective function (2.20) and the dual indicator kriging

approach (2.22) considers only the dependency between the unknown A and

one data at a time rather than considering the dependency between all data

together and the unknown. For example, the conditional probability

P(A = l\Di,...,Dn) should be expressed as the function of n data D\,...,Dn

taken jointly as [47]:

P(A = 1|£>! = du ..., Dn = dn) = E(A|A = du..., Dn = dn)

= b(0) + b^]Cov{A, D1) + ... + b^Cov(A, Dn)+

b™Cov(A, A , D2) + ... + b™Cov(A, Dn_u Dn)+ n ^ O T ^ . ^ n - l . ^ n j i - (2.23)

A,Dl,D2,D3) + ...

... + b(fl)Cov(A,D1,...,Dn)

b?Cov(A, DltD2, D3) + ... + b^Cov(A, Dn_2, Dn_i, Dn


where (") is the number of combinations of n values taken i at a time.

The expression (2.23) then identifies the full and exact expansion of the fully

conditioned probability P(A = 1\D\ = di,...,Dn = dn), considering the infor

mation of one indicator data at a time, then two at a time, and ending with

taking all indicators together. The dual indicator kriging expression (2.22) can

then be seen as a truncation of the full expression (2.23). The full expression

(2.23) considers all possible 2n pairings of the joint indicator distribution of the

(n + 1) indicator variables A, Di,...,Dn. Indicator kriging considers only the

data taken one or two at the time and ignores any joint dependency between

three or more data taken together. Such interaction may be important as it

accounts for multiple data (pattern) dependency.

2.2.2 Supra Bayesian Methods

The theoretical basis for Bayesian analysis of the problem of data integration dates

back to 1974 when Morris [60] laid a foundation for its wide-spread usage. This

method has been further developed by Agnew [2], French [25], Lindley [52], [55],

Winkler [75] among others. We pursue the Jacobs' [43] presentation of the supra

Bayesian method which follows closely the derivation of Lindley [52]. The goal is

once again to obtain the fully conditioned posterior probability

P(A = a\Di = di,..., Dn = dn). The equation central to Lindley's development relies

on Bayes relation as:

T>( A i n A n A \ p(A = a>Di = d1,...,Dn = dn) P(A = a\Dt = du ...£>„ = dn) = —— —-

P(D1=d1,...,Dn = dn) (2.24)

oc P(A = a)P{Dx = dx,..., Dn = dn\A = a)

Similarly, we can define the posterior probability of the complement event nonA.

Dividing the equation (2.24) for A by the similar equation for nonA allows eliminating

the joint data probability from equation (2.24) which results in the following odds


ratio form:

P(A\DX = dr, ...,Dn = dn) P(A)P{D1 = d1}..., Dn = dn\A)

P(nonA\Di = d\, ...,Dn = dn) P(nonA)P(D\ = di,..., Dn = dn\nonA)

Taking the logs on both sides then yields:

P(A\D1 = d1,...,Dn = dn)

(2.25)

log P{nonA\D\ = d\, ..^Dn = dn)

P(A) P(D1 = du...,Dn = dn\A) l o g + i o g .

(2.26)

P(nonA) P(Di = d1,...,Dn = dn\nonA)

The equation (2.26) is often referred to as the posterior log-odds-ratio [43]. The first

term log „/ ^—'-pr in this equation is the prior log-odds. ° P[nonA)

The second term log r,/ n — ^—X' " n _ J i —T\ ls ^ n e l°g-°dds ratio of

probabilities of data Di,...,Dn when A occurs to when nonA occurs. These two

probabilities are the likelihoods of the data given the unknown events A and nonA,

respectively. It has been noted that such logarithmic ratio of likelihoods represents

the degree of joint dependencies among the n data D\,..., Dn [43].

Unfortunately, despite its theoretical appeal, the supra Bayesian approach fails to

provide a clear method for inference of these likelihood functions. Various assump

tions have been proposed to simplify the derivation of the two likelihoods

P(D\ = di, ...,Dn = dn\A) and P{Di = d\, •••,Dn = dn\nonA). Not surprisingly, one

popular approach is to assume some form of independence, with conditional inde

pendence assumption dominating the literature ([12], [14], [17], [27], and [65] among

others).

If the assumption of conditional independence (both given A and nonA) is imposed


into equation (2.26), the ratio of the data log-likelihood reduces to:

P(D1=d1,...,Dn = dn\A) log

P(D\ — d\,..., Dn = dn\nonA)

l o g- = 2 2 l ° z P { D t l n o n A ) II P(Di\nanA) i = l

(2.27)

Another common approach for the estimation of the data likelihoods

P{D\ = d\, ...,Dn = dn\A) and P{D\ — di,..., Dn = dn\nonA) consists of assuming a

congenial family of distributions. A common assumption is that these likelihoods are

normal with the same variance/covariance matrix [43]:

P(D1 = d1,...,Dn = dn\A) is N(JJLA,E)

P(Di — di, ...,Dn = dn\nonA) is N(p, (2.28)

nonA;

where the vectors p^ and pnonA are the n-dimensional means for D = D\ = d\,..., Dn — dn

given A and nonA respectively, and E is a nxn covariance matrix [32] defined as:

E =

2^2 pi2cr1a2

<TS

Pn\o\o\ pn2(j\ol ... <

P2nOlp\

Using such assumption of normality, the posterior log-odds equation (2.26) then re

duces to [53], [54]:


. P(A\D1=d1,...,Dn = dn) log — —

P(nonA\D1 = d1,...,Dn = dn) ^ 2 9 )

However, this expression depends on the assumption of joint normality of the data,

which is often untenable in practice. Even with such stringent assumption, this

method requires massive modeling effort to get the covariance matrix E. Lastly, the

correlation matrix E does not take into account heteroscedastic, non-linear, multiple-

variables joint interactions between data and unknown.

2.2 .3 A brief o v e r v i e w of t h e t a u r e p r e s e n t a t i o n

Bordley's derivation

In 1982, Bordley [6] proposed an approach to the problem of data integration that

extends that of Lindley's quoted in the previous Section 2.2.2. He starts by trans

forming the elementary probabilities P(A\Di) with i = l , . . . ,n into odds functions:

P(A = a) u0 —

l - P { A = a) (2 30) Q = P(A = a\Di = di)

[Z-M) ' i

1 - P(A = a\Dt = di)

Bordley then states that under certain "regularity" conditions the sought-after fully

conditioned odd function O can written as:

n

O = F[Y,Ui(Oi)] (2.31) j = i

with O = / ^ ^ • • • ' V ^ l-P(A=a\Di=di,...,Dn=dn)

for an arbitrary continuous and monotonic odd function F and undetermined continu

ous functions Ui, ...,un, with the assumption that the data satisfy the non-interaction


property. Bordley already stresses that data non-interaction is different from data

independence. He speculates that non-interaction of data is satisfied when some data

"give the same assessments of probability in two different scenarios, then we can ig

nore them in deciding which scenario makes the decision maker feel more sure about

the event A occurring." In other words, non-interaction occurs when the decision

maker's interpretation of one datum Di does not depend upon the other datum Dj.

In contrast, for data Di, Dj to be independent, the information on which the elemen

tary probabilities P(A\Di) and P(A\Dj) are built must be independent [6].

The Bordley's non-interaction property, combined with an axiom called a "weak like

lihood ratio axiom", leads to the following sought-after odds ratio:

O n

7T = /(Oi/Oo) o - <> f(On/O0) = HiOi/OoY' (2.32)

where / is a real valued function and o is the ordinary multiplication operator. Bor

dley regards the exponent weights r; as the degree of reliability of each datum Di.

However, he stops short of providing the mathematical expression of these reliability

weights; their determination is left to the subjective opinion of the decision maker.

Journel and Krishnan's derivation

The multiplicative model (2.32) has been independently re-established by Journel [48]

building on a permanence of ratio paradigm which observes that ratios are typically

more stable than their components. Journel starts by denning distances which are

the inverse of odds defined as in (2.30) with the sought-after distance x defined as:

l-P{A\Du...,Dn) X= P(A\Du...7Dn)

G [ ° ' ° ° ] ^

Similarly, the elementary distances Xi with i — l , . . . ,n and prior distance x0 are

defined as: l-P(A\Di) l - P ( A ) [n . ,

*- Film ' i o = ^ p r e|0'°°1 <2-34>


The definition of distances to the unknown A is a sensible way to

re-formulate the conditional probabilities. For example, the distance X\ is equal to 0

if P(A\Di) = 1. That is the event A is certain to occur after observing data event

D\. This distance is equal to infinity if P{A\D\) = 0: event A is an impossible event

after observing data event D\.

Consider only two data D\ and D2 that inform the unknown A. Journel assumes then

that "the incremental relative contribution of datum D2 to (increasing or decreasing)

the distance to A is the same after or before knowing D\n:

a f t e r ^ ^ i = b e f o r e ^m ^ ^ = *i*2 ( 2 > 3 5 )

X\ X0 X0 X0X0

The derivation is then extended sequentially to n data D\,..., Dn leading to:

XQ XQ XQ

Equation (2.36) identifies the Bordley's derivation with the Tj weights equal to 1.

Confirming Bordley's conjecture, this derivation calls for the incremental information

contributed by datum D2 to knowledge of A not to be affected by the knowledge of the

previously used data D\. Journel then admits that data usually experience complex

interactions and these interactions need to be accounted for. This led to introduction

of the Tj weights into equation (2.36) to account for such data interactions [48]:

With the tau weights, the model proposed by Journel completely identifies Bordley's

derivation with a single but most important difference in the way the two authors

interpret the Tj weights. For Bordley, the weights relate to the reliability of the data.

Such reliability however is already incorporated into the distances (2.34). Journel's

derivation (referred to as the tau model from now on) formulates the weights Ti as

data interaction parameters. The author however stops short of providing the math

ematical definition of such data interaction.

(2.36)


This tau model remained a heuristic approximation of the exact fully conditioned

probability given all data events taken jointly, until the seminal contribution of

Krishnan [50]. Krishnan realized that interaction between multiple data is much

more complex than the traditional simple correlation between any two data. Indeed

the correlation measures only linear dependence between only two data taken at a

time. Joint data interaction should take into account not only the joint multiple point

dependency between data but also between these data and the unknown. It is also

critical to make this complex measure of dependency data values and unknown value

dependent. With these challenges, Krishnan proposed the following expression for the

Tj measures of data interaction in Journel's derivation. Krishnan considered first a

specific ordering of the n data £>i,..., Dn. There are n\ such sequences. For each such

sequence we can derive the n interaction weights r» of equation (2.36). Changing the

sequence will lead to a different set of n interaction Tj weights. However, all sequences

share one common feature, that T\ (that is the interaction weight assigned to the first

datum in the sequence) is equal to one. The value T\ = 1 simply tells that the first

datum does not interact yet with anything. The remaining (n — 1) interaction T{

weights are then derived in the following manner. Using the definition of conditional

probability (2.3), one can write:

miA,...,^)-^--^) P{L>l,...,Dn)

= P{Dl)P{A\D1)P{D2\A) D 1 ) . . .Pp w jA, A , . . . , i V i ) l ' '

P(Du...,Dn)

Note, all the probabilities in equation (2.38) should be written as data values and

unknown value dependent as: P(A = a\D\ = d\, ...,Dn = dn). However, we will use

the short notations of (2.38).


Similarly, the fully conditioned probability of the event nonA is:

a, A\-n n \ PjnonA, Du...,Dn) P(nonA\Du...)Dn) = ^ . ^

_ P{D1)P(nonA\D1)P(D2\nonA, Dl)...P{Dn\nonA, Du ...,Dn-i) P(£>i,...,£>n)

(2.39)

Dividing equation (2.39) by (2.38) leads to:

P(nonA\Dl,...,Dn) _ P(non^|£>i) P(D2\nonA,D1) P(Dn\nonA,Dl,...,Dn^1) P(A\Du...,Dn) ~ P{A\D{) P{D2\A,D{) '" P(Dn\A,D1,...,Dn^1)

Using Journel's definition of distances (2.33) and (2.34), the sought-after distance x

can be re-written as:

P{D2\nonA,Dl) P{Dn\nonA,D1)...,Dn_1) X Xl P{D2\A,DX) - P(Z?„|il,A,...,I>„-i) K }

Let:

where

P(£>2|raoraA,£>i) _ (P(D2\nonA) P(D2\A,D1) \ P{D2\A)

T1

(2.41)

I ,P(D2\nonA,Di)-<

*M,a) = (ZlZ± e [-00,00] (2.42)

The expression (2.42) is a ratio of data log likelihoods. It is important to note that

this r2 interaction weight is data value and unknown value dependent.

The distance x of expression (2.40) then can be re-written as:

T _ / r l n M " P(£>ro|nonA£>i,-,A>-i) (2A^ {Xl) W "• P{Dn\A,Du-,Dn.{) {1Ai)

where T\ = 1.


Generalizing then equation (2.41) to all (n — 1) weights leads to

P(Di\nonA, A , . . . , A - i ) ( P(Di\nonA)

P(A|AA,-,A-i) V P(Di\A)

with the key result:

I p(Dj = dj\A = nonaÂ- i = d ^ i )

Ti(di, ...dn, a) = , — j ^ -r e [-oo, +ooj,

(2.44)

l o g P ( A _ = dj\A = norm) L ' J' (2.45)

n = i

P ( A = <£ I 4 = a)

Substituting expression (2.44) into (2.43) for i = 3, ...,n, we get the Bordley-Journel

expression:

Interpretation of the tau expression

The denominator of the rrexpression (2.45) measures how datum A = di discrimi

nates the outcome A — a from non&. The numerator measures the same but in the

presence of the previous data A - i = di-i = { A = di,..., A - i = ^ t - i}- Thus the

ratio (of ratios) 7$ indicates how the discrimination power of A = di is changed by

knowledge of the previous data A _ i = di- i taken all together. Critically, this weight

is specific, as mentioned before, to the ordering of n data events A , •••, A M and is

data value and unknown value dependent.

Consider the following specific values for the tau weights:

• Tj = 1

This condition is satisfied when the two ratios in expression 2.45 are equal:

PROBABILITY COMBINATION ALGORITHMS 39

P(Di = dj | A = nona, D^ = d ^ i ) _ P(Dj = dt \ A = nona) .

When Tj = 1, the ability of the datum (or data event) Dj = d^ to discriminate

a from nona is unchanged by knowledge of the previous {% — 1) data events

dj_i = {D\ = d\,..., Z)j_i = dj_i}. Relation (2.47) entails the following equality

of log ratios:

P(Di = dj 1 A = nona, A - i = dj_i) ° g P(Di = di\A = nona)

. P(Di = di\A = a,~Dl_l=di_l) = l oe r— ; = r 8 P{A = di | A = a)

Note that the tau model with unit tau weights is less constraining than the

assumption of conditional independence. While data conditional independence

given both the unknown event A and its complement nonA leads to unit tau

weights, the reverse is not true. Unit tau weights need not imply any data

conditional independence. It suffices that r takes a constant value in equation

(2.48), any value different from one would also result in Tj = 1.

A zero tau interaction weight occurs when the numerator i P(D; = di I A — nona, Z),_i = dj_i) r . ,n . rX . , , „ , ,. log—^r-r-— -L —-——^ of expression (2.45) is equal to 0 leading

P(Di = di | A = a, A - i = dj_i) to:

P(Di = di | A = nona, A - i = di-i) = P(Di = dt\A = a, A - i = d,_i)

(2.49)

In the presence of previously used data Di_1 the datum Di is non-informative in

that it does not discriminate event a from nona. Note, however, that considering

a different data sequence might result in a r, weight different from 0. In such

case, the data Di does add valuable information about the unknown event


A = a.

• Ti > 1

1. If X{ > xo, that is if datum Di by itself increases the distance to event

A — a occurring as compared to the prior distance x0l then the interaction

factor Tj > 1 makes that increase even greater.

2. Similarly, if Xi < xo, that is if datum Di by itself decreases the prior

distance to event A — a occurring, then the interaction factor Tj > 1

reduces that decrease.

• If Ti < 1, the previous conclusions are reversed.

In summary, Krishnan has provided a solution to the difficult task of obtaining the

exact conditional probability P(A\Di, ...,Dn). This is done through relation (2.46)

by decomposing the problem into two, each simpler, tasks:

• obtaining information content through the individually conditioned probabili

ties

P{A\Di) with i=l,...,n.

• deriving multiple-point joint data interaction tau parameters whose exact ex

pressions are known.

The tau interaction weights in addition to be dependent on specific ordering of the

n data events Di,...,Dn are data values and unknown value dependent. While such

form of dependence allows for a more comprehensive representation of the fully con

ditioned probability P(A\D1,...Dn), it is too complex to be used in practice. This

calls for approximations built from Krishnan's exact tau parameter expression. We

argue that while sequence-dependent interaction weights are important in some appli

cations, most often it is the global representation of such interaction that is desirable.

Krishnan's derivation fails to provide a measure of such global data interaction.

Moreover, the r, weights are likely to exhibit an unstable behavior versus data values.

When the information is non-discriminating, the denominator of expression (2.45)


tends toward logl = 0 leading to infinite tau weight Tj —• oo, hence creating an

inference problem.

Krishnan [50] did note that the inference of the interaction T* weights is quite difficult

and the behavior of these tau weights was not fully understood. He pointed out the

need for further analysis starting with synthetic data sets. These data sets should

not only help develop a better understanding of the tau interaction parameters, but

also lead to further theoretical developments.

Chapter 3

The nu representation

The overview presented in the previous Chapter pointed to the need to remain alert

against any simplifying but potentially crippling hypothesis when it comes to data

dependence and interaction. The need to consider data jointly rather than one or

two at a time was also brought out. This is particularly critical when dealing with

spatially distributed phenomena where patterns of similar data carry valuable infor

mation beyond that carried by each datum individually.

This chapter builds the theoretical basis of the nu expression which is a sister of the

tau model proposed by Bordley [6] and Journel [48] and further developed by Kr-

ishnan [50]. In his thesis Krishnan gave the exact expression of the tau weights and

showed them to be directly related to the data interaction associated to any specific

sequence of data. With these weights the original tau model lead to an exact analyt

ical solution to the problem of probabilistic data integration.

The major contribution of the nu expression proposed in the present thesis is the

derivation of a single, data sequence-independent, interaction parameter v0. The

derivation of this UQ parameter and its estimation rely on the original idea of Jour-

nel's paper [48] that ratios of probabilities are more stable than the probabilities

42

3.1. DERIVATION OF THE NU REPRESENTATION 43

themselves-a well proven engineering paradigm. The exact i>o expression given here

after is too complex thus it is unpractical. However, availability of such exact ex

pression leads to avenues for its approximation. In this chapter we propose two such

approximations.

3.1 Derivation of the nu representation

3.1.1 The nu expression

Consider an unknown event A informed by n data events Di,..,Dn. These data have

been evaluated for their individual information content related to unknown event A

through the elementary probabilities P(A | Di)1. The challenge is then to recombine

the prior probability P(A) and the n single-event conditional probabilities P(A | DA

into the posterior probability P(A \ Di,...,Dn) while accounting for possible inter

action among data. The nu representation provides an exact expression for such

recombination.

One of the well-proven paradigms for engineering approximation is the permanence of

ratios: rates of increments are typically more stable than the increments themselves

[48], [62]. Using this key idea, define the following distances to the unknown event

(A) prior to and after observing any single data event:

l-P(A) PI A) . A , ~ x° = —pf A\— = p / A\ ^ [0' ooj; prior distance to A occurring, with A = nonA P(A) P{A)

1 - PI A I DA P(A I A ) r i Xi — — ^ , , . ^ x— = „ , . i ^ x e O.co ; updated distance knowing datum Dit

P(A | Di) P{A | A )

Xi equals to zero if P(A \ Di) = 1, and equals to infinity if P(A \ Di) — 0

The updated distance knowing jointly the n data is then:

•'Throughout this paper the short notation P(A | Di) is used for the (a,di) values-specific expression P(A = a | Di = di)

44 CHAPTER 3. THE NU REPRESENTATION

l-P(A\D1,...,Dn) P(A\Du...,Dn) ^_rn ^ X= P(A\Dl,...,Dn) =P(A\Du...,Dn)

e[°>°0] (3-X)

The one to one relation between the updated distance and the fully conditioned

probability is simply:

P(A = a\Di = di, i = l , . . . ,n) = — j — e [ 0 , l ] (3.2) 1 + x

If event A — a is a non-binary random variable, then nonA denotes any A-outcome

different from a.

Note that a distance is defined as a ratio, it is thus likely more robust than its

component probabilities with regard to any factor affecting the estimation of such

probabilities. Already, it can be seen that a distance has only one bound, it must

be non-negative, whereas a probability has two bounds [0, 1]. Bounds represent con

straints for any estimation or approximation.

A distance is also the inverse of an odds ratio [23], e.g. the prior odds of event A

occurring is:

V ' = — € 0, oo 1 - P(A) x0

The distance and its inverse, the odds ratio, are exchangeable variables. They are thus

Jeffrey's variables [71]. This allows for a convenient generalization of results: knowing

either distance (inverse) is sufficient condition to find its odds ratio (distance) form.

Jeffrey's variables are known to be more stable than the original variables on which

they are built.


Next, consider the following exact decomposition of the joint conditional probability:

P ( A , A , - , A ) P(A\ A , - , A ) =

P ( A , - , A ) P(A)P(D1 | A)P(D2 | A,D1)...P(Dn \ A, A , - , A )

P(A,-,A) n

nÂiA.A-i) - PlArlp(D> D.) ^

where A - i = { A — di,..., A - i = ^»-i} is the set of all data events up to the i — 1th

datum A - i = di_\.

Note, A = {0} is the empty set.

The notation A - i implies a specific sequence of n data, with D\ being the first datum

considered and A being the last. There are n! such data sequences.

Next, rewrite the fully updated distance x as the product series:

_ n p(AiA,£»i_i)

^ ) % ( D 1 , . . . A ) A P ( A I A A - I ) = — " M l p /n i 4 n. ^ ^

X = n pîA.Di-O y P(Di | A, A - i )

^ ( ^ ' ' p p i , . . . , ^ )

The ratio , ' / ' measures how the likelihood of observing the datum A = di

changes whether A or nonA is present, once the previous (i — 1) data values A - i have

been observed. This ratio is not yet a measure of interaction between the A-datum

and the previous data considered A _ i . A measure of such data interaction would be

obtained by comparing the previous ratio to the one ignoring D;_i, that is to p)D* J .

Consider then the nu parameter defined as the ratio of ratios:

= PmA,D^) > h = ( }

P(A|A) - ' V ' P(Di\A)


The denominator of equation (3.5) can be re-written with Bayes inversion:

P{Di | A) = P(Dj,A) P(A) = P{A 1 A ) P ( A ) P{A) = x±

P(Di\A) p(A) P(Di,A) p(A) P(A\Di)P(Di) x0

Hence, the expression (3.4) can be rewritten as the exact mi expression:

n n n

= J J V%— = VQ Yl ~> Wnere 0 = Y\ Ui - 0 (3-6) X° i=i X° i=X X° i=l

Expressions (3.4) and (3.6) are two equivalent exact expressions for the conditional

distance x, but equation (3.6) is written as product of the more easily accessible

elementary distances x, with parameters (the z/j and VQ) interpreted as measures of

data interaction. It is hoped that the ^-parameters (3.5) being ratios of ratios are

reasonably stable vs. data values, that is reasonably homoscedastic, more so than the

direct likelihood ratios of expression (3.4). If that is the case, z and i/0 parameters

would be easier to infer from proxy training data.

Remarks and Interpretation

• All probabilities in expressions (3.2) to (3.6), and therefore the (vo, ui) parameters,

are data values-dependent (the d'ts). In addition and most critically, they are also

dependent on the specific outcome A = a being discriminated from A = nona. Note

that this discrimination is between a and nona, not between any two outcomes dk,

ak< of a variable A with more than two outcomes {a^, k = 1,..., K > 2}. The non-

binary case K > 2 is treated in Chapter 5.

Data interaction

• The denominator of the i/j-expression (3.5) measures how datum Di = dj discrim

inates the outcome A = a from nona. The numerator measures the same but in

the presence of the previous data D^\ = {D\ = di,...,Di-i = dj-i}. Thus the

ratio (of ratios) Vi indicates how the discrimination power of Di = di is changed by


knowledge of the previous data Di_\ taken all together. Note that the n parame

ters Vi are data sequence-dependent, as opposed to the single parameter UQ in

expression (3.6).

• The expression (3.5) of ^ is symmetric in Di and Aj = A - i : o n e c a n exchange Di

for Aj, the expression of V{ would remain unchanged.

• If Vi — 1, the ability of the datum (or data event) Di = di to discriminate a

from nona is unchanged by knowledge of the previous (i — 1) data events Dj_i =

{D\ = d\,..., Di_i = di-i}. Therefore we can consider Vi = 1 as the case of "non

interaction" of the two data events Di and Dtî when it comes to discriminating

a from nona. The deviation | 1 — vt | is thus a measure of data interaction. That

measure is (a,dj,j = l,...,i) values-dependent, although at times we will use the

simplified terminology "Di, D,_i data interaction".

Note that non-interaction is not independence, if only because data independence,

say between two random variables D\ and D2 does not involve any third variable

A. The term "data redundancy" is sometimes used instead of "data interaction";

this is unfortunate because redundancy implies some overlap of the information

content of the two data D\ and D2 with redundancy of information being a state

one should avoid or, at least, correct. We prefer the more neutral term "data

interaction" which means only that one datum modifies the information brought

by the other whenever i/j ^ 1, i.e. whenever | 1 — i/j | > 0.

• Vi = 1 in equation (3.5) requires that the two ratios P(D<A~D ) an<^ p(j>L) ^ e e 1 u a ^

to each other. One sufficient (but not necessary) condition for Vi = 1 is conditional

independence of the two data Di and Dj_i given A = a and A — nona., that is:

P(Di | A = a, A _ i ) = P(Di | A = a)

and P{Di | A = nona, A - i ) = -P(A I A = nona)

Note that the traditional conditional independence of Di and .Di_i given only A =

a does not suffice to ensure Vi = 1, see example in Section 3.3.1. Thus data


independence, data conditional independence and data non-interaction (i/j = 1)

are different states.

• If Vi > 1 in expression (3.6), the relative distance Xi/xo is increased.

1. If Xi > x0, that is if datum Di by itself increases the distance to event A = a

occurring as compared to the prior distance XQ, then the interaction factor

Vi > 1 makes that increase even greater.

2. Similarly, if Xi < XQ, that is if datum Di by itself decreases the prior distance

to event A = a occurring, then the interaction factor vt > 1 reduces that

decrease.

• If Vi < 1, the previous conclusions are reversed.

• The greater the data interaction as measured by the deviation | 1 — V{ |, the more

critical it is to consider the data D;_i and Di jointly because they interact one on

the other as to evaluating the outcome A — a. Thus, one can also read | 1 — vi | as

a measure of the information modification brought by considering jointly the two

data -Dj-i and £V

Note that data interaction can go both ways, either increasing or decreasing the

probability of event A = a occurring. Hence ignoring data interaction by making

Vi = 1 is not necessarily a conservative approximation.

• Similarly when considering the single v0 parameter, the deviation | 1 — v0 | is

a measure of global data interaction. Note that for v0 = 1 it is not necessary

that all Vi = 1. For example, v\ = 1, v2 = 2 and v^ = 1/2 would result in

v0 = 1 * 2 * (1/2) = 1. In words, different elementary data interaction (v{ ^ 1) may

cancel out into zero global data interaction v0 — 1. What counts in the end are

not the elementary data interactions but the global one as measured by | 1 — v0 |.

This is the main contribution of the nu expression. Again the tau expression does

not deliver such global data interaction parameter.


Equivalence

• Although the tau derivation fails to reveal a single sequence-independent interaction

parameter similar to the single parameter u0 in expression (3.5), the individual,

sequence-dependent, Tj and Vi parameters are related one-to-one as:

* = ( ^ r - l , o r : TV = 1 + ^ (3.7)

Once again these expressions (the tau and the nu) are (a, df, i = 1, ...,n) values-

dependent.

The fact that the vi) T* relation is dependent on the datum Dt relative distance

Xi/x0 indicates a potential difference between the nu and tau approaches not yet

fully understood. Already their inference processes do feature different robustness

characteristics: in presence of little informative datum, Xi/x0 ~ 1 and T; —> co

which makes its inference difficult.

Comparing the nu (3.6) and tau (2.32) expressions, we observe:

• The nu weights lie in interval [0,+oo], while the tau weights have no bounds

being in [—oo,+oo]

• The interaction weight applied to the first datum in the sequence regardless of

the expression used is always one per definition: Tx=v\=\

• the mathematical derivation of the exact tau weights expression (refer to Krish-

nan's thesis [50]) is less straightforward than that for the nu weight, as shown

in the beginning of this chapter.

• Tj = 1 for i = 1, ...,n implies the u0 = 1 model. Similarly, i/j = 1 for % = 1, ...,n

would also imply u0 = 1. However, i/* weights different from 1 can also imply

the I/Q = 1 model.


Associativity

Elementary data events can be regrouped into larger data events. This does

not affect the final result yielded by either the tau or the nu expression. For

example, consider the three elementary data events Di, D2, D3 and regroup the

two first into B = {Di, D2}- It can be shown that for the two data sequences

Di, D2, Ds and D2, D\, D3, the third weight, 1/3 and T3, given to the last datum

D3 is the same as the second weight, VBD3 or TBD3, given to that same datum

D3 in the data sequence B, D3.

Symmetry

• A data interaction measure should be symmetric in the two data events involved.

This property is verified by the nu parameters, not by the tau parameters. For

example, consider any two data events D\, D2 informing the unknown A. For

the first possible sequence D± —> D2, consider the notation with the upper script

(12) recalling that particular sequence:

x _ x\ (12) 2 _ xi I x2 vK2

]— = — • — ( 3 .

x0 x0 x0 x0 \x0'

J12)

For the second possible sequence D2 —>• D\

X_ = X2 _ (21)^1 = ^ 2 _ fx±

XQ XQ XQ XQ \XQ

_(21) r l

(3.9)

It comes immediately: v^ ' = v\ ', that is the nu parameter given to the second

datum in the sequence remains the same no matter the sequence.

As for the tau parameters, using the log transform it comes:

(21), Xl , Xi (12), X2

r\ log — = log — + r2v log —, (3.10)

XQ X2 XQ


where T{ 7 r2 , unless x\ = Xi-

As noted before that, notwithstanding the one-to-one relation (3.7), there is a

difference between the nu and the tau parameters and thus between the nu and

the tau approach to data integration. The lack of symmetry of the tau param

eters represents an impediment which makes their interpretation and inference

potentially more difficult.

3.1.2 Dictatorial property

If any one datum information Di is fully decisive it imposes its result (as a dictator)

to the final conditional probability, no matter the other data Dj, j ^ i provided that

none of these is equally dictatorial and contradictory.

Indeed, if X{ = 0 (00) or equivalently if P(A | Di) = 1 (0), then the final distance x

given by the exact expression (3.6) is equal to 0 (00). That is P{A \ Di) = 1 (0) leads

to a decisive statement: A = a occurs certainly (certainly not). A contradictory and

equally dictatorial data event Dj would yield Xj = 00 when Xi = 0 and then equation

(3.6) would not be able to resolve such total contradiction.

The limit case vj = 0 corresponds to data interaction leading to such decisive result.

Indeed, first assume that none of the n data Di is dictatorial by itself, i.e. Xi ^ 0 or

00, V i = 1,..., n. Then z/j = 0 if and only if the numerator of its expression (3.5) is

zero, that is:

P(Di = di I A = non&, Dj_i) = 0, although P ( A = di\ A = nona) > 0

Thus, the previous information Z)j_i indicates assertively that the probability (like

lihood) of observing the datum Di = di in presence of A = nona, is zero, therefore

Vi = 0 entailing v^ = 0, although Xi > 0, and finally:

x = 0, that is P(A = a | Di, i = 1, ...n) = 1

Similarly, the case Vi = 00 would entail x — 00, hence


P(A = a j Di, i = 1, ...n) = 0, even though Xi / oo, V i.

3.1.3 A measure for data interaction

Considering the more general notations Di, Aj = Di_\ and v^ = U{, the deviation

| 1 — Vij | has been suggested as a measure of interaction between the two data sets

Di and A^. In view of the similarity of the two limit dictatorial cases Ui — 0 and

Vi = oo, a more symmetric and standardized measure of data interaction (when it

comes to discriminating A = a from A = nona) could be:

Kij = \ - exp(- | log Vi, j) = Kji e [0,1] (3.11)

where fy = Uji > 0 is the nu parameter defined by relation (3.5) generalized by con

sidering the notation Aj = Dî.

This interaction measure takes the limit values:

• Kij = 1, that is maximum data interaction, whenever ^ j = 0 or +oo

• K^ — 0, that is no-data-interaction, whenever Uij = 1 as discussed before.

3.2 Tau or nu expression?

If the Ti and v± parameters are evaluated explicitly, e.g. from training data, and made

data-values dependent, the two tau and nu expressions (2.32) and (3.6) are equivalent.

One should then prefer the nu-formulation since it puts forward the single interac

tion parameter va(a, di, i = 1, ...,n) which is independent of the data sequence

Di,D2, ...Dn. Also, evaluation of the 7$ parameter associated to non-informative da

tum such that P{Di \ A) = P ( A | A) would run into problems of instability linked

to division by a log ratio close to zero, see relation (2.32).

3.2. TAUORNU EXPRESSION? 53

However , if r* and Vi are assumed constant, independent of the data values (a, di, i =

l , . . .n), then the tau formulation should be preferred. Indeed, consider the case of

only two data events with the two different sets of data values:

{D1 = di, D2 = d2} and { A = d/, D2 = d2'}

• The nu model with the constant (homoscedastic) v0 = v^v2 parameter value is

written as:

#• = v0 for data set {dx, d2\

f i

%-— VQ ——- for data set id,', d2\ x ° Xn Xn I J XQ XQ

where x, x\, x2 are the distances corresponding to {di, d2} and Jb , Jb -| . IAJ Q

are the distances corresponding to {dx, d2}. Conditional distances are data

values-dependent, as opposed to the prior distance XQ — XQ . Therefore,

^- = -£--£-, for all constant VQ values.

The parameter VQ is seen to be ineffective on that ratio of distances.

• Conversely, the tau model (2.45) with constant T\ and r2 parameter values is

written as:

X~\ Xo log 7 - = 7i log h 72 log— for data set {d\, d2\ X° XQ XQ

log %- = T\ log — + T2 log — for data set i d-!, d*\ X ° XQ XQ I J

Thus:

or equivalently

JJ X i X o log — = n log v T2 log —

X X\ X2

a; \ aJi / \ £2


The tau parameters, when considered data values-independent, remain effective

unless r\ = T2 = VQ = 1. This latter property of the tau expression, remaining

effective when the T{S are considered data values-independent, explains why

the tau expression (2.32) is a convenient heuristic to weight more certain data

events [5]. It suffices to make T; > Tj > 0 to give more importance to data event

Di as compared to data event Dj, whatever the actual data values (di, dj).

However, this heuristic utilization of the tau model completely misses the main

contribution of the tau/nu expression which is the quantification of data inter

action for any specific set of values (a, di, i = 1, ...n).

In the nu representation, we then suggest if one decides not to trust individually

conditioned probabilities P(A — a\Di — di) to simply set these probabilities

closer to the marginal probability P(A = a).

3.3 Approximations based on the nu derivation

The exact developments presented above put forward the dependence of the fully

conditioned probability not only on the unknown value a but also on the data values di

themselves. This dependence, which we have called "heteroscedasticity", allows for an

exact representation of joint data interaction. However, full heteroscedasticity is too

complex to be accounted for in most practical situations. Thus, we need to formulate

approximations starting from the exact nu representation (3.6). In this section we

look at two such approximations and the resulting models. The first model ignores

data interaction by setting the global data interaction weight vQ = 1. We compare this

model to that based on the traditional assumption of conditional independence. The

second model does account for joint data interaction by considering a value vo ^ 1

and function of a few summary of the values (a, di, i — l , . . . ,n); that function is

calibrated from training data. We will refer to this second model as the "classified v0

approach". But first it serves to compare "no-data-interaction" with data conditional

independence on a large data set.

3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 55

3.3.1 Evaluating the conditional independence assumption

The single uQ parameter in expression (3.6) is noteworthy because:

• it is data sequence-independent and involves jointly all n data events A

• the no-data-interaction model UQ = 1 does not require that any or all elementary

parameter values v{ be equal to 1. In other words, multiple and possibly complex

elementary data interactions (z/j ^ 1) may cancel out into a "global non-data

interaction" i/0 = 1. For example, consider three data that receive the following

sequence-dependent interaction weights: v\ = \ (per definition) and v3 = l/zx2.

While the weights u?, and u2 are not equal to one, they cancel out into global

interaction since VQ = V\VIVZ = \.

The approximation u0 = 1 corresponding to no-data-interaction (the interaction

measure (3.11) is then K^ = 0) is different and far less restrictive than the joint data

conditional independence hypothesis written as:

P(DU ..., Dn\A = a) = P{Dl | A = a) P{Dn \A = a),

as shown by the following case study.

A binary case study

Consider an example with binary indicator variables illustrating the VQ = 1 model

and comparing it to the estimator stemming from the assumption of conditional in

dependence (CI). The unknown A is an indicator variable fully defined/informed by

two indicator data B and C. For example, A could be a binary variable indicating if

on a particular year the sea surface temperature in the North Atlantic Ocean is above

a given critical level. B could be an indicator of sea level pressure being greater than

a certain threshold, and C could indicate whether the salinity of the ocean exceeds a

certain threshold.

The total number of possible joint combinations of values for the three binary variables

(A and the two indicator data B and C) is 23 = 8. Each of these combinations is

assigned a probability of occurrence denoted pk, k = 1,..., 8 (Table 3.1).


Pi

P2

P3

Pi

A 0 0 0 0

B 0 0 1 1

c 0 1 0 1

Ph

Ps Pi

Ps

A 1 1 1 1

B 0 0 1 1

c 0 1 0 1

Table 3.1: Joint distribution of indicators and their probabilities.

We generated 10,000 random realizations of these combinations of eight joint prob

abilities P(A,B,C). For this, for each pk, k = 1,...,8 we drew at random and

independently a number between 0 and 1 from a uniform distribution. Once all eight

joint probabilities pk were drawn, we standardize them to ensure the total law of 8

probability: ^ pk — 1 and Pk G [0,1], Vfc. Generated 10,000 sets of probabilities can fc=i

be considered as a fair, almost exhaustive, sample of all aspects of data interaction

and dependence, and hence represent a good basis for assessing robustness (i.e. the

degree of deviation from the truth) of diverse approximations.

Once a set of the eight probabilities of joint occurrence is simulated, the correspond

ing marginal and conditional probabilities can be retrieved. For example, consider

the conditioned probability P(A — 0\B = 0, C = 0). The corresponding marginal is

P(A = 0) = pi + p2 + pi + pA. The probability of the unknown A = 0 given that the

two indicator data B and C are also equal to 0 is:

P(A = 0\B = 0,C = 0) = P(A = 0,B = 0,C = 0) px

P(B = 0,C = 0) ~ pt+ps (3.12)

This exact conditional probability P(A = 0|B = 0,C = 0) can be approximated in

two different ways.

• CI estimator: the estimator based on the traditional conditional independence

(CI) assumption given A = 0 as described in Section 2.1 is:

P**(A = 0\B = 0,C = 0) = P(A = 0,B = 0,C = 0)

P{B = 0,C = 0)

P(B = 0, C = 0\A = 0)P(A = 0)

P(B = 0,C = 0)


Per data conditional independence:

P(B = 0, C = Q\A = 0) = P(B = 0\A = 0)P(C = 0\A = 0)

Thus,

P**(A = 0\B = 0,C = 0) =

P(B = Q\A = 0)P(C = 0\A = 0)P(A = 0) (3-13) P(B = 0,C = 0)

wherein in terms of the joint probabilities pk of Table 3.1:

ZPk P(B = 0\A = 0) = ^^—: and P(B = 0, C = 0) = p i + p 5

EPfc

i/0 = 1 estimator: the conditional probability P(A = 0\B = 0, C = 0) is

estimated by the nu expression (3.6) with the single global parameter u0 set to

1 (no-data-interaction). First the distances to A = 0 occurring are calculated

as:

_ 1 - P(A = 0) Xo~ P(A = 0) ;

_ 1-P(A = 0\B = 0) _ 1 - P ( 4 = 0|C = 0) Xl ~ P(A = 0\B = 0) ' X2 ~ P(A = 0|C = 0)

wherein in terms of the p^ s of Table 3.1:

P(A=0) = E f t t = l

P(A = 0|5 = 0) = n+jF+K+w and

p(Aô|c = o) = p i + ^ + g; + ,


The vo = 1 model gives:

X X\ X2

The conditional probability is immediately retrieved as:

P*(A = 0\B = 0,C = 0) = - (3.14)

Figure 3.1 gives the scatter plots of these two estimates versus the reference true

probability (3.12). The x-axis relates to the 10,000 exact P(A = 0\B = 0,C = 0)

values and the y-axis relates to the approximations of this exact probability based on

vo = 1 model (left graph) and based on conditional independence (right graph).

The approximation of conditional independence leads to 585 illicit probabilities (greater

than 1), that is approximately 585/10000^6% of all estimated probabilities. The most

severe case leads to the estimated probability P(A — 0\B — 0, C = 0) = 14.5. Of

course, in any application any such violation of the law of probabilities would be cor

rected. In most cases, such correction would involve changing the elementary single

datum-conditioned probabilities.

The probabilities estimated under the i/0 = 1 model are always licit by definition

when working with binary data. Also, observe the high correlation coefficient (0.83)

of the vo = 1 model results with the reference. This robustness of the u0 = 1 results

is attributed to the original paradigm stating that ratios of probabilities are likely to

be more stable than the probabilities themselves.

Table 3.2 gives the summary statistics of the 10,000 reference probabilities and the

two sets of 10,000 approximations. Not only the v0 = 1 leads to licit probabilities

better correlated with the reference probabilities, but also, critically, the v§ = 1 model

reproduces the sample statistics of the reference much better than the model based on

the conditional independence assumption. That assumption tends to over-compound

the two data information hence generating a positive bias (overestimation).


Figure 3.1: The scatterplots for u0=l model (left) and conditional independence estimator (right) versus reference. The illicit probabilities are shown above the red line. Beware of the different y-axis scaling for these two panels.


reference UQ = 1

conditional independence model

mean 0.50 0.50 0.55

variance 0.056 0.038 0.150

Table 3.2: Summary statistics: means and variances of 10,000 conditional probabilities P(A\B, C) and their approximations.

reference U0 = l

mean 0.50 0.50

variance 0.0560 0.0562

Table 3.3: Summary statistics: means and variances of 10,000 conditional probabilities P(A\B, C) and their transformed estimator.

To ensure that conditional independence does hold, one might consider transforming

the eight original joint probabilities pk of Table 3.1 for each of 10,000 realizations.

Such transform amounts to tampering with actual observations (the p^ s) to fit con

venient models and is not generally recommendable.

One such transformation which ensures conditional independence given both A and

nonA is:

^trans r ' l (P3 + PA)

Pz —

trans PA —

trans Pi —

trans ^ 8 —

Pi +P2

P2JP3+P4.)

P1+P2

Pb{Pl+P%)

P5+P6

P^Pl+Pn)

P5+P6

(3.15)

This transformation leads to the two approximations given by conditional indepen

dence (given A and nonA) and by the v0 = 1 model to be identical.

Table 3.3 gives the summary statistics of the reference conditional probability P(A\B, C)


and the estimator of that reference based on transformation (3.15). The mean of the

10,000 resulting approximations is equal to the mean of the exact probability (0.5).

However, as can be seen from Figure 3.2, the resulting estimator based on that trans

form is poorly correlated with the reference true probability with only 0.41 coefficient

of correlation. Note that the correlation coefficient of the v0 = 1 model applied to

the original non-transformed data was 0.83.

Figure 3.2: The scatterplot for the estimator of fully conditional probability P(A = 0\B = 0, C = 0) based on transformed probabilities (y-axis) versus reference (x-axis). The correlation coefficient between them is 0.41.


Instead of tampering with "data" i.e. changing the elementary single-datum condi

tioned probabilities P(A\B) or P(A\C), Journel [48] suggested standardizing the two

conditional independence-based estimates P**(A = a\B, C) and P**(A = nons\B, C)

as follows. Consider the conditional probability P(A\B, C) given by:

P(B, C\A)P(A) P(B, C\A)P(A)

P(B, C, A) + P(B, C, A) P(B, C\A)P{A) + P(B, C\A)P(A)

Assuming conditional independence given A and nonA leads to:

P{B\A)P(C\A)P{A) P(A\B, C) =

P(B\A)P(C\A)P(A) + P(B\A)P(C\A)P(A)

That is:

P(A\B, C) = S(A) ~ e [0,1] (3.16) ; S(A) + S(A) L J '

where S(A) = P(B | A)P(C \ A)P(A)

and S(A) = P(B \ A)P(C | A)P(A)

Were expression (3.16) applied to the estimation of the complement event A = nonA,

the following probability would be obtained:

P(A\B,C)= S(A)

S(A) + S(A)

which ensures that:

P(A\B, C) + P(A\B, C) = 1 (3.17)

Note that neither expression (3.16) nor (3.17) corresponds to conditional indepen

dence.


The uQ = 1 model can be shown to identify the standardized expression (3.16). Indeed:

^ | B - C ) = TT% = ^ S ( I j (3'18)

In summary, the no-data-interaction i>0 = 1 model represents a significant contribu

tion beyond the traditional hypothesis of data conditional independence. Conditional

independence giving A and nonA does lead to the v0 = 1 model, but there are many

patterns of data dependence that also lead to the same u0 = 1 model.

Notwithstanding these advantages, the v0 = 1 model can be restrictive in some appli

cations. It is thus important to consider the case when the global interaction weight

UQ is different from one, as shown by the stockbrokers example below.

The s tockbrokers case and UQ ^1

Consider an uncertain decision to be made about buying a particular stock (A = 1).

The prior probability is uninformative: P(A — 1) = 0.5, hence x0 = 1.

1. Two stockbrokers (D\,D2) strongly advise to buy that stock:

P(A = l|Z>i = 1) = P(A = 1\D2 = 1) = 0.9 with xx = x2 = grj = g = 0.11

The likelihood of having the second broker advising a buy {D% = 1) in presence

of (A = 1, D\ = 1) is much greater than in presence of (A = 0, D\ = 1),

therefore:

P(D2 = 1\A=J,D1 = 1) = 9 • P(D2 = 1\A = l ,Di = 1) » P(D2 = 1\A = 0,D1 = 1)

Thus the second nu weight given to the information D2 = 1 is, according to

expression (3.5):


V2 _P{D2 1

P(D, = 1 A = 1,D1 = 1)(X2,-1_ 1 A = l,D1 = iyxo) ~ 5 •9 = 1

In the absence of any other information, the distance x to A = 1 conditioned

to D\ = D2 = 1 is then , with v\ = v?, = \:

X2\ _ X±X2 _ (\\2

and P(A = l |Di = £>2 = 1) = Y ^ = 0.99

The compounding of the two brokers advices leads to a strong probability (0.99)

for a buy (A = 1). Note that the previous {v2 = 1) value also corresponds to

the no-interaction case

P(D, = 1 P(D2 = 1

4 = 1 , A = 1) _ P(D2 = 1 A = l,Di = l) P(D2 = 1

A = l) A = l)

that is knowing the information Dx = 1 does not affect how the information

D2 = 1 discriminates A — I from A = nonA = 1; i.e. there is no presumption

of any collusion between the two brokers.

2. A third adviser (D3) admits knowing nothing about that stock, hence

P(A = 1|£>3 = 1) = P(A = 1) = 0.5 with x3 = x0 = 1

However, when told that the two brokers £>i, Z?2 did both advise strongly a

buy, the adviser D3 warns about collusion with:

P{D3 = 1\A = 0 , D 1 - J } 2 = 1) = 100 • P{D3 = 1\A = l,Dl = D2 = 1) »

P(D3\A=1,D1 = D2 = 1)

Therefore given the data sequence D\ = 1, D2 = 1, D3 = 1, the likelihood of

the warning D3 — 1 is much larger when the stock is a dud (A = 0) than when

the stock is actually good (A = 1).


The third nu weight given to D3 = 1 is:

A = l,D1 = D2 = l) _ P(D3 = 1 A = I,D1 = D2 = I)[X0) - i U U i - i U U

with 4 = nonA.

The fully conditioned distance x is then, with: v\ — 1, z/2 = 1, ^3 = 100:

* = U • If • 100(|[) = (J)2 • 100 • 1 = 1.21, and: P(A = 1|£>! = D2 = 1, I>3 = 1) = , , -L_ = 0.45 < P(A = 1) = 0.5

The advise D3 = 1 in presence of Dx = D2 = 1 leads to dropping the buy

probability from P(A = 1\D\ — D2 = I) — 0.99 to a value 0.45 lower than the

prior P(A = 1) = 0.5.

3. Hones t broker: If in presence of D\ = 1 (buy advise given A = 1), the second

broker D2 is honest and admits that his input would not discriminate further

A from A — nonA, then:

P(D2 = 1 P(D2 = 1

^ = 1 . A = 1) = 1 ^ ^ = ^ - i = 1 ,4 = i , A = i) ^ ^ - ^ x o ^ - g

^2 = A 7 1 indicates strong (here honest) data interaction.

Then: Vi^- = 1, i-e. the datum £>2 = 1 is ignored and the distance condi

tioned to D\ = 1, D2 — 1 is:

that is: P(A = 1|D1 = D2 = 1) = P(A = 1|A = 1) - 0.9.

In such a case (honest second broker Z?2), the third adviser would not have


discriminated A from A = nonA with:

P(D3 = 1 P A

A=1,D1 = D2 = 1) = 1 =• v3 = i(m-\^ = i , 4 = 1 , A = A = 1) -* "S-^XQJ >"'Xl

That is the third adviser brings no further information, and:

Jî§J)(^§§)(-3§f) = fj(l)(l)=0.11

Hence, P(A = 1 |A = D2 = 1, A = 1) = P(A = 1|A = 1) = 0.9.

4. A different datum A^ = 1

Consider the different sequence of information A , D3, D2 with the adviser da

tum (A- = 1) being given prior to knowing the second broker advice (D2 = 1).

Note that this third information is actually different from that considered above

and must be denoted D3 = 1 with D3 ^ A -

Ignoring the potential for collusion, the adviser D'3 would have been fully non-

committing with:

P(D'3 = 114 = 1, A = ! ) _ • , x* P(A\D3 = 1) P(I>; = 1 |A=1 ) JD 1 = 1) ' x° P(A\D'3 = 1)

hence the nu weight given to that "second in sequence" adviser information

is: v'3 = 1, leading to:

x^ = Ûlxi)(us^) = xj;

i.e. P(A = 1|A,D'3 = 1) = P(A = 1|A = 1) = 0.9.

Now comes the dishonest second broker data D2 = 1, with similarly to case 1:

P ( A = l\A = 1, A = 1, A = 1) = 9 • P ( A = l | 4 = 0, A = l, A = 1)


and v'2 = -h • 9 = 1

Then:

x0 - l " W ^ W ^ W - x0x0 - W

and, P{A = l |Di = 1, D'3 = 1, D2 = 1) = —L_, = 0.99. J. I tt/

The adviser datum Ds = 1, different from datum £>3 = 1 of case two because

of his ignorance of D2 being used, cannot reveal the collusion hence amend the

strong final buy probability (0.99).

3.3.2 The classified u0 approach

The UQ = 1 model is very simple yet performs significantly better than approxi

mations stemming from traditional data independence assumptions (see more case

studies in Chapter 4). Nevertheless, the vQ — 1 model fails to account for interaction

between data. In practice, data originating from the same physical source most often

do interact on each other. To reinstate data interaction and data values dependence

(heteroscedasticity) into our probabilistic system, we propose to borrow the data in

teraction parameters from proxy training data or images revealing the physics of that

interaction.

Consider first the example of n categorical data A , each datum taking one of K

categories; similarly the data Di could be continuous with the histogram of each da

tum being divided into K classes. There are Kn possible data values combinations, a

daunting number if K or n are large. Let / be anyone of these Kn possible data value

combinations. Each such data event d ^ — < Di = d\ , i — 1,..., n > may be reduced

into a few summary scores s ^ = < Sj , j = 1,..., n > with n « n.

For example, if D is a multiple-point data event involving n continuous variables Di

all related to the same attribute, two classical scores are the mean and variance of


the n data values, i. e. d"' ~ s ^ = {m^l\a2^}, the dimension reduction is from n

to n — 2 < < n.

Of course, there are other possible summary scores than mean and variance although

requiring n > 2, but still n « n [74].

A proxy experiment (or training data set), mimicking the actual data interaction,

would provide L joint outcomes < a^l\ d\' i = 1, ...,n >, I = 1, ...,L, with a® = a or

nona. These L joint outcomes could be classified into N classes or clusters of similar

scores s ^ = \sj, j = l,...,n < < n > , I — 1,...,N, where N « Kn. We can

associate a typical (or prototype) z^-value to each of these score classes or clusters,

since on the training data set we know both the final distance x and the n marginal

distances Xj. The greater the deviation j 1 — v$ \ of this prototype value u0, the more

consequential the global data interaction, and hence the more likely the classified UQ

approach is to outperform the simplistic v0 = 1 model.

For each application, having observed the actual data set d = {Di = di, i = 1,..., n}

we would calculate the corresponding score vector s ~ d. This score should be defined

in similar fashion as the training score s^ . We then find the training class s ^ closest

to that experimental score vector s and retrieve and use the ô-prototype value of

that closest class. The distance measure needed to find that closest training class

should be defined carefully and that distance is case-dependent. One simple measure

could be the absolute distance r between the two scores s ^ = \sj , j = 1, ...,n >

and s — {SJ, j = 1, ...,n } defined as:

ro = mm \ = J2\sf-Sj\, l = l,...,N\, (3.19)

where N is the number of score classes.

The closest training class Co will be that with the smallest distance TQ.

As an example, consider a training image whereby an outcome A is evaluated by


three binary data values. This leads to a total of 23 = 8 data value combinations.

One training score s ^ could be the average of the three binary data values. The eight

data value combinations and their respective scores are shown in Table 3.4.

These eight data value combinations can be classified into 4 classes of equal scores

data value combination 000 001 010 011 100 101 110 111

score 0/3=0.00 1/3=0.33 1/3=0.33 2/3=0.67 1/3=0.33 2/3=0.67 2/3=0.67 3/3=1.00

Table 3.4: Eight data value combinations and their scores.

as shown in Figure 3.3.

Assume that the actual conditioning data event consists of the following eight bi

nary values: 0111 0101 with corresponding score s = 5/8 = 0.63. Note that the

actual information size is eight binary data as opposed to the training information

which consists of only three binary data. Using then equation (3.19), we find that

the training class number 3 (with training score s® = 0.67) is closest to the actual

data score s. We will thus use the training Rvalue corresponding to that third train

ing class. That VQ-value is likely not equal to 1 reflecting the training data interaction.

Following our decomposition paradigm presented in Section 3.1, we borrow only the

data interaction parameter u0 from the training set or catalog, no t any of the elemen

tary training probabilities. The z , i/0, parameters being ratio of ratios of conditional

probabilities (3.5), should be more stable (more homoscedastic) in regard to data

values than the training conditional probabilities. If this engineer-type conjecture of

permanence of ratios proves right, the u0-values borrowed from the training set should


training classes

s = 0.00 s = 0.33

001 000

i

© 010 100

0

s = 0.67

011 101 110

®

s = 1.00

111

Figure 3.3: Four training classes and their respective representative scores.

be exportable to the application field, much more so than the conditional probabil

ities. For example, one would not export to an actual subsurface hydrocarbon field

direct porosity or permeability measurements taken from an analog outcrop. Instead,

one may retain the more stable permeability ratio Kv(u)/Kh(u), with Kv being verti

cal permeability at location u and K^ being the horizontal permeability at that same

location u.

In any particular application, we suggest that

1. the n elementary distances Xi, i. e. the n elementary single datum-conditioned

probabilities P(A\Di = di) be evaluated directly using the actual data values

di. As we mentioned before, this problem has received many solutions. For


example, numerous literature sources propose algorithms for obtaining condi

tional probability via neural networks [3],[40], [41]. In geostatistics, one could

consider an indicator algorithm for modeling the elementary conditional dis

tribution functions [34]. Obtaining such elementary, single datum-conditioned,

probability is not in the scope of this thesis.

2. the single weight VQ modeling the joint data interaction be borrowed from a

proxy experiment (or training set) where the relations |^- versus -^- are known,

thus providing proxy values for UQ.

The difficulty with borrowing such proxy weight u0 is that this weight is (a; dj, i =

1,..., n)-values dependent. The proposed VQ classification accounts for such data val

ues dependence although approximating it through summaries (scores) of these data

values. That heteroscedasticity has its positive side: the u0 weights measuring joint

data interaction do depend on data values, as opposed to the homoscedastic kriging

variance and regression weights.

An example of the classified UQ approach

As an example of the classified UQ approach consider obtaining posterior conditional

probability P(A = 1|D, B) where D and B are two data events informing the unknown

A. For example, A could be a binary variable indicating the presence of sand at the

location u of a potential reservoir site. Data event D could be indicator of sand

fades at nearby well locations, and data event B is the indicator of sand facies but at

more remote wells. Assume then the availability of the following prior, pre-posterior

probabilities, and training image:

• P(A = 1): prior probability of A occurring. Such probability could be obtained

from historic data. Note, this prior is common to both data events D and B.

• P(A = 1|D) and P(A = 1|B): probability of A occurring given information

provided by data event D and B taken separately. This is equivalent (through

Bayes relation) to knowing the respective likelihood functions P(D\A — 1) and


P(B\A = 1) of observing data event D or B given the unknown A = 1. For

example:

, , P ( A = 1 , D ) P(D\A)P(A = 1) P(A = 1|D) = v ' ' - v ' ' v !

P(D) P(D) P{B\A)P(A = 1) (3.20)

X) P(D|A = a)P(A = a) (a=0,l)

• the training image of Figure 3.4. Such training image is a synthetic representa

tion of the interaction between data events D and B and between them and the

unknown A. In practice a training image could be obtained from an outcrop or

built using a process-based simulation algorithm [70], [74]. The training image

allows retrieving the data values-dependent global interaction parameter i/0 of

equation (3.5).

Next consider the templates defining the data events D and B as shown in

Figure 3.5 (1) and (2):

— the closest data event D comprises 4 data locations located 10 meters

away from the unknown A(\i). These 4 data are located at the corners of

a square centered at location u.

— the second data event B comprises also 4 data locations with the same

geometry but located 15 meters away from the unknown u;

When conditioning only to either D or B alone, there are 24 = 16 possible

combinations of binary data values to consider. When conditioning jointly to

both D and B data events, there are 28 = 256 possible combinations of binary

data values. Note, that ideally a training image such as that shown in Figure

3.4 should be large enough to depict all possible 256 data value combinations.

The training image provides replicates of the (D, B) joint data event and the

corresponding A value. That training image thus provides all probabilities of

APPROXIMATIONS BASED ON THE NU DERIVATION 73

Training Image p=0.72 0.28

50 100 150 200 250 East-West direction

ure 3.4: Training image depicting the interactions between data and unknown.


( D (2)

Figure 3.5: Data events definitions.

the type P(A\B), P(A\D), and P(A\B, D) and consequently proxy values of the

u0 data interaction parameter, as defined in equation 3.5.

Regarding the inference issue, if the training image is not large and "rich"

enough to display enough replicates of all data events (taken jointly) found

in the actual field, one can reduce these data events through a few summary

statistics or scores. For example, the 256 (D, B) data values combinations of

this example could be summarized by two scores Si and S2, where:

1. Score Si is the arithmetic average of the (4+4)=8 data values

2. Score 52 could a measure of east-west connectivity calculated on the same

8 data values as suggested by Zhang [74].

That is the 256 data value combinations of the data template of Figure 3.5(2)

have been summarized by only two scores Si and S2- Figure 3.6 shows schemat

ically such dimension reduction where the two scores Si and S2 are plotted on

x and y axis, respectively. Each pair (Si, S2) on the score map (Figure 3.6,


training image score map ° 2 •?

training data reduced by two summary scores

•

•

• • • ••• • • • • »•

• • • •

•

•

••

• •

• • • •

• •• • • ••

• • •

50 100 150 200 250 S-, East-West direction

Figure 3.6: Training image (left) is summarized by the distribution of two summary scores shown on the score map (right).

right) corresponds to a particular training data occurrence with the configura

tion shown in Figure 3.5(2).

Further using a traditional classification technique such as cross or A;-mean par

titioning, the score space is divided into clusters or classes of similar score values.

For example, Figure 3.6 shows nine such classes. For each such class, we can

retrieve a prototype u0 value by, for example, taking the average or median

of that class training u0 values. These nine prototype u0 values are likely all

different from the value 1, they allow us to step away from the assumption of

no-data-interaction of the VQ = 1 model.

In the application phase, it is a simple task to find the training class closest to the

actual conditioning data scores and retrieve that class prototype vo value to combine

the elementary probabilities. The classified UQ paradigm is general in that:

1. The actual conditioning data event can be quite complex. In the example of

Figure 3.5 the conditioning data events D and B comprise 4 data points each.


In an actual application, the joint conditioning data event might comprise many

more than eight data points. Particularly important are the actual data score

values retained to find the closest training class and retrieve its prototype u0

value.

2. The actual conditioning scores need not match exactly any of the training class

scores. In other words, the actual conditioning data event does not need to have

exact replicates in the training image: it suffices to find the training class with

the closest set of score values.

Of course, the set of scores retained should be chosen so that it reflects the main

characteristic of any specific joint conditioning data set. Too many scores and the

training image available may not offer enough replicates to fill-in reliably the score

space.

Chapter 4

Application to binary data

The purpose of this chapter is to illustrate the nu model with application to binary

data sets. A binary data set consists of two values coded as either zero or one. We will

sometimes refer to the category zero as mud/no sand and to the category one as sand

following petroleum engineering convention. The reference binary data sets presented

in this work are assumed exhaustively known. Such reference data sets provides the

exact fully conditioned proportions and allows checking any approximation including

those resulting from the u0 = 1 model and the classified u0 approach against tradi

tional estimators based on data independence and conditional independence. Various

important parameters controlling data interaction are investigated. Particular focus

is given to the dependence of data interaction on data values. This heteroscedastic

dependence makes more difficult the inference of an accurate u0-raodel. The level

of heteroscedasticity of the tau and nu parameters are compared; we expect the nu

weights to be more stable versus data values and hence easier to infer.

4.1 An elementary case study

4.1.1 Equilateral configuration

To investigate how the nu and tau parameters relate to data interaction, the following

simple experiment is proposed. It involves one unknown A located at the center of an

77

78 CHAPTER 4. APPLICATION TO BINARY DATA

equilateral triangle and three data Ii, I2, h located at its three apices (Figure 4.1).

10.0

6.0

2.0

-2.0

-6.0

-10.0

-

1 1

1

—1—

1—

1

—1—

1—

1—

—1—

1

1

-

/

h ii

i i i

1:

A A

(5.77)

—i—i—r—

VlO.O)

\

— •

b

— i — i — i — — i — i — t —

-10.0 -6.0 -2.0

Figure 4.1: Spatial locations of three data I\, h, h and the unknown A. The distances are given in parentheses.

All four variables are binary (0,1) and were generated by truncation of a simulated

Gaussian field.

More precisely, 100,000 unconditional joint realizations of four corresponding Gaus

sian random variables Z(u) are generated by LU decomposition of their 4x4 covariance

matrix (program LUSIM in [11]).

The isotropic covariance model used to build the covariance matrix is:

7(h) = exp ( —), with practical range 3r

4.1. AN ELEMENTARY CASE STUDY 79

That range 3r is made variable from one set of 100,000 realizations to another set of

equal size. Each set allows to study data dependence and interaction in evaluating

the central value A.

All standard Gaussian realizations, denoted z(u), are truncated at the median value

z = 0 to generate joint realizations of the four binary indicator variables:

1 if z(ua) > 0

0 otherwise

where u0 is the location of the central value A to be evaluated, and u a , a = 1, 2, 3

are the three data locations.

The four variables being binary, there are a total of 24 = 16 possible joint combina

tions of their values. A joint probability of occurrence pk, k = 1,..., 16, is assigned 16

to each of these 16 joint combinations (Table 4.1). Note that ^ p\. = 1. fc=i

P i

P2

Pz Pi

P5

Pe P7

P8

A 0 0 0 0 0 0 0 0

h 0 0 0 0 1 1 1 1

h 0 0 1 1 0 0 1 1

h 0 1 0 1 0 1 0 1

p% PlO

P l l

P\2

Pl3

Pu Pl5

Pl6

A 1 1 1 1 1 1 1 1

h 0 0 0 0 1 1 1 1

h 0 0 1 1 0 0 1 1

h 0 1 0 1 0 1 0 1

Table 4.1: Probability notation for the 16 joint occurrences.

The prior or marginal probability associated with any of the four binary variables is

Po = 0.5 corresponding to the prior distance:

x0 = !^ -£o = 1 with po = P(A =1) = 0.5.

* (u a) =


0.9 v0=1modej.. -

P(A)=1 -e—e- o o o o—e-e-e-o o o -e-e-e—&- o o o <b

10 15 20

Practical range

25

Figure 4.2: Conditional probabilities. Concordant data case: A = I\ = I2 = h = 1. The fo = 1 model outperforms the model based on conditional independence assumption as seen from the zo — 1 model values being closer to reference probability.

The 16 probabilities pk are set equal to the corresponding 16 proportions of joint

occurrence calculated from each set of 100,000 simulated realizations of the four vari

ables A, Ii, I2, h- From such consistent set of probabilities of joint occurrence,

all conditional probabilities can be retrieved and plotted vs. the practical range 3r

(Figure 4.2).

For example, the probability that A = 1 given that all three indicator data are 1 is

retrieved as:

P(A=1\I1 = I2 = I3 = 1) P{Ahhh = 1) P{hhh = 1) Ps + Pie

Pie (4.2)

The practical range 3r of the underlying Gaussian random function is increased along

the abscissa axis of Figure 4.2 from zero (pure nugget effect) to 30, a large range value


three times the distance between any two of the three data Ij in Figure 4.1. The or

dinate axis gives the various probabilities, all related to the event A = l . All of these

probabilities are seen to increase with the correlation range 3r, as expected for the

case of concordant data: I\ = I2 = ^3 = 1.

The probabilities plotted along the ordinate axis of Figure 4.2 are:

• the constant prior probability P(A = 1) ~ 0.5. The small fluctuations around

the expected value 0.5 are due to the finite sample of only 100,000 realizations.

• the three elementary conditional probabilities conditioned to one single datum:

P(A = 1 I Ij = 1), j — 1, 2, 3. Had we drawn an infinite number of realizations

(instead of only 100,000) these three elementary conditional probabilities will

be all exactly equal.

• the exact fully conditioned probability P(A = 1 | Iihh — 1) a s calculated from

expression (4.2)

• the approximation of the previous exact fully conditioned probability using the

nu expression (3.6) with the approximation u0 = 1.

_1_

l + x* P:0=1(A = 1 I hhh = 1) = ^ 7 ^ (4.3)

with the distance x* approximated with the v0 = 1 model as:

X1 = T\X± ^ 1 - ^ ( 4 = 1 1 4 = 1) zo I J V Xk P(A = l\Ik = l)

• the approximation of the exact fully conditioned probability using the com

mon hypothesis of data conditional independence. More precisely, the exact

probability is written:

F ( A = l | / 1 , / 2 , / 3 = l) = ^ = / 2 = / s = 1 l - 4 = I ) P ( ' 4 = 1 ) P(h = h = h=\)


per data conditional independence:

P (l1 = l2 = I3 = l\A = l) =

P (I1 = 1\A = l)P(h = 1\A= 1)P(I3 = 1\A = 1)

where all conditional probabilities can be written in terms of the probabilities

Pk of Table 4.1, e.g.

16 16

p(h = i\A = i) = ^2Pk/^2Pk fc=13 k=

P(h = 1, I2 = 1, h = 1) = P{hhh = 1) = Ps + Pie

Thus, the approximation provided by data conditional independence is:

3 P(A =l)Y\P(Ik=l\A = l)

PXT(A = 1 I hhh = 1) = ^ - r (4.4) CIK ' 2 3 ; P{hhh = 1)

The denominator in expression (4.4), although available here, is typically very difficult

to get in real practice. This is the reason why the practice of conditional independence

considers ratio of conditional probabilities of the type [29]:

P*CI{A = 0 1 hhh = 1) _ P(A = 0)"p(Ik = l\A = 0)

P*CI{A = 1 | hhh = 1) P(A = 1) 1=1 P(Ik = 1 | A = 1) l ' j

This ratio is none other than expression (3.4) of the distance x under data conditional

independence given A — 1 and A = 0, i.e. expression (4.5) entails the vQ — \ model.

However, the u0 — 1 model is not necessarily based on two previous assumptions of

conditional independence (given A=\ and A = 0); in that regard the VQ = 1 model

is a less restrictive hypothesis, that of no-data-interaction.


0.04

0.03

0.01

-0.02 -

-0.03

-0.04

1

data values for l,l2l

^ - . i^ ia i .S^y- 'arr . j ' ! ' ! . ' • '• S"^-;* W " * ^ M S ,

-

-

1

1

3

• '

/ '' - o -

- ' . ^ '•> ' * r '»1- C-v

'v' \ '\ \

,

1 '

. ** ^

^ ' • ' ' ^

.-• , . . -c. ' .•• - - * / " • . ? N \Q

, .» • ' ' - ' .£••''

\ - ' •s v-;\

V % i O . Oln

. _ , - . - . . • '

/'"* -' . ><*.; ff - > • -

:*"''

J0n

, p

'"" ' — ' v p

*" ' 10 15 20

Practical range 25 30

Figure 4.3: Data values-dependent error associated with the vo = 1 model. The largest error is attributed to the cases when all three data I\, I2, h are concordant (the cases [1,1,1] and [0,0,0]) deviating most from assumption of no-data-interaction.

From Figure 4.2, the u0 = 1 model is seen to provide a better approximation than the

conditional independence hypothesis (given A = 1), increasingly better as the corre

lation range 3r increases. The v0 — 1 model corresponds to a hypothesis of no-data-

interaction which is increasingly poorer as the correlation within data and between

data and unknown increases. In the case of concordant data (Ii = I2 = I3 = I) used

to evaluate the probability of A — 1, ignoring data interaction by assuming UQ = 1

leads to over-compounding the three individual probabilities

P(A = 1 I Ik = 1) and an overestimation increasing with the correlation range.

Interestingly, the conditional independence approximation (A — 1) leads to an un

derestimation of the exact fully conditioned probability P(A = l\lil2l3 = 1).

Dependence on Data Values

To evaluate how the vQ = 1 approximation fares depending on the set of three data

values, Figure 4.3 plots the error


1.25

10 15 20

Practical range

Figure 4.4: The sequence-dependent Vi weights for the data concordant case A = I\ = I2 = h = 1- The first weight v\ is equal to 1 by definition. The third weight u3 reflects the greatest interaction. All three weights increase with correlation ranges since all data are concordant.

[P;Q=1{A = I | ilti2,i3) -P(A = I\ h,i2^)]

for the 23 = 8 possible sets of data values (Ji = i i , h = *2, h = H)-

As expected, ignoring data interaction leads to increasing errors as the correlation

range increases with overestimation when two or more of these the data are valued

1 and underestimation for the other cases. Also the largest error occurs when the

three data values are concordant (1, 1, 1) or (0, 0, 0), cases which contradict most

the assumption of no-data-interaction.

EXACT N U WEIGHTS

Availability of the exhaustive set of 16 joint probabilities pk allows calculation of the

exact (yi, vQ) weights as defined by expression (3.5). Figure 4.4 shows the three data

sequence-dependent i/j weights calculated for the case A = 11/2/3 = 1- Three data


h,h, h produce 3! = 6 possible data sequences. However, because of the equilateral

data configuration (Figure 4.1) associated with an isotropic correlation, the data

sequence does not matter here. The first datum in any sequence always receives a

unit weight V\ = \. As the correlation range increases the interaction between the two

first data increases leading to an increasing second nu weight i/2. The third datum

u3 reflects the even greater interaction between the first two data and the last one.

Note that data interaction and hence the nu weights are data values-dependent; that

interaction is maximal here when all data are concordant I\ = I2 — I3 = 1, and it

increases with the correlation range.

0.91 1 1 1 1 1 1 0 5 10 15 20 25 30

practical range

Figure 4.5: The single sequence-independent uQ weight: (1) with only two concordant data A = Ii = Ij = 1 and h = 0 with i 7 j ^ k (solid line) and (2) with all data that are concordant A = I\ = I2 = I2 — 1 (circled line). The interaction is the greatest when all three data Ii, I2, and ^3 = 1 are concordant. This interaction increases with the correlation range.

Figure 4.5 gives the single, data sequence-independent, exact u0 weight for the case

when all data are concordant A = J : = J2 = I3 = 1 (solid curve marked by circles)

and for the case with only two concordant data A = Ii = Ij = 1 with i ^ j = 1, 2, 3


0.04

0.03

-0.01

-0.02

-0.03

v0 vanes

v0=1

10 15

Practical range

20

Figure 4.6: The averaged error associated with data-value-dependent u0 model and with the VQ — 1 model. Data-value-dependent UQ model shows significant improvement reflected in smaller errors.

(solid curve). Note, in this case it does not matter which two of the three data are

concordant because of equilateral data configuration (Figure 4.1). In the presence of

concordant values A = Ii = I2 = h = 1, the strong data interaction is expressed

through an exact vQ value increasingly different from 1 as the data are more depen

dent one to another. With only two concordant data, the v0 weight is still increasing

with the range. However, as expected, this increase is less dramatic than for the case

with three concordant data.

Our inference paradigm consists of two steps. We first evaluate the single datum-

conditioned probabilities P(A = 1 | I3•, = 1), j = l , 2, 3 using the actual data from the

actual field under study. We then use some training image or expert catalog the data

values-dependent vQ weight and export it to the actual field under study to combine

the previous single datum-conditioned probabilities.

For example, assume that from some prior expertise (perhaps built from experiments


on training data sets similar to that used in this study) we have access to the following

VQ weight function:

{ 1, whatever the data values for any small range 3r < 6

1 + ••i2ooo i a s function of the practical range 3r > 6

This function is assumed applicable only when two or more data are valued 1. The

error graphs (Figure 4.3) are re-calculated using this improved i/o-model. The results

of Figure 4.6 show a significant reduction of the error and demonstrate that the

worth and practicality of the nu/tau approach depends on ability to go beyond the

approximation u0 — 1.

4.1.2 Non-equilateral configuration

For this second example, a non-equilateral configuration of three data was retained to

observe the impact of data locations on data interaction. Figure 4.7 shows the data

configuration, and Table 4.2 gives the corresponding Euclidean distances.

The study built around this data configuration is similar to that done for the equi

lateral case. Figure 4.8 shows the conditional probabilities associated to the case

A = I\ — Ii = Is = 1. Note that data values concordance represents an unfavorable

case for any independence-related approximation.

For this example we also included one more estimator for comparison of the results

based on the VQ = 1 model. This estimator considers a hypothesis of data indepen

dence combined with the hypothesis of conditional independence given A = 1.

A

h h h

h 10.63 0.00 21.40 3.61

h 11.18 21.40 0.00

22.83

h 11.66 3.61 22.83 0.00

Table 4.2: Distances between data-to-unknown and data-to-data.

CHAPTER 4. APPLICATION TO BINARY DATA

115

110

105

100

95

90

-

-

1 la

^

|

* \

A/

r

i i i

l2

/t / i

ft i

t

7

"'"--,..

85-

90 94 98 102 106 110

Figure 4.7: Non-equilateral data configuration.

Figure 4.9 shows the sum of the two estimates P*(A \ Ix,I2,h) + P*(A \ Ii,I2,I3)

for the three sets of approximations. That sum should be equal to one. It appears

that only the estimate associated with u0 = 1 verifies that consistency relation for all

ranges. The two independence-based estimates (4.4) and (4.7) are not self consistent

(over A and ^4), particularly the estimate based on conditional independence. This

consistency represents a valuable in-built property of the u0 — 1 approximation in

presence of data dependence.

We will call that combination of hypotheses as "full independence". The resulting

approximation is written:

P*{A\I1,I2,h) = P(A,h,I2,I3) P{A)P{h | A)P(I2 | A,h)P{h I A,h,I2)

P(h,h,h) P(h,h,h)

The numerator per conditional independence given A = 1 is written:

4.1. AN ELEMENTARY CASE STUDY

10 15 20 Practical range

30

Figure 4.8: Conditional probabilities for non-equilateral case with A = Ii = I2 = h = 1. The estimate based on full independence assumption (line marked by points) leads to a large over-compounding of the concordant information. Conditional independence estimate (line marked by plus signs) gives probability that is less than 0.5 for small (< 21) ranges. The u0 = 1 model (dash-dotted line) provides consistently better results.

P(A)P(I\ | A)P(l2 | A)P(h | A). The denominator per data independence is writ

ten: F(/1)P(/2)P(/3). Thus,

P *(A\h,I2,h) = P(A)P{h 1 A)P{I2 1 A)P{h 1 A) P(/1)P(/2)P(/3)

p(A)P(A 1 h)P(h) P(A | I2)P{h) P(A 1 h)P(h) nA) P(A) P(A) P(A)

P{h)P{I2)P{h)

P(A 1 h)P{A | h)P{A J h) P(A)2

per Bayes' inversion

(4.6)


conditional independence full independence

u u c 01

c o

15 20

Practical range 25

Figure 4.9: Checking the consistency relation. Case I\ — I2 = h = 1- The v0 = 1 model produces the licit probabilities. The estimates based on data independence assumptions (both conditional and full independence) do not follow the general law of probabilities which requires the probabilities to sum to 1.

Or, equivalently:

P\A | Iu I2,13) P(A | h) P(A | I2) P(A | J3)

P(A) P(A) P(A) P{A) (4.7)

For example, in terms of the p'ks of Table 4.1, the probability P(A — 1 | I\ = 1) is

obtained as:

P(A = l\h = l) = 16 E Pk

t=13

16

16 8 £ Pk+ E Pk

k=13 k=5

, and P(A = 1) = E Pk-k=9

From Figure 4.8, we observe that the estimate (4.7) based on "full independence"

leads to a large over-compounding of the concordant information It = 1. Conditional

independence (4.4) gives an estimate which is less than the prior probability (0.5)

at small ranges; this represents a severe error since all three individual probabilities

P(A = 1 | 7j = 1) are above the prior. Again the uo = 1 approximation (4.3) provides

consistently better estimate.


- * - conditional independence 0.4 I -©- full independence

0 5 10 IS 20 25 30

Practical range

Figure 4.10: Approximation errors for the eight data value configurations. The conditional probability estimated through u0 = 1 model (solid lines) has more stable and smaller errors; the conditional independence assumption (lines marked with stars) leads to the largest errors.

Figure 4.9 shows the sum of the two estimates P*(A | Ii,I2,I3) + P*{A \ IUI2,I3)

for the three sets of approximations. That sum should be equal to 1. It appears

that only the estimate associated with v§ = \ verifies that consistency relation for all

ranges. The two independence-based estimates (4.4) and (4.7) are not self consistent

(over A and A), particularly the estimate based on conditional independence. This

consistency represents a valuable in-built property of the vo = 1 approximation in

presence of data dependence.

The approximation errors defined as:

[ i ^ = 1 (A = 1 | i 1 ) i 2 , i 3 ) - F ( A = l l i i . i a , ^ ) ]

lP£I(A=l\il,i2,i3)-P(A = l\i1,i2,i3)}

[P*FI{A = 1 | h,i2,i3) -P{A = \\ i i , i2 , i3)]


Practical range

Figure 4.11: Error linked to v§ = 1 (non-equilateral case). The errors attributed to the VQ = 1 model are small and stable attesting that this model is the best among others presented.

data values for I, l2l3

10 IS 20

Practical range

Figure 4.12: Error linked to "full independence" hypothesis (non-equilateral case). The largest error is attributed to the case when all three data are equal to 1. This is the case when the consequence of wrong assumption of full independence is most severe.


data values fori, l j l3

. 0 . 2 i 1 1 1 1 1 1 O S 10 15 20 25 30

Practical range

Figure 4.13: Error linked to conditional independence (non-equilateral case). The errors are large and non-stable. The positive errors associated with overestimation of true conditional probability (A = 1) are higher than the negative errors associated with underestimation.

for each of the eight data values combinations when estimating A=\ are plotted in Fig

ure 4.10. Again, the conditional probability estimated through the vQ = 1 assumption

has more stable and smaller errors; the conditional independence assumption leads

to the largest errors. Figures 4.11, 4.12, 4.13 give the errors specific to each estimate

with indication of the three data values. Beware of the different ordinate axis scaling.

The errors associated with the u0 — 1 estimate (4.3) are small and centered around

zero (Figure 4.11). That error is smallest when the two close-by data are different

(Ii 7 J3), corresponding to data values less conflicting with the underlying no-data-

interaction hypothesis. The vo = 1 model appears to downplay the contribution of

the isolated Ii datum value: in Figure 4.11 the two error curves for 72 = 0 and h = 1

are similar for any given combination of the Ii, I3 data values. The smallest errors

for the z/fj = 1 model are related to cases of non-concordant data values, particularly

non-concordant I\ and I3 values, i.e. 001, 011, and 110.

Figure 4.12 shows the errors for the "full independence" estimate. The error is largest


x104

6

5

4

o

UJ 3

2

1

0

-1

5 10 15 20 25 30

Practical range Figure 4.14: Bias (error) averaged over all data values combinations (non-equilateral case). The full independence estimator and uQ — 1 models provide reasonably unbiased estimates, while the conditional independence leads to severe overestimation.

for the case of data Ii = I2 — h = 1 concordant with the outcome A = 1 being

evaluated. In such case, the assumption of data independence is most invalid. The

most significant result is the large error associated with the conditional independence

estimate, see Figure 4.13. The errors are much larger and more unstable than for

the other two estimates. Also, the positive errors associated with overestimation of

the true conditional probability that A = 1 are much higher than the negative errors

associated with underestimation leading to an overall bias.

Figure 4.14 shows the bias or error averaged over the eight data value combinations

when estimating the probability that A=l. On average, the v0 = 1 model (4.3) and

the "full independence" model (4.7) provide reasonably unbiased estimates, while the

estimate based on conditional independence leads to a severe overestimation of the

reference posterior probability.

conditional independence full independence

— • v „ = 1

4.2. A 3D CASE STUDY 95

4.2 A 3D case study

The applicability of the u0 inference paradigm is now tested using a large 3D reference

binary data set where all conditional probabilities involved in the tau and nu expres

sions (2.32) and (3.6) are known, including the exact full data-conditioned probability

P(A = a\Di = di, i = 1, ...,n). Various approximations of that reference probability

can be evaluated. The heteroscedasticity of the uo, Vi and Tj weights, i.e. their level

of dependence on the data values (di, i = 1,..., n) can be evaluated. The greater that

heteroscedasticity, the more difficult would be the inference of these data interaction

parameters in practice.

4.2.1 The reference da ta set

We start by generating a reasonably large 3D non-conditional realization of a Gaussian

field using the sequential Gaussian simulation code sgsim of the GSLIB software [11].

This 3D field is of size 100x100x50, comprising 500,000 nodes. The variogram model

used is spherical with small nugget (10%), isotropic horizontal range equal to 50 pixel

units, and a shorter vertical range equal to 20 pixel units. This Gaussian field is then

truncated at its upper quartile value, yielding the reference binary indicator field

shown in Figure 4.15.

Denote that reference field by S : {A(u) = 0 or 1, u e S } with P(A(u) = 1) =

0.25. In our paper, we will sometimes refer to the binary data valued 1 as sand.

Conversely, the binary data valued 0 will be referred to as non-sand or mud. We

borrowed this convention from petroleum engineering where the location of channel

sand is of the great interest. Figure 4.16 gives the reference indicator variograms

in the x, y, z directions calculated from indicator data from the top 35 layers of S;

the reason for excluding the bottom 15 layers will become apparent soon hereafter.

Those indicator variograms reflect the horizontal-to-vertical anisotropy of the original

Gaussian field.


t=50

t=1

P(A=1)=0.25 P(nonA)= P(A=0)=0.75 shown in black

Figure 4.15: Reference binary image generated by truncating a continuous Gaussian realization at its upper quartile.

0.3 -

0.25 -

D.2

0.19

0.1

0.05 - .

Horizontal EW

0 10 20 30 40 50 SO 70 SI

Mean: p=.274 Variance: p(1-p)=.199

0.3 -

0.25 —

0.15 -^

0.1 -_

0.05 -

Horizontal NS '•;"r,"\' ~v~~\~ :""T"\r"";

|- - - r - i - -1 - - \A

\--'r^^-\--\

•'•&'* - - r - r - f - -

liniJTinjimTfiWpiii

. - . j - - - | . - - ;

- - - p - - » - - ;

0.3 •

025

0.2 •

0.15 •

0.1 •

0.09 •

["" I 1

0 5

Vertical

! | l I , | I I I ! | l l l | l l l l j

10 15 20 25 30 35

Figure 4.16: Exhaustive indicator variograms, calculated over the 35 top layers. EW is the east-west direction and NS is north-south direction.


(1): conditioning to (2): conditioning to (3): conditioning to one data event D two data events D,B three data events D,B,C

Figure 4.17: Data events definition.

4.2.2 The estimation configuration

Consider the evaluation of the conditional probability of an unsampled value A(u)=l,

given any combination of the following three multiple-point data events (Figures 4.17

(1), (2), (3)):

• the closest data event D comprises four data locations at the level just below

that of A(u). These four data are at the corners of a square centered on the

projection of location u on their level (Figure 4.17 (1)).

• the next closest data event B comprises also four data locations with the same

geometry as for data event D, but located five levels below that of A(u); (Figure

4.17 (2)).

• the furthest away data event C again comprises four data locations but located

15 levels below that of A(u) (Figure 4.17 (3)).


If the unsampled location u of A(u) spans only the eroded field

So = {x = 11,. . . , 90; y = 11,...,90; z = 16, ..50} then each value A(u) can be eval

uated by any of the 3 data events D, B, C. From here on, all statistics will refer

to that "common denominator" field S0 comprising 224,000 nodes. Over that cen

tral field So, the marginal statistics for the event A = 1 being assessed is P(A)=0.274.

The definition of an "eroded" field S0 common to all data configurations entails that

the spatial averages of conditional probabilities (proportions) remain the same no

matter the conditioning data event retained. For example, if conditioning is only to

the sole D-data event:

P(A\D) = r ^ E P ^ ( u ) = ^ = d(u))

= i j E p(Aw =*) = pw= ° - 2 7 4

On1 — ueS0

where the data event D can take 24 — 16 possible combinations of data values. When

conditioning jointly to the two D and B data events:

P(A\D, B) = - ^ Y, P W U ) = 1ID = d(u) ' B = b(u))

= p(A = 1) = 0.274

where the data event (D, B) can take 28 = 256 possible combinations of binary data

values.

When conditioning jointly to three data events D , B and C:

P{A\D, B,C) = -±-Yl P ^ ( u ) = 1 I D = d ( u ) ' B = b(u)> C = c ( u ) )

= P(A = 1) = 0.274

where the data event (D, B, C) can take 212 = 4096 possible combinations of data

values. Because S0 is not that large, |So|=224,000 nodes, not all 212 data values


combinations are present in So; this does not affect, however, the previous equality:

P(A = 1|D, B, C) = P(A = 1) = 0.274.

Note also that the nu representation (3.6) does not restrict us to only the point

support of the unknown A (as in this example). Unknown event A can similarly be

defined as the data event provided we can find enough replicates of such data event

in our reference binary data set.

4.2.3 Conditional probabilities and estimates

As an example, Figure 4.18(1) gives the So-volume of the 224,000 exact probability

values P(A — 1|D, B, C) which are valued in the interval [0, 1] with mean 0.274 and

variance 0.067. Again, the mean is equal to that of the reference binary values. The

histogram of the probability values is given in Figure 4.18(2). We will use this field

as the comparison tool for future analysis. Similar figures and statistics are available

for all the following conditional probabilities, although not all are given here:

• single data event-conditioned: P(A(u) = 1|D), P(A(u) = 1|B), P(A(u) = 1|C)

• two data events-conditioned: P(A(u) = 1|D, B), P(A(u) = 1|D, C),

P(A(u) = 1|B, C)

• all three data events-conditioned: P(A(u) = 1|D, B, C)

• the estimated probability P*(A(u — 1)|D, B, C) using the uQ = 1 model (3.6)

to combine the previous single data-event conditioned probabilities.

Under the u0 = 1 model at each location u the estimate is:

P*(A{u) = 1|D, B, C) =

P*(A(u) = 1|D = d(u), B = b(u), C = c(u)) = 1 + ^ ( u ) ,

with for the estimated distance x*(u):

x*(u) _ x D ( u ) x B ( u ) x c ( u )

XQ XQ XQ XQ


(3) reference binary values

P(A=1)=0.274

1

0.893

0.666

0.5

0.333

0

0

V*/o.3

0 .25 -

frequency

o -

N=224,000 | m=0.274 1 a2=0.067

1

L H*L •b^. v0 model

Figure 4.18: (1) The reference eroded data set S0, (2) its histogram , and (3) binary reference field with the prior P(A = 1) = 0.274. The mean of the eroded data set is equal to that of the reference binary values. The area of the probability map (1) is equal to 80x80x35=224,000. This probability map will be used as the comparison tool for future analysis.

where: _ 1 - P(A = 1)

x0

i n 274

0 274 = ^ ' ^ *s ^ e marginal distance;

xD(u) = pfh^L 111)=: d(uY) is t h e d i s t a n c e t 0 ^(u) 1 updated by the data LM.J A | i - * KA.\U.JJ

event D = d(u) .

The distance xD(u) varies from one location u to another. It is obtained by scanning

the reference image 5o with the template definition of Figure 4.17(1) for the proportion

of D-replicates identifying the data values combination rf(u) which also features at

their upper center (one level above) a value A(u) = 1. Note our estimation paradigm

assumes that all elementary conditional probabilities P(A\T)), P(A\B), P(A\C) are

known. This analysis addresses only the problem of combining these elementary

probabilities into an estimate of the fully conditioned probability P(^4|D, B, C)

while accounting for data interaction. Similarly, from the training image one can


retrieve the other two elementary distances xB(u) and xc(u).

The vo = 1 model then provides an estimate of the fully conditioned probability,

P*(A(u) = 1|D, B, C) (Figure 4.19(1)).

Figure 4.19: (1) The estimate of fully conditioned probability P(A | D,B,C) using the uQ = 1 model, (2) its histogram, and (3) reference binary field with the prior P(A = 1) = 0.274. The spatial mean and variance of the estimated probabilities using the uQ = 1 model are greater than the corresponding statistics of the reference case leaving room for improvement.

This estimate is necessarily valued in the interval [0, 1]: its spatial mean is 0.288 and

spatial variance is 0.098. Its histogram is given in Figure 4.19(2). The histogram and

scattergram of the error defined as

P*(A(u) = 1|D, B, C) - P(A(u) = 1|D, B, C)

are shown in Figure 4.20.

The spatial variance and the spatial mean of the estimated probabilities using the

UQ = 1 model are greater than the corresponding statistics of the exact conditional


P*: m=0.288 o2=0.098 P: m=0.274 o2 =0.067

v0 model reference

(1) (2)

Figure 4.20: (1) Histogram of error P*(A | D,B,C) - P(A | D,B,C) and (2) the corresponding scatterplot of P*(A | D,B,C) based on zo = 1 model versus reference P(A | D ,B,C) .

probability of Figure 4.18 leading to a positive mean error of 0.14. One would expect

smoothing (smaller spatial variance) from an estimation. Note that the vo — 1 model

corresponds to an approximation of no-data-interaction which is a poor assumption

in presence of the two well correlated data events D and B. This ignorance of data

interaction results in over-compounding of the individual single-datum conditioned

probabilities leading to an overestimation of the fully conditioned probability and an

associated greater variance.

4.2.4 Ordering the data values combinations

The statistics presented in Figures 4.19 and 4.20 pool together the 224,000 estimated

conditional probabilities over So, irrespective of the actual conditioning data values.


Recall that there are 4 * 3 = 12 binary indicator data grouped four by four into the

three data events D, B, and C; therefore there is only a total of 212 = 4096 possible

data values combinations.

To study heteroscedasticity of the nu and tau parameters, that is their dependence

on data values, we should first rank or classify the 4096 possible data values combi

nations, then plot the Ui, Tj, VQ parameters vs. data values combinations and observe

their data values dependence. Note that the lesser that data dependence, particularly

of the single parameter uQ, the easier would be its inference in practice; this would

justify our paradigm of separating individual data event contribution and data inter

action.

Out of the total of 4096 possible data values combinations, 96 were not found in

So and of the remaining 4,000 only 931 combinations were found with at least 10

replicates. To ensure statistical significance we retain only the latter. These 931 data

values combinations were ranked along the abscissa axes of Figures 4.21 and 4.22

with increasing proportion of binary data valued 1, starting at abscissa 1 with 12

binary data all valued 0 (which may be interpreted as "no sand" event) and ending

at abscissa 931 with all 12 data valued 1 (which may be interpreted as all "sand").

The combinations with the same proportion of binary data valued 1 were then ranked

by physical distance to the unknown event A. From the template definition (Figure

4.17(3)), the data event D is closest to the unknown event A, followed by data event

B; then by data event C which is the furthest from that unknown A.

The proportion of binary data valued 1 increases toward higher abscissa axis and

since we are evaluating the probability of event A = 1, we are expecting increase in

data interaction, hence any hypothesis of no-data-interaction or data independence

would become worse.

The next section discusses advantages in use of nu model versus tau model for the

above cases and for general setting.


01

o w

BDC/DBC

^ |TW

I blue: 02(nu) = 0.48 red: <x2(tau) = 18.6

100 200 300 400 500 600 700 800 9

data value combination id

(1)

CDB/DCB

blue: o2(nu) l = 2.85 red: a2(tau) =80.2

BCD/CBD

blue: a2(nu) = 2.63 red: c2(tau) = 5.24

100 200 300 400 500 600 700 800 900


(2)

100 200 300 400 500 600 700 800 900


(3)

Figure 4.21: Sequence-dependent interaction parameters r3 (red) and u^ (blue) for data sequences (1) D B C / B D C , (2) D C B / C D B , and (3) C B D / B C D .

4.2.5 Heteroscedasticity of the tau and nu weights

It follows from from expressions (3.5) and (2.45) that,

• for any data sequence: v\ = T\ = \

• the Vi and Tj parameters are data sequence-dependent. For example, the last

parameter, v$ or r3, is not the same whether it applies to the sequence BDC or

the sequence CDB (Figures 4.21(1), (2)). However, this last parameter remains

unchanged from sequence BCD to sequence CBD (Figure 4.21(3)), or from

sequence BDC to sequence DBC (Figure 4.21(1)). Indeed, the last parameter


30

:ion

para

mel

te

rad

c

.

v0

iMkjWikiJ*

3

=nv, = l<v2<v3

ll jj

•

111 'IllilliJIf *•

yyMi|i|f|i|iijif|in';

100 200 300 400 500 £00 700 800 900


cr

^ S f

interaction parameter

Figure 4.22: Exact VQ parameter for 931 data value combinations. The VQ = \ model is excellent for the 600 first data value combinations as can be seen from the VQ values being close to 1.

(us or r3) measures the data interaction between the last data event (D in Figure

4.21(3)) and the indifferentiated ensemble of all previous data (BC or CB).

• the single global data interaction parameter u0 — v\v2vz = v2v^ is data sequence-

independent. The greater |1 — v^\ the larger the global data interaction is.

Figures 4.21 give the (r3, u3) parameter values applied to the last datum event in the

data sequence, as calculated from their exact expressions (3.5) and (2.45) using the

exhaustive proportions read from the reference field So- The following observations

can be made:

• the tau parameter is more unstable than the corresponding nu parameter as

seen from higher variability in tau series compared to nu series (Figures 4.21).

This is due to the denominator of the tau expression (2.45) becoming close to

log 1 = 0 whenever a datum is little or non-informative in discriminating A = a

from A = nona, as is the case for the furthest away datum event C.


• the u3 parameter accounting for interaction of the last data event in the sequence

is smallest when applied to the non-informative remote data event C (Figure

4.21(1)) and largest when applied to the two most informative closer data events

D or B (Figures 4.21(2) and 4.21(3)).

• the 1/3 parameter increases along the abscissa indicating that the data interaction

|1 — z/3| is data values-dependent and that data interaction increases as more

of the elementary binary indicator data are valued 1; note that the event being

assessed is A = 1. This is particularly notable when z/3 applies to D, the most

informative data event (Figure 4.21(3)).

Figure 4.22 gives the v0 global data interaction parameter as calculated from the

exact ^-expressions (3.5) with UQ = 1 • z/2 • ^3. This u0 value is seen to be data

value-dependent increasing as the three data events D, B, C become more redundant

in assessing the probability of event A — 1 by displaying a greater proportion of

elementary binary data valued 1 (higher abscissa values). However, for all but the

last 300 data values combinations out of total 931 retained, the approximation u0 = 1

appears quite robust, i.e. essentially data-value independent (homoscedastic). For

the last 300 data values combinations, a quadratic model of the type

u0 = 1 + X(p - pc)2, V p > p c

would provide a good approximation of the data value dependence of that single

global correction parameter u0, where:

• p is the proportion of sand in the two closest data events D and B pooled

together;

• pc is a threshold proportion below which the v0 = 1 model would be applied,

above which the quadratic model would be applied;

• A > 0 is a fitting parameter.

With the previous quadratic approximation, the dimension 3*4=12 for data value

dependence of VQ has been reduced to 2 (the two parameters A, and pc). In a real


application, So would be a training image built to mimic actual data interaction. A

study of data interaction would be developed on that training data set resulting in

some approximation of the global u0 parameter, say:

VQ = f(sj, j — 1, ...,n ), with n small

where p is a function of a few easily accessible statistics Sj summarizing the possibly

much larger space of variability of conditioning data events and values. That function

UQ = <£>(•) is then exported to the study of combining the various single data event-

conditioned probabilities. These single data event-conditioned probabilities should

not be read from the training set; only the interaction parameter VQ, or equivalently

the function p, is to be borrowed from the training set.

4.2.6 Independence-based estimates

To evaluate comparatively the performance of the u0 = 1 (no-data-interaction) model,

from the same 5 0 reference field we calculate estimates of the fully conditioned prob

ability F(A|D,B,C) stemming from two common approaches calling for data (condi

tional) independence. The expressions for these two estimators were given in Section

2.1.

The "conditional independence" (CI) estimator is written:

i3 ,(A|D,B,C) P(A\B) P(A\B) P(A\C) P(D)P(B)P(C)

P(A) P{A) P(A) P(A) ' P ( D , B , C) { '

The "full independence" (FI) estimator is written:

PÂID.B.C) _ P(A\B) P(A\B) P(A\C)

P(A) P(A) P{A) P(A) { '

The two sets of estimated probabilities Pci(A = 1|D,B,C) and Ppj(A = 1|D,B,C)

given by expression (4.8) for conditional independence and expression (4.9) for full

independence are retrieved from the reference set So- These and the VQ = 1 model es

timated probability P*0=1(A = 1|D,B,C) are plotted against the /So-exact probability


P{A = 1|D,B,C) in Figure 4.23.

No order relation violation Fewer order relation violation Most order relation violation Best correlation Poorest correlation Second best correlation

reference reference reference

Figure 4.23: Scatterplots of estimated probabilities P*(A | D,B,C) versus the reference P(A j D,B,C): (1) for estimate based on u0 = 1 model, (2) for estimate based on conditional independence assumption, (3) for estimate based on full independence assumption.

Although there is clearly data interaction (essentially between data events D ,B, and

A), the no-data-interaction model u0 — 1 (Figure 4.23(1)) gives reasonable results

with the largest correlation coefficient p = 0.82 and with estimated values necessarily

valued in the interval [0, 1]. The full independence approximation (Figure 4.23(3))

may appear at first sight to give equivalent good results (/? = 0.70) but expression

(4.9) does not guarantee that the resulting estimate PpJ(A\H,'B,C) lies in the interval

[0, 1] whenever there is actual data dependence and interaction: a large number of

these probability estimates are valued above 1. Assuming independence between data

might lead to severe violations such as probabilities that are greater 1. In practice,

these violations need to be corrected, for example, by setting all illicit probabilities

to 1. However, such artificial correction may add to overall bias of the estimates. For


example, the conditional independence estimator (Figure 4.23(2)) has fewer order vi

olations than the estimator based on full independence assumption (Figure 4.23(3)),

yet its correlation coefficient with the reference case (p = 0.36) is considerably lower

compared to the u0 = 1 model (p = 0.82) or full independence model (p = 0.70).

Using the no-data-interaction I/Q = 1 model calls for considering distances which

are ratios of conditional probabilities. In presence of departure from data indepen

dence hypothesis one is better off approximating ratios of probabilities (aka v0 = 1

model) which are generally more stable than the probabilities themselves. This was

the original point made by Journel [48]. However, much better results (as will be

shown in the next section) than those provided by the uo = 1 model can be obtained

with little additional effort by modeling the heteroscedastic variability of v0 using

a training/calibration data set mimicking the actual data interaction. No matter

how approximative is that training model of data interaction, it is likely to be better

than a blank and wrong hypothesis of no-data-interaction, or worst data conditional

independence. We consider this approach hereafter.

4.2.7 The classified VQ approach

The proposed classified UQ approach described in Section 3.3.2 can be summarized in

the two phases-training phase and application phase.

In the training phase we need to:

1. build a training data set mimicking (even only roughly) the actual data inter

action. From that set, retrieve the training data values-dependent z/o-values,

called proxy i/0-values

2. reduce each set of training data values to a few summary statistics or filter

scores. Based on these scores, classify the proxy values u0. Each class is identi

fied by a single (average or median) u0-value, called a "class z/0-prototype"

The application phase then consists of returning to the actual study field and then:


1. Finding the class closest to the actual conditioning data scores.

2. Retrieving that class "prototype" value u0.

3. Using that u0 value to combine the elementary probabilities. These elemen

tary probabilities must be evaluated from the actual study field, no t from the

training data set.

In the following example, the training data set is the reference data set (an ideal

case). Later in this section, the more realistic and less favorable case of a training set

different from the reference one will be considered.

For demonstration purposes consider then the classified u0 approach applied to the

reference data set shown in Figure 4.15. The goal is to estimate P(A = 1|D,B,C).

Each of the three conditioning data events D, B, and C comprises 4 binary data

points (refer to Figure 4.17 for the geometry of the three data events). There are 931

possible data value combinations for which we can reliably estimate such probability.

Consider as data summary (score) the single statistics defined as an average sand

proportion (i.e. the average of the 3x4 binary data and where sand is defined by

the binary data valued 1). That statistics can take only twelve possible values corre

sponding to the 12 classes of data events. Each class prototype VQ value is the average

of the proxy u0 values falling into that class. In Figure 4.24, the prototype u0 values

are shown in red for each of the 12 classes. The mean of these 12 proxy u0 values is

equal to 2.21 which indicates a significant deviation from the assumption of no-data

interaction (i.e. the vQ = l model). For each set of actual data values we look for the

closest training class and use the corresponding prototype v0 value (instead of u0 = 1)

for building the fully conditioned probability P(A = 1|D,B,C). Important remark

is the uncertainty of these prototype u0 interaction weights is different for each of

12 classes. For example, the variance of the proxy z/0 weights for the last class as

seen on the right of the Figure 4.24 is much larger than the variance of the proxy i/Q

weights for the classes in the middle of this Figure. To account for such uncertainty

we then can consider evaluating the lower and upper quantiles (i.e. 10-quantile and


90-quantile) of the proxy I/Q weights for each class, and then evaluating the fully con

ditional probability P(A = 1|D,B,C) for these two quantiles.

O

25

20

15 4

10

mean(v0j =2,21 mean(sand) =0.40 p{¥& /satiicl) =QJ0- .

1.12

0.66

i—'—i—•—r

0.89

0.75 ! I

8.48

3.80

1.69 •

2.69 *

* i

6.17 * •

* * *

* •

! !

• •

14.24

8.82

. , — | — r ^ — I — , . | i r

0 0.2 0.4 0.6 0.8

average sand values Figure 4.24: Exact v0 values versus average sand values defined over the three data events D,B,C. The average u0 values and their statistics are shown in red.

Comparison of the classified vo approach with the vo = 1 model performance is shown

in Figure 4.25.

The left graph shows a 0.82 coefficient of correlation between the reference true prob

ability and the UQ = 1 model for the 931 data value combinations. We observe only a

small increase in that correlation when using the classified u0 approach with p = 0.85.

Linear correlation, however, is not a fair measure of comparison between these two


reference reference

Figure 4.25: Scattergram of u0 = 1 model (left) and classified u0 model (right) relative to reference probability. The correlation coefficient of the classified VQ approach with the reference case (p = 0.85) is improved somewhat compared to that of the uQ = 1 model (p = 0.82).

models as it measures only linear dependence. The significant improvement brought

by the classified VQ approach can be observed in reproduction of the reference statis

tics for the 931 data value combinations retained, see Table 4.3.

Because of data over-compounding, the u0 = 1 model overestimates the reference

spatial mean and variance of the 931 exact probabilities. The classified VQ approach

reproduces the statistics of the reference case much better. The class-dependent VQ

model provides a significant improvement which is not fully reflected by the coefficient

correlation.


reference

i/0 = l classified u0 model

mean 0.44 0.52 0.41

variance 0.04 0.07 0.04

Table 4.3: Summary statistics: means and variances of reference conditional probabilities and approximations stemming from nu representation for 931 data value combinations.

Experiments with a different training set

To test further the robustness of the previous results, consider two different data sets.

The first one provides the information content (the actual data), and the other one of

fers training data from which data interaction is borrowed. For this, ten independent

Gaussian realizations were truncated at their respective upper quartiles generating

ten independent binary fields similar to that shown in Figure 4.15. The means and

variances of the eroded ten realizations SQ are given in Table 4.4.

1 2 3 4 5 6 7 8 9 10

mean 0.276 0.303 0.294 0.191 0.269 0.263 0.253 0.169 0.165 0.289

variance 0.200 0.211 0.208 0.155 0.196 0.193 0.189 0.140 0.138 0.205

Table 4.4: Means and variances of 10 independent realizations SQ.

We can now consider different combinations of these ten realizations for retrieval of

the various conditional probabilities:

• information content: we obtained the individually conditioned probabilities


P(A\B), P(A\C), P(A\D) from realization i = l,...,n = 10.

• proxy UQ values: for training we then used any realization j ^ i.

There is a total of n * (n — 1) = 90 possible combinations of the pair (actual vs.

training) realizations that can be used for the approximation of

P(A — 1|D,B,C) using the classified u0 approach. These estimates are then com

pared to the results of the u0 — 1 model.

Figure 4.26 shows the histograms of the means of the 90 reference P(A = 1|D,B,C)

values and their estimators based on classified u0 approach and on the u0 = 1 model.

The average of these 90 reference mean values is 0.399 (Figure 4.26 left). The respec

tive averages for the classified u0 approach and for the u0 — 1 model are 0.386 and

0.458 (Figure 4.26 right and center respectively). The u0 = 1 model leads to signifi

cant overestimation by over-compounding the individual probabilities. The classified

u0 approach reproduces well the mean value of the reference case. This similarity is

highly desirable property as it indicates that classified UQ approach is unbiased.

Figure 4.27 shows the histograms of the variances of the 90 reference

P(A = 1|D,B?(H) values and their estimators based on classified u0 approach and on

the u0 = 1 model. The average of these 90 reference variance values is 0.041 (Figure

4.27 left). The respective averages for the classified u0 approach and for the UQ = 1

model are to 0.041 and 0.076 (Figure 4.26 right and center respectively). The classi

fied UQ approach reproduces almost exactly the variance of the reference case.

reference UQ = 1

classified u0

mean 0.399 0.458 0.386

variance 0.041 0.076 0.041

Table 4.5: The average means and variances of P(A = 1|D,B,C) over 90 combinations.

Table 4.5 summarizes the average means and variances of the 90 reference fully con

ditioned probabilities P(A = 1|D,B,C) and the same statistics based on the UQ — 1


Exact v„ model

u a V ^ 0.15 O"

;n=90i • mean in [0.336, 0.<M m(mean) = 0.399

llll

6]

1 0.34 0.36 0.38 MO 0.42 0.44

reference (i)

0.42 D.44 0.46 0.48

v0 model (2)

classified v0 model lmean'into.118,0703] m(mean)>

0.2 03 0.4 0.5 0.6 0.

classified v. model (3)

Figure 4.26: The histograms of the means of the 90 reference P(A = 1|D,B,C) values (left), and their estimators based on the vo = 1 model (center), and classified VQ approach (right).

and the classified VQ approaches. The vo = 1 model over-compounds significantly the

elementary probabilities leading to a significant overestimation (bias). In contrast,

the classified vQ approach accounts better for the joint data interaction, and thus

decreases the over-compounding of information content and hence reduces the overall

bias.


Exact n-90

- mean in [0.029,0.050]! m(mean) =0.041

0.03 0.033 0.40

reference (i)

n=90 mean in [0.046,0.096] m(mean) =0.076 :

classified v0 model

JU 0.05 040 0.07 0.00 0.09

v„ model (2)

am a.02 o.o3 a.a* o.os 0.06

classified y, model (3)

Figure 4.27: The histograms of the variances of the 90 reference P(A = 1|D,B,C) values (left), and their estimators based on the u0 — 1 model (center), and classified uo approach (right).

Chapter 5

Application to non-binary data

As was shown in Chapter 4, in presence of actual data dependence, the v0 = 1

model significantly outperforms the results associated with the traditional estimators

based on any data independence hypothesis. Estimators defined by the independence

assumptions could lead to illicit probabilities, e.g. greater than one. The u0 = 1

model guarantees licit probabilities regardless of the level of data dependence. In this

chapter, we generalize the nu model to the case of non binary variables with extensive

testing using a ternary variable data set.

5.1 A single constraint

Consider the evaluation of the posterior probability P(A = fc|D) for Vfc = 1,...,K

where k is the particular outcome of the unknown A. For example, category k could

indicate the presence/absence of a channel sand.

n

D = P| Di, where the conditioning information D is constituted of n elementary data

events D{.

117

118 CHAPTER 5. APPLICATION TO NON-BINARY DATA

Using the notations of Chapter 3, the distances to event A = k are written:

„(*) P(A # k)

P(A = k)

{k) P(A^k\Di) Xi ~ p{A = k\Diy * - 1 ' - ' n ^l>

x(k) = P(A ± k\D)

P(A = k\D)

The fully conditioned posterior probability P(A = fc|D) is then:

P(A = fe|D) = T ^ r ) (5.2)

These posterior probabilities must verify the law of total probabilities , whatever the

data set D*, i.e. for all x0, x^, and x:

K K

fc=l fc=l l i x*

(5.3)

For each category k, the nu expression from Chapter 3 is written:

x{k) J L . , , x (* )

(fe) 1 1 " * (fe) x0 i=l x0

or n n

4=1 i=\

where

x 0 x 0

The sequence-dependent interaction parameter v\ is written as:

P{Di\A^k£i-1)

P(Di\A)

5.1. A SINGLE CONSTRAINT 119

Note that the prior distances should verify the constraints:

±P(A = k) = l, * E T T - ^ 1

k = l fc=l l ' x 0

which entails:

k'^k

For example, the case for K — 1 :

P(A = K) = 1 =• a#° = ? = 0 =* —^TK)=1 (5-7) 1 1 + 4 j

For the case K = 2 :

i + i _ ( i + 42)) + (i + 41)) i + xj1} i + 4 2 ) ( i+41 ))(i + 42))

O , (!) , (2) 2 + X^+Xy _ (1) ( 2 ) _ 1

~ 1 + r W 4. T(2) 4. r (1 ) r ( 2 ) ~ ° ° _

~ 0 ' 0 ' 0 0

When K = 3 :

1 1 1 + T + 1 %Q 1 T XQ 1 "T X Q

W^/i ,J3)\,n ,JQ\fi ,J2)> (1 + # ' ) ( ! + < Q + (1 + s^)(l + gW) + (1 + x^)(l + x?)

(1 + XP){1 + XP)(1 + X®)

Q . 9 [ T( 1 ) 4- r ( 2 ) 4- r ( 3 ) l 4- [ T

( 1 ) T ( 2 ) 4- T( 1 ) r ( 3 ) 4- r ( 2 ) r ( 3 ) l o n^ •"!.() ^ 0 ' 0 J ' L 0 0 ^ 0 0 ' 0 0 J

1 4 . \rW 4. r (2 ) 4. T(3)l 4. [ r(

1)T(2) . T ( 1 ) r (3) 4. T(2)T(3)i , (1) (2) (3) i n^ [ X Q T ^ X Q T ^ X Q J T ^ [ X Q X Q - f X Q 0 ^ 0 0 J ^~ ^ 0 0 0

1 : f f (1) (2) (3) _ 2 , r (1) (2) (3), 1 111 X Q X Q X Q — i i T [ X Q T^ X Q H^ X Q j

(5.8)

(5.9)


Then generalizing to any K :

K

E r 1 num

^ i + 4 f c ) den

where: K

den = 1 + Y^ Q

and Ci is the sum of all combinations of /-product of xQ defined as:

K K K

fcl=l fe2=l fc(=l

with k\ ^ k2 7 ... 7 A;/, and:

= #+(*--l)£>j( num fe=i

iC K

K (k)

T ( * i ) T ( * a ) XQ X Q +MEE

fci=ife2=i i f K K

fel=l A;2=l fc3=l

K K K

+ ... + (jf-DEE-E

_(fc l ) - ( *2)_( fc3) X Q X Q X Q

X Q X Q . . . X Q

Hence,

num = K + f > - I ) £ E - E 4fcl)4fc2)-4fci)] /=1 fei=lfc2=l fc;=l


In summary, when K > 2, the constraint to ensure licit prior probabilities is

fc=l L i x 0 Z = l / = 1

and finally if if-2

^ = n x°]=(R - i ) + z ^ -i - i)°i (5-n) *;=! /=i

For example, for the case K = 2, we have: a?g £Q = 1

For the case K = 3, it becomes: a r ^ o ^ o 0 = 2 + d = 2 + [x£1} + a;£2) + 4 3 ) ] -

For the case if = 4, we may write:

s f ^ W = 3 + 2d + Ca = 3 + 2[a<1} + 42) + 43) + xf] + f T ( l ) T (2 ) , T ( l ) _ ( 3 ) , _ (1) T (4) , (2) (3) (2) (4) (3) (4). ' L O 0 ' 0 0 ' 0 0 ' 0 0 ' 0 0 ' 0 0 J

Similarly, these constraints apply to any of the n elementary distances x] , % = 1,..., n

and to the fully updated distance x^k\

Constraint on the nu model

The distance constraint (5.11) applied on the fully updated distances x^ induces a

single, but non-linear, constraint on the parameters VQ of the nu model.

• K = 2:

Assuming that the prior distances are consistent (i.e. XQ 'X0 ' = 1), the con

straint on the fully updated distance x^ is:

3.(1)3.(2) = ! ^ xWx!g_ = yiy2 = 1

where yk is defined by relation (5.5).

Under the nu model (5.4) this constraint leads to:

n n n

yy=1 iff ^ n n ^ n n ^ = w I N v>=* (5-12) »=1 i = l i = l

Assuming that the single datum-conditioned probabilities are consistent, i.e.


x\ 'x\ — 1, Vi => y\ y\ ' = 1, Vz, relation (5.12) leads to the following single

constraint on the nu interaction parameters:

i/Wi/f = 1 (5.13)

For example, if v$ — 1, then VQ = 1. If I/Q = 0-5, then ^ = 2.

• K = 3:

The constraint (5.11) on the distance x^ is written:

3 3 (fe)

fc=i fc=i

Under the nu model this is written:

3 n 3 n

n ^[ i1-" n ^=2+E ^ W ] 1 - n ^ (5-M) fc=l i = l fc=l i = l

Setting: n

5(fc) = [xJ)/c)]1-"[]xf)>0 (5.15)

the constraint (5.14) is rewritten:

f[ulik)S^ = 2 + J2^k)S{k) (5.16)

fe=i fe=i

with I/Q ^ > 0 and assuming that all prior input distances XQ ', x\ ' are consis

tent. This is one single constraint (non-linear) on the three weights VQ ', k=l,

2 ,3 .

Example 1

Consider the case where S^ = S® = 1 and S& = 0.5. The constraint (5.16)


is then written:

, ( l ) „ ( 2 ) „ ( 3 ) / 0 _ 0 , , , ( 1 ) , 7/(2)x , f3)

leading to:

i ^ W 7 2 = 2 + M 1 ' + i /H + ^ 7 2

?)[^1V-l]=4 + 2rf) + )

This relation indicates that any model with fQ ^ = 1 is not permissible in

this situation. One possible solution to satisfy the above constraint is to set

z,W = 1, z,J2) = 1.5 and uf] = 18.

The constraint (5.11) for four categories is written similar to relation (5.16):

4 4

f[ vP&k) = 3 + 2 J2 4k)S{k) + E E "Pv^SWS^ (5.17) fc=i fc=i fc=i fc'=i

where k > k'.

Extending Example 1

Using the values above: S^ = S^ = 1 and S^ = 0.5 consider the case when

5(4) = 5(1) = S(2) = 1 T h e c o n s t r a i n t ( 5 1 7 ) jg w r i t ten as:

Z1),/2),/3),/4) /o - Q _L or,/1) _L „(2) i „(3) /oi _L r,/1),/2) i w(!)„(3) /o _L_ ,/2)„(3). ^ ^ W / 2 = 3 + 2[u?> + ^ + tf 72] + [ ^ ^ + u?>v?>/2 + ^ ^ / 2 ]

+ N4) + W + « 4 ) + «72]

Then:

^ M 2 ) ^ 3 M 4 ) = 6 + 4rf> + n?> + u§»/2] + 2 [ « > + ^ ) # / 2 + ^ 2 > # / 2 ]

+ [ 4 ^ + 2 ^ 4 ) + 2^V + ^ ^ ]


Factoring out the fourth parameter vQ ', it comes:

^ 4 ) [ ^ 2 ) ^ 3 ) - 4 - 2 ^ - 2 ^ - ^ ] =

6 + 4[vP + v? + v$»/2] + 2[vP„V + vPvV/2 + u®u$»/2] > 0

Note the previous solution v^ — 1, VQ —1.5, and VQ ' — 18 is not permissible

since the factor multiplying UQ would vanish. Similarly, the joint set of values

ô — ô — ô — 1 is n ° t acceptable since that factor would be negative. The

values VQ — 1 and VQ = 1.5 require that u^ > 18. For example, the values

VV> = 1, uf] = 1.5, i/$° = 20 calls for the solution v^ = 109

• Genera l case: K > 2

The model vQ = 1 introduced in Chapter 3 cannot be extended to all k =

1,..., K > 2. The K nu weights i^ , must verify a single but non-linear con

straint of type (5.11). Inference of the K nu weights VQ ' should then be done

jointly on the entire vector [[P(A = k), P(A = fc|A), i = l , . . . ,n; P(A =

fc|D);i/Q ]> k — 1,---K > 2]. Note that instead of imposing that single con

straint on the nu weights vQ , one could determine (K — 1) such weights VQ ,

k y£ ko and set the remaining probability to

P(A = ko\D) = l - E P(A = k'\D). k' ^k0

Numerical example:

To demonstrate how the constraint (5.11) can be used in practice and show the

consequences of not using it we develop this simple numerical experiment. Consider

the case of an unknown A informed by two data events D\ and D2 (i.e. n = 2). The

unknown A can belong to any of three categories k = 1, 2, 3 with K = 3. There

might be a wide variety of geological analogs to this situation. For example, the first

category may represent mud, the second category may indicate the presence/absence

of channel of sand at particular unsampled location of a potential reservoir, and the


third category may be indicative of fractures in that reservoir.

Assume the availability of the following prior distances:

• Prior distance

4 ° = 0.50, 42 ) = 3-00, 43 ) = 11-00. That is:

P(A = 1) = 1/1.5 = 0.667, P(A = 2) = 1/4 = 0.25, P(A = 3) = 1/12 =

0.0833

To verify the consistency relation (5.11), check: 3 3

fl # = 16-50 = 2 + £ a#° = 2 + 14.50.

• Conditioning to data event D\\

x? = 0.25, xf> = xf} = 9.00.

That is: P(A = l|Z?i) = 0.80, P(A = 2\Dt) = P(A = 3|£>i) = 0.10

To verify the consistency of the probabilities conditioned to the single data

event D\, check: 3 3

fl 4fc) = 20.25 = 2 + J2 x? = 2 + 18.25. fc=i fc=i

• Conditioning to data event D2:

4X) = 0.333, xf = 43 ) = 7.00.

That is: P(A = 1|Z>2) = 0.75, P(A = 2|D2) = ^ = 3|£>2) = 0.125

To verify the consistency of the probabilities conditioned to the single data

event D2, check:

II # = 16-33 = 2 + J2 4fc) = 2 + 14-33.

Note: £>2 compounds the D\-information by increasing the probability for A = 1

from its prior value P(A = 1) = 0.667.

According to relation (5.5), the relative distances to A = 1 are:

(!) (2) (3)

yi1) = ^ iy = 0.5, y?> = ^ = 3.0 ^ = ^ = 0 . 8 1 * Xr\ Xr\ X(\


Similarly,

(!) (2) (3)

y? = % = 0.667, y?> = \ = 2.33 y f = \ = 0.636 XQ XQ XQ

Note the y-expression (5.5) of the nu model:

yM = ^f[y?\ lb =1,2, 3

For the event A = 1 with conditioning to data D\ and £>2> the model fQ ' = 0.5 would

lead to the approximation:

y0) = 0.5* (0.50* 0.667) = 0.167 and x{1) = 0.167 * 0.50 = 0.0835

Hence, the model v$ — 0.5 < 1 increases the compounding of data Di, D2 and leads

t o P * ( 4 = l |A,-D2) = 0.923.

Setting the two other nu parameters ^ 0 = ^ 0 unknown u, the constraint (5.16)

is written: 3 3

fc=l fc=l

with:

5(1) = ^ _ = 0 J 6 7 ) 5(2) = ^ 2 _ = 2 L 0 ) 5,(3) = ^ 1 _ | _ = 5 ? 2 7

^ 0 Ô X0

The following equation needs to be solved for the unknown u^ = v^ with ujp = 0.5: 3 3

f[ u^SW = 0.5u2(20.09) = 2 + J2 Vok)S{k) = 2 + 0.0835 + 26.727ti. k=\ fc=l

That equation is:

10.045M2 - 26.727u - 2.0835 = 0


with discriminant: 5 = 714.33 + 83.715 = (28.25)2 > 0.

The solutions are u = 26'72207 f"25 • The negative solution is unacceptable leaving

u = vf] = v{3) = 2.737.

These two nu values lead to the approximations:

y(2) = i>(2)y[2)y(2) = 2.737 * 3 * 2.33 = 19.13

x(2) = x(2)y(2) = 3 „, 1 9 1 3 = 5 7 3 8 ) a n d p*(A = 2|D1,D2)=0.017.

and,

y(3) = 43)y[3)yi3) = 2.737 * 0.818 * 0.636 = 1.424

a;(3) = 43 )y ( 3 ) = 15.66, and: P*(A = S ^ , L>2)=0.060

The probabilities should be all consistent which means that to avoid any order viola

tions all probabilities across three categories k should sum up to 1. For our example,

the proposed constraint (5.11) allowed for such consistency, 3

i.e. £ P*(A = k\D1}D2) = 0.923 + 0.017 + 0.060 = 1.00. Note that the u^ = 1,

3

k = 1,2,3 model would have led to inconsistent probabilities since ]T) Lfci =

0.857 + 0.046 + 0.149 = 1.052 > 1.

Summary of the numerical example:

• The VQ = 0.5 model increases compounding as it increases the probability that

A = 1 from P(A = 1 |A) = 0.80 to P*(A = 1\DUD2) = 0.923.

The model v^ = v$ — 2.737 increases the two distances x^ and x^\ thus

decreases the two probabilities of having A = 2 and A = 3 into:


P*(A = 2\D1}D2) = 0.017 and P*(A = 3|Dx, D2) = 0.06 from their values

P(A = 2|L>i) = P(A = 3| A ) = 0.10.

• The model v^ = 2 leads to x^ = 0.334, and P*(A = l\DXlD2) = 0.75 =

P(A = 1\D2), thus erasing the compounding of data events D\ and D2.

For I/Q ' = 2, the consistency constraint is written:

2n2(20.09) = 2 + 0.334 + 26.727M,

that is: 40.18w2 - 26.727^ - 2.334 = 0, (2) (3)

leading to the positive solution: u = vQ = VQ — 0.745.

This leads to the estimates:

P*(A = 1 |A, D2) = 0.75, as found above.

Similarly, the fully updated distance x^ for the second category k = 2 is:

x(2) = U^S^ = 0.745 * 21.0 = 15.64. Thus, P*(A = 2\DU D2) = 0.060.

and fully updated distance x^ for the third category k = 3 is: x^ = v$ S^ =

0.745 * 5.727 = 4.27. Thus, P*(A = 3 |D u D2) = 0.190.

3 These estimates verify the consistency relation J2 P(A = k\D\,D2) = 1-00.

fc=i Without the proposed constraint (5.11) such consistency would not be possible

leading to order violation problem and possibly to biased estimate of the fully

conditional posterior probability P(A = k\Di,D2).

5.2 Large non-Gaussian ternary case study

In Chapter 3, we suggested the data combination paradigm in which posterior prob

ability P(A = k\Di,...,Dn) is obtained by completely separating the single datum

information content through the n elementary probabilities P(A = k\Di) with i =

1,..., n and the data interaction through the nu interaction weights I/Q , k — 1,..., K.

The elementary probabilities should be evaluated from the actual data. The interac

tion weights can be obtained from a training data set providing proxies, or replicates,

of the data interaction. The constraint (5.11) on the K interaction weights I/Q ensures

5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 129

that the resulting K fully conditioned posterior probability P*(A = k\D\, ...,Dn) are

all licit probability estimates.

The applicability of the proposed UQ ' inference paradigm is now tested using a large

3D reference ternary data set where all conditional probabilities involved in the nu

expression (3.6) are known, including the exact full data-conditioned probability

P(A = k\Di = di, i = l , . . . ,n) . Various approximations of that reference proba

bility can then be evaluated.

5.2.1 The reference data set

We start by generating a reasonably large 3D non-Gaussian field using the training

image generator code [56] of the SGEMS software [66]. This code generates various

geological structures using a non-iterative, unconditional Boolean simulation [36].

For this data set, we generated a ternary image with three mutually exhaustive cate

gories: category 1 for mud, category 2 for channel sand, and category 3 for fractures.

This 3D field is of size 100x100x50, comprising 500,000 nodes and yielding the refer

ence categorical field shown in Figure 5.1.

Denote that reference field by S : {A(u) = 1, 2, 3, u E S} with

P(A(u) = 1) = 0.67, P(A(u) = 2) = 0.23, and P(A(u) = 3) = 0.10 and 3

with: ^2 P(A(u) = k) = 1.00, and u denotes the location coordinates vector.

Figure 5.2 gives the reference indicator variograms in the x, y, z directions calculated

from indicator data from the top 35 layers of S for each of three categories k = 1,2,3;

the reason for excluding the bottom 15 layers will become apparent soon hereafter.

Those indicator variograms reflect the horizontal-to-vertical anisotropy of the original

categorical field.


Figure 5.1: Reference categorical image generated using a training image generator (the representation of the two categories A = 2 and A = 3 does not reflect their proportions).

5.2.2 The estimation configuration

Consider the evaluation of the conditional probability of an unsampled value A(u)=k,

given any combination of the following three multiple-point data events (Figure 5.3).

As seen from Figure 5.3, the closest data event D comprises four data locations at the

level just below that of A(u). These four data are at the corners of a square centered

on the projection of A(u) on their level. The next closest data event B comprises

also four data locations with the same geometry as for data event D, but located five

levels below that of A(u). The furthest away data event C again comprises four data

locations but located 15 levels below that of A(u).

If the unsampled location u of A(u) spans only the eroded field

5o = {x — 11, ...,90; y = 11, ...,90; z = 16, ..50} then each value A(u) can be eval

uated by any of the three data events D, B, C. From here on, all statistics will

refer to that "common denominator" field So comprising 224,000 nodes. Over that


k = 1 Mean: p = 0.636

Horizontal EW 0.3 -37

025 4'.'P

">: o.i5 -fY 0.1

O.QS - § • • * - ' - -'• 0 J .u . ' . - i .

"|iili|iiit|iiil|inijiiii[mr^n|li D 5 1016 2026 3035

fllnHIHIffr

0 5 1015 2025 3035 Vertical

0.3 • n -0.25 ! ' [ 0-2 !!<_•

0.15 I 0.1 |.Jfct.

0.05 " 3 * - 1 — i — i - T - t - n 0 - S . C - L . ' - _ ! _ . « _ J - J . J .

-! 1_ -«.. H » -* - *

0 S 10 IS 2020 3030

lag

k = 2 Mean: p = 0.249

Horizontal EW

k = 3 Mean: p = 0.115

Horizontal EW

1015 2025 3035

Horizontal NS

nil] HI i|ini| ini)iin] IIII)IIII)II'

0 5 1015 2026 3035

Horizontal NS

j7 HI j in i jn i7| in 7j» J i] u II |u nji i 0 5 1015 2025 3035

Vertical

i i i i i i i

f f T T T T T T T '

I I I I I I I I I I I I I I

r r T - r - r - r * r - r

-1— i— r - r - r - r " ' )l III) III fill lit HI lift lit lll^lll llll

0 5 1015 2025 3035

Vertical

Figure 5.2: Exhaustive indicator variograms in x, y and z directions, calculated over the 35 top layers for k — 1,2,3.


one data event D two data events D,B three data events D,B>C

Figure 5.3: Data events definition.

central field So, the marginal statistics (prior proportions) are: P(A — 1) = 0.636, 3

P(A = 2) = 0.249, P(A = 3) = 0.115 with: ]T P(A = k) = 1.00. fc=i

The definition of an "eroded" field SQ common to all data configurations entails that

the spatial averages of all conditional probabilities (proportions) remain the same no

matter the conditioning data event retained. For example, if conditioning only to the

sole D-data event then the conditional probability P(A = k\D) is:

P(A = k\D) = j^Y, P(A(") = *|D - d(u)) uGi>o

= \ h E p(A(u) =k) = p(A =fc); k = *•> 2>3

W ues0

where the data event D can take KA = 34 = 81 possible combinations of data values

with K = 3. For the case when conditioning jointly to the two D and B data events,


the conditional probability P(A — k\D, B) is then:

P(A = fc|D, B) = j±- £ P(A(u) = k\D = d(u), B = b(u)) 1 6 0 1 u £ 5 0

= P(A = k); jfc = l, 2, 3

where the data event (D, B) can take 38 = 6561 possible combinations of categorical

data values. Similarly:

P(A = fc|D, B, C) = - ^ V P(A(u) = fc|D = d(u), B = b(u) , C = c(u))

= P(A = k); k = l, 2, 3

where the data event (D, B, C) can take 312 = 531,441 possible combinations of data

values.

Note, because So is not that large (1501=224,000 nodes) therefore not all data values

combinations are present in SQ. This does not affect, however, the previous equalities:

P(A = fc|D, B) = P(A = Jfc|D, B, C) = P(A = k), V k.

5.2.3 Conditional probabilities and estimates

As an example, Figure 5.4(1) gives the So-volume of the 224,000 exact conditional

proportion values P(A(u) = 1|D, B, C) with mean 0.636 and variance 0.084. Their

histogram is given in Figure 5.4(2). This spatial field can be considered as reference.

The mean is equal to that of the reference categorical field as expected since we scan

the same So-volume.

Similar figures and statistics are available for all the following conditional proportions,

although not all are presented here:

• single data event-conditioned: P{A(u) — fc|D), P(A(u) = k\B), P(A(u) = k\C)

for all k = 1, 2, 3.


prior proportions: P(A=1)=0.636 P(A=2)=0.249 P(A=3)=0.115

Figure 5.4: (1) Spatial distribution, (2) histogram of the conditional proportions P(A(u) = 1|D, B, C) defined over the reference eroded volume So, and (3) the reference categorical field with respective proportions.

• two data events-conditioned: P(A(u) = k\B, B), P(A(u) = k\D, C),

P(A(u) = fc|B, C), for all k.

• all three data events-conditioned: P(A(u) — k\D, B, C).

• the estimated probability P*(A(u = k)\D, B, C) using the VQ model (3.6) to

combine the previous single data-event conditioned probabilities.

Again the mean of these proportions will be equal to that of the reference categorical

field as expected since for all the above proportions we scan the same <So-volume.

When the model VQ — 1 is used, at each location u the estimate of the fully condi

tioned posterior probability is:

P*{A{u) = fc|D = d(u), B = b(u) , C = c(u)) l + z(fc)(u)

(5.18)


with the estimated distance x^(u) being such that:

# » ( u ) _ ^ fc )(u)xL fc )(u)4 fe )(u) Tn (k) (fe) (k) X0 Xn Xn' Xn

where x0 = — p , } _ ,\ is the marginal distance, with

(1) 1 - 0.6356 Xo = ^ 6 3 5 6 - = ° - 5 7 3 3

(2) _ 1 - 0-2490 _ X° " 0.2490 ~ 6 m b l

(3) _ 1-0.1154 _ X° ~ 0.1154 ~ 7 - b b 5 5

and JCD (u) = —p//tV \ _. I.|T>_ J / \\ is the distance to A(u) = k updated by the

sole data event D = d(u).

The distance xD (u) is obtained by scanning the reference image So with the template

definition shown in Figure 5.3(1) for the proportion of D-replicates identifying the

data values combination d(\i). Our estimation paradigm assumes that all elementary

conditional probabilities P(A = fc|D), P(A — k\B), P(A = k\C) are known. This

study addresses only the problem of combining these elementary probabilities into an

estimate of the fully conditioned probability P(A = fc|D, B, C) while accounting for

data interaction.

Similarly, from the training image one can retrieve the other two elementary distances

XB (u) and a4 (u). The VQ — 1 model (5.18) then provides an estimate of the fully

conditioned probability, P*(A(u) — fc|D, B, C). For example, the spatial distribution

and histogram of the estimates P*(A(u) = 1 | D,B,C) using the UQ=1 model are

shown in Figure 5.5(1).

These estimates are necessarily valued in the interval [0, 1]: their spatial mean is

0.635 with spatial variance 0.010. Their histogram is given in Figure 5.5(2). Com

paring Figure 5.5 to Figure 5.4, shows that the bias of VQ = 1 estimates relative to


N=224,00C m=0.635 a2=0.010

I ^ ^ ^

: : .-

\ 1 ; [ i A i I i I i I i I ' i • i ' i

prior proportions: P(A=1)=0.636 P(A=2)=0.249 P(A=3)=0.11S

0 3 0.4 0.5 0.6 0.7 0.8

v„ model

Figure 5.5: (1) Spatial distribution, (2) histogram of the conditional probabilities P*(A(u) = 1|D,B,C) estimated with the model v^ = 1, and (3) categorical reference field with respective proportions.

the reference probabilities is small (compare their spatial mean of 0.635 and 0.636).

Note, however, that the estimates based on VQ = 1 model have the lesser spatial

variance (0.010 < 0.084) due to estimation smoothing effect.

The histogram and scattergram of the error defined as

P*(A(u) = 1|D, B, C) - P(A(u) = 1|D, B, C)

are shown in Figure 5.6. The correlation between the local probability estimate and

the actual true proportion is low (p = 0.34), leaving room for finding a better data

interaction parameter va ' different from 1.

Note that the Z/Q = 1 model cannot be extended to all k = 1,2,3. For example, we

calculated the estimates P*(A = 2 | D,B,C) and P*(A = 3 | D,B,C) using the nu

parameter value vQ = 1 and I/Q = 1. Figure 5.7 shows the histogram of the sum


u

cr

P*: m=0.635 &'• =0.010 P : m=0.636 a2=0.084

B m=0.0002 1 a2=0.0749

, , , , . , , , , . . . , , . „ , _ , . _ .

v0 model (1)

reference (2)

Figure 5.6: (1) Histogram of error P*(A = 1 | D,B,C) - P(A = 1 | D,B,C) and (2) the corresponding scatterplot of estimate P*(A = 1 | D,B,C) versus reference P ( A = 1 | D , B , C ) .


u 0.34

a> cr <V 0.2

0.1-

estimatcdiim =1.00034 exact:: m=l Data count: 224,001

Mean: 1.0003*

Variance: 4.86514

Maximum: 1.0962:

Upper quartile: 1.0029:

Median: 0.9996

Lower quartile: 0.99961

Minimum: 0.9207

- i — i | i — i i i | i i i i | — i — i — i —

0.95 1 1.05

summation

Figure 5.7: Histogram of ]T P*{A = k | D,B,C) . The v^ = 1 model cannot be k=\

3

extended to all categories k since the mean of J2 P*(A = k | D,B,C) > 1 which k=i

contradicts the general law of probabilities.

tk \ of the estimates ^2 P*(A(u) = /c|D, B, C) resulting from the model UQ ' = 1 V k.

fe=i The spatial mean is equal to 1.00034 which is slightly greater than one. Out of the 224,000 estimated values, 104,392 (that is about a half) were outside of the required

interval [0, 1].

In the previous runs, the single data event-conditioned probabilities P(A(u) — k\D),

P(A(\i) = k\B), and P(A(u) = fc|C) were set equal to the corresponding propor

tions over So. Each data event D, B or C can take 34 = 81 possible combinations

of data values. This small number of possible data value combinations ensures that

most likely all such 81 combinations are present in the training image with a num

ber of replicates greater than 10. This in turn ensures that the spatial estimates

P*(A = k | D,B,C) , such as the one given in Figure 5.5 based on the VQ = 1 model


are statistically significant. However, the statistical significance of the fully condi

tioned proportion P(A = k | D,B,C) shown in Figure 5.4 could be questioned. The

statistics shown in Figures 5.5 and 5.6 pool together the 224,000 estimated condi

tional probabilities over So, irrespective of the actual values of the conditioning data

values and the corresponding number of replicates. Note that there are 4 * 3 = 12

categorical (K = 3) indicator data grouped four by four into the three data events

D, B, and C; therefore there is a total of K12 = 312 = 531,441 possible data values

combinations. Out of the total of 531,441 possible data values combinations, 475,271

(89%) were not found in So and of the remaining 56,170 only 195 combinations were

found with at least 10 replicates. To ensure statistical significance we retain only the

latter 195 data combinations for future analysis.

5.2.4 Determining the Z/Q to ensure consistent probabilities

As was shown in Figure 5.7, the model v$ — 1 cannot be extended to all k = 1,2,3

since it may lead to inconsistent probabilities. To ensure consistency, the K weights

PQ ' must verify the single but non-linear relation (5.11). Consider the model Z/Q ' = (2)

UQ = 1, where the two first nu parameters are set to 1 indicating no-data-interaction

when evaluating A = 1 and A = 2. The third weight v$ is determined using the

constraint (5.11): u^ = 2 + 5 ( 1 ) + 5 ( 2 ) (5 19) "0 S(!) * S(2) * S(3) - S(3) l J

where S^\ S^2\ S® are defined by relation (5.15).

For the evaluation of the exact fully conditioned proportion P(A — k | D,B?C) we

use only the 195 statistically significant (with number of replicates greater than 10)

data value combinations. For each of these 195 combinations, we calculated:

P*(A = 1\ D,B,C) under the v^ = 1 model

P*(A = 2 | D,B,C) under the ^ 2 ) = 1 model (5.20)

P*(A = 3 | D,B,C) with UQ calculated using expression (5.19)


We also calculated the estimate PCJ(A = k | D,B,C) under the conditional indepen

dence (CI) assumption using expression (4.8) as:

P*CI{A = k\D,B,C)

P{A = k) (5 21) _ P(A = fc|D) P(A = k\B) P(A = k\C) P(D)P(B)P(C) K ' ~ P(A = k) P(A = k) P(A = k) P(D B C)

The CI estimate (5.21) and the nu model estimated probability P*(A = fc|D,B,C)

as given by (5.20) are plotted against the So-exact proportion P(A — k\D,B,C) in

Figure 5.8 for each category k = 1, 2, 3.

In presence of strong data interaction (essentially between the two close-by data events

D ,B, and the unknown event A), the no-data-interaction model v$ = v$ ' = 1 out

performs significantly the estimator based on the conditional independence assump

tion. That can be seen from Figure 5.8 (bottom) which shows that for the 195 data

value combinations retained the correlation coefficients based on v^ = 1 and vQ = 1

are both equal to 0.67. While for the CI estimator these coefficients are 0.14 and 0.34

for k = 1 and k — 2, respectively (Figure 5.8, top panels). Also, for k — 1, the CI

estimator leads to illicit probabilities, i.e. probabilities that are greater than one (Fig

ure 5.8, top left panel). This inconsistency comes from the fact that out of 195 data

value combinations retained, the category k = 1 is more likely to be present in the

training image than the other two categories. This leads to assumption of conditional

independence to be more likely invalid when evaluating P(A = 1 | D,B,C) than

for the other two categories. For k = 3, the constrained v$ ' model also significantly

outperforms the estimator defined by the conditional independence with a correlation

coefficient equal to 0.43 versus 0.18 (Figure 5.8, top and bottom right panels).

Table 5.1 also shows that the estimator defined by the the nu model allows for a better

reproduction of the spatial mean and variance of the reference across the 195 data

value combinations. The conditional independence estimator tends to underestimate

the spatial mean and to overestimate the spatial variance.


"S" a. i u S 0.8

5 0.6 s o 53 °-4

k=l k=2 k=3

o u

• ;

f

H

f * 1

4 * : • •

: .;

••

•.! •* * :• " * - • •

••,*.r . • v< • t * •

• . *. * •

' P=0-;14 :

*. . . . . ,«.* . . . . . .?

* • * » • • A.. , « V , * * •

.*V-"rV/ • » • .. •

. . ** . . * .i».... . ..

*

p40.34

.....; ? . ^ w ; . •.^•..•| ,.

0.3 0.4 0.5 0.6 0.7 0.8

P=o.i8 ';

J | • i i i

• • * •

• • . *

V*,-A

. . j . . . ; i . r 1 ' i M • i '

0.1 0.2 0.3 0.4 0.5 0.1 0.15 0.2 0.25 0 J 0.35

0.8 -

del

o *=«-— II -> 0.5-

0.4 -

9:

,

•

w .

=o,67 ; : ; , : .: :. : :•

</ •: .-.'^r-" ' . ». . : « . • • : . . •

• . • . . . f

« ' • * ' : '•• • * • « - • • • - . " - • »

.: «, . : . : •

•• i . 1 • i 1 1 1 i 0.3 0.4 0.5 0.6 0.7 0.8

reference

0.5 -

0.4 -

0 .3 -

0.2-

0.1 •

p=0.67

' . • . . . . •:

- . - . , v ; • : . - • •

- p=0.43

0.1 0.2 0.3 0.4 0.5

reference

1 1 i . 1 1 1 1 . 1 1 1 1 1 1 1 1 1 1 .

0.1 0.15 0.2 0.25 0.3 0.35

reference

Figure 5.8: Top: the reference proportion P(A = k | D,B,C) (x-axis) versus the estimated proportion P(A = k | D,B,C) based on the conditional independence assumption (y-axis) for k = 1,2,3. Bottom: the reference P(A = k | D,B,C) (x-axis) versus the estimated P(A = k | D,B?C) based on the VQ ' model (y-axis) for k = 1,2,3.


k = l reference

^ = 1 conditional independence

k=2 reference

^ = 1 conditional independence

k=3 reference

vi3)

conditional independence

mean 0.61

0.66 0.54

mean 0.24

0.22 0.18

mean 0.15

0.12 0.096

variance 0.0161

0.0078 0.0574

variance 0.0117

0.0073 0.0174

variance 0.0030

0.0002 0.0023

Table 5.1: Summary statistics for k = 1, 2,3: spatial means and variances of reference conditional proportions and of approximations denned by the nu model and from the conditional independence assumption.

The v^ model based on equation (5.20) ensures that:

P*(A = 1 | D,B,C) + P*(A = 2 | D,B,C) + P*(A = 3 | D,B,C) = 1.

This consistency relation is not ensured by the conditional independence assumption

since out of the corresponding 195 estimated probabilities, eight were greater than 1.

While the v$ ' model based on equation (5.20) provides a significant improvement

over the traditional estimator based on a conditional independence assumption, it

still assumes no-data-interaction. We next look at application of the classified UQ

approach as proposed in Section 3.3.2.

5.2.5 Classified v$ ' approach

Consider now the classified v0 approach as proposed in Section 3.3.2. The proposed

classified u0 approach described in Section 3.3.2 can be summarized in two phases,

the training phase and the application phase.


In the training phase we need to build a training data set mimicking (even only

roughly) the actual data interaction. From that set, we retrieve the training data

values-dependent ^-values, called proxy f0-values. This calls for reducing each set

of training data values into a few summary statistics or filter scores. Based on these

scores, we can classify the proxy values VQ. Each class of data is associated to a single

(average or median) u0-value, called a "class ^-prototype".

The application phase consists of returning to the actual study field and finding

the training class closest to the actual conditioning data scores, retrieving that class

"prototype" value, and finally using that u0 value to combine the elementary proba

bilities. These elementary probabilities must be evaluated from the actual study field,

not from the training data set.

When evaluating the probabilities of P(A = fc|D,B,C), for all k = 1, 2, 3, the three

conditioning data events D, B, and C comprises each four categorical data points

(refer to Figure 5.3 for the geometry of the three data events). There are only 195

data value combinations with at least 10 replicates for which we can evaluate reliably

the corresponding exact conditional proportions.

Were the training data set identified to the reference data set (ideal training), the (k)

mean of the proxy I/Q values is equal to 1.36, 1.03 and 0.85 for A; = 1 , 2, 3, respectively.

This represents a significant difference from the assumption of no-data-interaction (i.e.

the I/Q = 1 model). Note that in Chapter 4 of this thesis, we dealt with a similar

data set except that it included only two categories. In that two-category case for

evaluation P(A = 1 |D,B,C) the mean of the proxy VQ values was equal to 2.21.

This represents a much greater deviation from 1 (hence from no-data-interaction case)

than the previous value 1.36 obtained for the ternary data set. By adding the third

category, the actual data interaction was understated because the interaction param-(k)

eter UQ has been evaluated by averaging over all data values combinations and all

possible values for the unknown. Because of this severe averaging, we do not expect

this particular u0 classification to lead to results much better than the vQ — UQ — 1


model. Ideally, the classification of the training VQ proxy values should differentiate

data values and also the unknown value. This would require, however, a much larger

training image displaying enough replicates of all data values-dependent conditioning

data events.

For this ternary data set, the interaction is the greatest for the first category (i.e. k=\)

and we retained only 195 statistically significant data value combinations. Each con

ditioning data point consisted of four points. Each of these points can take three

values (k = 1 , 2, or 3). Evaluating P(A = 1 |D,B,C), we expected data interaction

to be the greatest when all four points of each of the three conditioning data events

D , B , C take the value 1. Similarly, when evaluating P(A = 2 |D,B,C) we expect

the interaction to be the greatest when all four points of the three conditioning data

events D , B , C take the value 2. However, out of the 195 data value combinations

retained, the case with all four points of the three conditioning data events D , B , C

equal to 2 was not found. The case with all four points of the three conditioning data

events D, B, C equal to 1 was found, this explains why the average proxy VQ ' value

deviates from one more significantly than for the categories k=2, 3.

Despite its considerable averaging, consider the classified v^ ' approach with for data

summary score a single statistic. For each category k, we categorize the retained 195

data value combinations by defining first the binary indicator of each point data value

as:

I{k) (u) = S. 1 i£Uj E category k

[ 0 otherwise

Then we define the data event D, B, C score as the average of its twelve point indicator

)res for k = 1, 2, 3:

= ^ E / C f e ) ( ^ ) (5-22)

da t a values; there are three such scores for k = 1, 2, 3:

m ( t > 12

By mapping these 195 sets of three scores into the 3D score space we can determine


(Ti

m*r l "

* *" *" „ * 1

0.2^--*"**!"

0.15.

0.K

0.05.

_,

_ . , •

* *

_,

„ ^ •

- ' * '

* • » ' " '

* - •

*

» ' " ' * ' "

*-**'"'

* - ' " "

i *"* -

. 4

*

*-K * - * * • * ^

i „ > - "* vM K t * < £ •

"**-*

"- ^ * ^ * l ^

1 " » * 1 " " I .

r - i , *» ' * i " ,

* * • ' . " >

i*» I i

^. i t'^-

^ El "v&. 1

0 0,5

Figure 5.9: Classification of scores rrS (!) m( 2 ) m(3) 771

m (1) m ( 2 ) TO (3) , mA -1. Each axis represents the scores

respectively. Note, the available 195 data value combinations cluster into only 11 points in the 3D m^ space. The resulting classification of these 11 points is shown by different geometric shapes.

classes of similar scores by performing, for example, a k-mean algorithm [59] which (k)

partitions the points into classes. Each training class prototype VQ value is the

average of the proxy Z/Q values falling into that class. Figure 5.9 shows such a

classification. Each axis represents the scores m^\ m^2\ m^3\ respectively. Note,

the available 195 data value combinations cluster into only 11 points in the 3D m^

space. The resulting classification of these 11 points is shown by different geometric

shapes. In this case, there are four classes marked my stars, triangles, diamonds and

squares, respectively.

For each set of actual data values (i.e. of conditioning data values), we then look for


the training class (out of four possible) that is closest to the actual data statistics, and

use the corresponding class prototype v0 value (instead of VQ — 1) for building the

fully conditioned probability P(A = A;|D,B,C). To ensure consistent probabilities,

i.e. that

P*(A = 1 | D,B,C) + P*(A = 2 | D,B,C) + P*(A = 3 | D,B,C) = 1,

the consistency relation (5.11) is applied to modify the third interaction weight UQ .

That is, the third weight VQ is calculated from the I/Q and u0 ' values as:

v = ^ = (2 + g ( 1 ) ^ 1 ) + g ( 2 ) * ^ 2 ) ) ( 5 ^ 0 5W * 5W * 5(3) * ^ 1 } * i/J2) - 5W

where S^\ S@\ S^ are defined according to relation (5.15).

Figure 5.10 shows the scattergrams of the estimate based on the classified I/Q ap

proach versus the reference P(A = A;|D,B,C) for k=l, 2, 3 for the 195 data value

combinations retained. The coefficient of correlations between the reference true pro

portions and the estimates based on classified VQ approach for the 195 data value

combinations are equal to 0.70, 0.67, and 0.55 for /c=l, 2, 3, respectively.

The statistics for the reference proportions and for its classified I/Q estimate are given

in Table 5.2.

the spatial mean and the variance are close to the reference values.

The classified I/Q ' model is well correlated with the reference and at the same time,

Note that without the constraint (5.23) the classified I/Q approach may lead to in

consistent probability estimates:

P*(A = 1 | D,B,C) + P*{A = 2 | D,B,C) + P*(A = 3 | D,B,C) ^ 1 (k)

Had we used the original exact training I/Q values without any classification and

consequent averaging, the consistency relation (5.23) would have been met exactly.


reference reference reference

Figure 5.10: Scattergram of reference proportion P{A = fcjD,B,C) along x-axis versus estimate P*(A = fc|D,B,C) based on classified v^ ' model for k=l (left), k = 2 (center), k = 3 (right) along y-axis. The highest correlation is attributed to the categories k = 1 and k = 2. The smallest correlation is attributed to category k — 3 which has less spatial structure than the other two categories.

5.2.6 Inference robustness

To test the robustness of the inference paradigm proposed, we now consider a training

data set which is different from the reference data set. The reference data set pro

vides the conditional data and the exact conditional proportions, the training data

set provides the proxy interaction nu parameter values.

For this purpose we draw 50 new realizations, S®, I = 1, ...,50, utilizing the same

image generator code [56] previously used to draw the reference data set. These 50

training images are again of size 100x100x50. The average of the 50 eroded realiza

tions ( S f ) means are given in Table 5.3 for k = 1,2,3. The corresponding reference

means over SQ are also shown in that Table.


k = l reference

classified VQ ' model

k=2 reference

classified Z/Q model

k=3 reference:

constrained classified v$

mean 0.61

0.60

mean 0.24

0.22

mean 0.15

0.19

variance 0.0161

0.0107

variance 0.0117

0.0073

variance 0.0030

0.0011

Table 5.2: Summary statistics: spatial means and variances of reference conditional probabilities and of estimates based on a classified nu representation.

The single datum event conditioned probabilities P(A|B), P(A\C), P(A\D) are re

trieved from the reference data set So shown in Figure 5.4. The proxy i/Q values are

retrieved from all 50 training realizations (So ) pooled into a single inference pool.

k = l k=2 k=3

training 0.665 0.222 0.113

reference 0.636 0.240 0.115

Table 5.3: The average means of the 50 eroded training data sets for k = 1, 2, 3. For comparison, the right column shows reference means.

In Table 5.4, we compare the spatial mean and variance of the reference 195 probabil

ity values P( J4|D,B,C) to the corresponding spatial statistics of the estimate based

on the nu representation. We define classified proxy u^ ' model as the model based

on the 50 training images all different from the reference data set. In Table 5.5, the

correlation coefficients of 195 reference probability values P(A|D,B,C) with the es

timate based on the nu representation are given for k = 1,2,3.


k = l reference

classified proxy u^ model

k=2 reference

classified proxy UQ model

k=3 reference

constrained classified proxy v$ model

mean 0.61

0.61.

mean 0.24

0.22

mean 0.15

0.17

variance 0.0161

0.0092

variance 0.0117

0.0072

variance 0.0030

0.0005

Table 5.4: Summary statistics: spatial means and variances of the 195 reference conditional probabilities and of the estimates built from a nu representation.

Comparing these two tables with Figure 5.10 and Table 5.2, the classified proxy VQ

model appears quite robust since the two Tables provide very similar results. For

example, in Table 5.4 where we tend to overestimate (underestimate) the spatial

statistics, the similar trend can be observed in Table 5.2. This inference allows us

to conclude that no matter how approximative is the training image it significantly

improves the results provided by v$ — 1 model as such training image will provide

insight on data interaction seen the study field.

classified proxy VQ

k = l

0.69

k=2

0.66

k=3

0.52

Table 5.5: Correlations of 195 reference proportion values P(A|D,B,C) with estimates based on a classified nu representation.

The key lesson learned from these case studies is that we must check the assumptions

underlying any model (whether it is a no-data-interaction model or models based

on data independence). The applicability of each model ultimately depends on the

physics of the data. For example, if conditional independence is inappropriate (as

is often the case in geology-related applications) it should not be imposed for mere


convenience, as it might lead to large bias and various order relation violations.

Chapter 6

Summary and conclusions

6.1 Summary of major theoretical developments

This thesis addresses the problem of integrating diverse data sources while account

ing for the interaction between these data. We consider n data events Di,...,Dn that

inform the same unknown event A. Each of the n + 1 events can be very complex

involving multiple locations in time and/or space. As an example, the unknown A

could be indicative of the presence of channel sand connecting two wells. Data event

Di is the indicator of facies at these two wells. Data event D2 is the result of a

seismic survey providing " soft" probabilities of presence of channel at or around the

same two wells.

In this thesis, we assume that each of n + 1 events had been previously processed

providing the following probabilities:

1. prior probability P(A = a) available from historic data, and

2. datum-specific conditional probabilities P(A = a\Di = di). Each of these prob

abilities captures the specific information about the unknown event A brought

by the datum event Di taken alone. This step is crucial. Many algorithms

exist to process information brought by a single individual data event into such

conditioned probabilities P(A = a\Di = di), e.g. indicator kriging [34], various

151

152 CHAPTER 6. SUMMARY AND CONCLUSIONS

regressions including neural networks [3],[40], and [41] among others). However,

the task of obtaining the probabilities P(A = a\Di = d\) is out of the scope of

this thesis.

The goal of this thesis is to combine the prior probability P(A — a) and the n

individually conditioned probabilities P(A — a\Di = di) into an estimate or model of

the fully conditioned probability P(A = a\D\ = di,..., Dn — dn):

P(A = a\Dx = di,...,£)„ = dn) = ip(P(A = a),P(A = a\Dt = di), i = 1, ...,n).(6.1)

• The exact combination function ty depends on data values di and unknown

value a. Approximations amount to propose functions ip that are either inde

pendent of the data values or dependent only on a few summaries of these data

values. One of our objectives was to decompose the task of determining the

fully conditioned probability into two easier tasks: obtaining the n individually

conditioned probabilities P(A = a\D{) and the prior probability P(A = a) and

then combining these probabilities into a function of type (6.1) while accounting

for data interaction.

The traditional approach for approximating of the function (6.1) is built around the

simplifying assumption of data conditional independence. This assumption states

that the n data events become independent of each other given knowledge of a spe

cific realization/value of the unknown A. Let the notation Z>j_i represent all data

Di,..., A - i in the sequence up to data Di excluded. Then conditional independence

between the data events D± and -Dj-i giving the unknown A — a is written:

P(Di = di\A = a,'Di-l = d\-1) = P{Di = di\A = a), W^ (6.2)

Assumption of conditional independence between the n data events Di,...,Dn giving

the unknown A = a leads to:

n

P(D1 = d1,...,Dn = dn\A = a) = 11 P(Di = di\A = a) (6.3) i = l

6.2. THE NU EXPRESSION: THEORY 153

However, this assumption of independence is rarely checked in practice. Geological

data tend to be related through their common geological origin and should not be

a priori considered as independent or conditionally independent. Typically a datum

event Dj independent of all other Dj is often also independent of A and, therefore,

should not be retained.

Accounting for data interaction is critical in any data integration algorithm. Such

interaction between data and unknown may change the naive assessment made from

an association of data ignoring their interaction. This interaction is typically data-

values and unknown-value dependent, requires considering all data jointly.

6.2 The nu expression: Theory

6.2.1 Tau expression

The nu expression proposed in this thesis expands from the tau model proposed by

Journel [48]. The tau model uses the well-known paradigm of permanence of ratios

which states that ratio of probabilities are typically more stable than the probabilities

themselves.

The Bordley-Journel expression is written:

xo \xoy \xo/ \xoJ

where Xi, x, and x0 are inverse of odds ratios and defined as datum-specific distances

to the unknown event A occurring. The fully conditioned distance

x = ~PIA\£,11,'"DJ) to the unknown event A is expressed in equation (6.4) as the

tau-weighted product of the elementary distances Xi = pj^m.-) scaled by the prior

distance XQ = ~PA\ • All of these distances necessarily lie in interval [0, oo]. In his

work, Journel stopped short of providing the expression of the interaction weights T;.

Without such expression or approximation of it, the expression (6.4) can be seen only


as a heuristic model for the sought-after posterior probability P(A — a\D\, ...,Dn).

The tau model remained a model until the contribution of Krishnan [50], who de

veloped the exact expression of the r interaction weights. The central equation to

Krishnan work is:

j / ^ A = di\A = nona, A - i = di-i) / , , N P(Di = di | A = a, Di_i = dj_i) , .

Ti{d1,...,dn,a) = -— . ' r -— e [ - o o , + o o | log-P(Di = dj\A = nona) L ' J ' (6.5)

P(Di = di\A = a

Tl = l

where Di_i represent all data Di, ...,Di_i in the sequence up to A excluded. Criti

cally, this Tj expression is associated to a specific ordering of n data events Di, ...,Dn.

The consideration of a different data sequence results in different T; weights.

The denominator of the Tj-expression (6.5) measures how datum Di — di discrimi

nates the outcome A — a from nona. The numerator measures the same but in the

presence of the previous data D{-\ = dj_i = {D\ = d\, ...,Dj_i = dj_i}. Thus the

ratio (of ratios) r» indicates how the discrimination power of Di = di is changed by

knowledge of the previous data Dj_i — dj_x taken all together. In this regard, the r,

parameter can be seen as a data interaction measure.

6.2.2 Nu expression

Krishnan's derivation (6.5) provided the exact expression of the conditional proba

bility P(A\Di,..., Dn) while accounting for data interaction through the weights Tt.

These tau interaction weights in addition to being dependent on the specific ordering

of the n data events D\,..., Dn are data-values and unknown-value dependent.

While data-specific and data sequence-dependent interaction weights are important

in some applications, most often it suffices to evaluate the global or compound impact

of data interaction. Krishnan's derivation fails to provide a measure of such global

data interaction. Moreover, the Tj weights are likely to be unstable versus data values

since, when the information is non-discriminating, the denominator of expression (6.5)

6.2. THE NU EXPRESSION: THEORY 155

tends toward log 1 = 0 leading to an infinite tau weight (TJ —*• oo) and consequent

stability problems.

The nu expression proposed in this thesis tries to overcome these shortcomings of the

tau expression. This nu expression is written as:

n n

= Y[ vi~ = ô XT —; wnere o = \\ Vi > 0 (6.6)

» = I i = i i = i

with:

P(DitA,.Di_1)

= P ^ I A ^ - 0 > w i t h = ( 6 J )

P(Di\A)

where A = nonA.

The expression (6.6) remains an exact representation of the posterior probability

P(A = a\Di = di,..., Dn — dn). However, compared to the tau expression (6.4), the

nu representation leads to a single, data sequence-independent, interaction parameter

UQ. That single I/Q interaction parameter is still data-values and unknown value de

pendent, as expected if the nu expression (6.6) is to be an exact representation of the

fully conditioned probability. The instability of the TJ parameter linked to division by

a log ratio possibly close to zero as in relation (6.4) does not affect the nu expression

(6.7).

Although the nu representation avoids some of the inference drawbacks of the tau

expression, the two representations are strictly equivalent through the parameter

relations:

U Xo

Characteristic unit nu values

• If Vi — 1, the ability of the datum (or data event) Di = di to discriminate

a from nona is unchanged by knowledge of the previous (i — 1) data events


dj_! = {Di — di,...,Di-i = <ij_i}. Therefore Vi = 1 can be seen as the case

of "non-interaction" of the two data events Di and D^-i when it comes to

discriminating a from nana. The deviation | 1 — z/j | is thus a measure of data

interaction. That measure is (a,dj,j = l,...,i) values-dependent. Similarly

when considering the single z/0 parameter, the deviation | 1 — i/0 | is a measure of

global data interaction. Note that for u0 = 1 it is not necessary that all z/j = 1.

For example, v\ = 1, z/2 = 3 and z/3 = 1/3 would result in z/0 = l*3*(l/3) = 1. In

other words, different elementary data interaction (z/j 7 1) may cancel out into

no global data interaction VQ — 1. The major contribution of the nu expression

is that it accounts not only for the elementary data interactions but also for the

global interaction as measured by | 1 — z/0 |.

• Vi — 1 in equation (6.7) requires that the two ratios p ^ u ' g * ) anc* PI-DIA) ^ e

equal to each other. The traditional hypothesis of conditional independence of

Di and D^\ given only A = a as in expression (6.2) does not suffice to ensure

Vi — \. Conversely, Vi = 1 does not imply necessarily conditional independence

given only A = a or A = nona. It was shown in the text that data independence,

data conditional independence and data non-interaction (z/j = 1) are different

states.

6.3 Approximations of the nu representation

The nu representation (6.6) is an exact representation of the fully conditioned pos

terior probability P(A = a\Dx — d\,..., Dn = dn). Critically, the nu parameter values

are data values and unknown value dependent. While such form of dependence allows

for an exact representation of the fully conditioned probability, it is unpractical. This

calls for approximations built from the exact nu expression (6.7).

The VQ=1 model

One straightforward approximation of the fully posterior probability P(A\Di,..., Dn)

is to assume global cancellation of individual data interaction by setting z/0 = 1. The

6.3. APPROXIMATIONS OF THE NU REPRESENTATION 157

estimate of the posterior probability P(A\D\,..., Dn) under the uQ = 1 model is then

written:

n

i=l

This model is remarkably simple, yet it is more comprehensive than the traditional

models stemming from various assumptions of data independence. For example, con

ditional independence of Dt and -Dj_i (as defined in equation (6.2)) given only A = a

does not suffice to ensure VQ — \. The u0 = 1 model ignores the interactions between

data, not because they do not exist but because they are assumed to globally cancel

out.

Approximations are made possible by the fact that in the nu expression data inter

action has been completely separated from individual data information; the single

parameter u0 suffices to measure the compound or global data interaction. That

global data interaction can be obtained from a training image where a similar inter

action is observed. Because of the previous separation, that training image can be

different from the actual set from which the actual data originate.

Classified u0 model

We suggest that the single global interaction parameter u0 be obtained (lifted) from

a training image. The training image provides proxy replicates of the interaction

between the actual data and the unknown. We stress that the training image is

necessarily only for retrieval of the global interaction parameter VQ. The individual

probabilities or, equivalently, distances involved, should be obtained from the actual

data. This is very similar to a kriging application where the linear dependence model,

or covariance, between data can be obtained from an outcrop while the data used for

the estimated value originate from the real field. With a paradigm similar to that of

kriging:

• the information content which is obtained through the actual data


• the interaction parameter u0 which can be obtained from training images

The algorithm for obtaining the posterior probability P(A\Di,..., Dn) comprises two

distinct phases, the training phase and the application phase. In the training phase:

1. find or build a training data set that approximates the actual data interaction.

From that set, retrieve the training data-values dependent f0-values, called

proxy uQ-values

2. summarize each set of training data values into a few filter scores. Based on

these scores, classify the proxy values u0. Each class is identified by a single

(average or median) z/0-value, called a "class z^-prototype"

In the application phase:

1. Find the training class closest to the scores of the actual conditioning data.

2. Retrieve that class "prototype" value z/0

3. Use that u0 value to combine the elementary probabilities. Once again, these

elementary probabilities must be evaluated from the actual study field, not from

the training data set.

6.4 Final conclusions

In this section, we summarize the key results and findings presented in this thesis.

Synthetic case studies were used to check the conclusions hereafter. The key results

are:

• An analytical expression (the nu expression) was established to account for

data interactions. This methodology is based on an effective separation of data

information content and data interaction.

• The nu expression is the sister of the previously proposed tau expression. The

tau expression did also allow for the separation of information content and

individual data interaction. The nu expression reduced these complex individual

6.5. FUTURE WORK 159

data interactions into a single global interaction parameter u0 which is both

data-values and unknown-value dependent. The tau expression did not reveal

such global parameter.

• The extremely concise and simple v^ — \ model provides favorable results when

compared to estimates obtained from traditional data independence assump

tions. While the vo = 1 assumes no-data-interaction. That hypothesis is less

demanding than either data independence or data conditional independence.

• The classified v0 model handles effectively the problem of data integration by

borrowing the global data interaction (u0 ^ 1) from a training image mimicking,

even only roughly, that data interaction.

6.5 Future work

This study has allowed us to draw several conclusions which may have a significant

impact on how we address the problem of data integration is addressed in the future.

Combining the contributions of data events from different sources is not a simple

task. Each of these data events could be quite complex and their contributions are

generally related one to another, that is any data event can influence the information

brought by another data or combination of other data.

The remarkably simple u0 model verifies all limit properties of probabilities and has

been shown in this thesis to consistently outperform traditional models based on vari

ous forms of data independence. The contribution of the nu representation compared

to the tau representation is to express the impact of joint data interaction by a single

correction parameter VQ whose known exact expression can lead to case-dependent

approximations. We suggest an inference paradigm where the v0-values would be

obtained from a catalog of "similar" cases with approximately similar data configura

tions and data values. The future practicality of the nu model would depend on our

ability in generating such proxy training data from which to export the ^-parameters.


The theoretical development have led to the definition a measure of data interaction

given a particular estimation goal. Joint no data-interaction, i.e. the u0 — 1 model,

is a concept much richer and more useful than that of data conditional independence.

Understanding data interaction is critical to a correct utilization of diverse data orig

inating from a variety of sources of unequal reliability, overlapping and sometimes

contradictory information. In an era of simulation made possible by high power

computers, it is time to graduate from the hypotheses of data independence or condi

tional independence which are hidden behind many traditional statistical prediction

algorithms. Covariance analysis, including principal component and kriging analysis,

would at best remove linear two-point correlation, it does not (cannot) address the

problem of joint, multiple-point, multiple data events dependence and their joint in

teraction itself dependent on the specific unknown being assessed.

A rigorous and complete solution calls for the joint thus multivariate distribution of

the random variables modeling the unknown and the various data events retained.

An illusion of rigor could be obtained by adopting some Gaussian-related multivari

ate distribution; this is but a diversion which allows reverting back to the previous

covariance analysis with all its limitations. It is time to acknowledge fully that spa

tial distributions, at least at any scale larger than molecular, is not Gaussian: earth

sciences distributions are neither Gaussian nor even symmetric, dependence is rarely

if ever linear, error variances are not homoscedastic (i.e. independent of the signal

value), data interact all together not two by two, etc... We suggest that modern com

puting ability (hardware and software) allows us to contemplate generating proxy

joint realizations of the set (unknown+data), that is in fact a non-parametric, non-

analytical multivariate distribution model. The data interaction can be understood

and modeled from these proxy realizations (training sets) and exported to the actual

field under study, just like one would borrow a covariance model from a proxy outcrop

or proxy data set. However, the actual individual data information content still have

to be evaluated directly from the actual field under study, just like one would not

borrow local information (sample values) from an outcrop.

6.5. FUTURE WORK 161

This study suggests decomposing the difficult task of evaluating joint data information

content into

1. the easier task of evaluating each single data event information

2. recombining the previous elementary data event information into a joint data

information content using data interaction parameters (the ^-parameters) in

ferred from proxy training sets chosen to mimic the actual data interaction.

We suggest that the generation of such proxy data interaction realizations, no matter

how approximative, is better than adopting unrealistic, non-physical hypotheses of

data (conditional) independence even if hidden under the screen of Gaussian distri

butions. This calls for a case-specific, comprehensive review of data interaction and

learning what aspects of interaction are important and what others could be disre

garded. Such understanding would require future research to go beyond the synthetic

case studies presented in this thesis and diving into real datasets, where more often

than not the processes are driven by a complex physical background.

One characteristic of earth sciences is that there usually exists prior knowledge about

these physical processes and our prediction models should build on such prior. The

concept of training image(s) and data interaction parameters lifted from such training

images allows using such prior. Of course, there is uncertainty on any prior, but then

several different priors, aka training images, can be considered. How should different

training images be built then weighted toward the final probabilistic decision remains

a research subject.

Bibliography

[1] Abidi, M. A., and Gonzalez, R. C , 1992, Data fusion in robotics and machine

intelligence, San Diego, CA, Academic Press.

[2] Agnew, C. E., 1985, Multiple probability assessments by dependent experts, Jour

nal of the American Statistical Association, v. 80, p. 343-347.

[3] Arribas, J. I., Cid-Sueiro, J., Adali, T., Figueiras-Vidal, A. R., 1999, Neural

networks to estimate ML multi-class constrained conditional probability density

functions, Neural Networks, v. 2, p. 1429-1432.

[4] Bates, J. M., and Granger, C. W. J., 1969, The combination of forecasts, Opera

tional Research Quarterly, v. 20, p. 451-467.

[5] Benediktsson and Swain, 1992, Consensus theoretic classification methods, IEEE

Trans. Systems, man Cybernet, v. 22, p. 688-704.

[6] Bordley, R.F., 1982, A multiplicative formula for aggregating probability assess

ments, Management Science, v. 28, no. 10, p. 1137-1148.

[7] Budyko, M. I., 1969, The effect of solar radiation variations on the climate of the

Earth: Teelus, v. 21, p. 611-619.

[8] Butz, C. J., and Sanscartier, M. J., 2002, Properties of weak conditional indepen

dence, 3rd International conference on rough sets and current trends in computing

(RSCTC02), p. 349-356.

[9] Bunn, D. W., 1981, Two methodologies for the linear combination of forecasts,

The Journal of the Operational Research Society, v. 32, no. 3, p. 213-222.

162

BIBLIOGRAPHY 163

[10] Cheng, J., and Greiner, R., 2001, Learning Bayesian Belief Network Classifiers:

Algorithms and System, Advances in artificial intelligence: 14th Biennial confer

ence of the Canadian society for computational studies of intelligence, Ottawa,

Canada.

[11] Deutsch, C , and Journel, A. C , 1998, GSLIB: Geostatistical software library

and user's guide, Oxford University Press, New York, p. 87.

[12] Drton, M., Andersson, S. A., and Perlman, M. D., 2006, Conditional indepen

dence models for seemingly unrelated regressions with incomplete data, Jour, of

Multivariate Analysis, v. 97, no. 2, p. 385-411.

[13] Caers, J., and Hoffman, T. B., 2006, The probability perturbation method: an

alternative Bayesian approach for solving inverse problems, Math. GeoL, v. 38, no.

1, p. 81-100.

[14] Causeur, D., and Dhorne, T., 2003, Linear regression models under conditional

independence restrictions, Scandinavian Journal of Statistics, v. 30, no. 3, p. 637-

650.

[15] Clemen, R. T., 1987, Combining overlapping information, Management Science,

v. 33, no. 3, p. 373-380.

[16] Clement, R. T., and Reilly, T., 1999, Correlations and copulas for decision and

risk analysis, Management Science, v. 45, no. 2, p. 208-224.

[17] Chatterjee, N., and Carroll, R. J., 2005, Semiparametric maximum likelihood

estimation exploiting gene-environment independence in case-control studies, v.

92, no. 2, p. 399-418.

[18] Dawid, A. P., 1979, Conditional independence in statistical theory, Journal of

Royal Statistical Society, Series B (Methodological), v. 41, no. 1, p. 1-31.

[19] Downs, G. W., and Rocke, D. M., 1979, Interpreting heteroscedasticity, American

Journal of Political Science, v. 23, no. 4, p. 816-828.

164 BIBLIOGRAPHY

[20] Dickinson, J. P., 1973, Some statistical results in the combination of forecasts,

Operational Research Quarterly, v. 24, p. 253-256.

[21] Dickinson, J. P., 1975, Some comments on the combination of forecasts, Opera

tional Research Quarterly, v. 26, p. 205-210.

[22] Dubrule, O., 1983, Two methods with different objectives: Splines and kriging,

Math. Geology, v. 15, no. 2, p. 245-257.

[23] Fleiss, J. L., 1981, Statistical methods for rated and proportions, 2nd Ed. New

York, John Wiley.

[24] Freeling, A. N., 1981, Reconciliation of multiple probability assessments, Orga

nizational Behavior and Human Performance, v. 28, p. 395-414.

[25] French, S., 1980, Updating of belief in the light of someone's else opinion, Journal

of the Royal Statistical Society A, v. 143, p. 43-48.

[26] Friedman, N., Nachman I., and Pe'er, D., 2000, Using Bayesian networks to

analyze expression data, Journal of Comput. Biology, v. 7 (3-4), p. 601-620.

[27] Fuchs, C , and Greenhouse, J. B., 1988, The EM algorithm for maximum likeli

hood estimation in the mover-stayer model, Biometrics, v. 44, no. 2, p. 605-613.

[28] Galton, F., 1894, Natural Inheritance (5th ed.), Macmillan and Company, New

York.

[29] Gelfand, A. E., and Smith, A. F. M., 1990, Sampling-based approaches to cal

culating marginal densities, Journal of American Statistical Association, v. 85, p.

398-409.

[30] Genest, C., and McConway, K. J., 1990, Allocating the weights in the linear

opinion pool, Journal of Forecasting, v. 9, p. 53-73.

[31] Genest, C., and Zidek, J. V., 1986, Combining probability distributions: A cri

tique and an annotated bibliography, Statistical Science, v. 1, p. 114-118.

BIBLIOGRAPHY 165

[32] Roback, P. J., and Givens, P. J., 2001, Supra-Bayesian pooling of priors linked

by deterministic simulation model, Communications in Statistics, v. 30, no. 3, p.

447-476.

[33] Goovaerts, P., 1997, Geostatistics for natural resources evaluation, Oxford Press.

[34] Goovaerts, P., 1994, Comparative performance of indicator algorithms for mod

eling conditional probability distribution functions, Math. Geology, v. 26, no. 3, p.

389-411.

[35] Guardiano, F., and Srivastava, M., 1992, Multivariate geostatistics: beyond bi-

variate moments, In A. Soarces (Ed.), Geostatistics-Troia, p. 133-144. Kluwer Aca

demic Publ., Dordrecht.

[36] Haldorsen, H. H., and Lake, L. W., 1984, A new approach to shale management

in field-scale models: SPE Jour., v. 24, no. 8, p. 447-452.

[37] Halperin, M., 1961, Almost linearly optimum combination of unbiased estima

tors, American Statistical Association Journal, p. 36-43.

[38] Hanushek, E., and Jackson, J. E., 1977, Statistical methods for social scientists,

New York, Academic Press.

[39] Harrison, J. M., 1977, Independence and calibration in decision analysis, Man

agement Science, v. 24, no. 3, p. 320.

[40] Husmeier, D., 1999, Neural networks for conditional probability estimation: fore

casting beyond point predictions (perspectives in neural computing), Springer.

[41] Husmeier, D., and Taylor, J. G., 1998, Neural networks for predicting conditional

probability densities: improved training scheme combining EM and RVFL, Neural

Networks, v. 11, no. 1, p. 89-116.

[42] Kmenta, 1971, Elements of econometrics, New York, Macmillan.

[43] Jacobs R. A., 1995, Methods for combining experts' probability assessments,

Neural Computation, v.7, p. 867-888.

166 BIBLIOGRAPHY

[44] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E., 1991, Adaptive

mixtures of local experts, Neural Computation, v. 3, p. 79-87.

[45] Journel, A. G., 1983, Non parametric estimation of spatial distributions, Math.

Geology, v. 15, no. 3, p. 793-806.

[46] Journel, A. G., 1989, Fundamentals of Geostatistics in five lessons, Short course

in Geology, American Geophysical Union, Washington, D. C., v. 8.

[47] Journel, A. G., 1992, Geostatistics: roadblocks and challenges. In A. Soares

(Ed.), Geostatistics-Troia, Kluwer Academic PubL, Dordrecht, p. 133-144.

[48] Journel, A.G., 2002, Combining knowledge form diverse information sources: an

alternative to Bayesian analysis, Math. Geology, v. 34, No. 5, 573-598.

[49] Journel, A. G., and Huijbregts, C. J., 1978, Mining Geostatistics, Academic

Press, New York, p. 566.

[50] Krishnan, S., 2005, Combining diverse and partially redundant information in

the earth sciences, Ph. D. thesis, Department of Geological and Environmental

Sciences, Stanford University.

[51] Lindley, D. V., 1979, In discussion of Dawid's paper: conditional independence

in statistical theory, v. 41, no. 1, p. 15-16.

[52] Lindley, D. V., 1982, The improvement of probability judgments, Journal of the

Royal Statistical Society A, v. 145, p. 117-126.

[53] Lindley, D. V., 1985, Reconciliation of discrete probability distributions, In J.

M. Bernardo, M.H. DeGroot, D. V. Lindley, and A. F. M. Smith (Eds), Bayesian

Statistics 2, Amsterdam: North-Holland.

[54] Lindley, D. V., 1988, The use of probability statements, In C. A. Clarotti and

D. V. Lindley (Eds), Accelerated Life Testing and Experts' Opinions in Reliability,

Amsterdam: North-Holland.

BIBLIOGRAPHY 167

[55] Lindley, D. V., Tversky, A., and Brown, R. V, 1979, On the reconciliation of

probability assessments, Journal of the Royal Statistical Society A, v. 142, p. 146-

180.

[56] Maharaja, A., 2007, Global net-to-gross uncertainty assessment of reservoir ap

praisal stage, Ph. D. thesis, Department of Geological and Environmental Sciences,

Stanford University.

[57] Markov, A. A., 2006, An example of statistical investigation of the text Eugene

Onegin concerning the connection of samples in chains, trans. David Link. Science

in Context, v. 19, no. 4, p. 591-600.

[58] McLaren, A. D., 1979, In discussion of Dawid's paper: conditional independence

in statistical theory, v. 41, no. 1, p. 16-17.

[59] MacQueen, J. B., 1967, Some methods for classification and analysis of mul

tivariate observations, Proceedings of 5th Berkley Symposium on Mathematical

Statistics and Probability, Berkley, University of California Press, v. 1, p. 281-297.

[60] Morris, P. A., 1974, Decision analysis expert use, Management Science, v. 20, p.

1923-1241.

[61] Omre, H., and Tjelmeland, H., 1997, Petroleum geostatistics, in Baafi E. Y., and

Schofield N. A., eds., Geostatistics Wollongong '96, v. 1: Kluwer, Dordrecht, The

Netherlands, p. 41-52.

[62] Park, N. W., and Chi, K. H., 2003, A probabilistic approach to predictive spa

tial data fusion for geological hazard assessment, Geoscience and Remote Sensing

Symposium, IGARSS apos; Proceedings. IEEE International, v. 4, no. 21-25, p.

2425-2427.

[63] Pearl, J., 1988, Probabilistic reasoning in intelligent systems: networks of plau

sible inference, San Mateo, CA, USA: Morgan Kaufmann Publishers.

[64] Polyakova, E. I., and Journel, A. G., 2007, The Nu expression for probabilistic

data integration, Math. Geology, v. 39, No. 8, 715-733.

168 BIBLIOGRAPHY

[65] Rabe-Hesketh, S., Skrondal, A., and Pickles, A., 2004, Maximum likelihood es

timation of limited and discrete dependent variable models with nested random

effects, Journal of Econometrics, v. 128, no. 2, p. 301-323.

[66] Remy, N., 2004, The Stanford geostatistical modeling software (S-Gems), SCRF

Lab, Stanford University.

[67] Ross, S., 1998, A first course in probability, Prentice Hall, Upper Saddle River,

New Jersey 07458.

[68] Specht, D., 1990, Probabilistic neural networks, Neural Networks, v. 3, p. 109-

118.

[69] Stone, M., 1961, The opinion pool, Annals of Mathematical Statistics, v. 32, p.

1339-1342.

[70] Strebelle, S., 2001, Conditional simulation of complex geological structures using

multiple-point statistics, Math. Geol., v. 334, no. 1, p. 1-22.

[71] Tarantola, A., 2004, Inverse problem theory and model parameter estimation,

SIAM (in press).

[72] Theil, H., 1971, Principles of econometrics, New York, Wiley.

[73] Vinnikov, K. Y., Gruza, G. V., Zakharov, V. F., Kirillov, A. A., Kovyneva, N.

P., and Ran'kova E. Y., 1980, Recent climatic changes in the Northern Hemisphere,

Soviet Meteorology and Hydrology, v. 6, p. 1-10.

[74] Zhang, T., Switzer, P., and Journel, A. G., 2006, Filter-based classification of

training image patterns for spatial simulation, Math. Geol., v. 38-1.

[75] Winkler, R. L., 1981, Combining probability distributions from dependent infor

mation sources, Management Science, v. 27, p. 479-488.

[76] Wong, S. K. M., Butz, C. J., and Wu, D., 2000, On implication problem for

probabilistic conditional independence, IEEE transactions on systems, man, and

cybernetics, part A: systems and humans, v. 30, no. 6, p. 785-805.

A GENERAL THEORY FOR EVALUATING JOINT DATA …pangea.stanford.edu/departments/ere/dropbox/scrf/documents/Theses/SCRF-Theses/2000...a general theory for evaluating joint data interaction

Documents