Bayesian network learning for natural hazard analyses · with graph theory, BNs depict probabilistic dependence re-lations in a graph: the nodes of the graph represent the con-sidered

Nat. Hazards Earth Syst. Sci., 14, 2605–2626, 2014www.nat-hazards-earth-syst-sci.net/14/2605/2014/doi:10.5194/nhess-14-2605-2014© Author(s) 2014. CC Attribution 3.0 License.

Bayesian network learning for natural hazard analyses

K. Vogel1,*, C. Riggelsen2, O. Korup1, and F. Scherbaum1

1Institute of Earth and Environmental Sciences, University of Potsdam, Germany2Pivotal Software Inc., Palo Alto, USA* Invited contribution by K. Vogel, recipient of the Outstanding Student Poster (OSP) Award 2012.

Correspondence to:K. Vogel ([email protected])

Received: 14 August 2013 – Published in Nat. Hazards Earth Syst. Sci. Discuss.: 22 October 2013Revised: 26 June 2014 – Accepted: 19 August 2014 – Published: 29 September 2014

Abstract. Modern natural hazards research requires deal-ing with several uncertainties that arise from limited processknowledge, measurement errors, censored and incompleteobservations, and the intrinsic randomness of the govern-ing processes. Nevertheless, deterministic analyses are stillwidely used in quantitative hazard assessments despite thepitfall of misestimating the hazard and any ensuing risks.

In this paper we show that Bayesian networks offer a flexi-ble framework for capturing and expressing a broad range ofuncertainties encountered in natural hazard assessments. Al-though Bayesian networks are well studied in theory, theirapplication to real-world data is far from straightforward,and requires specific tailoring and adaptation of existing al-gorithms. We offer suggestions as how to tackle frequentlyarising problems in this context and mainly concentrate onthe handling of continuous variables, incomplete data sets,and the interaction of both. By way of three case studiesfrom earthquake, flood, and landslide research, we demon-strate the method of data-driven Bayesian network learning,and showcase the flexibility, applicability, and benefits of thisapproach.

Our results offer fresh and partly counterintuitive in-sights into well-studied multivariate problems of earthquake-induced ground motion prediction, accurate flood damagequantification, and spatially explicit landslide prediction atthe regional scale. In particular, we highlight how Bayesiannetworks help to express information flow and independenceassumptions between candidate predictors. Such knowledgeis pivotal in providing scientists and decision makers withwell-informed strategies for selecting adequate predictorvariables for quantitative natural hazard assessments.

1 Introduction

Natural hazards such as earthquakes, tsunamis, floods, land-slides, or volcanic eruptions have a wide range of differingcauses, triggers, and consequences. Yet the art of predict-ing such hazards essentially addresses very similar issues interms of model design: the underlying physical processes areoften complex, while the number of influencing factors islarge. The single and joint effects of the driving forces arenot always fully understood, which introduces a potentiallylarge degree of uncertainty into any quantitative analysis. Ad-ditionally, observations that form the basis for any inferenceare often sparse, inaccurate and incomplete, adding yet an-other layer of uncertainty. For example,Merz et al.(2013)point out the various sources of uncertainty (scarce data, poorunderstanding of the damaging process, etc.) in the contextof flood damage assessments, whileBerkes(2007) calls at-tention to the overall complexity of human–environment sys-tems, as well as the importance of understanding underlyinguncertainties to improve resilience. Similarly,Bommer andScherbaum(2005) discuss the importance of capturing un-certainties in seismic hazard analyses to balance between in-vestments in provisions of seismic resistance and possibleconsequences in the case of insufficient resistance.

Nevertheless, deterministic approaches are still widelyused in natural hazards assessments. Such approaches rarelyprovide information on the uncertainty related to parame-ter estimates beyond the use of statistical measures of dis-persion such as standard deviations or standard errors aboutempirical means. However, uncertainty is a carrier of infor-mation to the same extent as a point estimate, and ignor-ing it or dismissing it as simply an error may entail graveconsequences. Ignoring uncertainties in quantitative hazard

Published by Copernicus Publications on behalf of the European Geosciences Union.

2606 K. Vogel et al.: Bayesian network learning for natural hazard analyses

appraisals may have disastrous effects, since it often leads toover- or underestimates of certain event magnitudes. Yet de-terministic approaches persist as the state of the art in manyapplications. For example, tsunami early warning systemsevaluate pre-calculated synthetic databases and pick out thescenario that appears closest to a given situation in order toestimate its hazard (Blaser et al., 2011). Recently developedmodels for flood damage assessments use classification ap-proaches, where the event under consideration is assigned toits corresponding class, and the caused damage is estimatedby taking the mean damage of all observed events belongingto the same class (Elmer et al., 2010). In seismic hazard anal-ysis the usage of regression-based ground motion models iscommon practice, restricting the model to the chosen func-tional form, which is defined based on physical constrains(Kuehn et al., 2009).

In this paper we consider Bayesian networks (BNs), whichwe argue are an intuitive, consistent, and rigorous way ofquantifying uncertainties.Straub(2005) underlines the largepotential of BNs for natural hazard assessments, herald-ing not only the ability of BNs to model various inter-dependences but also their intuitive format: the representa-tion of (in)dependences between the involved variables in agraphical network enables improved understandings and di-rect insights into the relationships and workings of a nat-ural hazard system. The conditional relationships betweendependent variables are described by probabilities, fromwhich not only the joint distribution of all variables but anyconditional probability distribution of interest can be derived.BNs thus endorse quantitative analyses of specific hazardscenarios or process-response chains.

In recent years, BNs have been used in avalanche risk as-sessment (e.g.,Grêt-Regamey and Straub, 2006), tsunamiearly warning (e.g.,Blaser et al., 2009, 2011), earthquakerisk management (e.g.,Bayraktarli and Faber, 2011), proba-bilistic seismic hazard analysis (e.g.,Kuehn et al., 2011), andearthquake-induced landslide susceptibility (e.g.,Song et al.,2012). Aguilera et al.(2011) give an overview of applica-tions of BNs in the environmental sciences between 1990 and2010, and conclude that the potential of BNs remains under-exploited in this field. This is partly because, even thoughBNs are well studied in theory, their application to real-worlddata is not straightforward. Handling of continuous variablesand incomplete observations remains the key problem. Thispaper aims to overcome these challenges. Our objective isto briefly review the technique of learning BNs from data,and to suggest possible solutions to implementation prob-lems that derive from the uncertainties mentioned above. Weuse three examples of natural hazard assessments to discussthe demands of analyzing real-world data, and highlight thebenefits of applying BNs in this regard.

In our first example (Sect.3), we develop a seismic groundmotion model based on a synthetic data set, which servesto showcase some typical BN properties. In this context wedemonstrate a method to deal with continuous variables with-

Fig. 1. The figure shows the BN for the burglary example. The graph structure illustrates the dependence

relations of the involved variables: The alarm can be triggered by a burglary or earthquake. An earthquake

might be reported in the radio newscast. The joint distribution of all variables can be decomposed into the

product of its conditionals accordingly: P (B,E,A,R) = P (B)P (E)P (A|B,E)P (R|E)

Fig. 2. Illustration of a parent set in a BN.XPa(i) is the parent set of Xi

(a) (b) (c)

Fig. 3. Working with continuous variables we have to make assumptions about the functional form of the

probability distributions (gray), e.g. (a) exponential, (b) normal, (c) uniform. Thus we restrict the distributions

to certain shapes that may not match reality. In contrast using a discrete multinomial distribution (black), each

continuous distribution can be approximated and we avoid prior restrictions on the shape. Rather the shape is

learned from the data by estimating the probability for each interval.

30

Figure 1. The figure shows the BN for the burglary exam-ple. The graph structure illustrates the dependence relations ofthe involved variables: the alarm can be triggered by a burg-lary or earthquake. An earthquake might be reported in theradio newscast. The joint distribution of all variables can bedecomposed into the product of its conditionals accordingly:P(B, E, A, R) = P(B)P (E)P (A|B,E)P (R|E).

out any prior assumptions on their distributional family. InSect. 4 we use data that were collected after the 2002 and2005/2006 floods in the Elbe and Danube catchments, Ger-many, to learn a BN for flood damage assessments. This ex-ample is emblematic of situations where data are incomplete,and requires a treatment of missing observations, which canbe challenging in combination with continuous variables.Our final example in Sect.5 deals with a regional landslidesusceptibility model for Japan, where we investigate how thesame set of potential predictors of slope stability may pro-duce nearly equally well performing, though structurally dif-ferent, BNs that reveal important and often overlooked vari-able interactions in landslide studies. This application furtherillustrates the model uncertainty related to BN learning.

2 Bayesian networks (BNs)

The probabilistic framework of BNs relies on the theoremformulated by Reverend Thomas Bayes (1702–1761), andexpresses how to update probabilities in light of new evi-dence (McGrayne, 2011). By combining probability theorywith graph theory, BNs depict probabilistic dependence re-lations in a graph: the nodes of the graph represent the con-sidered random variables, while (missing) edges between thenodes illustrate the conditional (in)dependences between thevariables. Textbooks often refer to the burglary alarm sce-nario for a simple illustration of BNs (Pearl, 1998). In thisexample, the alarm of your home may not only be triggeredby burglary but also by earthquakes. Moreover, earthquakeshave a chance to be reported in the news. Figure1 shows thedependence relations of these variables as captured by a BN.Now, imagine you get a call from your neighbor notifyingyou that the alarm went off. Supposing the alarm was trig-gered by burglary, you drive home. On your way home youhear the radio reporting a nearby earthquake. Even thoughburglaries and earthquakes may be assumed to occur inde-pendently, the radio announcement changes your belief in theburglary, as the earthquake “explains away” the alarm. BNs

Nat. Hazards Earth Syst. Sci., 14, 2605–2626, 2014 www.nat-hazards-earth-syst-sci.net/14/2605/2014/

K. Vogel et al.: Bayesian network learning for natural hazard analyses 2607

Table 1. Conditional probabilities in the burglary example, giv-ing the conditional probabilities forearthquake(e), burglary (b),alarm (a), andearthquake reported(r). The parameters that definethe conditional distributions correspond for discrete variables to theconditional (point) probabilities. Note that the conditional probabil-ity values forno earthquake(e), no burglary(b), etc. can be derivedfrom the fact that the conditionals sum up to 1.

θe= p(e) = 0.001 θa|e,b = p(a|e, b) = 0.98θb = p(b) = 0.01 θa|e,b = p(a|e, b) = 0.95

θr|e= p(r|e) = 0.95 θa|e,b = p(a|e, b) = 0.95θr|e= p(r|e) = 0.001 θa|e,b = p(a|e, b) = 0.03

offer a mathematically consistent framework to conduct andspecify reasonings of such kind. A detailed introduction toBNs is provided inKoller and Friedman(2009) andJensenand Nielsen(2001), whileFenton and Neil(2012) offers easyand intuitive access. In this paper we restrict ourselves to sev-eral key aspects of the BN formalism.

2.1 Properties and benefits

Applying BNs to natural hazard assessments, we define thespecific variables of the hazard domain to be the nodes in aBN. In the following we denote this set of random variablesasX = {X1, . . . , Xk}. The dependence relations between thevariables are encoded in the graph structure, generating a di-rected acyclic graph (DAG). The directions of the edges de-fine the flow of information, but do not necessarily indicatecausality. As we shall see in subsection “Learned ground mo-tion model” of Sect. 3.2, it may prove beneficial to directedges counterintuitively in order to fulfill regularization con-straints. The set of nodes from which edges are directed to aspecific node,Xi , is called the parent set,XPa(i), of Xi (seeFig. 2). Table2 summarizes the notations used in this paper.

Apart from the graph structure, a BN is defined by con-ditional probabilities that specify the dependence relationsencoded in the graph structure. The conditional probabilitydistribution for each variable,Xi , is given conditioned on itsparent set:p

(Xi |XPa(i)

). For simplification we restrict our-

selves here to discrete variables for whichθ is the set of con-ditional (point) probabilities for each combination of statesfor Xi andXPa(i): θ = {θxi |xPa(i) = p(xi |xPa(i))}. The condi-tional probabilities for the burglary BN example are given inTable1. For continuous variables, the design of the param-eters depends on the family of distributions of the particulardensitiesp(·|·).

Given the BN structure (DAG) and parameters (θ ), it fol-lows from the axioms of probability theory that the joint dis-tribution of all variables can be factorized into a product ofconditional distributions:

P(X|DAG, θ) =k∏

i=1

p(Xi |XPa(i)

). (1)






(a) (b) (c)






30

Figure 2. Illustration of a parent set in a BN.XPa(i) is the parentset ofXi .

Further, applying Bayes theorem,P(A|B) = P(A,B)P (B)

=

P(B|A)P (A)P (B)

, each conditional probability of interest can bederived. In this way a BN is characterized by many attractiveproperties that we may profit from in a natural hazard setting,including the following properties:

– Property 1 – graphical representation: the interactionsof the variables of the entire “system” are encoded in theDAG. The BN structure thus provides information aboutthe underlying processes and the way various variablescommunicate and share “information” as it is propa-gated through the network.

– Property 2 – use prior knowledge: the intuitive inter-pretation of a BN makes it possible to define theBN based on prior knowledge; alternatively it may belearned from data, or even a combination of the two(cast as Bayesian statistical problem) by posing a priorBN and updating it based on observations (see belowfor details).

– Property 3 – identify relevant variables: by learning theBN from data we may identify the variables that are(according to the data) relevant; “islands” or isolatedsingle unconnected nodes indicate potentially irrelevantvariables.

– Property 4 – capture uncertainty: uncertainty can eas-ily be propagated between any nodes in the BN; we ef-fectively compute or estimate probability distributionsrather than single-point estimates.

– Property 5 – allow for inference: instead of explicitlymodeling the conditional distribution of a predefinedtarget variable, the BN captures the joint distribution ofall variables. Via inference, we can express any givenor all conditional distribution(s) of interest, and reasonin any direction (including forensic and inverse reason-ing): for example, for a given observed damage we mayinfer the likely intensity of the causing event. A detailedexample for reasoning is given in Sect.4.3.

www.nat-hazards-earth-syst-sci.net/14/2605/2014/ Nat. Hazards Earth Syst. Sci., 14, 2605–2626, 2014


Table 2.Summary of notations used in this paper.

Notation Meaning

Xi a specific variablexi a realization ofXiX = {X1, . . . , Xk} set of the considered variablesXPa(i) parent set ofXixPa(i) a realization of the parent setX−Y all variables butYDAG directed acyclic graph (graph structure)p(Xi |XPa(i)) conditional probability of a variable conditioned on its parent setθxi |xPa(i) parameter that defines the probability forxi givenxPa(i)

θ ={θxi |xPa(i)

}set of model parameters that defines the conditional distributions

2 random variable for the set of model parametersBN: (DAG, θ ) Bayesian network, defined by the pair of structure and parametersd discrete/discretized data set that is used for BN learningdc (partly) continuous data set that is used for BN learning3 discretization that bins the original datadc into dXMB(i) set of variables that form the Markov blanket ofXi (Sect.4.2)Ch(i) variable indices of the children ofXi (Sect.4.2)

Note that inference in BNs is closed under restric-tion, marginalization, and combination, allowing forfast (close to immediate) and exact inference.

– Property 6 – use incomplete observations: during pre-dictive inference (i.e., computing a conditional distribu-tion), incomplete observations of data are not a problemfor BNs. By virtue of the probability axioms, it merelyimpacts the overall uncertainty involved.

In the following we will refer to these properties 1–6 inorder to clarify what is meant. For “real-life” modeling prob-lems, including those encountered in natural hazard analysis,adhering strictly to the BN formalism is often a challeng-ing task. Hence, the properties listed above may seem undulytheoretical. Yet many typical natural hazard problems can beformulated around BNs by taking advantage of these proper-ties. We take a data-driven stance and thus aim to learn BNsfrom collected observations.

2.2 Learning Bayesian networks

Data-based BN learning can be seen as an exercise in findinga BN which, according to the decomposition in Eq. (1), couldhave been “responsible for generating the data”. For this wetraverse the space of BNs (Castelo and Kocka, 2003) look-ing for a candidate maximizing a fitness score that reflectsthe “usefulness” of the BN. This should however be donewith careful consideration to the issues always arising in thecontext of model selection, i.e., over-fitting, generalization,etc. Several suggestions for BN fitness scoring are derivedfrom different theoretical principles and ideas (Bouckaert,1995). Most of them are based on the maximum likelihoodestimation for different DAG structures according to Eq. (1).

In this paper we opt for a Bayesian approach to learn BNs(note that BNs are not necessarily to be interpreted froma Bayesian statistical perspective). Searching for the mostprobable BN, (DAG,θ ), given the observed data,d, we aimto maximize the BN MAP (Bayesian network maximum aposteriori) score suggested byRiggelsen(2008):

P(DAG,2|d)︸︷︷︸posterior

∝ P(d|DAG,2)︸︷︷︸likelihood

P(2,DAG)︸︷︷︸prior

. (2)

The likelihood term decomposes according to Eq. (1). Theprior encodes our prior belief in certain BN structures and pa-rameters. This allows us to assign domain specific prior pref-erences to specific BNs before seeing the data (Property 2)and thus to compensate for sparse data, artifacts, bias, etc.In the following applications we use a non-informative prior,which nevertheless fulfills a significant function. Acting asa penalty term, the prior regularizes the DAG complexityand thus avoids over-fitting. Detailed descriptions for priorand likelihood term are given in AppendixA1 andRiggelsen(2008).

The following section illustrates the BN formalism “in ac-tion” and will also underscore some theoretical and practi-cal problems along with potential solutions in the context ofBN learning. We will learn a ground motion model, whichis used in probabilistic seismic hazard analysis, as a BN; thedata are synthetically generated. Subsequently, we considertwo other natural hazard assessments where we learn BNsfrom real-world data.



Table 3. Variables used in the ground motion model and the corresponding distributions used for the generation of the synthetic data setwhich is used for BN learning.

Xi Description Distribution[range]

Predictors

M Moment magnitude of the earthquake U[5,7.5]R Source-to-site distance Exp[1km,200km]SD Stress released during the earthquake Exp[0bar,500bar]Q0 Attenuation of seismic wave amplitudes in deep layers Exp[0s−1,5000s−1]κ0 Attenuation of seismic wave amplitudes near the surface Exp[0s,0.1s]VS30 Average shear-wave velocity in the upper 30 m U[600ms−1,2800ms−1]

Ground motion parameter

PGA Horizontal peak ground acceleration According to the stochastic model(Boore, 2003)

3 Seismic hazard analysis: ground motion models

When it comes to decision making on the design of high-riskfacilities, the hazard arising from earthquakes is an impor-tant aspect. In probabilistic seismic hazard analysis (PSHA)we calculate the probability of exceeding a specified groundmotion for a given site and time interval. One of the most crit-ical elements in PSHA, often carrying the largest amount ofuncertainty, is the ground motion model. It describes the con-ditional probability of a ground motion parameter,Y , suchas(horizontal) peak ground acceleration, given earthquake-and site-related predictor variables,X−Y . Ground motionmodels are usually regression functions, where the func-tional form is derived from expert knowledge and the groundmotion parameter is assumed to be lognormally distributed:ln Y = f (X−Y ) + �, with � ∼N (0, σ 2). The definition ofthe functional form off (·) is guided by physical model as-sumptions about the single and joint effects of the differentparameters, but also contains some ad hoc elements (Kuehnet al., 2011). Using the Bayesian network approach there isno prior knowledge required per se, but if present it can beaccounted for by encoding it in the prior term of Eq. (2).If no reliable prior knowledge is available, we work witha non-informative prior, and the learned graph structure pro-vides insight into the dependence structure of the variablesand helps in gaining a better understanding of the underlyingmechanism (Property 1). Modeling the joint distribution ofall variables,X = {X−Y ,Y }, the BN implicitly provides theconditional distributionP(Y |X−Y , DAG, 2), which givesthe probability of the ground motion parameter for specificevent situations needed for the PSHA (Property 5).

3.1 The data

The event situation is described by the predictor variablesX−Y = {M, R, SD, Q0, κ0, VS30}, which are explained inTable 3. We generate a synthetic data set consisting of






(a) (b) (c)






30

Figure 3. When working with continuous variables, we have tomake assumptions about the functional form of the probability dis-tributions (gray), e.g.,(a) exponential,(b) normal, and(c) uniform.Thus we restrict the distributions to certain shapes that may notmatch reality. In contrast, using a discrete multinomial distribution(black), each continuous distribution can be approximated and weavoid prior restrictions on the shape. Rather the shape is learnedfrom the data by estimating the probability for each interval.

10 000 records. The ground motion parameter,Y , is the hor-izontal peak ground acceleration (PGA). It is generated bya so-calledstochastic modelwhich is described in detail byBoore(2003). The basic idea is to distort the shape of a ran-dom time series according to physical principles and thus toobtain a time series with properties that match the ground-motion characteristics. The predictor variables are either uni-form (U) or exponentially (Exp) distributed within a particu-lar interval (see Table3).

The stochastic model does not have good analytical prop-erties, and its usage is non-trivial and time consuming.Hence, surrogate models, which describe the stochasticmodel in a more abstract sense (e.g., regressions), are usedin PSHA instead. We show that BNs may be seen as a viablealternative to the classical regression approach. However, be-fore doing so, we need to touch upon some practical issuesarising when learning BNs from continuous data.

For continuous variables we need to define the distri-butional family for the conditionalsp(·|·) and thus makeassumptions about the functional form of the distribu-tion. To avoid such assumptions and “let the data speak”,we discretize the continuous variables, thus allowing for



Fig. 4.:::::::::::Representation

::of

:::the

::::::::dependency

::::::::::assumptions

::in

::the

::::::::::discretization

::::::::approach:

:::The

:::::::::dependency

:::::::relations

:of

:::the

:::::::variables

:::are

:::::::captured

::by

::::their

::::::discrete

:::::::::::representations

:::::(gray

:::::shaded

:::::area).

::A

::::::::continuous

:::::::variable,

::::Xci ,

::::::depends

::::only

::on

::its

::::::discrete

:::::::::counterpart,

:::Xi.

Fig. 5. A:::For

:::the


:::::::approach

::::each multivariate continuous distribution (a) can be

::is characterized

by a discrete distribution that captures the dependence relations (b) and a continuous uniform distribution over

each grid cell (c): Assume .:::For

::::::::::::exemplification

::::::assume we consider two dependent, continuous variables Xc1

and Xc2 . (a) shows a possible realization of a corresponding sample. According to Monti and Cooper (1998)

we now assume, that we can find a discretization, such that the resulting discretized variables X1 and X2

capture the dependence relation between Xc1 and Xc2 . This is illustrated by (b), where the shading of the grid

cells corresponds to their probabilities:::::(which

:::are

::::::defined

::by

:::θ). A darker color means, that we expect more

realizations in this grid cell. Further we say, that within each grid cell the realizations are uniformly distributed,

as illustrated in (c).

31

Figure 4. Representation of the dependency assumptions in thediscretization approach: the dependency relations of the variablesare captured by their discrete representations (gray-shaded area). Acontinuous variable,Xc

i, depends only on its discrete counterpart,

Xi .

completely data-driven and distribution-free learning (seeFig.3). In the following subsection we describe an automaticdiscretization, which is part of the BN learning procedure andtakes the dependences between the single variables into ac-count. However, the automatic discretization does not neces-sarily result in a resolution that matches the requirements forprediction purposes or decision support. To increase the po-tential accuracy of predictions, we approximate, once the net-work structure is learned, the continuous conditionals withmixtures of truncated exponentials(MTE), as suggested byMoral et al.(2001). More on this follows in Sect.3.3.

3.2 Automatic discretization for structure learning

The range of existing discretization procedures differs intheir course of action (supervised vs. unsupervised, globalvs. local, top-down vs. bottom-up, direct vs. incremental,etc.), their speed and their accuracy.Liu et al. (2002) pro-vide a systematic study of different discretization techniques,while Hoyt (2008) concentrates on their usage in the contextwith BN learning. The choice of a proper discretization tech-nique is anything but trivial as the different approaches resultin different levels of information loss. For example, a dis-cretization conducted as a pre-processing step to BN learningdoes not account for the interplay of the variables and oftenmisses information hidden in the data. To keep the informa-tion loss small, we use a multivariate discretization approachthat takes the BN structure into account. The discretizationis defined by a set of interval boundary points for all vari-ables, forming a grid. All data points of the original contin-uous (or partly continuous) data set,dc, that lie in the samegrid cell, correspond to the same value in the discretized dataset,d. In a multivariate approach, the “optimal” discretiza-tion, denoted by3, depends on the structure of the BN andthe observed data,dc. Similar to Sect.2.2, we again cast theproblem in a Bayesian framework searching for the combin-ation of (DAG,θ , 3) that has the highest posterior probabil-ity given the data,

Fig. 4.:::::::::::Representation

::of

:::the

::::::::dependency

::::::::::assumptions

::in

::the


::::::::approach:

:::The

:::::::::dependency

:::::::relations

:of

:::the

:::::::variables

:::are

:::::::captured

::by

::::their

::::::discrete

:::::::::::representations

:::::(gray

:::::shaded

:::::area).

::A

::::::::continuous

:::::::variable,

::::Xci ,

::::::depends

::::only

::on

::its

::::::discrete

:::::::::counterpart,

:::Xi.

Fig. 5. A:::For

:::the


:::::::approach

::::each multivariate continuous distribution (a) can be

::is characterized

by a discrete distribution that captures the dependence relations (b) and a continuous uniform distribution over

each grid cell (c): Assume .:::For

::::::::::::exemplification

::::::assume we consider two dependent, continuous variables Xc1

and Xc2 . (a) shows a possible realization of a corresponding sample. According to Monti and Cooper (1998)

we now assume, that we can find a discretization, such that the resulting discretized variables X1 and X2

capture the dependence relation between Xc1 and Xc2 . This is illustrated by (b), where the shading of the grid

cells corresponds to their probabilities:::::(which

:::are

::::::defined

::by

:::θ). A darker color means, that we expect more

realizations in this grid cell. Further we say, that within each grid cell the realizations are uniformly distributed,

as illustrated in (c).

31

Figure 5. For the discretization approach each multivariate contin-uous distribution(a) is characterized by a discrete distribution thatcaptures the dependence relations(b) and a continuous uniform dis-tribution over each grid cell(c). For exemplification assume we con-sider two dependent, continuous variables:Xc1 andX

c2. (a) shows a

possible realization of a corresponding sample. According toMontiand Cooper(1998) we now assume that we can find a discretization,such that the resulting discretized variablesX1 andX2 capture thedependence relation betweenXc1 andX

c2. This is illustrated by(b),

where the shading of the grid cells corresponds to their probabilities(which are defined byθ ). A darker color means that we expect morerealizations in this grid cell. Further, we say that, within each gridcell, the realizations are uniformly distributed, as illustrated in(c).

P(DAG, 2, 3|dc

)︸︷︷︸posterior

∝ P(dc|DAG, 2, 3

)︸︷︷︸likelihood

P (DAG, 2, 3)︸︷︷︸prior

. (3)

Let us consider the likelihood term: expanding on an idea byMonti and Cooper(1998), we assume that all communica-tion/flow of information between the variables can be cap-tured by their discrete representations (see Fig.4) and is de-fined by the parametersθ . Thus only the distribution of thediscrete datad depends on the network structure, while thedistribution of the continuous datadc is, for givend, inde-pendent of the DAG (see Figs.4 and5). Consequently thelikelihood for observingdc (for a given discretization, net-work structure and parameters) can be written as

P(dc|DAG, 2, 3

)=P

(dc|d, 3

)P (d|DAG, 2, 3) (4)

and Eq. (3) decomposes into

P(DAG, 2, 3|dc

)∝ P

(dc|d,3

)︸︷︷︸continuous data

P(d|DAG, 2, 3)︸︷︷︸likelihood (discrete)

P(DAG, 2, 3)︸︷︷︸prior

.



Figure 6. Theoretic BN for the ground motion model. It capturesthe known dependences of the data-generating model.

The likelihood (discrete) term is now defined as for the sep-arate BN learning for discrete data (Sect.2.2), and we use anon-informative prior again. For the continuous data, we as-sume that all continuous observations within the same inter-val defined by3 have the same probability (Fig.5). Moreinformation about the score definition can be found in theAppendixA1, and technical details are given inVogel et al.(2012, 2013). In the following we discuss the BN and dis-cretization learned from the synthetic seismic data set.

Learned ground motion model

Since we generated the data ourselves, we know which(in)dependences the involved variables should adhere to; thisis expected to be reflected in the BN DAG we learn from thesynthetic data (Property 1, 3). Due to data construction, thepredictor variablesM, R, SD,Q0, κ0, andVS30 are indepen-dent of each other and PGA depends on the predictors. Fig-ure 6 shows the dependence structure of the variables. Theconverging edges at PGA indicate that the predictors becomeconditionally dependent for a given PGA. This means that,for a given PGA, they carry information about each other;for example, for an observed large PGA value, a small stressdrop indicates a close distance to the earthquake. The knowl-edge about the dependence relations gives the opportunityto use the seismic hazard application for an inspection of theBN learning algorithm regarding the reconstruction of the de-pendences from the data, which is done in the following.

The network that we found to maximizeP(DAG, 2, 3|dc) for the 10 000 synthetic seismicdata records is shown in Fig.7. The corresponding dis-cretization that was found is plotted in Fig.8, which showsthe marginal distributions of the discretized variables.The learned BN differs from the original one, mainlydue to regularization constraints as we will explain in thefollowing: as mentioned in Sect.2, the joint distribution

Figure 7. BN for the ground motion model learned from the gen-erated synthetic data. It captures the most dominant dependences.Less distinctive dependences are neglected for the sake of parameterreduction.

of all variables can be decomposed into the product of theconditionals according to the network structure (see Eq.1).For discrete/discretized variables, the number of parametersneeded for the definition ofp(Xi |XPa(i)) in Eq. (1) corre-sponds to the number of possible state combinations for (Xi ,XPa(i)). Taking the learned discretization shown in Fig.8,the BN of the data-generating process (Fig.6) is definedby 3858 parameters, 3840 needed alone for the descriptionof p(PGA|M, R, SD, Q0, κ0, VS30). A determination ofthat many parameters from 10 000 records would lead toa strongly over-fitted model. Instead we learn a BN thatcompromises between model complexity and its abilityto generate the original data. The BN learned under theserequirements (Fig.7) consists of only 387 parameters andstill captures the most relevant dependences.

Figure 9 shows the ln PGA values of the data set plot-ted against the single predictors. A dependence on stressdrop (SD) and distance (R) is clearly visible. These are alsothe two variables with remaining converging edges on PGA,revealing that, for a given PGA, SD contains informationaboutR and vice versa. The dependences between PGA andthe remaining predictors are much less distinctive, such thatthe conditional dependences between the predictors are neg-ligible and the edges can be reversed for the benefit of pa-rameter reduction. The connection toVS30 is neglected com-pletely, since its impact on PGA is of minor interest com-pared to the variation caused by the other predictors.

Note that the DAG of a BN actually maps the indepen-dences (not the dependences) between the variables. Thismeans that each (conditional) independence statement en-coded in the DAG must be true, while encoded dependencerelations must not hold per se (see Fig.10 for explanation).In turn this implies that each dependence holding for thedata should be encoded in the DAG. The learning approach



Fig. 8. Marginal distribution of the variables included in the ground motion model, discretized according to the

discretization learned for the BN in Fig. 7. The number of intervals per variable ranges from 2 to 8.

33

Figure 8. Marginal distribution of the variables included in the ground motion model, discretized according to the discretization learned forthe BN in Fig.7. The number of intervals per variable ranges from 2 to 8.

applied here fulfills the task quite well, detecting the rele-vant dependences, while keeping the model complexity at amoderate level.

The model complexity depends not only on the DAG butalso on the discretization. A complex DAG will enforce asmall number of intervals, and a large number of intervalswill only be chosen for variables with a strong influence onother variables. This effect is also visible for the learned dis-cretization (Fig.8). PGA is split into eight intervals, distanceand stress drop into four and five, respectively, and the othervariables consist of only two to three intervals.

3.3 Approximation of continuous distributions withmixtures of exponentials (MTEs)

A major purpose of the ground motion model is the predic-tion of the ground motion (ln PGA) based on observationsof the predictors; hence, although the BN captures the jointdistribution (Property 5) of all involved variables, the focusin this context is on a single variable. The accuracy of theprediction is limited by the resolution of the discretizationlearned for the variable. For the BN shown above, the dis-cretization of the target variable into eight intervals enablesa quite precise approximation of the continuous distribution,but this is not the case per se. Complex network structuresand smaller data sets used for BN learning lead to a coarserdiscretization of the variables. To enable precise estimates,we may search for alternative approximations of the (or atleast some, in particular the primary variable(s) of interest)

continuous conditional distributions once the BN has beenlearned.

Moral et al.(2001) suggest using MTEs for this purpose,since they allow for the approximation of a variety of func-tional shapes with a limited number of parameters (Langsethand Nielsen, 2008) and they are closed under the opera-tions used for BN inference: restriction, combination, andmarginalization (Langseth et al., 2009). The basic idea isto approximate conditional distributionsp(Xi |XPa(i)) with acombination/mixture of truncated exponential distributions.For this purpose the domain(Xi ,XPa(i)) is partitioned intohypercubesD1, . . . ,DL, and the density within each hyper-cube,Dl , is defined such that it follows the form

p↓Dl(Xi |XPa(i)

)= a0 +

J∑j=1

aj ebj Xi+c

Tj XPa(i) . (5)

The determination of the hypercubes and the number of ex-ponential terms in each hypercube as well as the estimationof the single parameters is done according to the maximumlikelihood approach described inLangseth et al.(2010). Inthe following we show how the MTE approximation im-proves the BN prediction performance compared to the us-age of the discretized variables, and we compare the resultsto those from a regression approach.

Prediction performance

We conduct a 10-fold cross validation to evaluate the pre-diction performance of the BN compared to the regression



Fig. 9. The single figures show the dependences between the predictor variables M,R,SD,Q0,κ0,VS30 and

the target variable lnPGA by plotting the data used to learn the BN for ground motion modeling.

(a) P (B)P (E)P (A|B,E)P (R|E) (b) P (B)P (A|B)P (E|A,B)P (R|E)

Fig. 10. The graph structure of a BN dictates, how the joint distribution of all variables decomposes into a

product of conditionals. Thus for a valid decomposition each independence assumption mapped into the BN

must hold. Usually this applies to a variety of graphs, i.e. the complete graph is always a valid independence

map as it does not make any independence assumption. (a) and (b) show two valid BN structures and the

corresponding decompositions for the burglary example. The independence assumptions made in both BNs

hold, however (b) does not capture the independence between earthquakes and burglaries. An independence

map that maps all independences (a) is called a perfect map, yet perfect maps do not exist for all applications.

Besides, for parameter reduction it might be beneficial to work with an independence map that differs from the

perfect map.

34

Figure 9. The individual panels show the dependences between the predictor variablesM, R, SD,Q0, κ0, andVS30 and the target variableln PGA by plotting the data used to learn the BN for ground motion modeling.

approach: the complete data set is divided into 10 disjointsubsamples, of which one is defined as a test set in each trialwhile the others are used to learn the model (regression func-tion or BN). The functional form of the regression function isdetermined by expert knowledge based on the description ofthe Fourier spectrum of seismic ground motion and followsthe form

f (X−Y ) =a0 + a1M + a2M · lnSD+ (a3 + a4M)

ln√

a25 + R2 + a6κR + a7VS30+ a8 lnSD,

with κ = κ0 + t∗, t∗ = RQ0Vsq andVsq= 3.5 km s−1.

We compare the regression approach in terms of predic-tion performance to the BN with discretized variables andwith MTE approximations. For this purpose we determinethe conditional density distributions of ln PGA given the pre-dictor variables for each approach and consider how muchprobability it assigns to the real ln PGA value in each ob-servation. For the regression approach the conditional den-sity follows a normal distribution,N (f (X−Y ), σ 2), while itis defined via the DAG and the parametersθ using the BNmodels. Table4a shows for each test set the conditional den-sity value of the observed ln PGA averaged over the individ-

ual records. Another measure for the prediction performanceis the mean squared error of the estimates for ln PGA (Ta-ble 4b). Here the point estimate for ln PGA is defined as themean value of the conditional density. For example, in theregression model the estimate corresponds tof (x−Y ).

Even though the discretization of ln PGA is relative preciseusing the discrete BNs (eight intervals in each trial, exceptfor the first trial, where ln PGA is split into seven intervals),the MTE approximation of the conditional distributions im-proves the prediction performance of the BN. Still, it doesnot entirely match the precision of the regression function.However, the prediction performances are on the same orderof magnitude, and we must not forget that the success of theregression approach relies on the expert knowledge used todefine its functional form, while the structure of the BN islearned in a completely data-driven manner. Further the re-gression approach profits in this example from the fact thatthe target variable (ln PGA) is normally distributed, whichis not necessarily the case for other applications. Focusingon the prediction of the target variable the regression ap-proach also does not have the flexibility of the BN, whichis designed to capture the joint distribution of all variablesand thus allows for inference in all directions (Property 5),



Table 4. Results of a 10-fold cross validation to test the predictionperformance of the BN (with discrete and MTE approximations ofthe conditional distributions) and the regression approach.(a) con-tains the calculated conditional densities for the observed ln PGAvalues averaged over each trial.(b) contains the mean squared errorof the predicted ln PGA for each trial.

(a) Averaged conditional density

BNdiscrete BNMTE Regression

1 0.237 0.320 0.3312 0.240 0.297 0.3293 0.239 0.298 0.3314 0.218 0.255 0.3235 0.216 0.260 0.3396 0.222 0.257 0.3397 0.215 0.252 0.3328 0.243 0.317 0.3309 0.212 0.249 0.328

10 0.243 0.315 0.331

Avg. 0.229 0.282 0.331

(b) Mean squared error

BNdiscrete BNMTE Regression

1 1.021 0.749 0.6632 1.197 0.963 0.6803 1.082 0.821 0.6734 1.262 0.951 0.7235 1.201 0.851 0.6296 1.298 1.059 0.6257 1.297 1.077 0.6728 1.149 0.713 0.7019 1.343 1.161 0.692

10 1.169 0.841 0.666

Avg. 1.202 0.919 0.672

as exemplified in Sect.4.3. Additional benefits of BNs, liketheir ability to make use of incomplete observations, will berevealed in the following sections, where we investigate real-world data.

4 Flood damage assessment

In the previous section we dealt with a fairly small BN (a fewvariables/nodes) and a synthetic data set. In this section wego one step further and focus on learning a larger BN fromreal-life observations on damage caused to residential build-ings by flood events. Classical approaches, so-called stage–damage functions, relate the damage for a certain class ofobjects to the water stage or inundation depth, while othercharacteristics of the flooding situation and the flooded ob-ject are rarely taken into account (Merz et al., 2010). Eventhough it is known that the flood damage is influenced by a

Fig. 9. The single figures show the dependences between the predictor variables M,R,SD,Q0,κ0,VS30 and

the target variable lnPGA by plotting the data used to learn the BN for ground motion modeling.

(a) P (B)P (E)P (A|B,E)P (R|E) (b) P (B)P (A|B)P (E|A,B)P (R|E)

Fig. 10. The graph structure of a BN dictates, how the joint distribution of all variables decomposes into a

product of conditionals. Thus for a valid decomposition each independence assumption mapped into the BN

must hold. Usually this applies to a variety of graphs, i.e. the complete graph is always a valid independence

map as it does not make any independence assumption. (a) and (b) show two valid BN structures and the

corresponding decompositions for the burglary example. The independence assumptions made in both BNs

hold, however (b) does not capture the independence between earthquakes and burglaries. An independence

map that maps all independences (a) is called a perfect map, yet perfect maps do not exist for all applications.

Besides, for parameter reduction it might be beneficial to work with an independence map that differs from the

perfect map.

34

Figure 10. The graph structure of a BN dictates how the joint dis-tribution of all variables decomposes into a product of condition-als. Thus for a valid decomposition each independence assumptionmapped into the BN must hold. Usually this applies to a variety ofgraphs, i.e., the complete graph is always a valid independence mapas it does not make any independence assumption.(a) and(b) showtwo valid BN structures and the corresponding decompositions forthe burglary example. The independence assumptions made in bothBNs hold; however(b) does not capture the independence betweenearthquakes and burglaries. An independence map that maps all in-dependences(a) is called a perfect map, yet perfect maps do notexist for all applications. Furthermore, for parameter reduction itmight be beneficial to work with an independence map that differsfrom the perfect map.

variety of factors (Thieken et al., 2005), stage–damage func-tions are still widely used. This is because the number of po-tential influencing factors is large and the single and joint ef-fects of these parameters on the degree of damage are largelyunknown.

4.1 Real-life observations

The data collected after the 2002 and 2005/2006 floodevents in the Elbe and Danube catchments in Germany (seeFig. 11) offer a unique opportunity to learn about the driv-ing forces of flood damage from a BN perspective. The dataresult from computer-aided telephone interviews with flood-affected households, and contain 1135 records for which thedegree of damage could be reported. The data describe theflooding and warning situation, building and household char-acteristics, and precautionary measures. The raw data weresupplemented by estimates of return periods, building val-ues, and loss ratios, as well as indicators for flow velocity,contamination, flood warning, emergency measures, precau-tionary measures, flood experience, and socioeconomic fac-tors. Table5 lists the 29 variables allocated to their domains.A detailed description of the derived indicators and the sur-vey is given byThieken et al.(2005) andElmer et al.(2010).In Sect.3.2we dealt with the issue of continuous data whenlearning BNs; here we will apply the methodology presentedthere. However, in contrast to the synthetic data from theprevious section, many real-world data sets are, for differentreasons, lacking some observations for various variables. Forthe data set at hand, the percentage of missing values is be-low 20 % for most variables, yet for others it reaches almost70 %. In the next subsection we show how we deal with themissing values in the setting of the automatic discretizationdescribed in Sect.3.2when learning BNs.


K. Vogel et al.: Bayesian network learning for natural hazard analyses 2615Ta

ble

5.Va

riabl

esus

edin

the

flood

dam

age

asse

ssm

enta

ndth

eir

corr

espo

ndin

gra

nges

.C:c

ontin

uous

;O:o

rdin

al;N

:nom

inal

.

Varia

ble

Sca

lean

dra

nge

Per

cent

age

ofm

issi

ngda

ta

Flo

odpa

ram

eter

s

Wat

erde

pth

C:2

48cm

belo

wgr

ound

to67

0cm

abov

egr

ound

1.1

Inun

datio

ndu

ratio

nC

:1to

1440

h1.

6F

low

velo

city

indi

cato

rO

:0=

still

to3

=hi

ghve

loci

ty1.

1C

onta

min

atio

nin

dica

tor

O:0=

noco

ntam

inat

ion

to6=

heav

yco

ntam

inat

ion

0.9

Ret

urn

perio

dC

:1to

848

year

s0

War

ning

and

emer

genc

ym

easu

res

Ear

lyw

arni

ngle

adtim

eC

:0to

336

h32

.3Q

ualit

yof

war

ning

O:1

=re

ceiv

erof

war

ning

knew

exac

tlyw

hatt

odo

to6

=re

ceiv

erof

war

ning

had

noid

eaw

hatt

odo

55.8

Indi

cato

rof

flood

war

ning

sour

ceN

:0=no

war

ning

to4=

offic

ialw

arni

ngth

roug

hau

thor

ities

17.4

Indi

cato

rof

flood

war

ning

info

rmat

ion

O:0=

nohe

lpfu

linf

orm

atio

nto

11=

man

yhe

lpfu

lin

form

atio

n19

.1

Lead

time

perio

del

apse

dw

ithou

tusi

ngit

for

emer

genc

ym

easu

res

C:0

to33

5h

53.6

Em

erge

ncy

mea

sure

sin

dica

tor

O:1=

nom

easu

res

unde

rtak

ento

17=m

any

mea

sure

sun

dert

aken

0

Pre

caut

ion

Pre

caut

iona

rym

easu

res

indi

cato

rO

:0=

nom

easu

res

unde

rtak

ento

38=m

any

effic

ient

mea

sure

sun

dert

aken

0

Per

cept

ion

ofef

ficie

ncy

ofpr

ivat

epr

ecau

tion

O:1=

very

effic

ient

to6=

note

ffici

enta

tall

2.9

Flo

odex

perie

nce

indi

cato

rO

:0=no

expe

rienc

eto

9=re

cent

flood

expe

rienc

e68

.6K

now

ledg

eof

flood

haza

rdN

(yes

/no)

32.7

Bui

ldin

gch

arac

teris

tics

Bui

ldin

gty

peN

:(1=

mul

tifam

ilyho

use,

2=se

mi-d

etac

hed

hous

e,3

=on

e-fa

mily

hous

e)0.

1

Num

ber

offla

tsin

build

ing

C:1

to45

flats

1.2

Flo

orsp

ace

ofbu

ildin

gC

:45

to18

000

m21.

9B

uild

ing

qual

ityO

:1=

very

good

to6=

very

bad

0.6

Bui

ldin

gva

lue

C:C

9224

4to

371

867

70.

2

Soc

ioec

onom

icfa

ctor

s

Age

ofth

ein

terv

iew

edpe

rson

C:1

6to

95ye

ars

1.6

Hou

seho

ldsi

ze,i

.e.,

num

ber

ofpe

rson

sC

:1to

20pe

ople

1.1

Num

ber

ofch

ildre

n(<

14ye

ars)

inho

useh

old

C:0

to6

10.1

Num

ber

ofel

derly

pers

ons

(>

65ye

ars)

inho

useh

old

C:0

to4

7.6

Ow

ners

hip

stru

ctur

eN

:(1=

tena

nt;2

=ow

ner

offla

t;3=

owne

rof

build

ing)

0M

onth

lyne

tinc

ome

incl

asse

sO

:11

=be

low

EU

R50

0to

16=

EU

R30

00an

dm

ore

17.6

Soc

ioec

onom

icst

atus

acco

rdin

gtoPla

pp(2

003)

O:3

=ve

rylo

wso

cioe

cono

mic

stat

usto

13=ve

ryhi

ghso

cioe

cono

mic

stat

us25

.5

Soc

ioec

onom

icst

atus

acco

rdin

gtoSch

nell

etal

.(199

9)O

:9=

very

low

soci

oeco

nom

icst

atus

to60=

very

high

soci

oeco

nom

icst

atus

31.7

Flo

odlo

ss

rloss

–lo

ssra

tioof

resi

dent

ialb

uild

ing

C:0=

noda

mag

eto

1=to

tald

amag

e0



Fig. 11. Catchments investigated for the flood damage assessment and location of communities reporting losses

from the 2002, 2005 and 2006 flood events in the Elbe and Danube catchments (Schroeter et al., 2014).

35

Figure 11.Catchments investigated for the flood damage assessment and location of communities reporting losses from the 2002, 2005, and2006 floods in the Elbe and Danube catchments (Schroeter et al., 2014).

4.2 Handling of incomplete records

To learn the BN, we again maximize the joint posterior forthe given data (Eq.3). This requires the number of countsfor each combination of states for(Xi, XPa(i)), consideringall variables,i = 1, . . . ,k (see AppendixA1). However thisis only given for complete data, and for missing values it canonly be estimated by using expected completions of the data.We note that a reliable and unbiased treatment of incompletedata sets (no matter which method is applied) is only possiblefor missing data mechanisms that areignorableaccording tothemissing (completely) at random(M(C)AR) criteria as de-fined inLittle and Rubin(1987), i.e., the absence/presence ofa data value is independent of the unobserved data. For thedata sets considered in this paper, we assume the MAR cri-terion to hold and derive the predictive function/distributionbased on the observed part of the data in order to estimate thepart which is missing.

In the context of BNs a variety of approaches has beendeveloped to estimate the missing values (so-called “impu-tation”). Most of these principled approaches are iterativealgorithms based on expectation maximization (e.g.,Fried-

man, 1997, 1998) or stochastic simulations (e.g.,Tanner andWong, 1987). In our case we already have to run several it-erations of BN learning and discretization, each iteration re-quiring the estimation of the missing values. Using an itera-tive approach for the missing value prediction will thus eas-ily become infeasible. Instead we use a more efficient albeitapproximate method, using theMarkov blanket predictorde-veloped byRiggelsen(2006).

The idea is to generate a predictive function which enablesthe prediction of a missing variableXi based on the observa-tions of its Markov blanket (MB),XMB(i). The Markov blan-ket identifies the variables that directly influenceXi , i.e., theparents, and children ofXi , as well as the parents ofXi ’schildren. An example is given in Fig.12. Assuming the MBis fully observed, it effectively blocks influence from all othervariables, i.e., the missing value depends only on its MB.When some of the variables in the MB are missing, it doesnot shield offXi . However, for predictive approximation pur-poses, we choose to always ignore the impact from outsidethe MB. Hence, the prediction ofXi based on the observeddata reduces to a prediction based on the observations of the



Fig. 12.::::::::Illustration

::of

:a::::::Markov

::::::Blanket

:::::(gray

:::::shaded

::::::nodes)

::on

::a

::::blood

:::::group

:::::::example:

::::Lets

::::::assume,

::::that

::for

::::some

::::::reason

:I::do

:::not

:::::know

::my

:::::blood

:::::group,

:::but

:I:::::know

::the

::::::::genotypes

::of

:::my

:::::::relatives.

::::The

::::::::genotypes

::of

::my

::::::parents

::::::provide

:::::::::information

::::about

:::my

:::own

:::::blood

:::::group

:::::::::specification

:–::in

:::the

::::::pictured

:::::::example

:::they

::::::restrict

::the

:::list

::of

::::::::::opportunities

::to

::the

::::four

::::::options:

::::AB,

:::A0,

:::B0

:::and

:::BB

:–

::as

::::well

::as

::the

:::::::genotype

::of

:::my

::::child

::::::reveals

:::::::::information,

::::::::excluding

:::BB

::::from

:::the

::list

:::of

::::::possible

::::::options.

::::::::::Considering

:::the

:::::::genotype

::of

:::the

:::::::::::father/mother

:of

:::my

:::::child

::::alone

::::does

:::not

::::::provide

:::any


:::::about

::my

:::::blood

::::type

:::(our

:::::blood

::::::groups

::are

::::::::::independent

:::from

::::each

::::::other),

:::but

::::::together

::::with

:::the


::::about

:::our

::::child

::it

::::again

:::::::restricts

::the

:::list

::of

:::::::::::opportunities,

:::::leaving

::::only

:::AB

:::and

::A0

::as

::::::possible

::::::options

::::::::::(conditioned

::on

:::our

::::child

::our

:::::blood

:::::groups

::::::become

:::::::::dependent).

:::All

::::these

:::::::variables

::::::(parents,

:::::::children,

::::::parents

::of

:::::::children)

::::::provide

::::direct


:::::about

::the

::::::::considered

:::::::variable

:::(my

:::::blood

::::type)

:::and

::::form

::its

:::::::Markov

::::::Blanket.

::::::::Knowing

:::the

:::::values

::of

:::the

::::::Markov

::::::Blanket

::::::further

:::::::variables

::do

:::not

::::::provide

:::any

::::::::additional

:::::::::information,

::::e.g.

:::::::knowing

:::the

::::::::genotypes

::of

:::my

::::::parents,

:::the

::::::::knowledge

:::::about

::my

::::::::::grandparents

::::does

:::not

::::::deliver

:::any

:::::further


:::::about

::::::myself

:::(the


::is

:::::::‘blocked’

:::by

:::my

::::::parents).

::::Yet,

:if

:::the

::::blood

::::type

::of

::my

::::::parents

::is

:::::::unknown

::the


:::::about

::my

::::::::::grandparents

:::can

:::::‘flow’

:::and

::::::provides

::::new

::::::insights.

36

Figure 12. Illustration of a Markov blanket (gray-shaded nodes) ona blood group example: let us assume that I do not know my bloodgroup for some reason, but I know the genotypes of my relatives.The genotypes of my parents provide information about my ownblood group specification – in the pictured example they restrict thelist of opportunities to the four options: AB, A0, B0 and BB – aswell as the genotype of my child reveals information, excluding BBfrom the list of possible options. Considering the genotype of thefather/mother of my child alone does not provide any informationabout my blood type (our blood groups are independent from eachother), but together with the information about our child it again re-stricts the list of opportunities, leaving only AB and A0 as possibleoptions (conditioned on our child our blood groups become depen-dent). All these variables (blood type of my parents, my children,and the parents of my children) provide direct information aboutthe considered variable (my blood type) and form its Markov blan-ket. If I know the values of the Markov blanket, further variablesdo not provide any additional information. For example, knowingthe genotypes of my parents, the knowledge about my grandparentsdoes not deliver any further information about myself (the informa-tion is “blocked” by my parents). Yet, if the blood type of my par-ents is unknown, the information about my grandparents can “flow”and provides new insights.

MB and factorizes according to the DAG in Fig.13a:

P(Xi |XMB(i), θ , DAG

)∝ θXi |XPa(i)

∏j∈Ch(i)

θXj |XPa(j) , (6)

where Ch(i) are the variable indices for the children ofXi .Thus the prediction ofXi requires, according to Eq. (6), in-ference in the BN (albeit very simple) where correct esti-mates ofθ are assumed. These in general can not be givenwithout resorting to iterative procedures. To avoid this wedefine a slightly modified version of the predictive function,for which we define all variables that belong to the MB ofXito be the parents ofXi in a modified DAG′ (see Fig.13 forillustration). ThusXDAG

′

Pa(i) corresponds toXDAGMB(i). The result-

ing DAG′ preserves all dependences given in DAG and can

Fig. 13. (a) Illustration of a Markov Blanket of Xi. The Markov Blanket of a variable ::Xi comprises the ::its

parents and childrenof that variable, as well as the parents of the:its

:children.

The prediction of missing values is based on the observations of the variables in the Markov Blanket. To

avoid inference that requires unknown parameters, the subgraph of DAG that spans the Markov Blanket (a) is

modified by directing all edges towards Xi, receiving the DAG′ pictured in (b).

37

Figure 13. (a)The Markov blanket ofXi comprises its parents andchildren, as well as the parents of its children. The prediction ofmissing values is based on the observations of the variables in theMarkov blanket. To avoid inference that requires unknown parame-ters, the subgraph of DAG that spans the Markov blanket(a) is mod-ified by directing all edges towardsXi , receiving the DAG

′ picturedin (b).

alternatively be used for the prediction ofXi ,

P(Xi |X

DAG′Pa(i) , θ

DAG′ ,DAG′)

def= θDAG

′

Xi |XPa(i). (7)

For this predictive distribution we need to estimate the pa-rametersθDAG

′

Xi |XPa(i). Note that more parameters are required

for the newly derived predictive distribution, but now at leastall influencing variables are considered jointly and an iter-ative proceeding can be avoided. The parameters are esti-mated with asimilar-casesapproach, which is described inAppendix A2. A detailed description for the generation ofthe predictive distribution is given inRiggelsen(2006) andVogel et al.(2013).

It is worth noting that, as the MBs of variables change dur-ing the BN learning procedure, the prediction of missing val-ues (depending on the MB) needs to be updated as well.

4.3 Results

Coming back to the flood damage data, we have three vari-ables with more than one-third of the observations miss-ing: flood experience (69 % missing), warning quality (56 %missing) and lead time elapsed without emergency measures(54 % missing). In a first “naive” application (Vogel et al.,2012), no special attention was paid to a proper treatmentof missing values; the missing values were simply randomlyimputed, resulting in the isolation of two variables (flood ex-perience and lead time elapsed) in the network; no connec-tion to any other variable was learned (Fig.14a). With appli-cation of the Markov blanket predictor, the situation changesand a direct connection from the relative building damage,rloss, to flood experience is found, as well as a connectionbetween warning source and elapsed lead time (Fig.14b).These relations, especially the first one, match with experts’expectations and speak for an improvement in the learnedBN structure.



(a)

(b)

Fig. 14. BNs learned for flood damage assessments, showing the effect of the applied missing value estimator.

The algorithm used to learn (a) replaces missing values randomly, while the one used to learn (b) applies the

Markov Blanket predictor for the estimation of missing values. Nodes with a bold frame belong to the Markov

Blanket of relative building loss and are thus assumed to have direct impact on the caused flood damage.

38

Figure 14. BNs learned for flood damage assessments, showing the effect of the applied missing value estimator. The algorithm used tolearn(a) replaces missing values randomly, while the one used to learn(b) applies the Markov blanket predictor for the estimation of missingvalues. Nodes with a bold frame belong to the Markov blanket ofrelative building lossand are thus assumed to have a direct impact on thecaused flood damage.

Using the graphical representation (Property 1), as men-tioned in Sect.2.1, the learned DAG (Fig.14b) gives in-sight into the dependence relations of the variables. It revealsa number of direct links connecting the damage-describingvariable with almost all subdomains. This supports the de-mand for improved flood damage assessments that take sev-eral variables into account (Merz et al., 2010). Moreover,the DAG shows which variables are the most relevant for

the prediction of rloss. The domains “precaution” and “floodparameters” in particular are densely connected to buildingdamage and should be included in any damage assessment(Property 3).

Existing approaches for flood damage assessments usuallyconsider fewer variables and an employment of a large num-ber of variables is often considered as disadvantageous, sincecomplete observations for all involved variables are rare. The



(a)0.0 0.2 0.4 0.6 0.8 1.0

05

1015

relative building loss

dens

ity

bad precaution ( 14)

(b)0.0 0.2 0.4 0.6 0.8 1.0

05

1015

relative building loss

dens

ity

bad precaution ( 14)

7.5 14) and a bad precau-tion (precautionary measures indicator≤ 14) in a generalcase (Fig.15a: all other variables are unknown and summedout) and for a specific flood event (Fig.15b: 7.5 m≤ waterdepth< 96.5 m; 82 h≤ duration< 228 h; 1≤ velocity). Wemay appreciate how a good precaution increases the chancefor no or only small building losses.

Similar investigations may support the identification of ef-ficient precautionary measures, not only in the context offlood events but also for natural hazards in general. They mayalso help to convince authorities or private persons to under-take the suggested precautions. Using the flexibility of BNsand their ability to model specific situations, BNs may thuscontribute to a better communication between scientists andnon-scientific stakeholders. BNs can also be used for forensicreasoning, i.e., we can turn around the direction of reasoningin the example just considered and ask what a likely stateof precaution is for a given observed damage in a specificor general event situation. Forensic reasoning might be of



interest, for instance, for insurance companies. Forensic rea-soning might be of interest, for instance, for insurance com-panies.

5 Landslides

So far we assumed the existence of a unique model that ex-plains the data best. In practical problems, however, theremay be many models almost as good as the best, i.e., onesthat explain the data similarly well. This results in an un-certainty about which BN structure to use. We consider thisproblem in our last application, where we apply BN learningto landslides, which are another ubiquitous natural hazard inmany parts of the world.

A key theme in many landslide studies is the search forthose geological, hydroclimatological, topographic, and en-vironmental parameters that sufficiently predict the suscepti-bility to slope failure in a given region. A wide range of mul-tivariate data analysis techniques has been proposed to meetthis challenge. Amongst the more prominent methods arelogistic regression, artificial neural networks, and Bayesianweights of evidence. The popularity of such methods is onlymatched by their seeming success: a recent review of 674 sci-entific papers on the topic indicates that most reported suc-cess rates are between 75 and 95 % (Korup and Stolle, 2014),where in the majority of studies the success rate is defined asthe percentage of correctly (true positives and true negatives)identified locations that were subject to slope instability inthe past. This raises the question as to why landslides stillcontinue to cause massive losses despite this seemingly highpredictive accuracy. Moreover, success rates do not showany significant increase over the last 10 years regardless ofthe number of landslide data or predictors used (Korup andStolle, 2014). An often overlooked key aspect in these analy-ses is the potential for correlated or interacting predictor can-didates. Few studies have stringently explored whether thislikely limitation is due to physical or statistical (sampling)reasons.

5.1 Data

The landslide data are taken from an inventory of∼ 300 000 digitally mapped landslide deposit areas acrossthe Japanese islands (Korup et al., 2014). These landslideswere mapped systematically mostly from stereographic im-age interpretation of air photos, and compiled by the Na-tional Research Institute for Earth Science and Disaster Pre-vention NIED (http://lsweb1.ess.bosai.go.jp/gis-data/index.html). The dominant types of failure in this database aredeep-seated slow-moving earthflows and more rapid rock-slides. The mapped size range of the deposits from theselandslides spans from 102 to 107 m2 footprint area and isdistinctly heavy tailed (Korup et al., 2012). Many of thelandslide deposits are covered by vegetation. Individual de-

posits do not carry any time-stamp information, and so theinventory contains both historic and prehistoric slope fail-ures, likely containing landslides up to several thousands ofyears old. Smaller rockfalls or soil slips are not included.Similarly, the inventory contains no data on specific triggermechanisms (such as earthquakes, rainfall, or snowmelt), thedominant type of materials mobilized, or absolute age infor-mation for the bulk of individual landslides. In this context,the data nicely reflect common constraints that scientists en-counter when compiling large landslide databases from re-mote sensing data covering different time slices. Yet thistype of inventory is frequently used as a key input for as-sessing and mapping regional landslide susceptibility froma number of statistical techniques, including BNs. However,data-driven learning of BNs containing landslide informationhas, to the best of our knowledge, not been attempted before.We have compiled a number of geological, climatic, and to-pographic metrics for individual catchments throughout theJapanese islands to test their influence on the average frac-tion of landslide-affected terrain that we computed within a10 km radius. Most of our candidate predictors (Table7) havebeen used in modified form in other studies (Korup et al.,2014). While all of these candidate predictors may be phys-ically related to slope instability, our choice of predictors isintentionally arbitrary in order to learn more about their ef-fects on BN learning and structure. The final data set usedfor the BN learning consists of landslide and predictor datathat we averaged at the scale of 553 catchments that are upto 103 km2 large, and that we sampled randomly from thedrainage network across Japan. This averaging approach pro-duced∼ 0.4 % missing data in the subset, and aptly simulatesfurther commonly encountered constraints on the quality oflarge landslide inventories.

5.2 Uncertainty in BN structure

Ideally, a given model should adequately encapsulate naturalphenomena such as the causes and triggers of slope instabil-ity. However, there may be several equally well poised, butcompeting, models because of the intrinsic uncertainty tiedto the governing processes. In practice we also face otherlimitations that prevent us from focusing on one single bestmodel. The finite number of observations we have at ourdisposal for learning, and the fact that it is unclear whichrelevant predictor variables to consider for landslide predic-tion, implies that several models may be justifiable. This is ageneral problem when attempting to formally model naturalsystems. In our case this means that several BNs might ex-plain the data (almost) equally well, i.e., they receive a simi-lar score according to Eq. (2).

An additional source of uncertainty stems from the struc-ture learning algorithm used to maximize the score definedin Eq. (2) or – for continuous variables – in Eq. (3). For in-finite data sets the algorithm terminates according toMeek’sconjecturein the (unique) optimal equivalence class of DAGs


http://lsweb1.ess.bosai.go.jp/gis-data/index.htmlhttp://lsweb1.ess.bosai.go.jp/gis-data/index.html


Table 7.Variables used in the landslide model.

Name Definition Unit

Mean elevation Average of elevation values within catchment boundaries [m]Catchment area Log-transformed catchment area [a.u.]Catchment perimeter Total length of catchment divides [m]Mean local topographic relief Maximum elevation difference in a 10 km radius [m]Mean annual precipitationa Based on interpolated rainfall station data (reference period 1980–2010) [mm]Mean coefficient of variation ofannual precipitationa

Based on interpolated rainfall station data, with standard deviation divided by mean (referenceperiod 1980–2010)

[1]

Mean coefficient of variation ofmonthly precipitationa

Based on interpolated rainfall station data, with standard deviation divided by mean (referenceperiod 1980–2010)

[1]

Mean surface uplift 2001–2011b GPS-derived accumulated surface uplift 2001–2011 [m]Mean surface uplift 2010–2011b GPS-derived accumulated surface uplift 2010–2011 [m]Mean fraction of 10 % steepest bedrockchannels

Average fraction of 10 % steepest channels per unit length of bedrock-river drainage network ina 10 km radius, based on an arbitrarily set referenceconcavityθ = 0.45

[1]

Mean bedrock channelsteepness

Average of channel steepness index per reach length, based on an arbitrarily set reference con-cavity θ = 0.45

[1]

Regionalized river sinuosity Average bedrock-channel sinuosity weighted by drainage network length ina 10 km radius calculated as the flow length of a given channel segmentdivided by its shortest vertex distance

[1]

Fraction of volcanic rocksc Fraction of catchment area underlain by volcanic rocks [1]Fraction of lakes Fraction of catchment area covered by lakes [1]Fraction of plutonic rocksc Fraction of catchment area underlain by plutonic rocks [1]Fraction of sedimentary rocksc Fraction of catchment area underlain by sedimentary rocks [1]Fraction of accretionary complexrocksc

Fraction of catchment area underlain by accretionary complexrocks

[1]

Fraction of metamorphic rocksc Fraction of catchment area underlain by metamorphic rocks [1]Median area of landslide-affectedterrain

Fraction of landslide terrain per unit catchment area within a 10 km radiuscalculated using an inventory of mostly prehistoric landslide-deposit areas

[1]

a Calculated using data provided by the Japan Meteorological Agency (JMA,http://www.jma.go.jp/jma/indexe.html).b Calculated from secular high-precision leveling data (Kimura et al., 2008).c Calculated using the seamless digital geological map of Japan (1 : 200 000) available from the Geological Survey of Japan (https://gbank.gsj.jp/seamless).

(Chickering, 2002), but this does not necessarily hold for fi-nite data sets, incomplete observations and a search spaceextended by the discretization. The algorithm for the traver-sal of the BN hypothesis space contains stochastic elementsand may get stuck in local optima, providing slightly differ-ent results for different runs.

To analyze this random behavior, we run the BN learningand discretization algorithm 10 times on the same data set oflandslide data. We do not expect to end up with the same BNin each trial, as the constraints to meet Meek’s conjecture arenot fulfilled. Instead, we are more interested in documentinghow strongly the results differ from each other.

Figure 16 gives a summarized representation of the BNDAG structures. The frequency with which an edge betweentwo variables is learned is encoded according to its widths(by scaling it accordingly). Despite the differences in DAGstructures, all learned BNs seem to model the data-generatingprocess almost equally well, which can be gathered fromthe score obtained by Eq. (3): for the BNs learned, we ob-served scores between−64 364.42 and−64 253.98. This isa promising result, since it indicates that, even though the al-gorithm gets stuck in local maxima, the quality of the results

does not differ significantly. This supports the assumptionthat the quality of the learned BN is not seriously affectedby random effects of the learning algorithm. Multiple runs ofthe algorithm on other data sets confirm this assumption.

In literature on BN learning (and on model learning basedon data in general), ideas of how to handle several compet-ing, but all justifiable, BNs have been investigated.Fried-man et al.(1999) use bootstrap sampling to learn BNs fromdifferent variations of the data set. Based on those they de-velop a confidence measure on features of a network (e.g., thepresence of an edge or membership of a node to a cer-tain Markov blanket). A Bayesian approach is presented byFriedman and Koller(2000) andRiggelsen(2005), who ap-proximate the Bayesian posterior on the DAG space using aMarkov chain Monte Carlo approach. An adaptation of thesemethods for the extended MAP score introduced in this paperis left for future work.

5.3 Results

Despite (or rather thanks to) the DAG structural differences,we can glean some instructive insights from the learned BNs.


http://www.jma.go.jp/jma/indexe.htmlhttps://gbank.gsj.jp/seamless


Fig. 16. Summary of ten learned network structures modeling landslides susceptibility, all based on the same

data set. Arrow widths between the variables are scaled to the number of times they occur in the learned

BNs. Likewise, we color-coded the variables by the frequency with that they occur as part of the Markov

Blanket of fraction of landslide affected terrain (circular node shape), where darker hues indicate more frequent

occurrence.

Fig. 17. Illustration for the calculation of s(·) used for the parameter estimation in DAG′. The graph on the

left shows a DAG′ for the estimation of C conditioned on A and B. The three variables take the values t and

f . An exemplary data set is given in the table on the right together with the contribution for each record to

s(C = t,(A= t,B = f)).

40

Figure 16. Summary of 10 learned network structures modeling landslides susceptibility, all based on the same data set. Arrow widthsbetween the variables are scaled to the number of times they occur in the learned BNs. Likewise, we color-coded the variables accordingto the frequency with which they occur as part of the Markov blanket offraction of landslide-affected terrain(circular node shape), wheredarker hues indicate more frequent occurrences.

The fact that we can learn something about the landslide-affected terrain from several BN structures indicates that thedifferent predictors are highly interacting, and that a missedlink between two variables can often be compensated for byother interactions. To understand which variables are mostrelevant for the prediction of landslide-affected terrain, wecoded the variables in Fig.16 according to the frequency atwhich they occur as part of the target variable’s Markov blan-ket, where darker hues indicate more frequent occurrences.

Perhaps the most surprising aspect of the learned BNs isthat only few of the predictors that have traditionally been in-voked to explain landslide susceptibility are duly representedin the Markov blanket. These include mean annual precipita-tion (part of the MB in each run) – including some derivativessuch as precipitation variability (either annual or mont

Bayesian network learning for natural hazard analyses · with graph theory, BNs depict probabilistic dependence re-lations in a graph: the nodes of the graph represent the con-sidered

Documents