-
Learning Similarity Metrics for Numerical Simulations
Georg Kohl 1 Kiwon Um 1 Nils Thuerey 1
AbstractWe propose a neural network-based approachthat computes
a stable and generalizing metric(LSiM) to compare data from a
variety of nu-merical simulation sources. We focus on
scalartime-dependent 2D data that commonly arisesfrom motion and
transport-based partial differen-tial equations (PDEs). Our method
employs aSiamese network architecture that is motivatedby the
mathematical properties of a metric. Weleverage a controllable data
generation setup withPDE solvers to create increasingly different
out-puts from a reference simulation in a controlledenvironment. A
central component of our learnedmetric is a specialized loss
function that intro-duces knowledge about the correlation
betweensingle data samples into the training process. Todemonstrate
that the proposed approach outper-forms existing metrics for vector
spaces and otherlearned, image-based metrics, we evaluate the
dif-ferent methods on a large range of test data. Addi-tionally, we
analyze generalization benefits of anadjustable training data
difficulty and demonstratethe robustness of LSiM via an evaluation
on threereal-world data sets.
1. IntroductionEvaluating computational tasks for complex data
sets is afundamental problem in all computational disciplines.
Reg-ular vector space metrics, such as the L2 distance, wereshown
to be very unreliable (Wang et al., 2004; Zhang et al.,2018), and
the advent of deep learning techniques with con-volutional neural
networks (CNNs) made it possible to morereliably evaluate complex
data domains such as natural im-ages, texts (Benajiba et al.,
2018), or speech (Wang et al.,2018). Our central aim is to
demonstrate the usefulness of
1Department of Informatics, Technical University of Mu-nich,
Munich, Germany. Correspondence to: Georg Kohl.
Proceedings of the 37 th International Conference on
MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020
bythe author(s).
CNN-based evaluations in the context of numerical simula-tions.
These simulations are the basis for a wide range ofapplications
ranging from blood flow simulations to aircraftdesign.
Specifically, we propose a novel learned simulationmetric (LSiM)
that allows for a reliable similarity evaluationof simulation
data.
Potential applications of such a metric arise in all areaswhere
numerical simulations are performed or similar datais gathered from
observations. For example, accurate evalua-tions of existing and
new simulation methods with respect toa known ground truth solution
(Oberkampf et al., 2004) canbe performed more reliably than with a
regular vector norm.Another good example is weather data for which
complextransport processes and chemical reactions make
in-placecomparisons with common metrics unreliable (Jolliffe
&Stephenson, 2012). Likewise, the long-standing, open
ques-tions of turbulence (Moin & Mahesh, 1998; Lin et al.,
1998)can benefit from improved methods for measuring the
simi-larity and differences in data sets and observations.
In this work, we focus on field data, i.e., dense grids ofscalar
values, similar to images, which were generated withknown partial
differential equations (PDEs) in order to en-sure the availability
of ground truth solutions. While wefocus on 2D data in the
following to make comparisons withexisting techniques from imaging
applications possible, ourapproach naturally extends to higher
dimensions. Everysample of this 2D data can be regarded a high
dimensionalvector, so metrics on the corresponding vector space
areapplicable to evaluate similarities. These metrics, in
thefollowing denoted as shallow metrics, are typically
simple,element-wise functions such as L1 or L2 distances.
Theirinherent problem is that they cannot compare structures
ondifferent scales or contextual information.
Many practical problems require solutions over time andneed a
vast number of non-linear operations that often re-sult in
substantial changes of the solutions even for smallchanges of the
inputs. Hence, despite being based onknown, continuous
formulations, these systems can be seenas chaotic. We illustrate
this behavior in Fig. 1, where twosmoke flows are compared to a
reference simulation. Asingle simulation parameter was varied for
these examples,and a visual inspection shows that smoke plume (a)
is moresimilar to the reference. This matches the data
generation
arX
iv:2
002.
0786
3v2
[cs
.LG
] 2
3 Ju
n 20
20
-
Learning Similarity Metrics for Numerical Simulations
Figure 1. Example of field data from a fluid simulation of hot
smoke with normalized distances for different metrics. Our method
(LSiM,green) approximates the ground truth distances (GT, gray)
determined by the data generation method best, i.e., version (a) is
closer to theground truth data than (b). An L2 metric (red)
erroneously yields a reversed ordering.
process: version (a) has a significantly smaller parameterchange
than (b) as shown in the inset graph on the right.LSiM robustly
predicts the ground truth distances while theL2 metric labels plume
(b) as more similar. In our work, wefocus on retrieving the
relative distances of simulated datasets. Thus, we do not aim for
retrieving the absolute param-eter change but a relative distance
that preserves orderingwith respect to this parameter.
Using existing image metrics based on CNNs for this prob-lem is
not optimal either: natural images only cover a smallfraction of
the space of possible 2D data, and numericalsimulation outputs are
located in a fundamentally differentdata manifold within this
space. Hence, there are crucialaspects that cannot be captured by
purely learning fromphotographs. Furthermore, we have full control
over thedata generation process for simulation data. As a result,
wecan create arbitrary amounts of training data with gradualchanges
and a ground truth ordering. With this data, we canlearn a metric
that is not only able to directly extract and usefeatures but also
encodes interactions between them. Thecentral contributions of our
work are as follows:
• We propose a Siamese network architecture with fea-ture map
normalization, which is able to learn a metricthat generalizes well
to unseen motion and transport-based simulation methods.
• We propose a novel loss function that combines a cor-relation
loss term with a mean squared error to improvethe accuracy of the
learned metric.
• In addition, we show how a data generation approachfor
numerical simulations can be employed to trainnetworks with general
and robust feature extractors formetric calculations.
Our source code, data sets, and final model are available
athttps://github.com/tum-pbs/LSIM.
2. Related WorkOne of the earliest methods to go beyond using
simple met-rics based on Lp norms for natural images was the
structural
similarity index (Wang et al., 2004). Despite improvements,this
method can still be considered a shallow metric. Overthe years,
multiple large databases for human evaluations ofnatural images
were presented, for instance, CSIQ (Larson& Chandler, 2010),
TID2013 (Ponomarenko et al., 2015),and CID:IQ (Liu et al., 2014).
With this data and the discov-ery that CNNs can create very
powerful feature extractorsthat are able to recognize patterns and
structures, deep fea-ture maps quickly became established as means
for evalua-tion (Amirshahi et al., 2016; Berardino et al., 2017;
Bosseet al., 2016; Kang et al., 2014; Kim & Lee, 2017).
Recently,these methods were improved by predicting the
distributionof human evaluations instead of directly learning
distancevalues (Prashnani et al., 2018; Talebi & Milanfar,
2018b).Zhang et al. compared different architecture and levels
ofsupervision, and showed that metrics can be interpreted as
atransfer learning approach by applying a linear weightingto the
feature maps of any network architecture to form theimage metric
LPIPS v0.1. Typical use cases of these image-based CNN metrics are
computer vision tasks such as detailenhancement (Talebi &
Milanfar, 2018a), style transfer, andsuper-resolution (Johnson et
al., 2016). Generative adver-sarial networks also leverage
CNN-based losses by traininga discriminator network in parallel to
the generation task(Dosovitskiy & Brox, 2016).
Siamese network architectures are known to work well for
avariety of comparison tasks such as audio (Zhang & Duan,2017),
satellite images (He et al., 2019), or the similarity ofinterior
product designs (Bell & Bala, 2015). Furthermore,they yield
robust object trackers (Bertinetto et al., 2016),algorithms for
image patch matching (Hanif, 2019), and fordescriptors for fluid
flow synthesis (Chu & Thuerey, 2017).Inspired by these studies,
we use a similar Siamese neuralnetwork architecture for our metric
learning task. In contrastto other work on self-supervised learning
that utilizes spatialor temporal changes to learn meaningful
representations(Agrawal et al., 2015; Wang & Gupta, 2015), our
methoddoes not rely on tracked keypoints in the data.
While correlation terms have been used for learning
jointrepresentations by maximizing correlation of projected
https://github.com/tum-pbs/LSIM
-
Learning Similarity Metrics for Numerical Simulations
views (Chandar et al., 2016) and are popular for style trans-fer
applications via the Gram matrix (Ruder et al., 2016),they were not
used for learning distance metrics. As wedemonstrate below, they
can yield significant improvementsin terms of the inferred
distances.
Similarity metrics for numerical simulations are a topic
ofongoing investigation. A variety of specialized metrics havebeen
proposed to overcome the limitations of Lp norms,such as the
displacement and amplitude score from the areaof weather
forecasting (Keil & Craig, 2009) as well as per-mutation based
metrics for energy consumption forecasting(Haben et al., 2014).
Turbulent flows, on the other hand, areoften evaluated in terms of
aggregated frequency spectra(Pitsch, 2006). Crowd-sourced
evaluations based on thehuman visual system were also proposed to
evaluate simula-tion methods for physics-based animation (Um et
al., 2017)and for comparing non-oscillatory discretization
schemes(Um et al., 2019). These results indicate that visual
evalua-tions in the context of field data are possible and robust,
butthey require extensive (and potentially expensive) user
stud-ies. Additionally, our method naturally extends to
higherdimensions, while human evaluations inherently rely on
pro-jections with at most two spatial and one time dimension.
3. Constructing a CNN-based MetricIn the following, we explain
our considerations when em-ploying CNNs as evaluation metrics. For
a comparison thatcorresponds to our intuitive understanding of
distances, anunderlying metric has to obey certain criteria. More
pre-cisely, a function m : I× I→ [0,∞) is a metric on its
inputspace I if it satisfies the following properties ∀x,y, z ∈
I:
m(x,y) ≥ 0 non-negativity (1)m(x,y) = m(y,x) symmetry (2)m(x,y)
≤ m(x, z) +m(z,y) triangle ineq. (3)m(x,y) = 0 ⇐⇒ x = y identity of
indisc. (4)
The properties (1) and (2) are crucial as distances should
besymmetric and have a clear lower bound. Eq. (3) ensures
that direct distances cannot be longer than a detour.
Property(4), on the other hand, is not really useful for discrete
opera-tions as approximation errors and floating point
operationscan easily lead to a distance of zero for slightly
differentinputs. Hence, we focus on a relaxed, more
meaningfuldefinition m(x,x) = 0 ∀x ∈ I, which leads to a
so-calledpseudometric. It allows for a distance of zero for
differentinputs but has to be able to spot identical inputs.
We realize these requirements for a pseudometric with
anarchitecture that follows popular perceptual metrics suchas
LPIPS: The activations of a CNN are compared in latentspace,
accumulated with a set of weights, and the resultingper-feature
distances are aggregated to produce a final dis-tance value. Fig. 2
gives a visual overview of this process.
To show that the proposed Siamese architecture by construc-tion
qualifies as a pseudometric, the function
m(x,y) = m2(m1(x),m1(y))
computed by our network is split into two parts: m1 : I→ Lto
compute the latent space embeddings x̃ = m1(x), ỹ =m1(y) from each
input, and m2 : L→ [0,∞) to comparethese points in the latent space
L. We chose operationsfor m2 such that it forms a metric ∀x̃, ỹ ∈
L. Since m1always maps to L, this means m has the properties
(1),(2), and (3) on I for any possible mapping m1, i.e., only
ametric on L is required. To achieve property (4), m1 wouldneed to
be injective, but the compression of typical featureextractors
precludes this. However, if m1 is deterministicm(x,x) = 0 ∀x ∈ I is
still fulfilled since identical inputsresult in the same point in
latent space and thus a distanceof zero. More details for this
proof can be found in App. A.
3.1. Base Network
The sole purpose of the base network (Fig. 2, in purple) is
toextract feature maps from both inputs. The Siamese architec-ture
implies that the weights of the base network are sharedfor both
inputs, meaning all feature maps are comparable.We experimented
with the feature extracting layers from var-
Basenetwork
Input 1
Input 2 Basenetwork
Feature mapnormalization
Feature mapnormalization
Elementwiselatent spacedifference
Channel aggr.:weighted avg.
Spatial aggr.:average
Layer aggr.:summation
Distanceoutput
1 Learned weight per feature map
RGB inputs Feature maps:sets of 3rd order tensors
Difference maps:set of 3rd order tensors
Average maps:set of 2nd order tensors
Layer distances:set of scalars
d1 d2 d3 dResult:scalar
Figure 2. Overview of the proposed distance computation for a
simplified base network that contains three layers with four
feature mapseach in this example. The output shape for every
operation is illustrated below the transitions in orange and white.
Bold operations arelearned, i.e., contain weights influenced by the
training process.
-
Learning Similarity Metrics for Numerical Simulations
ious CNN architectures, such as AlexNet (Krizhevsky et
al.,2017), VGG (Simonyan & Zisserman, 2015), SqueezeNet(Iandola
et al., 2016), and a fluid flow prediction network(Thuerey et al.,
2018). We considered three variants of thesenetworks: using the
original pre-trained weights, fine-tuningthem, or re-training the
full networks from scratch. In con-trast to typical CNN tasks where
only the result of the finaloutput layer is further processed, we
make use of the fullrange of extracted features across the layers
of a CNN (seeFig. 2). This implies a slightly different goal
comparedto regular training: while early features should be
generalenough to allow for extracting more complex features
indeeper layers, this is not their sole purpose. Rather, featuresin
earlier layers of the network can directly participate inthe final
distance calculation and can yield important cues.
We achieved the best performance for our data sets using abase
network architecture with five layers, similar to a re-duced
AlexNet, that was trained from scratch (see App. B.1).This feature
extractor is fully convolutional and thus allowsfor varying spatial
input dimensions, but for comparabilityto other models we keep the
input size constant at 224×224for our evaluation. In separate tests
with interpolated inputs,we found that the metric still works well
for scaling factorsin the range [0.5, 2].
3.2. Feature Map Normalization
The goal of normalizing the feature maps (Fig. 2, in red) isto
transform the extracted features of each layer, which typi-cally
have very different orders of magnitude, into compara-ble ranges.
While this task could potentially be performedby the learned
weights, we found the normalization to yieldimproved performance in
general.
Let G denote a 4th order feature tensor with dimensions(gb, gc,
gx, gy) from one layer of the base network. We forma series G0,G1,
. . . for every possible content of this tensoracross our training
samples. The normalization only hap-pens in the channel dimension,
so all following operationsaccumulate values along the dimension of
gc while keepinggb, gx, and gy constant, i.e., are applied
independently of thebatch and spatial dimensions. The unit length
normalizationproposed by Zhang et al., i.e.,
normunit(G) = G / ‖G‖2 ,
only considers the current sample. In this case, ‖G‖2 isa 3rd
order tensor with the Euclidean norms of G alongthe channel
dimension. Effectively, this results in a cosinedistance, which
only measures angles of the latent spacevectors. To consider the
vector magnitude, the most basicidea is to use the maximum norm of
other training samples,and this leads to a global unit length
normalization
normglobal(G) = G /max (‖G0‖2 , ‖G1‖2 , . . . ) .
Now, the magnitude of the current sample can be comparedto other
feature vectors, but this is not robust since the largestfeature
vector could be an outlier with respect to the typicalcontent.
Instead, we individually transform each componentof a feature
vector with dimension gc to a standard normaldistribution. This is
realized by subtracting the mean anddividing by the standard
deviation of all features element-wise along the channel dimension
as follows:
normdist(G) =1√gc − 1
G−mean (G0,G1, . . . )std (G0,G1, . . . )
.
These statistics are computed via a preprocessing step overthe
training data and stay fixed during training, as we did notobserve
significant improvements with more complicatedschedules such as
keeping a running mean. The magnitudeof the resulting normalized
vectors follows a chi distributionwith k = gc degrees of freedom,
but computing its mean√
2 Γ((k + 1)/2) / Γ(k/2) is expensive1, especially forlarger k.
Instead, the mode of the chi distribution
√gc − 1
that closely approximates its mean is employed to achieve
aconsistent average magnitude of about one independently ofgc. As a
result, we can measure angles for the latent spacevectors and
compare their magnitude in the global lengthdistribution across all
layers.
3.3. Latent Space Differences
Computing the difference of two latent space representationsx̃,
ỹ ∈ L that consist of all extracted features from the twoinputs
x,y ∈ I lies at the core of the metric. This differenceoperator in
combination with the following aggregations hasto ensure that the
metric properties above are upheld withrespect to L. Thus, the most
obvious approach to employ anelement-wise difference x̃i− ỹi ∀i ∈
{0, 1, . . . , dim(L)} isnot suitable, as it invalidates
non-negativity and symmetry.Instead, exponentiation of an absolute
difference via |x̃i −ỹi|p yields an Lp metric on L, when combined
with thecorrect aggregation and a pth root. |x̃i − ỹi|2 is used
tocompute the difference maps (Fig. 2, in yellow), as we didnot
observe significant differences for other values of p.
Considering the importance of comparing the extracted fea-tures,
this simple feature difference does not seem optimal.Rather, one
can imagine that improvements in terms of com-paring one set of
feature activations could lead to overallimprovements for derived
metrics. We investigated replac-ing these operations with a
pre-trained CNN-based metricfor each feature map. This creates a
recursive process or“meta-metric” that reformulates the initial
problem of learn-ing input similarities in terms of learning
feature space sim-ilarities. However, as detailed in App. B.3, we
did not findany substantial improvements with this recursive
approach.This implies that once a large enough number of
expressive
1Γ denotes the gamma function for factorials
-
Learning Similarity Metrics for Numerical Simulations
features is available for comparison, the in-place differenceof
each feature is sufficient to compare two inputs.
3.4. Aggregations
The subsequent aggregation operations (Fig. 2, in green)
areapplied to the difference maps to compress the containedper
feature differences along the different dimensions into asingle
distance value. A simple summation in combinationwith an absolute
difference |x̃i − ỹi| above leads to an L1distance on the latent
space L. Similarly, we can show thataverage or learned weighted
average operations are applica-ble too (see App. A). In addition,
using a p-th power for thelatent space difference requires a
corresponding root opera-tion after all aggregations, to ensure the
metric propertieswith respect to L.
To aggregate the difference maps along the channel dimen-sion,
we found the weighted average proposed by Zhanget al. to work very
well. Thus, we use one learnable weightto control the importance of
a feature. The weight is amultiplier for the corresponding
difference map before sum-mation along the channel dimension, and
is clamped to benon-negative. A negative weight would mean that a
largerdifference in this feature produces a smaller overall
distance,which is not helpful. For regularization, the learned
ag-gregation weights utilize dropout during training, i.e.,
arerandomly set to zero with a probability of 50%. This ensuresthat
the network cannot rely on single features only, but hasto consider
multiple features for a more stable evaluation.
For spatial and layer aggregation, functions such as a
sum-mation or averaging are sufficient and generally
interchange-able. We experimented with more intricate aggregation
func-tions, e.g., by learning a spatial average or determining
layerimportance weights dynamically from the inputs. When thebase
network is fixed and the metric only has very few train-able
weights, this did improve the overall performance. But,with a fully
trained base network, the feature extractionseems to automatically
adopt these aspects making a morecomplicated aggregation
unnecessary.
4. Data Generation and TrainingSimilarity data sets for natural
images typically rely onchanging already existing images with
distortions, noise,or other operations and assigning ground truth
distancesaccording to the strength of the operation. Since we
cancontrol the data creation process for numerical
simulationsdirectly, we can generate large amounts of simulation
datawith increasing dissimilarities by altering the parametersused
for the simulations. As a result, the data contains moreinformation
about the nature of the problem, i.e., whichchanges of the data
distribution should lead to increaseddistances, than by applying
modifications as a post-process.
4.1. Data Generation
Given a set of model equations, e.g., a PDE from fluid
dy-namics, typical solution methods consist of a solver that,given
a set of boundary conditions, computes discrete ap-proximations of
the necessary differential operators. Thediscretized operators and
the boundary conditions typicallycontain problem dependent
parameters, which we collec-tively denote with p0, p1, . . . , pi,
. . . in the following. Weonly consider time dependent problems,
and our solversstart with initial conditions at t0 to compute a
series of timesteps t1, t2, . . . until a target point in time (tt)
is reached.At that point, we obtain a reference output field o0
from oneof the PDE variables, e.g., a velocity.
Initial conditions OutputFinite difference solver with time
discretization
[ p0 p1⋯ pi ] t1 t2 t t o0
o1[ p0 p1⋯ pi+Δi ]
[ p0 p1⋯ pi+n⋅Δi] t1 t2 t t onIncre
asin
g pa
ram
eter
cha
nge
Dec
reas
ing
outp
ut s
imila
rity
noise1,1(s) noise1,2(s) noise1 , t(s)
t1 t2 t t
noise2,1(s) noise2,2(s) noise2 , t(s)
noisen ,1(s) noisen ,2(s) noisen ,t (s)
Figure 3. General data generation method from a PDE solver fora
time dependent problem. With increasing changes of the
initialconditions for a parameter pi in ∆i increments, the outputs
de-crease in similarity. Controlled Gaussian noise is injected in
asimulation field of the solver. The difficulty of the learning
taskcan be controlled by scaling ∆i as well as the noise variance
v.
For data generation, we incrementally change a single pa-rameter
pi in n steps ∆i, 2 ·∆i, . . . , n ·∆i to create a seriesof n
outputs o1, o2, . . . , on. We consider a series obtainedin this
way to be increasingly different from o0. To createnatural
variations of the resulting data distributions, we addGaussian
noise fields with zero mean and adjustable vari-ance v to an
appropriate simulation field such as a velocity.This noise allows
us to generate a large number of varieddata samples for a single
simulation parameter pi. Further-more, v serves as an additional
parameter that can be variedin isolation to observe the same
simulation with differentlevels of interference. This is similar in
nature to numericalerrors introduced by discretization schemes.
These pertur-bations enlarge the space covered by the training
data, andwe found that training networks with suitable noise
levelsimproves robustness as we will demonstrate below. Theprocess
for data generation is summarized in Fig. 3.
As PDEs can model extremely complex and chaotic be-haviour,
there is no guarantee that the outputs always ex-hibit increasing
dissimilarity with the increasing parameterchange. This behaviour
is what makes the task of similar-
-
Learning Similarity Metrics for Numerical Simulations
ity assessment so challenging. Even if the solutions
areessentially chaotic, their behaviour is not arbitrary but
rathergoverned by the rules of the underlying PDE. For our dataset,
we choose the following range of representative PDEs:We include a
pure Advection-Diffusion model (AD), andBurger’s equation (BE)
which introduces an additional vis-cosity term. Furthermore, we use
the full Navier-Stokesequations (NSE), which introduce a
conservation of massconstraint. When combined with a deterministic
solver anda suitable parameter step size, all these PDEs exhibit
chaoticbehaviour at small scales, and the medium to large
scalecharacteristics of the solutions shift smoothly with
increas-ing changes of the parameters pi.
The noise amplifies the chaotic behaviour to larger scalesand
provides a controlled amount of perturbations for thedata
generation. This lets the network learn about the natureof the
chaotic behaviour of PDEs without overwhelming itwith data where
patterns are not observable anymore. Thelatter can easily happen
when ∆i or v grow too large andproduce essentially random outputs.
Instead, we specificallytarget solutions that are difficult to
evaluate in terms of ashallow metric. We heuristically select the
smallest v and asuitable ∆i such that the ordering of several
random outputsamples with respect to their L2 difference drops
below acorrelation value of 0.8. For the chosen PDEs, v was
smallenough to avoid deterioration of the physical
behaviourespecially due to the diffusion terms, but different means
ofadjusting the difficulty may be necessary for other data.
4.2. Training
For training, the 2D scalar fields from the simulations
wereaugmented with random flips, 90◦ rotations, and croppingto
obtain an input size of 224 × 224 every time they areused.
Identical augmentations were applied to each field ofone given
sequence to ensure comparability. Afterwards,each input sequence is
collectively normalized to the range[0, 255]. To allow for
comparisons with image metrics andprovide the possibility to
compare color data and full ve-locity fields during inference, the
metric uses three inputchannels. During training, the scalar fields
are duplicated toeach channel after augmentation. Unless otherwise
noted,networks were trained with a batch size of 1 for 40
epochswith the Adam optimizer using a learning rate of 10−5.
Toevaluate the trained networks on validation and test inputs,only
a bilinear resizing and the normalization step is applied.
5. Correlation Loss FunctionThe central goal of our networks is
to identify relative dif-ferences of input pairs produced via
numerical simulations.Thus, instead of employing a loss that forces
the networkto only infer given labels or distance values, we train
ournetworks to infer the ordering of a given sequence of
simula-
tion outputs o0, o1, . . . , on. We propose to use the
Pearsoncorrelation coefficient (see Pearson, 1920), which yieldsa
value in [−1, 1] that measures the linear relationship be-tween two
distributions. A value of 1 implies that a linearequation describes
their relationship perfectly. We com-pute this coefficient for a
full series of outputs such that thenetwork can learn to extract
features that arrange this dataseries in the correct ordering. Each
training sample of ournetwork consists of every possible pair from
the sequenceo0, o1, . . . , on and the corresponding ground truth
distancedistribution c ∈ [0, 1]0.5(n+1)n representing the
parameterchange from the data generation. For a distance
predictiond ∈ [0,∞)0.5(n+1)n of our network for one sample,
wecompute the loss with:
L(c,d) = λ1(c−d)2 +λ2(1−(c− c̄) · (d− d̄)‖c− c̄‖2
∥∥d− d̄∥∥2
) (5)
Here, the mean of a distance vector is denoted by c̄ andd̄ for
ground truth and prediction, respectively. The firstpart of the
loss is a regular MSE term, which minimizesthe difference between
predicted and actual distances. Thesecond part is the Pearson
correlation coefficient, which isinverted such that the
optimization results in a maximizationof the correlation. As this
formulation depends on the lengthof the input sequence, the two
terms are scaled to adjusttheir relative influence with λ1 and λ2.
For the training, wechose n = 10 variations for each reference
simulation. Ifn should vary during training, the influence of both
termsneeds to be adjusted accordingly. We found that scalingboth
terms to a similar order of magnitude worked best inour
experiments.
0.62 0.64 0.66 0.68 0.70 0.72 0.74Correlation on all test
data
MSE
Cross cor.
Pearson cor.
MSE + cross cor.
Proposed
LSiM (ours)AlexNetfrozen
Figure 4. Performance comparison on our test data of the
proposedapproach (LSiM) and a smaller model (AlexNetfrozen) for
differentloss functions on the y-axis.
In Fig. 4, we investigate how the proposed loss functioncompares
to other commonly used loss formulations for ourfull network and a
pre-trained network, where only aggre-gation weights are learned.
The performance is measuredvia Spearman’s rank correlation of
predicted against groundtruth distances on our combined test data
sets. This is com-parable to the All column in Tab. 1 and described
in more
-
Learning Similarity Metrics for Numerical Simulations
detail in Section 6.2. In addition to our full loss function,
weconsider a loss function that replaces the Pearson
correlationwith a simpler cross-correlation (c · d) / (‖c‖2 ‖d‖2).
Wealso include networks trained with only the MSE or onlythe
correlation terms for each of the two variants.
A simple MSE loss yields the worst performance for bothevaluated
models. Using any correlation based loss functionfor the
AlexNetfrozen metric (see Section 6.2) improves theresults, but
there is no major difference due to the limitednumber of only 1152
trainable weights. For LSiM, the pro-posed combination of MSE loss
with the Pearson correlationperforms better than using
cross-correlation or only isolatedPearson correlation.
Interestingly, combining cross correla-tion with MSE yields worse
results than cross correlationby itself. This is caused by the
cross correlation term influ-encing absolute distance values, which
potentially conflictswith the MSE term. For our loss, the Pearson
correlationonly handles the relative ordering while the MSE deals
withthe absolute distances, leading to better inferred
distances.
6. ResultsIn the following, we will discuss how the data
generationapproach was employed to create a large range of
trainingand test data from different PDEs. Afterwards, the
proposedmetric is compared to other metrics, and its robustness
isevaluated with several external data sets.
6.1. Data Sets
We created four training (Smo, Liq, Adv and Bur) and twotest
data sets (LiqN and AdvD) with ten parameter steps foreach
reference simulation. Based on two 2D NSE solvers,the smoke and
liquid simulation training sets (Smo andLiq) add noise to the
velocity field and feature varied initialconditions such as fluid
position or obstacle properties, inaddition to variations of
buoyancy and gravity forces. Thetwo other training sets (Adv and
Bur) are based on 1Dsolvers for AD and BE, concatenated over time
to form a2D result. In both cases, noise was injected into the
velocityfield, and the varied parameters are changes to the
fieldinitialization and forcing functions.
For the test data set, we substantially change the data
dis-tribution by injecting noise into the density instead of
thevelocity field for AD simulations to obtain the AdvD dataset and
by including background noise for the velocity fieldof a liquid
simulation (LiqN). In addition, we employedthree more test sets
(Sha, Vid, and TID) created withoutPDE models to explore the
generalization for data far fromour training data setup. We include
a shape data set (Sha)that features multiple randomized moving
rigid shapes, avideo data set (Vid) consisting of frames from
randomvideo footage, and TID2013 (Ponomarenko et al., 2015) asa
perceptual image data set (TID). Below, we additionallylist a
combined correlation score (All) for all test sets apartfrom TID,
which is excluded due to its different structure.Examples for each
data set are shown in Fig. 5 and genera-tion details with further
samples can be found in App. D.
6.2. Performance Evaluation
To evaluate the performance of a metric on a data set, wefirst
compute the distances from each reference simulationto all
corresponding variations. Then, the predicted andthe ground truth
distance distributions over all samples arecombined and compared
using Spearman’s rank correlationcoefficient (see Spearman, 1904).
It is similar to the Pear-son correlation, but instead it uses
ranking variables, i.e.,measures monotonic relationships of
distributions.
The top part of Tab. 1 shows the performance of the
shallowmetrics L2 and SSIM as well as the LPIPS metric (Zhanget
al., 2018) for all our data sets. The results clearly showthat
shallow metrics are not suitable to compare the samplesin our data
set and only rarely achieve good correlationvalues. The perceptual
LPIPS metric performs better ingeneral and outperforms our method
on the image data setsVid and TID. This is not surprising as LPIPS
is specificallytrained for such images. For most of the simulation
datasets, however, it performs significantly worse than for
theimage content. The last row of Tab. 1 shows the results ofour
LSiM model with a very good performance across alldata sets and no
negative outliers. Note that although it wasnot trained with any
natural image content, it still performswell for the image test
sets.
Figure 5. Samples from our data sets. For each subset the
reference is on the left, and three variations in equal parameter
steps follow.From left to right and top to bottom: Smo (density,
velocity, and pressure), Adv (density), Liq (flags, velocity, and
levelset), Bur(velocity), LiqN (velocity), AdvD (density), Sha and
Vid.
-
Learning Similarity Metrics for Numerical Simulations
Table 1. Performance comparison of existing metrics (top block),
experimental designs (middle block), and variants of the
proposedmethod (bottom block) on validation and test data sets
measured in terms of Spearman’s rank correlation coefficient of
ground truthagainst predicted distances. Bold+underlined values
show the best performing metric for each data set, bold values are
within a 0.01error margin of the best performing, and italic values
are 0.2 or more below the best performing. On the right, a
visualization of thecombined test data results is shown for
selected models.
MetricValidation data sets Test data sets
Smo Liq Adv Bur TID LiqN AdvD Sha Vid All
L2 0.66 0.80 0.74 0.62 0.82 0.73 0.57 0.58 0.79 0.61SSIM 0.69
0.73 0.77 0.71 0.77 0.26 0.69 0.46 0.75 0.53LPIPS v0.1. 0.63 0.68
0.68 0.72 0.86 0.50 0.62 0.84 0.83 0.66
AlexNetrandom 0.63 0.69 0.69 0.66 0.82 0.64 0.65 0.67 0.81
0.65AlexNetfrozen 0.66 0.70 0.69 0.71 0.85 0.40 0.62 0.87 0.84
0.65Optical flow 0.62 0.57 0.36 0.37 0.55 0.49 0.28 0.61 0.75
0.48Non-Siamese 0.77 0.85 0.78 0.74 0.65 0.81 0.64 0.25 0.80
0.60Skipfrom scratch 0.79 0.83 0.80 0.74 0.85 0.78 0.61 0.78 0.83
0.71
LSiMnoiseless 0.77 0.77 0.76 0.72 0.85 0.62 0.58 0.86 0.82
0.68LSiMstrong noise 0.65 0.65 0.67 0.69 0.84 0.39 0.54 0.89 0.82
0.64LSiM (ours) 0.78 0.82 0.79 0.75 0.86 0.79 0.58 0.88 0.81
0.73
L2SS
IMLP
IPS
Opt
Flow
Non
Siam Skip
LSiM
0.5
0.6
0.7
Cor
rela
tion
(All)
ShallowImage-basedExperimentalProposed
The middle block of Tab. 1 contains several interesting
vari-ants (more details can be found in App. B): AlexNetrandomand
AlexNetfrozen are small models, where the base net-work is the
original AlexNet with pre-trained weights.AlexNetrandom contains
purely random aggregation weightswithout training, whereas
AlexNetfrozen only has trainableweights for the channel aggregation
and therefore lacksthe flexibility to fully adjust to the data
distribution of thenumerical simulations. The random model performs
surpris-ingly well in general, pointing to powers of the
underlyingSiamese CNN architecture.
Recognizing that many PDEs include transport phenomena,we
investigated optical flow (Horn & Schunck, 1981) as ameans to
compute motion from field data. For the Opticalflow metric, we used
FlowNet2 (Ilg et al., 2016) to bidirec-tionally compute the optical
flow field between two inputsand aggregate it to a single distance
value by summing allflow vector magnitudes. On the data set Vid
that is similarto the training data of FlowNet2, it performs
relatively well,but in most other cases it performs poorly. This
shows thatcomputing a simple warping from one input to the other
isnot enough for a stable metric although it seems like an
in-tuitive solution. A more robust metric needs the knowledgeof the
underlying features and their changes to generalizebetter to new
data.
To evaluate whether a Siamese architecture is really
ben-eficial, we used a Non-Siamese architecture that
directlypredicts the distance from both stacked inputs. For
thispurpose, we employed a modified version of AlexNet thatreduces
the weights of the feature extractor by 50% andof the remaining
layers by 90%. As expected, this metric
works great on the validation data but has huge problemswith
generalization, especially on TID and Sha. In addi-tion, even
simple metric properties such as symmetry are nolonger guaranteed
because this architecture does not havethe inherent constraints of
the Siamese setup. Finally, weexperimented with multiple fully
trained base networks. Asre-training existing feature extractors
only provided smallimprovements, we used a custom base network with
skipconnections for the Skipfrom scratch metric. Its results
alreadycome close to the proposed approach on most data sets.
The last block in Tab. 1 shows variants of the proposedapproach
trained with varied noise levels. This inherentlychanges the
difficulty of the data. Hence, LSiMnoiseless wastrained with
relatively simple data without perturbations,whereas LSiMstrong
noise was trained with strongly varyingdata. Both cases decrease
the capabilities of the trainedmodel on some of the validation and
test sets. This indicatesthat the network needs to see a certain
amount of variationat training time in order to become robust, but
overly largechanges hinder the learning of useful features (also
seeApp. C).
6.3. Evaluation on Real-World Data
To evaluate the generalizing capabilities of our trained
met-ric, we turn to three representative and publicly availabledata
sets of captured and simulated real-world phenomena,namely buoyant
flows, turbulence, and weather. For theformer, we make use of the
ScalarFlow data set (Eckertet al., 2019), which consists of
captured velocities of buoy-ant scalar transport flows.
Additionally, we include velocitydata from the Johns Hopkins
Turbulence Database (JHTDB)
-
Learning Similarity Metrics for Numerical Simulations
Figure 6. Examples from three real-world data repositories used
for evaluation, visualized via color-mapping. Each block
featuresfour different sequences (rows) with frames in equal
temporal or spatial intervals. Left: ScalarFlow – captured buoyant
volumetrictransport flows using the z-slice (top two) and z-mean
(bottom two). Middle: JHTDB – four different turbulent DNS
simulations. Right:WeatherBench – weather data consisting of
temperature (top two) and geopotential (bottom two).
(Perlman et al., 2007), which represents direct
numericalsimulations of fully developed turbulence. As a third
case,we use scalar temperature and geopotential fields from
theWeatherBench repository (Rasp et al., 2020), which
containsglobal climate data on a Cartesian latitude-longitude grid
ofthe earth. Visualizations of this data via color-mapping
thescalar fields or velocity magnitudes are shown in Fig. 6.
L2 SSIM LPIPS LSiM (ours)
0.7
0.8
0.9
1.0
Aver
age
corre
latio
n
ScalarFlow JHTDB WeatherBench
Figure 7. Spearman correlation values for multiple metrics on
datafrom three repositories. Shown are mean and standard
deviationover different temporal or spatial intervals used to
create sequences.
For the results in Fig. 7, we extracted sequences of frameswith
fixed temporal and spatial intervals from each data setto obtain a
ground truth ordering. Six different interval spac-ings for every
data source are employed, and all velocitydata is split by
component. We then measure how well dif-ferent metrics recover the
original ordering in the presenceof the complex changes of content,
driven by the underlyingphysical processes. The LSiM model outlined
in previoussections was used for inference without further
changes.
Every metric is separately evaluated (see Section 6.2) forthe
six interval spacings with 180-240 sequences each. ForScalarFlow
and WeatherBench, the data was additionallypartitioned by z-slice
or z-mean and temperature or geopo-
tential respectively, leading to twelve evaluations. Fig. 7shows
the mean and standard deviation of the resulting cor-relation
values. Despite never being trained on any datafrom these data
sets, LSiM recovers the ordering of all threecases with
consistently high accuracy. It yields averagedcorrelations of 0.96
± 0.02, 0.95 ± 0.05, and 0.95 ± 0.06for ScalarFlow, JHTDB, and
WeatherBench, respectively.The other metrics show lower means and
higher uncertainty.Further details and results for the individual
evaluations canbe found in App. E.
7. ConclusionWe have presented the LSiM metric to reliably and
robustlycompare outputs from numerical simulations. Our
methodsignificantly outperforms existing shallow metric
functionsand provides better results than other learned metrics.
Wedemonstrated the usefulness of the correlation loss, showedthe
benefits of a controlled data generation environment,and
highlighted the stability of the obtained metric for arange of
real-world data sets.
Our trained LSiM metric has the potential to impact a widerange
of fields, including the fast and reliable accuracy as-sessment of
new simulation methods, robust optimizationsof parameters for
reconstructions of observations, and guid-ing generative models of
physical systems. Furthermore, itwill be highly interesting to
evaluate other loss functions,e.g., mutual information (Bachman et
al., 2019) or con-trastive predictive coding (Hénaff et al.,
2019), and combi-nations with evaluations from perceptual studies
(Um et al.,2019). We also plan to evaluate our approach for an
evenlarger set of PDEs as well as for 3D and 4D data sets.
Espe-cially, turbulent flows are a highly relevant and
interestingarea for future work on learned evaluation metrics.
-
Learning Similarity Metrics for Numerical Simulations
AcknowledgementsThis work was supported by the ERC Starting
Grant re-alFlow (StG-2015-637014). We would like to thank
StephanRasp for preparing the WeatherBench data and all
reviewersfor helping to improve this work.
ReferencesAgrawal, P., Carreira, J., and Malik, J. Learning to
see by moving.
In 2015 IEEE International Conference on Computer Vision(ICCV),
pp. 37–45, 2015. doi:10.1109/ICCV.2015.13.
Amirshahi, S. A., Pedersen, M., and Yu, S. X. Image Qual-ity
Assessment by Comparing CNN Features between Im-ages. Journal of
Imaging Sience and Technology, 60(6),
2016.doi:10.2352/J.ImagingSci.Technol.2016.60.6.060410.
Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning
rep-resentations by maximizing mutual information across
views.CoRR, abs/1906.00910, 2019. URL
http://arxiv.org/abs/1906.00910.
Bell, S. and Bala, K. Learning visual similarity for product
designwith convolutional neural networks. ACM Transactions
onGraphics, 34(4):98:1–98:10, 2015. doi:10.1145/2766959.
Benajiba, Y., Sun, J., Zhang, Y., Jiang, L., Weng, Z., and
Biran,O. Siamese networks for semantic pattern similarity.
CoRR,abs/1812.06604, 2018. URL http://arxiv.org/abs/1812.06604.
Berardino, A., Balle, J., Laparra, V., and Simoncelli, E.
Eigen-Distortions of Hierarchical Representations. In Advances in
Neu-ral Information Processing Systems 30 (NIPS 2017), volume
30,2017. URL http://arxiv.org/abs/1710.02266.
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A.,
andTorr, P. H. S. Fully-Convolutional Siamese Networks for
ObjectTracking. In Computer Vision - ECCV 2016 Workshops, PTII,
volume 9914, pp. 850–865, 2016. doi:10.1007/978-3-319-48881-3
56.
Bosse, S., Maniry, D., Mueller, K.-R., Wiegand, T., and Samek,W.
Neural Network-Based Full-Reference Image Quality As-sessment. In
2016 Picture Coding Symposium (PCS),
2016.doi:10.1109/PCS.2016.7906376.
Chandar, S., Khapra, M. M., Larochelle, H., and Ravindran,
B.Correlational neural networks. Neural Computation, 28(2):257–285,
2016. doi:10.1162/NECO a 00801.
Chu, M. and Thuerey, N. Data-Driven Synthesis of SmokeFlows with
CNN-based Feature Descriptors. ACMTransactions on Graphics,
36(4):69:1–69:14, 2017.doi:10.1145/3072959.3073643.
Dosovitskiy, A. and Brox, T. Generating Images with Percep-tual
Similarity Metrics based on Deep Networks. In Advancesin Neural
Information Processing Systems 29 (NIPS 2016),volume 29, 2016. URL
http://arxiv.org/abs/1602.02644.
Eckert, M.-L., Um, K., and Thuerey, N. Scalarflow: A large-scale
volumetric data set of real-world scalar transport flows
forcomputer animation and machine learning. ACM Transactionson
Graphics, 38(6), 2019. doi:10.1145/3355089.3356545.
Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning.
MITPress, 2016. URL http://www.deeplearningbook.org.
Haben, S., Ward, J., Greetham, D. V., Singleton, C.,
andGrindrod, P. A new error measure for forecasts of
household-level, high resolution electrical energy consumption.
In-ternational Journal of Forecasting, 30(2):246–256,
2014.doi:10.1016/j.ijforecast.2013.08.002.
Hanif, M. S. Patch match networks: Improved two-channel and
Siamese networks for image patch match-ing. Pattern Recognition
Letters, 120:54–61, 2019.doi:10.1016/j.patrec.2019.01.005.
He, H., Chen, M., Chen, T., Li, D., and Cheng, P. Learningto
match multitemporal optical satellite images using
multi-support-patches Siamese networks. Remote Sensing Letters,
10(6):516–525, 2019. doi:10.1080/2150704X.2019.1577572.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learningfor image recognition. In 2016 IEEE Conference on
ComputerVision and Pattern Recognition (CVPR), pp. 770–778,
2016.doi:10.1109/CVPR.2016.90.
Hénaff, O. J., Razavi, A., Doersch, C., Eslami, S. M. A.,
andvan den Oord, A. Data-efficient image recognition with
con-trastive predictive coding. CoRR, abs/1905.09272, 2019.
URLhttp://arxiv.org/abs/1905.09272.
Horn, B. K. and Schunck, B. G. Determining optical flow.
Arti-ficial intelligence, 17(1-3):185–203, 1981.
doi:10.1016/0004-3702(81)90024-2.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.
Q.Densely connected convolutional networks. In 2017 IEEE
Con-ference on Computer Vision and Pattern Recognition (CVPR),pp.
2261–2269, 2017. doi:10.1109/CVPR.2017.243.
Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally,W.
J., and Keutzer, K. Squeezenet: Alexnet-level accuracywith 50x
fewer parameters and
-
Learning Similarity Metrics for Numerical Simulations
Kim, J. and Lee, S. Deep Learning of Human Visual Sensitivityin
Image Quality Assessment Framework. In 30TH IEEE Con-ference on
Computer Vision and Pattern Recognition (CVPR2017), pp. 1969–1977,
2017. doi:10.1109/CVPR.2017.213.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classifica-tion with deep convolutional neural networks.
Communicationsof the ACM, 60(6):84–90, 2017.
doi:10.1145/3065386.
Larson, E. C. and Chandler, D. M. Most apparent distor-tion:
full-reference image quality assessment and the roleof strategy.
Journal of Electronic Imaging, 19(1),
2010.doi:10.1117/1.3267105.
Lin, Z., Hahm, T. S., Lee, W., Tang, W. M., and White,R. B.
Turbulent transport reduction by zonal flows: Massivelyparallel
simulations. Science, 281(5384):1835–1837,
1998.doi:10.1126/science.281.5384.1835.
Liu, X., Pedersen, M., and Hardeberg, J. Y. CID:IQ - A New
ImageQuality Database. In Image and Signal Processing, ICISP2014,
volume 8509, pp. 193–202, 2014. doi:10.1007/978-3-319-07998-1
22.
Moin, P. and Mahesh, K. Direct numerical simulation: a tool
inturbulence research. Annual review of fluid mechanics,
30(1):539–578, 1998. doi:10.1146/annurev.fluid.30.1.539.
Oberkampf, W. L., Trucano, T. G., and Hirsch, C. Verification,
val-idation, and predictive capability in computational
engineeringand physics. Applied Mechanics Reviews, 57:345–384,
2004.doi:10.1115/1.1767847.
Pearson, K. Notes on the History of Correlation. Biometrika,
13(1):25–45, 1920. doi:10.1093/biomet/13.1.25.
Perlman, E., Burns, R., Li, Y., and Meneveau, C. Data
explorationof turbulence simulations using a database cluster. In
SC ’07:Proceedings of the 2007 ACM/IEEE Conference on
Supercom-puting, pp. 1–11, 2007. doi:10.1145/1362622.1362654.
Pitsch, H. Large-eddy simulation of turbulent combus-tion. Annu.
Rev. Fluid Mech., 38:453–482,
2006.doi:10.1146/annurev.fluid.38.050304.092133.
Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V.,
Egiazarian,K., Astola, J., Vozel, B., Chehdi, K., Carli, M.,
Battisti, F., andKuo, C. C. J. Image database TID2013:
Peculiarities, resultsand perspectives. Signal Processing-Image
Communication, 30:57–77, 2015. doi:10.1016/j.image.2014.10.009.
Prashnani, E., Cai, H., Mostofi, Y., and Sen, P. Pieapp:
Perceptualimage-error assessment through pairwise preference.
CoRR,abs/1806.02067, 2018. URL http://arxiv.org/abs/1806.02067.
Rasp, S., Dueben, P., Scher, S., Weyn, J., Mouatadid, S.,
andThuerey, N. Weatherbench: A benchmark dataset for data-driven
weather forecasting. CoRR, abs/2002.00469, 2020.
URLhttp://arxiv.org/abs/2002.00469.
Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu-tional
networks for biomedical image segmentation. CoRR,abs/1505.04597,
2015. URL http://arxiv.org/abs/1505.04597.
Ruder, M., Dosovitskiy, A., and Brox, T. Artistic style
trans-fer for videos. In Pattern Recognition, pp. 26–36,
2016.doi:10.1007/978-3-319-45886-1 3.
Simonyan, K. and Zisserman, A. Very deep convolutional
networksfor large-scale image recognition. In ICLR, 2015. URL
http://arxiv.org/abs/1409.1556.
Spearman, C. The proof and measurement of association betweentwo
things. The American Journal of Psychology, 15(1):72–101,1904.
doi:10.2307/1412159.
Talebi, H. and Milanfar, P. Learned Perceptual Im-age
Enhancement. In 2018 IEEE International Con-ference on
Computational Photography (ICCP),
2018a.doi:10.1109/ICCPHOT.2018.8368474.
Talebi, H. and Milanfar, P. NIMA: Neural Image Assessment.IEEE
Transactions on Image Processing, 27(8):3998–4011,2018b.
doi:10.1109/TIP.2018.2831899.
Thuerey, N., Weissenow, K., Mehrotra, H., Mainali, N., Prantl,
L.,and Hu, X. Well, how accurate is it? A study of deep learn-ing
methods for reynolds-averaged navier-stokes simulations.CoRR,
abs/1810.08217, 2018. URL http://arxiv.org/abs/1810.08217.
Um, K., Hu, X., and Thuerey, N. Perceptual Evaluation of
LiquidSimulation Methods. ACM Transactions on Graphics, 36(4),2017.
doi:10.1145/3072959.3073633.
Um, K., Hu, X., Wang, B., and Thuerey, N. Spot the Differ-ence:
Accuracy of Numerical Simulations via the HumanVisual System. CoRR,
abs/1907.04179, 2019. URL http://arxiv.org/abs/1907.04179.
Wang, X. and Gupta, A. Unsupervised learning of visual
rep-resentations using videos. In 2015 IEEE International
Con-ference on Computer Vision (ICCV), pp. 2794–2802,
2015.doi:10.1109/ICCV.2015.320.
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E.
Imagequality assessment: From error visibility to structural
similarity.IEEE Transactions on Image Processing, 13(4):600–612,
2004.doi:10.1109/TIP.2003.819861.
Wang, Z., Zhang, J., and Xie, Y. L2 Mispronunciation
Verifica-tion Based on Acoustic Phone Embedding and Siamese
Net-works. In 2018 11TH International Symposium on ChineseSpoken
Language Processing (ISCSLP), pp. 444–448,
2018.doi:10.1109/ISCSLP.2018.8706597.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,O.
The Unreasonable Effectiveness of Deep Features as aPerceptual
Metric. In 2018 IEEE Conference on ComputerVision and Pattern
Recognition (CVPR), pp. 586–595,
2018.doi:10.1109/CVPR.2018.00068.
Zhang, Y. and Duan, Z. IMINET: Convolutional Semi-Siamese
Networks for Sound Search by Vocal Imita-tion. In 2017 IEEE
Workshop on Applications of Sig-nal Processing to Audio and
Acoustics, pp. 304–308, 2017.doi:10.1109/TASLP.2018.2868428.
Zhu, Y. and Bridson, R. Animating sand as a fluid. In
ACMSIGGRAPH 2005 Papers, pp. 965–972, New York, NY, USA,2005.
doi:10.1145/1186822.1073298.
http://dx.doi.org/10.1109/CVPR.2017.213http://dx.doi.org/10.1145/3065386http://dx.doi.org/10.1117/1.3267105http://dx.doi.org/10.1126/science.281.5384.1835http://dx.doi.org/10.1007/978-3-319-07998-1_22http://dx.doi.org/10.1007/978-3-319-07998-1_22http://dx.doi.org/10.1146/annurev.fluid.30.1.539http://dx.doi.org/10.1115/1.1767847http://dx.doi.org/10.1093/biomet/13.1.25http://dx.doi.org/10.1145/1362622.1362654http://dx.doi.org/10.1146/annurev.fluid.38.050304.092133http://dx.doi.org/10.1016/j.image.2014.10.009http://arxiv.org/abs/1806.02067http://arxiv.org/abs/1806.02067http://arxiv.org/abs/2002.00469http://arxiv.org/abs/1505.04597http://arxiv.org/abs/1505.04597http://dx.doi.org/10.1007/978-3-319-45886-1_3http://arxiv.org/abs/1409.1556http://arxiv.org/abs/1409.1556http://dx.doi.org/10.2307/1412159http://dx.doi.org/10.1109/ICCPHOT.2018.8368474http://dx.doi.org/10.1109/TIP.2018.2831899http://arxiv.org/abs/1810.08217http://arxiv.org/abs/1810.08217http://dx.doi.org/10.1145/3072959.3073633http://arxiv.org/abs/1907.04179http://arxiv.org/abs/1907.04179http://dx.doi.org/10.1109/ICCV.2015.320http://dx.doi.org/10.1109/TIP.2003.819861http://dx.doi.org/10.1109/ISCSLP.2018.8706597http://dx.doi.org/10.1109/CVPR.2018.00068http://dx.doi.org/10.1109/TASLP.2018.2868428http://dx.doi.org/10.1145/1186822.1073298
-
Appendix: Learning Similarity Metrics for Numerical
Simulations
This supplemental document contains an analysis of theproposed
metric design with respect to properties of metricsin general (App.
A) and details to the used network archi-tectures (App. B).
Afterwards, material that deals with thedata sets is provided. It
contains examples and failure casesfor each of the data domains and
analyzes the impact ofthe data difficulty (App. C and D). Next, the
evaluation onreal-world data is described in more detail (App. E).
Finally,we explore additional metric evaluations (App. F) and
givean overview on the used notation (App. G).
The source code for using the trained LSiM metric and
re-training the model from scratch are available at
https://github.com/tum-pbs/LSIM. This includes the fulldata sets
and the corresponding data generation scripts forthe employed PDE
solver.
A. Discussion of Metric PropertiesTo analyze if the proposed
method qualifies as a metric, it issplit in two functionsm1 : I→ L
andm2 : L×L→ [0,∞),which operate on the input space I and the
latent space L.Through flattening elements from the input or latent
spaceinto vectors, I ' Ra and L ' Rb where a and b are
thedimensions of the input data and all feature maps respec-tively,
and both values have a similar order of magnitude.m1 describes the
non-linear function computed by the basenetwork combined with the
following normalization andreturns a point in the latent space. m2
uses two points inthe latent space to compute a final distance
value, thus it in-cludes the latent space difference and the
aggregation alongthe spatial, layer, and channel dimensions. With
the Siamesenetwork architecture, the resulting function for the
entireapproach is
m(x,y) = m2(m1(x),m1(y)).
The identity of indiscernibles mainly depends on m1 be-cause,
even if m2 itself guarantees this property, m1 couldstill be
non-injective, which means it can map different in-puts to the same
point in latent space x̃ = ỹ for x 6= y.Due to the complicated
nature of m1, it is difficult to makeaccurate predictions about the
injectivity of m1. Each basenetwork layer of m1 recursively
processes the result of thepreceding layer with various feature
extracting operations.Here, the intuition is that significant
changes in the inputshould produce different feature map results in
one or morelayers of the network. As very small changes in the
inputlead to zero valued distances predicted by the CNN (i.e.,
an
identical latent space for different inputs), m1 is in
practicenot injective. In an additional experiment, the proposed
ar-chitecture was evaluated on about 3500 random inputs fromall our
data sets, where the CNN received one unchangedand one slightly
modified input. The modification consistedof multiple pixel
adjustments by one bit (on 8-bit color im-ages) in random positions
and channels. When adjustingonly a single pixel in the 224× 224
input, the CNN predictsa zero valued distance on about 23% of the
inputs, but wenever observed an input where seven or more changed
pixelsresulted in a distance of zero in all experiments.
In this context, the problem of numerical errors is impor-tant
because even two slightly different latent space repre-sentations
could lead to a result that seems to be zero ifthe difference
vanishes in the aggregation operations or issmaller than the
floating point precision. On the other hand,an automated analysis
to find points that have a differentinput but an identical latent
space image is a challengingproblem and left as future work.
The evaluation of the base network and the normalization
isdeterministic, and hence ∀x : m1(x) = m1(x) holds. Fur-thermore,
we know that m(x,x) = 0 if m2 guarantees that∀m1(x) :
m2(m1(x),m1(x)) = 0. Thus, the remainingproperties, i.e.,
non-negativity, symmetry, and the triangleinequality, only depend
on m2 since for them the originalinputs are not relevant, but their
respective images in the la-tent space. The resulting structure
with a relaxed identity ofindiscernibles is called a pseudometric,
where ∀x̃, ỹ, z̃ ∈ L:
m2(x̃, ỹ) ≥ 0 (6)m2(x̃, ỹ) = m2(ỹ, x̃) (7)m2(x̃, ỹ) ≤ m2(x̃,
z̃) +m2(z̃, ỹ) (8)m2(x̃, x̃) = 0 (9)
Notice that m2 has to fulfill these properties with respect
tothe latent space but not the input space. If m2 is
carefullyconstructed, the metric properties still apply,
independentlyof the actual design of the base network or the
feature mapnormalization.
A first observation concerning m2 is that if all
aggregationswere sum operations and the element-wise latent space
dif-ference was the absolute value of a difference operation,m2
would be equivalent to computing the L1 norm of thedifference
vector in latent space:
msum2 (x̃, ỹ) =
b∑i=1
|x̃i − ỹi|.
https://github.com/tum-pbs/LSIMhttps://github.com/tum-pbs/LSIM
-
Learning Similarity Metrics for Numerical Simulations
Similarly, adding a square operation to the element-wisedistance
in the latent space and computing the square rootat the very end
leads to the L2 norm of the latent spacedifference vector. In the
same way, it is possible to use anyLp norm with the corresponding
operations:
msum2 (x̃, ỹ) =
(b∑
i=1
|x̃i − ỹi|p) 1
p
.
In both cases, this forms the metric induced by the
corre-sponding norm, which by definition has all desired
prop-erties (6), (7), (8), and (9). If we change all
aggregationmethods to a weighted average operation, each term in
thesum is multiplied by a weight wi. This is even possible
withlearned weights, as they are constant at evaluation time ifthey
are clamped to be positive as described above. Now, wican be
attributed to both inputs by distributivity, meaningeach input is
element-wise multiplied with a constant vectorbefore applying the
metric, which leaves the metric prop-erties untouched. The reason
is that it is possible to definenew vectors in the same space,
equal to the scaled inputs.This renaming trivially provides the
correct properties:
mweighted2 (x̃, ỹ) =
b∑i=1
wi|x̃i − ỹi|,
wi>0=
b∑i=1
|wix̃i − wiỹi|.
Accordingly, doing the same with the Lp norm idea is pos-sible,
and each wi just needs a suitable adjustment beforedistributivity
can be applied, keeping the metric propertiesonce again:
mweighted2 (x̃, ỹ) =
(b∑
i=1
wi|x̃i − ỹi|p) 1
p
=
(b∑
i=1
wi|x̃i − ỹi| |x̃i − ỹi| . . . |x̃i − ỹi|
) 1p
=
(b∑
i=1
w1p
i |x̃i − ỹi| w1p
i |x̃i − ỹi| . . . w1p
i |x̃i − ỹi|
) 1p
,
wi>0=
(b∑
i=1
|w1p
i x̃i − w1p
i ỹi|p
) 1p
.
With these weighted terms for m2, it is possible to describeall
used aggregations and latent space difference methods.The proposed
method deals with multiple higher order ten-sors instead of a
single vector. Thus, the weights wi addi-tionally depend on
constants such as the direction of theaggregations and their
position in the latent space tensors.But it is easy to see that
mapping a higher order tensor to avector and keeping track of
additional constants still retainsall properties in the same way.
As a result, the describedarchitecture by design yields a
pseudometric that is suitablefor comparing simulation data in a way
that corresponds toour intuitive understanding of distances.
B. ArchitecturesThe following sections provide details regarding
the archi-tecture of the base network and some experimental
design.
B.1. Base Network Design
Fig. 8 shows the architecture of the base network for theLSiM
metric. Its purpose is to extract features from bothinputs of the
Siamese architecture that are useful for thefurther processing
steps. To maximise the usefulness andto avoid feature maps that
show overly similar features,the chosen kernel size and stride of
the convolutions areimportant. Starting with larger kernels and
strides meansthe network has a big receptive field and can consider
simple,low-level features in large regions of the input. For the
two
32
55
55
3
224
224
96
26
26
192
12
12
128
12
12
128
12
12
12x12 Convolutionwith stride 4+ ReLU
4x4 MaxPool with stride 2
5x5 Convolutionwith stride 1+ ReLU
3x3 Convolutionwith stride 1+ ReLULayer 1 Layer 2 Layer 3 Layer
4 Layer 5
Figure 8. Proposed base network architecture consisting of five
layers with up to 192 feature maps that are decreasing in spatial
size. It issimilar to the feature extractor from AlexNet as
identical spatial dimensions for the feature maps are used, but it
reduces the number offeature maps for each layer by 50% to have
fewer weights.
-
Learning Similarity Metrics for Numerical Simulations
1 2 3 4 5Layer
0.00
0.05
0.10
0.15
0.20
0.25
Mea
n an
d st
d. d
ev.
of fe
atur
e m
ap w
eigh
tsAlexNetfrozen
1 2 3 4 5Layer
0.00
0.05
0.10
0.15
0.20
0.25
Mea
n an
d st
d. d
ev.
of fe
atur
e m
ap w
eigh
ts
LSiM (ours)
0
5
10
15
20
25
Unus
ed fe
atur
e m
aps i
n %
0
5
10
15
20
25
Unus
ed fe
atur
e m
aps i
n %
Figure 9. Analysis of the distributions of learned feature map
aggregation weights across the base network layers. Displayed is a
basenetwork with pre-trained weights (left) in comparison to our
method for fully training the base network (right). Note that the
percentageof unused feature maps for most layers of our base
network is 0%.
following layers, the large strides are replaced by
additionalMaxPool operations that serve a similar purpose and
reducethe spatial size of the feature maps.
For the three final layers, only small convolution kernelsand
strides are used, but the number of channels is signifi-cantly
larger than before. These deep features maps typicallycontain
high-level structures, which are most important todistinguish
complex changes in the inputs. Keeping thenumber of trainable
weights as low as possible was an im-portant consideration for this
design to prevent overfittingto certain simulations types and
increase generality. Weexplored a weight range by using the same
architecture andonly scaling the number of feature maps in each
layer. Thefinal design shown in Fig. 8 with about 0.62 million
weightsworked best for our experiments.
In the following, we analyze the contributions of the per-layer
features of two different metric networks to highlightdifferences
in terms of how the features are utilized for thedistance
estimation task. In Fig. 9, our LSiM network yieldsa significantly
smaller standard deviation in the learnedweights that aggregate
feature maps of five layers, com-pared to a pre-trained base
network. This means, all fea-ture maps contribute to establishing
the distances similarly,and the aggregation just fine-tunes the
relative importanceof each feature. In addition, almost all
features receive aweight greater than zero, and as a result, more
features arecontributing to the final distance value.
Employing a fixed pre-trained feature extractor, on the
otherhand, shows a very different picture: Although the meanacross
the different network layers is similar, the contribu-tions of
different features vary strongly, which is visible inthe standard
deviation being significantly larger. Further-more, 2-10% of the
feature maps in each layer receive aweight of zero and hence were
deemed not useful at all forestablishing the distances. This
illustrates the usefulness ofa targeted network in which all
features contribute to thedistance inference.
B.2. Feature Map Normalization
In the following, we analyze how the different featuremap
normalizations discussed in Section 3.2 of the mainpaper affect the
performance of our metric. We com-pare using no normalization
normnone(G) = G, the unitlength normalization via division by the
norm of a fea-ture vector normunit(G) = G / ‖G‖2 proposed by
Zhanget al., a global unit length normalization normglobal(G) =G
/max (‖G0‖2 , ‖G1‖2 , . . . ) that considers the norm of allfeature
vectors in the entire training set, and the proposednormalization
to a scaled chi distribution
normdist(G) =1√gc − 1
G−mean (G0,G1, . . . )std (G0,G1, . . . )
.
Fig. 10 shows a comparison of these normalization methodson the
combined test data. Using no normalization is sig-nificantly
detrimental to the performance of the metric assucceeding
operations cannot reliably compare the features.A unit length
normalization of a single sample is already amajor improvement
since following operations now have apredictable range of values to
work with. This correspondsto a cosine distance, which only
measures angles of thefeature vectors and entirely neglects their
length.
0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74Correlation on all test
data
normnonenormunit
normglobalnormdist.
Figure 10. Performance on our test data for different feature
mapnormalization approaches.
Using the maximum norm across all training samples (com-puted in
a pre-processing step and fixed for training) in-troduces
additional information as the network can nowcompare magnitudes as
well. However, this comparisonis not stable as the maximum norm can
be an outlier withrespect to the typical content of the
corresponding feature.
-
Learning Similarity Metrics for Numerical Simulations
The proposed normalization forms a chi distribution by
indi-vidually transforming each component of the feature vectorto a
standard normal distribution. Afterwards, scaling withthe inverse
mode of the chi distribution leads to a consistentaverage magnitude
close to one. It results in the best per-forming metric since both
length and angle of the featurevectors can be reliably compared by
the following opera-tions.
B.3. Recursive “Meta-Metric”
Since comparing the feature maps is a central operation ofthe
proposed metric calculations, we experimented with re-placing it
with an existing CNN-based metric. In theory, thiswould allow for a
recursive, arbitrarily deep network thatrepeatedly invokes itself:
first, the extracted representationsof inputs are used and then the
representations extractedfrom the previous representations, etc. In
practice, however,using more than one recursion step is currently
not feasibledue to increasing computational requirements in
addition tovanishing gradients.
Fig. 11 shows how our computation method can be modi-fied for a
CNN-based latent space difference, instead of anelement-wise
operation. Here we employ LPIPS (Zhanget al., 2018). There are two
main differences compared toproposed method. First, the LPIPS
latent space differencecreates single distance values for a pair of
feature mapsinstead of a spatial feature difference. As a result,
the fol-lowing aggregation is a single learned average operation
andspatial or layer aggregations are no longer necessary. Wealso
performed experiments with a spatial LPIPS versionhere, but due to
memory limitations, these were not success-ful. Second, the
convolution operations in LPIPS have alower limit for spatial
resolution, and some feature maps ofour base network are quite
small (see Fig. 8). Hence, weup-scale the feature maps below the
required spatial size of32× 32 using nearest neighbor
interpolation.
On our combined test data, such a metric with a fullytrained
base network achieves a performance comparable toAlexNetrandom or
AlexNetfrozen.
B.4. Optical Flow Metric
In the following, we describe our approach to compute ametric
via optical flow (OF). For an efficient OF evalua-tion, we employed
a pre-trained network (Ilg et al., 2016).From an OF network f : I ×
I → Rimax×jmax×2 withtwo input data fields x,y ∈ I , we get the
flow vector fieldfxy(i, j) = (fxy1 (i, j), f
xy2 (i, j))
T , where i and j de-note the locations, and f1 and f2 denote
the components ofthe flow vectors. In addition, we have a second
flow fieldfyx(i, j) computed from the reversed input ordering.
Wecan now define a function m : I× I→ [0,∞):
m(x,y) =
imax∑i=0
jmax∑j=0
√(fxy1 (i, j))
2 + (fxy2 (i, j))2
+√
(fyx1 (i, j))2 + (fyx2 (i, j))
2.
Intuitively, this function computes the sum over the mag-nitudes
of all flow vectors in both vector fields. With thisdefinition, it
is obvious that m(x,y) fulfills the metric prop-erties of
non-negativity and symmetry (see Eq. (6) and (7)).Under the
assumption that identical inputs create a zero flowfield, a relaxed
identity of indiscernibles holds as well (seeEq. (9)). Compared to
the proposed approach, there is noguarantee for the triangle
inequality though, thus m(x,y)only qualifies as a
pseudo-semimetric.
Fig. 12 shows flow visualizations on data examples pro-duced by
FlowNet2. The metric works relatively well forinputs that are
similar to the training data from FlowNet2such as the shape data
example in the top row. For datathat provides some outline, e.g.,
the smoke simulation ex-ample in the middle row or also liquid
data, the metric does
Basenetwork
Input 1
Input 2 Basenetwork
Feature mapnormalization
Feature mapnormalization
LPIPSlatent spacedifference
Aggregation:weighted avg.
Distanceoutput
1 Learned weight per feature map
RGB inputs Feature maps:sets of 3rd order tensors
Feature differences:sets of scalars
d1,4d1,3d1,2d1,1 dResult:scalar
d1,4d1,3d1,2d2,1
d1,4d1,3d1,2d3,1
Spatialextension
Spatialextension
Extended feature maps:sets of 3rd order tensors
Figure 11. Adjusted distance computation for a LPIPS-based
latent space difference. To provide sufficiently large inputs for
LPIPS, smallfeature maps are spatially enlarged with nearest
neighbor interpolation. In addition, LPIPS creates scalar instead
of spatial differencesleading to a simplified aggregation.
-
Learning Similarity Metrics for Numerical Simulations
Reference Variation Flow streamlines Flow direction Flow
magnitude
Figure 12. Outputs from FlowNet2 on data examples. The flow
streamlines are sparse visualization of the resulting flow field
and indicatethe direction of the flow by their orientation and its
magnitude by their color (darker being larger). The two
visualizations on the rightshow the dense flow field and are
color-coded to show the flow direction (blue/yellow: vertical,
green/red: horizontal) and the flowmagnitude (brighter being
larger).
6
224
224
128
12
12Dropout
Adaptive MaxPool
ReLU
Sigmoid2 Stacked inputs
Feature extractor identical to base network
128 128
1
Predicteddistance
4608
Fully connected layer
Flatten
12866
Figure 13. Non-Siamese network architecture with the same
feature extractor used in Fig. 8. It uses both stacked inputs and
directlypredicts the final distance value from the last set of
feature maps with several fully connected layers.
not work as well but still provides a reasonable flow
field.However, for full spatial examples such as the Burger’s
orAdvection-Diffusion cases (see bottom row), the networkis no
longer able to produce meaningful flow fields. Theresults are often
a very uniform flow with similar magnitudeand direction.
B.5. Non-Siamese Architecture
To compute a metric without the Siamese architecture out-lined
above, we use a network structure with a single outputas shown in
Fig. 13. Thus, instead of having two identicallyfeature extractors
and combining the feature maps, here thedistance is directly
predicted from the stacked inputs with asingle network with about
1.24 million weights. After usingthe same feature extractor as
described in Section B.1, thefinal set of feature maps is spatially
reduced with an adap-
tive MaxPool operation. Next, the result is flattened, andthree
consecutive fully connected layers process the data toform the
final prediction. Here, the last activation functionis a sigmoid
instead of ReLU. The reason is that a ReLUwould clamp every
negative intermediate value to a zerodistance, while a sigmoid
compresses the intermediate valueto a small distance that is more
meaningful than directlyclamping it.
In terms of metric properties, this architecture only
providesnon-negativity (see Eq. (6)) due to the final sigmoid
function.All other properties cannot be guaranteed without
furtherconstraints. This is the main disadvantage of a
non-Siamesenetwork. These issues could be alleviated with
specializedtraining data or by manually adding constraints to the
model,e.g., to have some amount of symmetry (see Eq. (7)) andat
least a weakened identity of indiscernibles (see Eq. (9)).
-
Learning Similarity Metrics for Numerical Simulations
32
55
55
3
224
224
64
26
26
128
12
12
128
12
12
12x12 Convolutionwith stride 4+ ReLU
4x4 MaxPool with stride 2
5x5 Convolutionwith stride 1+ ReLU
3x3 Convolutionwith stride 1+ ReLU
32
55
55
128 + 64
12
12
64 + 64
26
26
32 + 32
55
55
3x3 Transposed convolution with stride 1 + ReLU
5x5 Transposed convolution with stride 2 + ReLU
Skip connection via channel concatenation
3x3 Transposed convolution with stride 2 + ReLU
Figure 14. Network architecture with skip connections for better
information transport between feature maps. Transposed
convolutionsare used to upscale the feature maps in the second half
of the network to match the spatial size of earlier layers for the
skip connections.
However, compared to a Siamese network that guaranteesthem by
design, these extensions are clearly sub-optimal.As a result of the
missing properties, this network has signif-icant problems with
generalization. While it performs wellon the training data, the
performance noticeably deterioratesfor several of the test data
sets.
B.6. Skip Connections in Base Network
As explained above, our base network primarily serves as
afeature extractor to produce activations that are employed
toevaluate a learned metric.In many state-of-the-art
methods,networks with skip connections are employed (Ronnebergeret
al., 2015; He et al., 2016; Huang et al., 2017), as experi-ments
have shown that these connections help to preserveinformation from
the inputs. In our case, the classification“output” of a network
such as the AlexNet plays no actualrole. Rather, the features
extracted along the way are crucial.Hence, skip connections should
not improve the inferencetask for our metrics.
To verify that this is the case, we have included tests with
abase network (see Fig. 14) similar to the popular UNet
archi-tecture (Ronneberger et al., 2015). For our experiments,
wekept the early layers closely in line with the feature
extrac-tors that worked well for the base network (see Section
B.1).Only the layers in the decoder part have an increased spa-tial
feature map size to accommodate the skip connections.As expected,
this network can be used to compute reliablemetrics for the input
data without negatively affecting theperformance. However, as
expected, the improvements ofskip connections for regular inference
tasks do not translateinto improvements for the metric
calculations.
C. Impact of Data Difficulty
0x 5x 10x 15x 20x 25xDifficulty of training data (scaled noise
strength)
0.0
0.2
0.4
0.6
0.8
1.0
Cor
rela
tion
LSiMreduced on all test dataL2 on training dataLPIPS on training
data
Figure 15. Impact of increasing data difficulty for a reduced
train-ing data set. Evaluations on training data for L2 and LPIPS,
andthe test performance of models trained with the different
reduceddata sets (LSiMreduced) are shown.
We shed more light on the aspect of noise levels and
datadifficulty via six reduced data sets that consist of a
smalleramount of Smoke and Advection-Diffusion data with
dif-ferently scaled noise strength values. Results are shown inFig.
15. Increasing the noise level creates more difficultdata as shown
by the dotted and dashed plots representingthe performance of the
L2 and the LPIPS metric on eachdata set. Both roughly follow an
exponentially decreasingfunction. Each point on the solid line plot
is the test result ofa reduced LSiM model trained on the data set
with the corre-sponding noise level. Apart from the data, the
entire training
-
Learning Similarity Metrics for Numerical Simulations
setup was identical. This shows that the training process isvery
robust to the noise, as the result on the test data onlyslowly
decreases for very high noise levels. Furthermore,small amounts of
noise improve the generalization com-pared to the model that was
trained without any noise. Thisis somewhat expected, as a model
that never saw noisy dataduring training cannot learn to extract
features which arerobust with respect to noise.
D. Data Set DetailsIn the following sections, the generation of
each used dataset is described. For each figure showing data
samples(consisting of a reference simulation and several
variantswith a single changing initial parameter), the leftmost
imageis the reference and the images to the right show the
variantsin order of increasing parameter change. For the figures
16,17, 18, and 19, the first subfigure (a) demonstrates thatmedium
and large scale characteristics behave very non-chaotic for
simulations without any added noise. They areonly included for
illustrative purposes and are not used fortraining. The second and
third subfigure (b) and (c) ineach case show the training data of
LSiM, where the largemajority of data falls into the category (b)
of normal samplesthat follow the generation ordering, even with
more varyingbehaviour. Category (c) is a small fraction of the
trainingdata, and the shown examples are specifically picked toshow
how the chaotic behaviour can sometimes override theordering
intended by the data generation in the worst case.Occasionally,
category (d) is included to show how normaldata samples from the
test set differ from the training data.
D.1. Navier-Stokes Equations
These equations describe the general behaviour of fluidswith
respect to advection, viscosity, pressure, and mass con-servation.
Eq. (10) defines the conservation of momentum,and Eq. (11)
constraints the conservation of mass:
∂u
∂t+ (u · ∇)u = −∇P
ρ+ ν∇2u+ g, (10)
∇ · u = 0. (11)
In this context, u is the velocity, P is the pressure the
fluidexerts, ρ is the density of the fluid (usually assumed tobe
constant), ν is the kinematic viscosity coefficient thatindicates
the thickness of the fluid, and g denotes the accel-eration due to
gravity. With this PDE, three data sets werecreated using a smoke
and a liquid solver. For all data, 2Dsimulations were run until a
certain step, and useful datafields were exported afterwards.
SMOKE
For the smoke data, a standard Eulerian fluid solver usinga
preconditioned pressure solver based on the conjugate
gradient method and Semi-Lagrangian advection schemewas
employed.
The general setup for every smoke simulation consists of
arectangular smoke source at the bottom with a fixed additivenoise
pattern to provide smoke plumes with more details.Additionally,
there is a downwards directed, spherical forcefield area above the
source, which divides the smoke in twomajor streams along it. We
chose this solution over an ac-tual obstacle in the simulation in
order to avoid overfittingto a clearly defined black obstacle area
inside the smokedata. Once the simulation reaches a predefined time
step,the density, pressure, and velocity fields (separated by
di-mension) are exported and stored. Some example sequencescan be
found in Fig. 16. With this setup, the following initialconditions
were varied in isolation:
• Smoke buoyancy in x- and y-direction• Strength of noise added
to the velocity field• Amount of force in x- and y-direction
provided by the
force field
• Orientation and size of the force field• Position of the force
field in x- and y-direction• Position of the smoke source in x- and
y-direction
Overall, 768 individual smoke sequences were used fortraining,
and the validation set contains 192 sequences withdifferent
initialization seeds.
LIQUID
For the liquid data, a solver based on the fluid implicit
parti-cle (FLIP) method (Zhu & Bridson, 2005) was employed.It
is a hybrid Eulerian-Lagrangian approach that replacesthe
Semi-Lagrangian advection scheme with particle basedadvection to
reduce numerical dissipation. Still, this methodis not optimal as
we experienced problems such as massloss, especially for larger
noise values.
The simulation setup consists of a large breaking dam andseveral
smaller liquid areas for more detailed splashes. Afterthe dam hits
the simulation boundary, a large, single dropof liquid is created
in the middle of the domain that hits thealready moving liquid
surface. Then, the extrapolated levelset values, binary indicator
flags, and the velocity fields(separated by dimension) are saved.
Some examples areshown in Fig. 17. The list of varied parameters
include:
• Radius of the liquid drop• Position of the drop in x- and
y-direction• Amount of additional gravity force in x- and y-
direction
• Strength of noise added to the velocity field
-
Learning Similarity Metrics for Numerical Simulations
(a) Data samples generated without noise: tiny output changes
following generation ordering
(b) Normal training data samples with noise: larger output
changes but ordering still applies
(c) Outlier data samples: noise can override the generation
ordering by chance
Figure 16. Various smoke simulation examples using one component
of the velocity (top rows), the density (middle rows), and the
pressurefield (bottom rows).
-
Learning Similarity Metrics for Numerical Simulations
(a) Data samples generated without noise: tiny output changes
following generation ordering
(b) Normal training data samples with noise: larger output
changes but ordering still applies
(c) Outlier data samples: noise can override the generation
ordering by chance
(d) Data samples from test set with additional background
noise
Figure 17. Several liquid simulation examples using the binary
indicator flags (top rows), the extrapolated level set values
(middle rows),and one component of the velocity field (bottom rows)
for the training data and only the velocity field for the test
data.
-
Learning Similarity Metrics for Numerical Simulations
(a) Data samples generated without noise: tiny output changes
following generation ordering
(b) Normal training data samples with noise: larger output
changes but ordering still applies
(c) Outlier data samples: noise can override the generation
ordering by chance
(d) Data samples from test set with additional background
noise
Figure 18. Various examples from the Advection-Diffusion
equation using the density field.
-
Learning Similarity Metrics for Numerical Simulations
(a) Data samples generated without noise: tiny output changes
following generation ordering
(b) Normal training data samples with noise: larger output
changes but ordering still applies
(c) Outlier data samples: noise can override the generation
ordering by chance
Figure 19. Different simulation examples from the Burger’s
equation using the velocity field.
The liquid training set consists of 792 sequences and
thevalidation set of 198 sequences with different random seeds.For
the liquid test set, additional background noise wasadded to the
velocity field of the simulations as displayedin Fig. 17(d).
Because this only alters the velocity field, theextrapolated level
set values and binary indicator flags arenot used for this data
set, leading to 132 sequences.
D.2. Advection-Diffusion and Burger’s Equation
For these PDEs, our solvers only discretize and solve
thecorresponding equation in 1D. Afterwards, the differenttime
steps of the solution process are concatenated along