Saturday 18 June 2022 NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics Algorithms for Algorithms for Dynamical Dynamical Fermions Fermions A D Kennedy A D Kennedy School of Physics, The University of School of Physics, The University of Edinburgh Edinburgh
Algorithms for Dynamical Fermions. A D Kennedy School of Physics, The University of Edinburgh. Outline. Monte Carlo integration Importance sampling Markov chains Detailed balance, Metropolis algorithm Symplectic integrators Hybrid Monte Carlo Pseudofermions RHMC QCD Computers. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Thursday 20 April 2023NCTS 2006 School on Modern Numerical
Methods in Mathematics and Physics
Algorithms for Algorithms for Dynamical FermionsDynamical Fermions
A D KennedyA D KennedySchool of Physics, The University of EdinburghSchool of Physics, The University of Edinburgh
Thursday 20 April 2023
A D Kennedy 2
Outline
Monte Carlo integrationImportance samplingMarkov chainsDetailed balance, Metropolis algorithmSymplectic integratorsHybrid Monte CarloPseudofermionsRHMCQCD Computers
Thursday 20 April 2023
A D Kennedy 3
Monte Carlo methods: I
Monte Carlo integration is based on the identification of probabilities with measures
There are much better methods of carrying out low dimensional quadrature
All other methods become hopelessly expensive for large dimensions
In lattice QFT there is one integration per degree of freedom
We are approximating an infinite dimensional functional integral
Thursday 20 April 2023
A D Kennedy 4
Measure the value of on each configuration and compute the average
1
1( )
N
ttN
Monte Carlo methods: II
Generate a sequence of random field configurations chosen from the probability distribution
1 2 t N, , , , ,
( )1( ) tS
t tP d eZ
Thursday 20 April 2023
A D Kennedy 5
Central Limit Theorem: I
Distribution of values for a single sample = ()
( )P d P
Central Limit Theorem
where the variance of the distribution of is
2CNO
22C
Law of Large Numbers limN
The Laplace–DeMoivre Central Limit theorem is an asymptotic expansion for the probability distribution of
Thursday 20 April 2023
A D Kennedy 6
3
0 3
4 21 4 2
2
2
0
3
C C
C C C
C
The first few cumulants are
Central Limit Theorem: II
Note that this is an asymptotic expansion
Generating function for connected moments
( ) ln ( ) ikW k d P e ln lnik ikd P e e
0 !
n
nn
ikC
n
Thursday 20 April 2023
A D Kennedy 7
Central Limit Theorem: III
Connected generating function
lnN
ik Nd P e
1 11
ln expN
N N tt
ikd d P P
N
( ) ln ( ) ikW k d P e
1
1 !
n
nn
n
ik Cn N
ln ik NN e
kNW
N
1 11
1 N
N N tt
P d d P PN
Distribution of the average of N samples
Thursday 20 April 2023
A D Kennedy 8
3 4
3 4212 3 3 4
223! 4!2
N
C Cd dik ik CN N dk ikd de e e
Take inverse Fourier transform to obtain distribution P
12
W k ikP dk e e
23 4
23 4 22 3 3 43! 4!
2 2
C Cd d C N
N Nd d
C N
ee
Central Limit Theorem: IV
Thursday 20 April 2023
A D Kennedy 9
Central Limit Theorem: V
where and
N
22
2 23 2
32 2
31
6 2
CC C eF
C N C
dP F
d
Re-scale to show convergence to Gaussian distribution
Deterministic evolution of probability distribution P: Q Q
Thursday 20 April 2023
A D Kennedy 16
The sequence Q, PQ, P²Q, P³Q,… is Cauchy
Define a metric on the space of (equivalence classes of) probability distributions
1 2 1 2,d Q Q dx Q x Q x
Prove that with > 0, so the Markov process P is a contraction mapping
1 2 1 2, 1 ,d PQ PQ d Q Q
The space of probability distributions is complete, so the sequence converges to a unique fixed pointlim n
nQ P Q
Convergence of Markov Chains: I
Thursday 20 April 2023
A D Kennedy 17
1 2 1 2,d PQ PQ dx PQ x PQ x 1 2dx dy P x y Q y dy P x y Q y
dx dy P x y Q y Q y Q y
dx dy P x y Q y
2 mindx dy P x y Q y Q y
dx dy P x y Q y
2min ,a b a b a b
1 2Q y Q y Q y
1y y
Convergence of Markov Chains: II
Thursday 20 April 2023
A D Kennedy 18
2 mindy Q y dx dy P x y Q y Q y
2 inf min
ydy Q y dx P x y dy Q y Q y
infy
dy Q y dx P x y dy Q y 1 2(1 ) ,d Q Q
1 2 1 1 0dy Q y dy Q y dy Q y dy Q y Q y dy Q y Q y
1dx P x y
12dy Q y Q y dy Q y
0 infy
dx P x y
1 2,d PQ PQ
Convergence of Markov Chains: III
Thursday 20 April 2023
A D Kennedy 19
Markov Chains: II
Use Markov chains to sample from QSuppose we can construct an ergodic Markov process P which has distribution Q as its fixed point
Start with an arbitrary state (“field configuration”)
Iterate the Markov process until it has converged (“thermalized”)
Thereafter, successive configurations will be distributed according to Q
But in general they will be correlated
To construct P we only need relative probabilities of states
Do not know the normalisation of Q
Cannot use Markov chains to compute integrals directly
We can compute ratios of integrals
Thursday 20 April 2023
A D Kennedy 20
Metropolis algorithm
min 1,Q x
P x yQ y
Markov Chains: III
How do we construct a Markov process with a specified fixed point ?
Q x dy P x y Q y Integrate w.r.t. y to obtain fixed point conditionSufficient but not necessary for fixed point
P y x Q x P x y Q y Detailed balance
Q xP x y
Q x Q y
Other choices are possible, e.g.,
Consider cases and separately to obtain detailed balance condition
( ) ( )Q x Q y ( ) ( )Q x Q y
Sufficient but not necessary for detailed balance
Thursday 20 April 2023
A D Kennedy 21
Markov Chains: IV
Composition of Markov stepsLet P1 and P2 be two Markov steps which have the desired
fixed point distribution
They need not be ergodic
Then the composition of the two steps P2P1 will also have
the desired fixed point
And it may be ergodic
This trivially generalises to any (fixed) number of steps
For the case where P1 is not ergodic but (P1 ) n is the
terminology weakly and strongly ergodic are sometimes used
Thursday 20 April 2023
A D Kennedy 22
Markov Chains: V
This result justifies “sweeping” through a lattice performing single site updates
Each individual single site update has the desired fixed point because it satisfies detailed balance
The entire sweep therefore has the desired fixed point, and is ergodic
But the entire sweep does not satisfy detailed balance
Of course it would satisfy detailed balance if the sites were updated in a random order
But this is not necessaryAnd it is undesirable because it puts too much randomness into the system
Thursday 20 April 2023
A D Kennedy 23
Markov Chains: VI
Coupling from the PastPropp and Wilson (1996)Use fixed set of random numbersFlypaper principle: If states coalesce they stay together forever
– Eventually, all states coalesce to some state with probability one
– Any state from t = - will coalesce to – is a sample from the fixed point distribution
Thursday 20 April 2023
A D Kennedy 24
Markov Chains: VII
Suppose we have a latticeThat is, a partial ordering with a largest and smallest element
And an update step that preserves itThen once the largest and smallest states have coalesced all the others must have too
a
b dc
e
g
f
Thursday 20 April 2023
A D Kennedy 25
Markov Chains: VIII
β ≥ 0Ordering is spin configuration A ≥ B iff for every site As ≥ Bs
Update is a local heatbathi jij
H ssb= å
A non-trivial example of where this is possible is the ferrormagnetic Ising model
Thursday 20 April 2023
A D Kennedy 26
Autocorrelations: I
This corresponds to the exponential autocorrelation time
expmax
10
lnN
Exponential autocorrelationsThe unique fixed point of an ergodic Markov process
corresponds to the unique eigenvector with eigenvalue
1
All its other eigenvalues must lie within the unit circlemax 1 In particular, the largest subleading eigenvalue is
Thursday 20 April 2023
A D Kennedy 27
Autocorrelations: II
2
21 1
1 N N
t tt tN
1
2
21 1 1
12
N N N
t t tt t t tN
Integrated autocorrelationsConsider the autocorrelation of some operator Ω
Without loss of generality we may assume 0
12 2 2
1
1 2 N
N CN N
2
t tC
Define the autocorrelation function
Thursday 20 April 2023
A D Kennedy 28
Autocorrelations: III
The autocorrelation function must fall faster that the
exponential autocorrelation exp
maxNC e
2
2 exp1 2 1N
A ON N
2
exp
1
1 2 1N
C ON N
For a sufficiently large number of samples
1
A C
Define the integrated autocorrelation function
Thursday 20 April 2023
A D Kennedy 29
In order to carry out Monte Carlo computations including the effects of dynamical fermions we would like to find an algorithm which
Update the fields globally
Because single link updates are not cheap if the action is not local
Take large steps through configuration space
Because small-step methods carry out a random walk which leads to critical slowing down with a dynamical critical exponent z=2
Hybrid Monte Carlo: I
Does not introduce any systematic errors
z relates the autocorrelation to the correlation length of the system,
zA
Thursday 20 April 2023
A D Kennedy 30
A useful class of algorithms with these properties is the (Generalised) Hybrid Monte Carlo (HMC) method
Introduce a “fictitious” momentum p corresponding to each dynamical degree of freedom q
Find a Markov chain with fixed point exp[-H(q,p) ] where H is the “fictitious” Hamiltonian ½p2 + S(q)
The action S of the underlying QFT plays the rôle of the potential in the “fictitious” classical mechanical system
This gives the evolution of the system in a fifth dimension, “fictitious” or computer time
This generates the desired distribution exp[-S(q) ] if we ignore the momenta p (i.e., the marginal distribution)
Hybrid Monte Carlo: II
Thursday 20 April 2023
A D Kennedy 31
The HMC Markov chain alternates two Markov steps
Molecular Dynamics Monte Carlo (MDMC)(Partial) Momentum Refreshment
Both have the desired fixed pointTogether they are ergodic
Hybrid Monte Carlo: IIIHybrid Monte Carlo: III
Thursday 20 April 2023
A D Kennedy 32
MDMC: I
If we could integrate Hamilton’s equations exactly we could follow a trajectory of constant fictitious energy
This corresponds to a set of equiprobable fictitious phase space configurationsLiouville’s theorem tells us that this also preserves the functional integral measure dp dq as required
Any approximate integration scheme which is reversible and area preserving may be used to suggest configurations to a Metropolis accept/reject test
With acceptance probability min[1,exp(-H)]
Thursday 20 April 2023
A D Kennedy 33
MDMC: II
Molecular Dynamics (MD), an approximate integrator which is exactly : , ,U q p q p
We build the MDMC step out of three parts
A Metropolis accept/reject step
Area preserving, *
,det det 1
,
q pU
q p
Reversible, 1F U F U
A momentum flip :F p p
1H Hq qF U e y y e
p p
The composition of these gives
with y being a uniformly distributed random number in [0,1]
Thursday 20 April 2023
A D Kennedy 34
The Gaussian distribution of p is invariant under F
The extra momentum flip F ensures that for small the momenta are reversed after a rejection rather than after an acceptance
For = / 2 all momentum flips are irrelevant
This mixes the Gaussian distributed momenta p with Gaussian noise
cos sin
sin cos
p pF
Partial Momentum Refreshment
Thursday 20 April 2023
A D Kennedy 35
If A and B belong to any (non-commutative)
algebra then , where constructed from commutators of A and B (i.e.,
is in the Free Lie Algebra generated by {A,B })
A B A Be e e
Symplectic Integrators: I
1 2
1 2
1 2
221
1 2, , 10
1, , , ,
1 2 ! m
m
m
nm
n n k kk km
k k n
Bc c A B c c A B
n m
More precisely, where and
1
ln A Bn
n
e e c
1c A B
Baker-Campbell-Hausdorff (BCH) formula
Thursday 20 April 2023
A D Kennedy 36
Symplectic Integrators: IIExplicitly, the first few terms are
1 1 12 12 24
1720
ln , , , , , , , ,
, , , , 4 , , , ,
6 , , , , 4 , , , ,
2 , , , , , , , ,
A Be e A B A B A A B B A B B A A B
A A A A B B A A A B
A B A A B B B A A B
A B B A B B B B A B
In order to construct reversible integrators we use symmetric symplectic integrators
2 2 124
15760
ln , , 2 , ,
7 , , , , 28 , , , ,
12 , , , , 32 , , , ,
16 , , , , 8 , , , ,
A B Ae e e A B A A B B A B
A A A A B B A A A B
A B A A B B B A A B
A B B A B B B B A B
The following identity follows directly from the BCH formula
Thursday 20 April 2023
A D Kennedy 37
Symplectic Integrators: III
exp expd dp dqdt dt p dt q
212,H q p T p S q p S q
We are interested in finding the classical trajectory in phase space of a system described by the Hamiltonian
The basic idea of such a symplectic integrator is to write the time evolution operator as
ˆexp HH He
q p p q
exp S q T pp q
Thursday 20 April 2023
A D Kennedy 38
Symplectic Integrators: IV
Define and so thatH P Q P S qp
Q T pq
Since the kinetic energy T is a function only of p and the potential energy S is a function only of q, it follows that the action of and may be evaluated triviallyPe Qe
: , ,
: , ,
Q
P
e f q p f q T p p
e f q p f q p S q
Thursday 20 April 2023
A D Kennedy 39
Symplectic Integrators: V
From the BCH formula we find that the PQP symmetric symplectic integrator is given by
1 12 2
//
0( ) P PQU e e e
3 5124exp , , 2 , ,P Q P P Q Q P Q O
2 4124exp , , 2 , ,P Q P P Q Q P Q O
ˆ 2P QHe e O
In addition to conserving energy to O (² ) such symmetric symplectic integrators are manifestly area preserving and reversible
Thursday 20 April 2023
A D Kennedy 40
Symplectic Integrators: VI
For each symplectic integrator there exists a Hamiltonian H’ which is exactly conserved
This is given by substituting Poisson brackets for commutators in the BCH formula
2124
4 615760
' , , 2 , ,
32 , , , 16 , , , ,
28 , , , , 12 , , , ,
8 , , , , 7 , , , ,
H S T T T S S T S
S S T T S T S S T S
S T T T S T S T T S O
S S S T S T T T T S
Thursday 20 April 2023
A D Kennedy 41
Symplectic Integrators: VII
, , , , , , 0A B C B C A C A B
,A B A B
A Bp q q p
Poisson brackets form a Lie algebra
Homework exercise: verify the Jacobi identity
, ,A B B A
Thursday 20 April 2023
A D Kennedy 42
Symplectic Integrators: VIII
The explicit result for H’ is
2 2 2124
44 2 2 2 4 61720
2
6 2 3
H H p S S
p S p SS S S S O
Note that H’ cannot be written as the sum of a p-dependent kinetic term and a q-dependent potential termAs H’ is conserved, δH is of O(δτ 2) for trajectories of arbitrary length
Even if τ = O (δτ -k) with k > 1
Thursday 20 April 2023
A D Kennedy 43
Symplectic Integrators: IX
Define and so that 1 2H P P Q i iP S qp
Q T pq
1 2( , ) ( ) ( ) ( )H q p T p S q S q Multiple timescales
Split the Hamiltonian into pieces
2 1 21 1 1 1 1 1 12 1 24 2 4 4 2 42 2 1 1 1 2 2
/
/SW( ) n n n n n n n
n n nQ P Q P Q P Q
U e e e e e e e
Introduce a symmetric symplectic integrator of the form
If then the instability in the integrator is
tickled equally by each sub-step
1 2
1 22
P PQ
n n
This helps if the most expensive force computation does not correspond to the largest force
Thursday 20 April 2023
A D Kennedy 44
Dynamical fermions: I
PseudofermionsDirect simulation of Grassmann fields is not feasible
The problem is not that of manipulating anticommuting values in a computer
We therefore integrate out the fermion fields to obtain the fermion determinant detMd d e M
It is that is not positive, and thus we get poorimportance sampling
FS Me e
and always occur quadratically
The overall sign of the exponent is unimportant
Thursday 20 April 2023
A D Kennedy 45
Dynamical fermions: II
1
0
, , Me
Any operator can be expressed solely in terms of the bosonic fields
Use Schwinger’s source method and integrate out the fermions
1( , ) ( , )G x y x y M x y
E.g., the fermion propagator is
Thursday 20 April 2023
A D Kennedy 46
Dynamical fermions: III
Represent the fermion determinant as a bosonic Gaussian integral with a non-local kernel 1
det MM d d e
The fermion kernel must be positive definite (all its eigenvalues must have positive real parts) otherwise the bosonic integral will not convergeThe new bosonic fields are called pseudofermions
The determinant is extensive in the lattice volume, thus again we get poor importance sampling
Including the determinant as part of the observable to
be measured is not feasible
det
detB
B
S
S
M
M
Thursday 20 April 2023
A D Kennedy 47
Dynamical fermions: IV
It is usually convenient to introduce two flavours of fermion and to write 1†2 †det det
M MM M M d d e
The evaluation of the pseudofermion action and the corresponding force then requires us to find the solution of a (large) set of linear equations 1†M M
This not only guarantees positivity, but also allows us to generate the pseudofermions from a global heatbath by applying to a random Gaussian distributed field
†M
The equations for motion for the boson (gauge) fields are
†1 1 1† † † † †B BS S
M M M M M M M M
Thursday 20 April 2023
A D Kennedy 48
Dynamical fermions: V
It is not necessary to carry out the inversions required for the equations of motion exactly
There is a trade-off between the cost of computing the force and the acceptance rate of the Metropolis MDMC step
The inversions required to compute the pseudofermion action for the accept/reject step does need to be computed exactly, however
We usually take “exactly” to by synonymous with “to machine precision”
Thursday 20 April 2023
A D Kennedy 49
Reversibility: I
Are HMC trajectories reversible and area preserving in practice?
The only fundamental source of irreversibility is the rounding error caused by using finite precision floating point arithmetic
For fermionic systems we can also introduce irreversibility by choosing the starting vector for the iterative linear equation solver time-asymmetrically
We do this if we to use a Chronological Inverter, which takes (some extrapolation of) the previous solution as the starting vector
Floating point arithmetic is not associativeIt is more natural to store compact variables as scaled integers (fixed point)
Saves memoryDoes not solve the precision problem
Thursday 20 April 2023
A D Kennedy 50
Reversibility: II
Data for SU(3) gauge theory and QCD with heavy quarks show that rounding errors are amplified exponentially
The underlying continuous time equations of motion are chaotic
Ляпунов exponents characterise the divergence of nearby trajectories
The instability in the integrator occurs when H » 1
Zero acceptance rate anyhow
Thursday 20 April 2023
A D Kennedy 51
Reversibility: III
In QCD the Ляпунов exponents appear to scale with as the system approaches the continuum limit
= constant
This can be interpreted as saying that the Ляпунов exponent characterises the chaotic nature of the continuum classical equations of motion, and is not a lattice artefact
Therefore we should not have to worry about reversibility breaking down as we approach the continuum limit
Caveat: data is only for small lattices, and is not conclusive
Thursday 20 April 2023
A D Kennedy 52
What is the best polynomial approximation p(x) to a continuous function f(x) for x in [0,1] ?
Polynomial approximation
Best with respect to the appropriate norm
where n 1
1/1
0
( ) ( )n
np f dx p x f xn
Thursday 20 April 2023
A D Kennedy 53
Weierstraß’ theorem
Weierstraß: Any continuous function can be arbitrarily well approximated by a polynomial
0 1minmax ( ) ( )
p xp f p x f x
Taking n →∞ this is the Taking n →∞ this is the minimax norm norm
Thursday 20 April 2023
A D Kennedy 54
Бернштейне polynomials
0
(1 )n
n n kn
k
nkf x x
n kp x
The explicit solution is provided by Бернштейне polynomials
Thursday 20 April 2023
A D Kennedy 55
Чебышев’s theorem
The error |p(x) - f(x)| reaches its maximum at exactly d+2 points on the unit interval
0 1max
xp f p x f x
ЧебышевЧебышев: There is always : There is always a unique polynomial of any a unique polynomial of any degree d which minimises degree d which minimises
Thursday 20 April 2023
A D Kennedy 56
Чебышев’s theorem: NecessitySuppose p-f has less than d+2 extrema of equal magnitudeThen at most d+1 maxima exceed some magnitude
And whose magnitude is smaller than the “gap”The polynomial p+q is then a better approximation than p to f
This defines a “gap”We can construct a polynomial q of degree d which has the opposite sign to p-f at each of these maxima (Lagrange interpolation)
Thursday 20 April 2023
A D Kennedy 57
Чебышев’s theorem: Sufficiency
Then at each of the d+2 extrema i i i ip x f x p x f x
Thus p’ – p = 0 as it is a polynomial of degree d
Therefore p’ - p must have d+1 zeros on the unit interval
Suppose there is a polynomialp f p f
Thursday 20 April 2023
A D Kennedy 58
The notation is an old transliteration of Чебышев !
Чебышев polynomials
Convergence is often exponential in d The best approximation of degree d-1 over [-
1,1] to xd is 1
112
dd
d dp x x T x
1cos cosdT x d x
Where the Чебышев polynomials are
1 ln21 2
2
dd
d ddx p x T x e
The error is
Thursday 20 April 2023
A D Kennedy 59
Чебышев rational functions
Чебышев’s theorem is easily extended to rational approximations
Rational functions with nearly equal degree numerator and denominator are usually best
Convergence is still often exponential
Rational functions usually give a much better approximation
A simple (but somewhat slow) numerical algorithm for finding the optimal Чебышев rational approximation was given by Ремез
Thursday 20 April 2023
A D Kennedy 60
Чебышев rationals: ExampleA realistic example of a rational approximation is
x 2.3475661045 x 0.1048344600 x 0.00730638141
0.3904603901x 0.4105999719 x 0.0286165446 x 0.0012779193x
Using a partial fraction expansion of such rational functions allows us to use a multishift linear equation solver, thus reducing the cost significantly.
The partial fraction expansion of the rational function above is
This is accurate to within almost 0.1% over the range [0.003,1]
This appears to be numerically stable.
Thursday 20 April 2023
A D Kennedy 61
Polynomials versus rationals
Optimal L2 approximation with weight is
2
1
1 x2 1
0
( ) 4( )
(2 1)
jn
jj
T xj
Optimal L∞ approximation cannot be too much better (or it would lead to a better L2 approximation)
lnn
e Золотарев’s formula has L∞ error
This has L2 error of O(1/n)
Thursday 20 April 2023
A D Kennedy 62
Non-linearity of CG solver
Suppose we want to solve A2x=b for Hermitian A by CG
It is better to solve Ax=y, Ay=b successivelyCondition number (A2) = (A)2
Cost is thus 2(A) < (A2) in general
Suppose we want to solve Ax=bWhy don’t we solve A1/2x=y, A1/2y=b successively?
The square root of A is uniquely defined if A>0This is the case for fermion kernels
All this generalises trivially to nth rootsNo tuning needed to split condition number evenly
How do we apply the square root of a matrix?
Thursday 20 April 2023
A D Kennedy 63
Rational matrix approximation
Functions on matricesDefined for a Hermitian matrix by diagonalisationH = U D U -1
f (H) = f (U D U -1) = U f (D) U -1
Rational functions do not require diagonalisation
H m + H n = U ( D m + D n) U -1
H -1 = U D -1 U -1
Rational functions have nice propertiesCheap (relatively) Accurate
Thursday 20 April 2023
A D Kennedy 64
No Free Lunch Theorem
We must apply the rational approximation with each CG iteration
M1/n r(M)The condition number for each term in the partial fraction expansion is approximately (M)So the cost of applying M1/n is proportional to (M)Even though the condition number (M1/n)=(M)1/n
And even though (r(M))=(M)1/n
So we don’t win this way…
Thursday 20 April 2023
A D Kennedy 65
Pseudofermions
We want to evaluate a functional integral including the fermionic determinant det M
1
det MM d d e
We write this as a bosonic functional integral over a pseudofermion field with kernel M -1
Thursday 20 April 2023
A D Kennedy 66
Multipseudofermions
We are introducing extra noise into the system by using a single pseudofermion field to sample this functional integral
This noise manifests itself as fluctuations in the force exerted by the pseudofermions on the gauge fieldsThis increases the maximum fermion forceThis triggers the integrator instabilityThis requires decreasing the integration step size
11
detnMnM d d e
A better estimate is det M = [det M1/n]n
Thursday 20 April 2023
A D Kennedy 68
Violation of NFL Theorem
So let’s try using our nth root trick to implement multipseudofermions
Condition number (r(M))=(M)1/n
So maximum force is reduced by a factor of n(M)(1/n)-1
This is a good approximation if the condition number is dominated by a few isolated tiny eigenvaluesThis is so in the case of interest
Cost reduced by a factor of n(M)(1/n)-1
Optimal value nopt ln (M)
So optimal cost reduction is (e ln) /
This works!
Thursday 20 April 2023
A D Kennedy 69
Rational Hybrid Monte Carlo: I
Generate pseudofermion from Gaussian heatbath
RHMC algorithm for fermionic kernel 1
2† nM M
†12( )P e
14† nM M
1
1 † †1 2†14 22 †( )
nnP d e e
M MM M
Use accurate rational approximation 4( ) nr x x
Use less accurate approximation for MD, 2( ) nr x x
, so there are no double poles2( ) ( )r x r x
Use accurate approximation for Metropolis acceptance step
Thursday 20 April 2023
A D Kennedy 70
Rational Hybrid Monte Carlo: II
RemindersApply rational approximations using their partial fraction expansionsDenominators are all just shifts of the original fermion kernel
All poles of optimal rational approximations are real and positive for cases of interest (Miracle #1)Only simple poles appear (by construction!)
Use multishift solver to invert all the partial fractions using a single Krylov space
Cost is dominated by Krylov space construction, at least for O(20) shifts
Result is numerically stable, even in 32-bit precisionAll partial fractions have positive coefficients (Miracle #2)
MD force term is of the usual form for each partial fractionApplicable to any kernel
Thursday 20 April 2023
A D Kennedy 71
Comparison with R algorithm: I
Algorithm
δt A B4
R 0.0019 1.56(5)
R 0.0038 1.73(4)
RHMC 0.055 84% 1.57(2)
Binder cumulant of chiral condensate, B4, and RHMC acceptance rate A from a finite temperature study (2+1 flavour naïve staggered fermions, Wilson gauge action, V = 83×4, mud = 0.0076, ms = 0.25, τ= 1.0)
Thursday 20 April 2023
A D Kennedy 72
Comparison with R algorithm: II
Algorithm
mud ms δt A P
R 0.04 0.04 0.01 0.60812(2)
R 0.02 0.04 0.01 0.60829(1)
R 0.02 0.04 0.005 0.60817
RHMC 0.04 0.04 0.02 65.5% 0.60779(1)
RHMC 0.02 0.040.0185
69.3% 0.60809(1)
The different masses at which domain wall results were gathered, together with the step-sizes δt, acceptance rates A, and plaquettes P (V = 163×32×8, DBW2 gauge action, β = 0.72)
The step-size variation of the plaquette with mud =0.02
Thursday 20 April 2023
A D Kennedy 73
Comparison with R algorithm: III
The integrated autocorrelation time of the 13th time-slice of the pion propagator from the domain wall test, with mud = 0.04
Thursday 20 April 2023
A D Kennedy 74
Multipseudofermions with multiple timescales
Semiempirical observation: The largest force from a single pseudofermion does not come from the smallest shift
For example, look at the numerators in the partial fraction expansion we exhibited earlier
Make use of this by using a coarser timescale for the more expensive smaller shifts
0%
25%
50%
75%
100%
-13 -10 -8.5 -7.1 -5.8 -4.4 -3.1 -1.7 -0.3 1.5
Shift [ln(β)]
Residue (α)L² Forceα/(β+0.125)CG iterations
Thursday 20 April 2023
A D Kennedy 76
Berlin Wall for Wilson fermions
HMC ResultsC Urbach, K Jansen, A Schindler, and U Wenger, hep-lat/0506011, hep-lat/0510064
Comparable performance to Lüscher’s SAP algorithmRHMC?
t.b.a.
Thursday 20 April 2023
A D Kennedy 77
Conclusions (RHMC)
Advantages of RHMCExact
No step-size errors; no step-size extrapolations
Significantly cheaper than the R algorithmAllows easy implementation of Hasenbusch (multipseudofermion) accelerationFurther improvements possible
Such as multiple timescales for different terms in the partial fraction expansion
Disadvantages of RHMC???
Thursday 20 April 2023
A D Kennedy 78
QCD Machines: I
We want a cost-effective computer to solve interesting scientific problems
In fact we wanted a computer to solve lattice QCD
But it turned out that there is almost nothing that was not applicable to many other problems too
Not necessary to solve all problems with one architecture
Demise of the general-purpose computer?
Development cost « hardware cost for one large machine
Simple OS and software model
Interleave a few long jobs without time- or space-sharing
Thursday 20 April 2023
A D Kennedy 79
QCD Machines: II
Take advantage of mass market componentsIt is not cost- or time-effective to compete with PC market in designing custom chips
Use standard software and tools whenever possible
Do not expect compilers or optimisers to anything particularly smart
Parallelism has to be built into algorithms and programs from the start
Hand code critical kernels in assemblerAnd develop these along with the hardware design
Thursday 20 April 2023
A D Kennedy 80
QCD Machines: III
Parallel applicationsMany real-world applications are intrinsically parallel
Because they are approximations to continuous systems
Lattice QCD is an good exampleLattice is a discretisation of four dimensional space-time
Lots of arithmetic on small complex matrices and vectors
Relatively tiny amount of I/O required
Amdahl’s lawAmount of parallel work may be increased working on a larger volume
Relevant parameter is σ, the number of sites per processor
Thursday 20 April 2023
A D Kennedy 81
Strong Scaling
The amount of computation V δ required to equilibrate a system of volume V increases faster than linearly
If we are to have any hope of equilibrating large systems the value of δ cannot be much larger than oneFor lattice QCD we have algorithms with δ = 5/4
We therefore are driven to as small a value of σ as possible
This corresponds to “thin nodes,” as opposed to “fat nodes” as in PC clusters
Clusters are competitive in price/performance up to a certain maximum problem size
This borderline increase with time, of course
Thursday 20 April 2023
A D Kennedy 82
Data Parallelism
All processors run the same codeNot necessarily SIMD, where they share a common clock
Synchronization on communication, or at explicit barriers
Type of data parallel operationsPointwise arithmetic
Nearest neighbour shiftsPerhaps simultaneously in several directions
Global operationsBroadcasts, sums or other reductions
Thursday 20 April 2023
A D Kennedy 83
Alternatives Paradigms
MultithreadingParallelism comes from running many separate more-or-less independent threads
Recent architectures propose running many light-weight threads on each processor to overcome memory latency
What are the threads that need almost no memory
Calculating zillions of digits of ?
Cryptography?
Thursday 20 April 2023
A D Kennedy 84
Computational Grids
In the future carrying out large scale In the future carrying out large scale computations using the Grid will be computations using the Grid will be as easy as plugging into an electric as easy as plugging into an electric socketsocket
Thursday 20 April 2023
A D Kennedy 85
Hardware Choices
Cost/MFlopGoal is to be about 10 times more cost-effective than commercial machines
Otherwise it is not worth the effort and risk of building our own machine
Our current Teraflops machines cost about $1/MFlopFor a Petaflops machine we will need to reach about $1/GFlop
Power/coolingMost cost effective to use low-power components and high-density packagingLife is much easier if the machine can be air cooled
Thursday 20 April 2023
A D Kennedy 86
Clock Speed
Peak/Sustained speedThe sustained performance should be about 20%—50% of peakOtherwise there is either too much or too little floating point hardware
Real applications rarely have equal numbers of adds and multiplies
Clock speed“Sweet point” is currently at 0.5—1 GHz chips
High-performance chips running at 3 GHz are both hot and expensiveUsing moderate clock speed makes electrical design issues such as clock distribution much simpler
Thursday 20 April 2023
A D Kennedy 87
Memory Systems
Memory bandwidthThis is currently the main bottleneck in most architecturesThere are two obvious solutions
Data prefetchingVector processing is one way of doing thisRequires more sophisticated softwareFeasible for our class of applications because the control flow is essentially static (almost no data dependencies)
Hierarchical memory system (NUMA)We make use of both approaches
Thursday 20 April 2023
A D Kennedy 88
Memory Systems
On-chip memoryFor QCD the memory footprint is small enough that we can put all the required data into an on-chip embedded DRAM memory
CacheWhether the on-chip memory is managed as a cache or as directly addressable memory is not too important
Cache flushing for communications DMA is a nuisanceCaches are built into most μ-processors, not worth designing our own!
Off-chip memoryThe amount of off-chip memory is determined by cost
If the cost of the memory is » 50% of the total cost then buy more processing nodes instead
After all, the processors are almost free!
Thursday 20 April 2023
A D Kennedy 89
Communications Network: I
This is where a massively parallel computer differs from a desktop or server machineIn the future the network will become the principal bottleneck
For large data parallel applicationsWe will end up designing networks and decorating them with processors and memories which are almost free
Thursday 20 April 2023
A D Kennedy 90
Communications Network
TopologyGrid
This is easiest to buildHow many dimensions?
QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6Extra dimensions allow easier partitioning of the machine
Hypercube/Fat Tree/Ω network/Butterfly networkThese are all essentially an infinite dimensional gridGood for FFTs
SwitchExpensive, and does not scale well
Thursday 20 April 2023
A D Kennedy 91
36
Global Operations
6 8 10 12
1 2 3 4
5 6 7 8
Grid wiresNot very good error propagationO(N1/4) latency (grows as the perimeter of the grid)O(N) hardware
Combining networkA tree which can perform arithmetic at each nodeUseful for global reductionsBut global operations can be performed using grid wires
It is all a matter of costUsed in BGL, not in QCDx
Thursday 20 April 2023
A D Kennedy 92
Global Operations
1 52 3 4 6 7 8
3 7 11 15
10 26
36Combining tree
Good error propagationO(log N) latencyO(N log N) hardware
Bit-wise operationsAllow arbitrary precisionUsed on QCDSPNot used on QCDOC or BGL
Because data sent byte-wise (for ECC)
Thursday 20 April 2023
A D Kennedy 93
Broadcast 110100
Global Operations
110100
111000 Select
MaxMSB first
>11010
11100 ?
0
1101
1110 ?
00
110
111 upper
100
11
11 upper
0100
1
1 upper
10100
upper
110100
110100
110100001011
000111
LSB first+
Carry
Sum
00101
00011
1
0
0010
0001
1
10
001
000
1
010
00
00
1
0010
0
0
0
10010
0
010010
Thursday 20 April 2023
A D Kennedy 94
Communications Network: II
ParametersBandwidthLatencyPacket size
The ability to send small [O(102) bytes] packets between neighbours is vital if we are to run efficiently with a small number of sites per processor (σ)
Control (DMA)The DMA engine needs to be sophisticated enough to interchange neighbouring faces without interrupting the CPUBlock-strided movesI/O completion can be polled (or interrupt driven)
Thursday 20 April 2023
A D Kennedy 95
Packet Size
Thursday 20 April 2023
A D Kennedy 96
Block-strided Moves
1514131211109876543210
15141312
111098
7654
3210
15141312
111098
7654
3210
15141312
111098
7654
3210
1514131211109876543210 1514131211109876543210
Block size 14
Stride 312
For each direction separately we specify in memory mapped registers: Source starting address
Target starting address Block size Stride Number of blocks
Thursday 20 April 2023
A D Kennedy 97
Hardware Design Process
Optimise machine for applications of interest
We run simulations of lattice QCD to tune our designMost design tradeoffs can be evaluated using a spreadsheet as the applications are mainly static (data-independent)
It also helps to debug the machine if you use the actual kernels you will run in production as test cases!
But running QCD even on a RTL simulator is painfully slow
Circuit design done using hardware description languages (e.g., VHDL)Time to completion is critical
Trade-off between risk of respin and delay in tape-out
Thursday 20 April 2023
A D Kennedy 99
VHDL
entity lcount5 isport ( signal clk, res: in vlbit;
signal reset, set: in vlbit; signal count: inout vlbit_1d(4 downto 0));
end lcount5;
architecture bhv of lcount5 isbeginprocess
beginwait until prising(clk) or res='1';
if res='1' then count<=b"00000";else if reset='1'
then count<=b"00000"; elsif set='1' then
count(4)<=count(4) xor (count(0) and count(1) and count(2) and count(3)); count(3)<=count(3) xor (count(0) and count(1) and count(2));count(2)<=count(2) xor (count(0) and count(1));count(1)<=count(1) xor count(0);count(0)<=not count(0);