Algorithms for Dynamical Fermions

Thursday 20 April 2023NCTS 2006 School on Modern Numerical

Methods in Mathematics and Physics

Algorithms for Algorithms for Dynamical FermionsDynamical Fermions

A D KennedyA D KennedySchool of Physics, The University of EdinburghSchool of Physics, The University of Edinburgh

Thursday 20 April 2023

A D Kennedy 2

Outline

Monte Carlo integrationImportance samplingMarkov chainsDetailed balance, Metropolis algorithmSymplectic integratorsHybrid Monte CarloPseudofermionsRHMCQCD Computers


A D Kennedy 3

Monte Carlo methods: I

Monte Carlo integration is based on the identification of probabilities with measures

There are much better methods of carrying out low dimensional quadrature

All other methods become hopelessly expensive for large dimensions

In lattice QFT there is one integration per degree of freedom

We are approximating an infinite dimensional functional integral


A D Kennedy 4

Measure the value of on each configuration and compute the average

1

1( )

N

ttN

Monte Carlo methods: II

Generate a sequence of random field configurations chosen from the probability distribution

1 2 t N, , , , ,

( )1( ) tS

t tP d eZ


A D Kennedy 5

Central Limit Theorem: I

Distribution of values for a single sample = ()

( )P d P

Central Limit Theorem

where the variance of the distribution of is

2CNO

22C

Law of Large Numbers limN

The Laplace–DeMoivre Central Limit theorem is an asymptotic expansion for the probability distribution of


A D Kennedy 6

3

0 3

4 21 4 2

2

2

0

3

C C

C C C

C

The first few cumulants are

Central Limit Theorem: II

Note that this is an asymptotic expansion

Generating function for connected moments

( ) ln ( ) ikW k d P e ln lnik ikd P e e

0 !

n

nn

ikC

n


A D Kennedy 7

Central Limit Theorem: III

Connected generating function

lnN

ik Nd P e

1 11

ln expN

N N tt

ikd d P P

N

( ) ln ( ) ikW k d P e

1

1 !

n

nn

n

ik Cn N

ln ik NN e

kNW

N

1 11

1 N

N N tt

P d d P PN

Distribution of the average of N samples


A D Kennedy 8

3 4

3 4212 3 3 4

223! 4!2

N

C Cd dik ik CN N dk ikd de e e

Take inverse Fourier transform to obtain distribution P

12

W k ikP dk e e

23 4

23 4 22 3 3 43! 4!

2 2

C Cd d C N

N Nd d

C N

ee

Central Limit Theorem: IV


A D Kennedy 9

Central Limit Theorem: V

where and

N

22

2 23 2

32 2

31

6 2

CC C eF

C N C

dP F

d

Re-scale to show convergence to Gaussian distribution


A D Kennedy 10

Importance Sampling: I

( ) 1N dx x

Sample from distributionProbability

Normalisation

0 ( ) . .x a e

( )dx f x Integral

( )( )

( )f x

I x dxx

Estimator of integral

Estimator of variance2 2

2( ) ( )( )

( ) ( )f x f x

V x dx I dx Ix x


A D Kennedy 11

Importance Sampling: II

2

2

( ) ( )0

( ) ( )V N f y

y y

Minimise varianceConstraint N=1Lagrange multiplier

( )( )

( )opt

f xx

dx f x

Optimal measure

Optimal variance 2 2

opt ( ) ( )V dx f x dx f x


A D Kennedy 12

( ) sin( )f x x Example:

12( ) sin( )x x Optimal weight

2 22 2

opt0 0

sin( ) sin( ) 16V dx x dx x

Optimal variance

2

0sindx x

Importance Sampling: III


A D Kennedy 13

Importance Sampling: IV2

0sindx x

1 Construct cheap approximation to |sin(x)|

2 Calculate relative area within each rectangle

5 Choose another random number uniformly

4 Select corresponding rectangle

3 Choose a random number uniformly

6 Select corresponding x value within rectangle

7 Compute |sin(x)|


A D Kennedy 14

Importance Sampling: V

2 2

0 0

sin( ) sin( ) sin( ) sin( )I dx x x dx x x

But we can do better!

opt 0V For which

With 100 rectangles we have V = 16.02328561

With 100 rectangles we have V = 0.011642808

2

0sindx x


A D Kennedy 15

Markov Chains: I

State space (Ergodic) stochastic transitions P’:

Distribution converges to unique fixed point

Q

Deterministic evolution of probability distribution P: Q Q


A D Kennedy 16

The sequence Q, PQ, P²Q, P³Q,… is Cauchy

Define a metric on the space of (equivalence classes of) probability distributions

1 2 1 2,d Q Q dx Q x Q x

Prove that with > 0, so the Markov process P is a contraction mapping

1 2 1 2, 1 ,d PQ PQ d Q Q

The space of probability distributions is complete, so the sequence converges to a unique fixed pointlim n

nQ P Q

Convergence of Markov Chains: I


A D Kennedy 17

1 2 1 2,d PQ PQ dx PQ x PQ x 1 2dx dy P x y Q y dy P x y Q y

dx dy P x y Q y Q y Q y

dx dy P x y Q y

2 mindx dy P x y Q y Q y

dx dy P x y Q y

2min ,a b a b a b

1 2Q y Q y Q y

1y y

Convergence of Markov Chains: II


A D Kennedy 18

2 mindy Q y dx dy P x y Q y Q y

2 inf min

ydy Q y dx P x y dy Q y Q y

infy

dy Q y dx P x y dy Q y 1 2(1 ) ,d Q Q

1 2 1 1 0dy Q y dy Q y dy Q y dy Q y Q y dy Q y Q y

1dx P x y

12dy Q y Q y dy Q y

0 infy

dx P x y

1 2,d PQ PQ

Convergence of Markov Chains: III


A D Kennedy 19

Markov Chains: II

Use Markov chains to sample from QSuppose we can construct an ergodic Markov process P which has distribution Q as its fixed point

Start with an arbitrary state (“field configuration”)

Iterate the Markov process until it has converged (“thermalized”)

Thereafter, successive configurations will be distributed according to Q

But in general they will be correlated

To construct P we only need relative probabilities of states

Do not know the normalisation of Q

Cannot use Markov chains to compute integrals directly

We can compute ratios of integrals


A D Kennedy 20

Metropolis algorithm

min 1,Q x

P x yQ y

Markov Chains: III

How do we construct a Markov process with a specified fixed point ?

Q x dy P x y Q y Integrate w.r.t. y to obtain fixed point conditionSufficient but not necessary for fixed point

P y x Q x P x y Q y Detailed balance

Q xP x y

Q x Q y

Other choices are possible, e.g.,

Consider cases and separately to obtain detailed balance condition

( ) ( )Q x Q y ( ) ( )Q x Q y

Sufficient but not necessary for detailed balance


A D Kennedy 21

Markov Chains: IV

Composition of Markov stepsLet P1 and P2 be two Markov steps which have the desired

fixed point distribution

They need not be ergodic

Then the composition of the two steps P2P1 will also have

the desired fixed point

And it may be ergodic

This trivially generalises to any (fixed) number of steps

For the case where P1 is not ergodic but (P1 ) n is the

terminology weakly and strongly ergodic are sometimes used


A D Kennedy 22

Markov Chains: V

This result justifies “sweeping” through a lattice performing single site updates

Each individual single site update has the desired fixed point because it satisfies detailed balance

The entire sweep therefore has the desired fixed point, and is ergodic

But the entire sweep does not satisfy detailed balance

Of course it would satisfy detailed balance if the sites were updated in a random order

But this is not necessaryAnd it is undesirable because it puts too much randomness into the system


A D Kennedy 23

Markov Chains: VI

Coupling from the PastPropp and Wilson (1996)Use fixed set of random numbersFlypaper principle: If states coalesce they stay together forever

– Eventually, all states coalesce to some state with probability one

– Any state from t = - will coalesce to – is a sample from the fixed point distribution


A D Kennedy 24

Markov Chains: VII

Suppose we have a latticeThat is, a partial ordering with a largest and smallest element

And an update step that preserves itThen once the largest and smallest states have coalesced all the others must have too

a

b dc

e

g

f


A D Kennedy 25

Markov Chains: VIII

β ≥ 0Ordering is spin configuration A ≥ B iff for every site As ≥ Bs

Update is a local heatbathi jij

H ssb= å

A non-trivial example of where this is possible is the ferrormagnetic Ising model


A D Kennedy 26

Autocorrelations: I

This corresponds to the exponential autocorrelation time

expmax

10

lnN

Exponential autocorrelationsThe unique fixed point of an ergodic Markov process

corresponds to the unique eigenvector with eigenvalue

1

All its other eigenvalues must lie within the unit circlemax 1 In particular, the largest subleading eigenvalue is


A D Kennedy 27

Autocorrelations: II

2

21 1

1 N N

t tt tN

1

2

21 1 1

12

N N N

t t tt t t tN

Integrated autocorrelationsConsider the autocorrelation of some operator Ω

Without loss of generality we may assume 0

12 2 2

1

1 2 N

N CN N

2

t tC

Define the autocorrelation function


A D Kennedy 28

Autocorrelations: III

The autocorrelation function must fall faster that the

exponential autocorrelation exp

maxNC e

2

2 exp1 2 1N

A ON N

2

exp

1

1 2 1N

C ON N

For a sufficiently large number of samples

1

A C

Define the integrated autocorrelation function


A D Kennedy 29

In order to carry out Monte Carlo computations including the effects of dynamical fermions we would like to find an algorithm which

Update the fields globally

Because single link updates are not cheap if the action is not local

Take large steps through configuration space

Because small-step methods carry out a random walk which leads to critical slowing down with a dynamical critical exponent z=2

Hybrid Monte Carlo: I

Does not introduce any systematic errors

z relates the autocorrelation to the correlation length of the system,

zA


A D Kennedy 30

A useful class of algorithms with these properties is the (Generalised) Hybrid Monte Carlo (HMC) method

Introduce a “fictitious” momentum p corresponding to each dynamical degree of freedom q

Find a Markov chain with fixed point exp[-H(q,p) ] where H is the “fictitious” Hamiltonian ½p2 + S(q)

The action S of the underlying QFT plays the rôle of the potential in the “fictitious” classical mechanical system

This gives the evolution of the system in a fifth dimension, “fictitious” or computer time

This generates the desired distribution exp[-S(q) ] if we ignore the momenta p (i.e., the marginal distribution)

Hybrid Monte Carlo: II


A D Kennedy 31

The HMC Markov chain alternates two Markov steps

Molecular Dynamics Monte Carlo (MDMC)(Partial) Momentum Refreshment

Both have the desired fixed pointTogether they are ergodic

Hybrid Monte Carlo: IIIHybrid Monte Carlo: III


A D Kennedy 32

MDMC: I

If we could integrate Hamilton’s equations exactly we could follow a trajectory of constant fictitious energy

This corresponds to a set of equiprobable fictitious phase space configurationsLiouville’s theorem tells us that this also preserves the functional integral measure dp dq as required

Any approximate integration scheme which is reversible and area preserving may be used to suggest configurations to a Metropolis accept/reject test

With acceptance probability min[1,exp(-H)]


A D Kennedy 33

MDMC: II

Molecular Dynamics (MD), an approximate integrator which is exactly : , ,U q p q p

We build the MDMC step out of three parts

A Metropolis accept/reject step

Area preserving, *

,det det 1

,

q pU

q p

Reversible, 1F U F U

A momentum flip :F p p

1H Hq qF U e y y e

p p

The composition of these gives

with y being a uniformly distributed random number in [0,1]


A D Kennedy 34

The Gaussian distribution of p is invariant under F

The extra momentum flip F ensures that for small the momenta are reversed after a rejection rather than after an acceptance

For = / 2 all momentum flips are irrelevant

This mixes the Gaussian distributed momenta p with Gaussian noise

cos sin

sin cos

p pF

Partial Momentum Refreshment


A D Kennedy 35

If A and B belong to any (non-commutative)

algebra then , where constructed from commutators of A and B (i.e.,

is in the Free Lie Algebra generated by {A,B })

A B A Be e e

Symplectic Integrators: I

1 2

1 2

1 2

221

1 2, , 10

1, , , ,

1 2 ! m

m

m

nm

n n k kk km

k k n

Bc c A B c c A B

n m

More precisely, where and

1

ln A Bn

n

e e c

1c A B

Baker-Campbell-Hausdorff (BCH) formula


A D Kennedy 36

Symplectic Integrators: IIExplicitly, the first few terms are

1 1 12 12 24

1720

ln , , , , , , , ,

, , , , 4 , , , ,

6 , , , , 4 , , , ,

2 , , , , , , , ,

A Be e A B A B A A B B A B B A A B

A A A A B B A A A B

A B A A B B B A A B

A B B A B B B B A B

In order to construct reversible integrators we use symmetric symplectic integrators

2 2 124

15760

ln , , 2 , ,

7 , , , , 28 , , , ,

12 , , , , 32 , , , ,

16 , , , , 8 , , , ,

A B Ae e e A B A A B B A B

A A A A B B A A A B

A B A A B B B A A B

A B B A B B B B A B

The following identity follows directly from the BCH formula


A D Kennedy 37

Symplectic Integrators: III

exp expd dp dqdt dt p dt q

212,H q p T p S q p S q

We are interested in finding the classical trajectory in phase space of a system described by the Hamiltonian

The basic idea of such a symplectic integrator is to write the time evolution operator as

ˆexp HH He

q p p q

exp S q T pp q


A D Kennedy 38

Symplectic Integrators: IV

Define and so thatH P Q P S qp

Q T pq

Since the kinetic energy T is a function only of p and the potential energy S is a function only of q, it follows that the action of and may be evaluated triviallyPe Qe

: , ,

: , ,

Q

P

e f q p f q T p p

e f q p f q p S q


A D Kennedy 39

Symplectic Integrators: V

From the BCH formula we find that the PQP symmetric symplectic integrator is given by

1 12 2

//

0( ) P PQU e e e

3 5124exp , , 2 , ,P Q P P Q Q P Q O

2 4124exp , , 2 , ,P Q P P Q Q P Q O

ˆ 2P QHe e O

In addition to conserving energy to O (² ) such symmetric symplectic integrators are manifestly area preserving and reversible


A D Kennedy 40

Symplectic Integrators: VI

For each symplectic integrator there exists a Hamiltonian H’ which is exactly conserved

This is given by substituting Poisson brackets for commutators in the BCH formula

2124

4 615760

' , , 2 , ,

32 , , , 16 , , , ,

28 , , , , 12 , , , ,

8 , , , , 7 , , , ,

H S T T T S S T S

S S T T S T S S T S

S T T T S T S T T S O

S S S T S T T T T S


A D Kennedy 41

Symplectic Integrators: VII

, , , , , , 0A B C B C A C A B

,A B A B

A Bp q q p

Poisson brackets form a Lie algebra

Homework exercise: verify the Jacobi identity

, ,A B B A


A D Kennedy 42

Symplectic Integrators: VIII

The explicit result for H’ is

2 2 2124

44 2 2 2 4 61720

2

6 2 3

H H p S S

p S p SS S S S O

Note that H’ cannot be written as the sum of a p-dependent kinetic term and a q-dependent potential termAs H’ is conserved, δH is of O(δτ 2) for trajectories of arbitrary length

Even if τ = O (δτ -k) with k > 1


A D Kennedy 43

Symplectic Integrators: IX

Define and so that 1 2H P P Q i iP S qp

Q T pq

1 2( , ) ( ) ( ) ( )H q p T p S q S q Multiple timescales

Split the Hamiltonian into pieces

2 1 21 1 1 1 1 1 12 1 24 2 4 4 2 42 2 1 1 1 2 2

/

/SW( ) n n n n n n n

n n nQ P Q P Q P Q

U e e e e e e e

Introduce a symmetric symplectic integrator of the form

If then the instability in the integrator is

tickled equally by each sub-step

1 2

1 22

P PQ

n n

This helps if the most expensive force computation does not correspond to the largest force


A D Kennedy 44

Dynamical fermions: I

PseudofermionsDirect simulation of Grassmann fields is not feasible

The problem is not that of manipulating anticommuting values in a computer

We therefore integrate out the fermion fields to obtain the fermion determinant detMd d e M

It is that is not positive, and thus we get poorimportance sampling

FS Me e

and always occur quadratically

The overall sign of the exponent is unimportant


A D Kennedy 45

Dynamical fermions: II

1

0

, , Me

Any operator can be expressed solely in terms of the bosonic fields

Use Schwinger’s source method and integrate out the fermions

1( , ) ( , )G x y x y M x y

E.g., the fermion propagator is


A D Kennedy 46

Dynamical fermions: III

Represent the fermion determinant as a bosonic Gaussian integral with a non-local kernel 1

det MM d d e

The fermion kernel must be positive definite (all its eigenvalues must have positive real parts) otherwise the bosonic integral will not convergeThe new bosonic fields are called pseudofermions

The determinant is extensive in the lattice volume, thus again we get poor importance sampling

Including the determinant as part of the observable to

be measured is not feasible

det

detB

B

S

S

M

M


A D Kennedy 47

Dynamical fermions: IV

It is usually convenient to introduce two flavours of fermion and to write 1†2 †det det

M MM M M d d e

The evaluation of the pseudofermion action and the corresponding force then requires us to find the solution of a (large) set of linear equations 1†M M

This not only guarantees positivity, but also allows us to generate the pseudofermions from a global heatbath by applying to a random Gaussian distributed field

†M

The equations for motion for the boson (gauge) fields are

†1 1 1† † † † †B BS S

M M M M M M M M


A D Kennedy 48

Dynamical fermions: V

It is not necessary to carry out the inversions required for the equations of motion exactly

There is a trade-off between the cost of computing the force and the acceptance rate of the Metropolis MDMC step

The inversions required to compute the pseudofermion action for the accept/reject step does need to be computed exactly, however

We usually take “exactly” to by synonymous with “to machine precision”


A D Kennedy 49

Reversibility: I

Are HMC trajectories reversible and area preserving in practice?

The only fundamental source of irreversibility is the rounding error caused by using finite precision floating point arithmetic

For fermionic systems we can also introduce irreversibility by choosing the starting vector for the iterative linear equation solver time-asymmetrically

We do this if we to use a Chronological Inverter, which takes (some extrapolation of) the previous solution as the starting vector

Floating point arithmetic is not associativeIt is more natural to store compact variables as scaled integers (fixed point)

Saves memoryDoes not solve the precision problem


A D Kennedy 50

Reversibility: II

Data for SU(3) gauge theory and QCD with heavy quarks show that rounding errors are amplified exponentially

The underlying continuous time equations of motion are chaotic

Ляпунов exponents characterise the divergence of nearby trajectories

The instability in the integrator occurs when H » 1

Zero acceptance rate anyhow


A D Kennedy 51

Reversibility: III

In QCD the Ляпунов exponents appear to scale with as the system approaches the continuum limit

= constant

This can be interpreted as saying that the Ляпунов exponent characterises the chaotic nature of the continuum classical equations of motion, and is not a lattice artefact

Therefore we should not have to worry about reversibility breaking down as we approach the continuum limit

Caveat: data is only for small lattices, and is not conclusive


A D Kennedy 52

What is the best polynomial approximation p(x) to a continuous function f(x) for x in [0,1] ?

Polynomial approximation

Best with respect to the appropriate norm

where n 1

1/1

0

( ) ( )n

np f dx p x f xn


A D Kennedy 53

Weierstraß’ theorem

Weierstraß: Any continuous function can be arbitrarily well approximated by a polynomial

0 1minmax ( ) ( )

p xp f p x f x

Taking n →∞ this is the Taking n →∞ this is the minimax norm norm


A D Kennedy 54

Бернштейне polynomials

0

(1 )n

n n kn

k

nkf x x

n kp x

The explicit solution is provided by Бернштейне polynomials


A D Kennedy 55

Чебышев’s theorem

The error |p(x) - f(x)| reaches its maximum at exactly d+2 points on the unit interval

0 1max

xp f p x f x

ЧебышевЧебышев: There is always : There is always a unique polynomial of any a unique polynomial of any degree d which minimises degree d which minimises


A D Kennedy 56

Чебышев’s theorem: NecessitySuppose p-f has less than d+2 extrema of equal magnitudeThen at most d+1 maxima exceed some magnitude

And whose magnitude is smaller than the “gap”The polynomial p+q is then a better approximation than p to f

This defines a “gap”We can construct a polynomial q of degree d which has the opposite sign to p-f at each of these maxima (Lagrange interpolation)


A D Kennedy 57

Чебышев’s theorem: Sufficiency

Then at each of the d+2 extrema i i i ip x f x p x f x

Thus p’ – p = 0 as it is a polynomial of degree d

Therefore p’ - p must have d+1 zeros on the unit interval

Suppose there is a polynomialp f p f


A D Kennedy 58

The notation is an old transliteration of Чебышев !

Чебышев polynomials

Convergence is often exponential in d The best approximation of degree d-1 over [-

1,1] to xd is 1

112

dd

d dp x x T x

1cos cosdT x d x

Where the Чебышев polynomials are

1 ln21 2

2

dd

d ddx p x T x e

The error is


A D Kennedy 59

Чебышев rational functions

Чебышев’s theorem is easily extended to rational approximations

Rational functions with nearly equal degree numerator and denominator are usually best

Convergence is still often exponential

Rational functions usually give a much better approximation

A simple (but somewhat slow) numerical algorithm for finding the optimal Чебышев rational approximation was given by Ремез


A D Kennedy 60

Чебышев rationals: ExampleA realistic example of a rational approximation is

x 2.3475661045 x 0.1048344600 x 0.00730638141

0.3904603901x 0.4105999719 x 0.0286165446 x 0.0012779193x

Using a partial fraction expansion of such rational functions allows us to use a multishift linear equation solver, thus reducing the cost significantly.

1 0.0511093775 0.1408286237 0.59648450330.3904603901

x 0.0012779193 x 0.0286165446 x 0.4105999719x

The partial fraction expansion of the rational function above is

This is accurate to within almost 0.1% over the range [0.003,1]

This appears to be numerically stable.


A D Kennedy 61

Polynomials versus rationals

Optimal L2 approximation with weight is

2

1

1 x2 1

0

( ) 4( )

(2 1)

jn

jj

T xj

Optimal L∞ approximation cannot be too much better (or it would lead to a better L2 approximation)

lnn

e Золотарев’s formula has L∞ error

This has L2 error of O(1/n)


A D Kennedy 62

Non-linearity of CG solver

Suppose we want to solve A2x=b for Hermitian A by CG

It is better to solve Ax=y, Ay=b successivelyCondition number (A2) = (A)2

Cost is thus 2(A) < (A2) in general

Suppose we want to solve Ax=bWhy don’t we solve A1/2x=y, A1/2y=b successively?

The square root of A is uniquely defined if A>0This is the case for fermion kernels

All this generalises trivially to nth rootsNo tuning needed to split condition number evenly

How do we apply the square root of a matrix?


A D Kennedy 63

Rational matrix approximation

Functions on matricesDefined for a Hermitian matrix by diagonalisationH = U D U -1

f (H) = f (U D U -1) = U f (D) U -1

Rational functions do not require diagonalisation

H m + H n = U ( D m + D n) U -1

H -1 = U D -1 U -1

Rational functions have nice propertiesCheap (relatively) Accurate


A D Kennedy 64

No Free Lunch Theorem

We must apply the rational approximation with each CG iteration

M1/n r(M)The condition number for each term in the partial fraction expansion is approximately (M)So the cost of applying M1/n is proportional to (M)Even though the condition number (M1/n)=(M)1/n

And even though (r(M))=(M)1/n

So we don’t win this way…


A D Kennedy 65

Pseudofermions

We want to evaluate a functional integral including the fermionic determinant det M

1

det MM d d e

We write this as a bosonic functional integral over a pseudofermion field with kernel M -1


A D Kennedy 66

Multipseudofermions

We are introducing extra noise into the system by using a single pseudofermion field to sample this functional integral

This noise manifests itself as fluctuations in the force exerted by the pseudofermions on the gauge fieldsThis increases the maximum fermion forceThis triggers the integrator instabilityThis requires decreasing the integration step size

11

detnMnM d d e

A better estimate is det M = [det M1/n]n


A D Kennedy 68

Violation of NFL Theorem

So let’s try using our nth root trick to implement multipseudofermions

Condition number (r(M))=(M)1/n

So maximum force is reduced by a factor of n(M)(1/n)-1

This is a good approximation if the condition number is dominated by a few isolated tiny eigenvaluesThis is so in the case of interest

Cost reduced by a factor of n(M)(1/n)-1

Optimal value nopt ln (M)

So optimal cost reduction is (e ln) /

This works!


A D Kennedy 69

Rational Hybrid Monte Carlo: I

Generate pseudofermion from Gaussian heatbath

RHMC algorithm for fermionic kernel 1

2† nM M

†12( )P e

14† nM M

1

1 † †1 2†14 22 †( )

nnP d e e

M MM M

Use accurate rational approximation 4( ) nr x x

Use less accurate approximation for MD, 2( ) nr x x

, so there are no double poles2( ) ( )r x r x

Use accurate approximation for Metropolis acceptance step


A D Kennedy 70

Rational Hybrid Monte Carlo: II

RemindersApply rational approximations using their partial fraction expansionsDenominators are all just shifts of the original fermion kernel

All poles of optimal rational approximations are real and positive for cases of interest (Miracle #1)Only simple poles appear (by construction!)

Use multishift solver to invert all the partial fractions using a single Krylov space

Cost is dominated by Krylov space construction, at least for O(20) shifts

Result is numerically stable, even in 32-bit precisionAll partial fractions have positive coefficients (Miracle #2)

MD force term is of the usual form for each partial fractionApplicable to any kernel


A D Kennedy 71

Comparison with R algorithm: I

Algorithm

δt A B4

R 0.0019 1.56(5)

R 0.0038 1.73(4)

RHMC 0.055 84% 1.57(2)

Binder cumulant of chiral condensate, B4, and RHMC acceptance rate A from a finite temperature study (2+1 flavour naïve staggered fermions, Wilson gauge action, V = 83×4, mud = 0.0076, ms = 0.25, τ= 1.0)


A D Kennedy 72

Comparison with R algorithm: II

Algorithm

mud ms δt A P

R 0.04 0.04 0.01 0.60812(2)

R 0.02 0.04 0.01 0.60829(1)

R 0.02 0.04 0.005 0.60817

RHMC 0.04 0.04 0.02 65.5% 0.60779(1)

RHMC 0.02 0.040.0185

69.3% 0.60809(1)

The different masses at which domain wall results were gathered, together with the step-sizes δt, acceptance rates A, and plaquettes P (V = 163×32×8, DBW2 gauge action, β = 0.72)

The step-size variation of the plaquette with mud =0.02


A D Kennedy 73

Comparison with R algorithm: III

The integrated autocorrelation time of the 13th time-slice of the pion propagator from the domain wall test, with mud = 0.04


A D Kennedy 74

Multipseudofermions with multiple timescales

Semiempirical observation: The largest force from a single pseudofermion does not come from the smallest shift

1 0.0511093775 0.1408286237 0.59648450330.3904603901

x 0.0012779193 x 0.0286165446 x 0.4105999719x

For example, look at the numerators in the partial fraction expansion we exhibited earlier

Make use of this by using a coarser timescale for the more expensive smaller shifts

0%

25%

50%

75%

100%

-13 -10 -8.5 -7.1 -5.8 -4.4 -3.1 -1.7 -0.3 1.5

Shift [ln(β)]

Residue (α)L² Forceα/(β+0.125)CG iterations


A D Kennedy 76

Berlin Wall for Wilson fermions

HMC ResultsC Urbach, K Jansen, A Schindler, and U Wenger, hep-lat/0506011, hep-lat/0510064

Comparable performance to Lüscher’s SAP algorithmRHMC?

t.b.a.


A D Kennedy 77

Conclusions (RHMC)

Advantages of RHMCExact

No step-size errors; no step-size extrapolations

Significantly cheaper than the R algorithmAllows easy implementation of Hasenbusch (multipseudofermion) accelerationFurther improvements possible

Such as multiple timescales for different terms in the partial fraction expansion

Disadvantages of RHMC???


A D Kennedy 78

QCD Machines: I

We want a cost-effective computer to solve interesting scientific problems

In fact we wanted a computer to solve lattice QCD

But it turned out that there is almost nothing that was not applicable to many other problems too

Not necessary to solve all problems with one architecture

Demise of the general-purpose computer?

Development cost « hardware cost for one large machine

Simple OS and software model

Interleave a few long jobs without time- or space-sharing


A D Kennedy 79

QCD Machines: II

Take advantage of mass market componentsIt is not cost- or time-effective to compete with PC market in designing custom chips

Use standard software and tools whenever possible

Do not expect compilers or optimisers to anything particularly smart

Parallelism has to be built into algorithms and programs from the start

Hand code critical kernels in assemblerAnd develop these along with the hardware design


A D Kennedy 80

QCD Machines: III

Parallel applicationsMany real-world applications are intrinsically parallel

Because they are approximations to continuous systems

Lattice QCD is an good exampleLattice is a discretisation of four dimensional space-time

Lots of arithmetic on small complex matrices and vectors

Relatively tiny amount of I/O required

Amdahl’s lawAmount of parallel work may be increased working on a larger volume

Relevant parameter is σ, the number of sites per processor


A D Kennedy 81

Strong Scaling

The amount of computation V δ required to equilibrate a system of volume V increases faster than linearly

If we are to have any hope of equilibrating large systems the value of δ cannot be much larger than oneFor lattice QCD we have algorithms with δ = 5/4

We therefore are driven to as small a value of σ as possible

This corresponds to “thin nodes,” as opposed to “fat nodes” as in PC clusters

Clusters are competitive in price/performance up to a certain maximum problem size

This borderline increase with time, of course


A D Kennedy 82

Data Parallelism

All processors run the same codeNot necessarily SIMD, where they share a common clock

Synchronization on communication, or at explicit barriers

Type of data parallel operationsPointwise arithmetic

Nearest neighbour shiftsPerhaps simultaneously in several directions

Global operationsBroadcasts, sums or other reductions


A D Kennedy 83

Alternatives Paradigms

MultithreadingParallelism comes from running many separate more-or-less independent threads

Recent architectures propose running many light-weight threads on each processor to overcome memory latency

What are the threads that need almost no memory

Calculating zillions of digits of ?

Cryptography?


A D Kennedy 84

Computational Grids

In the future carrying out large scale In the future carrying out large scale computations using the Grid will be computations using the Grid will be as easy as plugging into an electric as easy as plugging into an electric socketsocket


A D Kennedy 85

Hardware Choices

Cost/MFlopGoal is to be about 10 times more cost-effective than commercial machines

Otherwise it is not worth the effort and risk of building our own machine

Our current Teraflops machines cost about $1/MFlopFor a Petaflops machine we will need to reach about $1/GFlop

Power/coolingMost cost effective to use low-power components and high-density packagingLife is much easier if the machine can be air cooled


A D Kennedy 86

Clock Speed

Peak/Sustained speedThe sustained performance should be about 20%—50% of peakOtherwise there is either too much or too little floating point hardware

Real applications rarely have equal numbers of adds and multiplies

Clock speed“Sweet point” is currently at 0.5—1 GHz chips

High-performance chips running at 3 GHz are both hot and expensiveUsing moderate clock speed makes electrical design issues such as clock distribution much simpler


A D Kennedy 87

Memory Systems

Memory bandwidthThis is currently the main bottleneck in most architecturesThere are two obvious solutions

Data prefetchingVector processing is one way of doing thisRequires more sophisticated softwareFeasible for our class of applications because the control flow is essentially static (almost no data dependencies)

Hierarchical memory system (NUMA)We make use of both approaches


A D Kennedy 88

Memory Systems

On-chip memoryFor QCD the memory footprint is small enough that we can put all the required data into an on-chip embedded DRAM memory

CacheWhether the on-chip memory is managed as a cache or as directly addressable memory is not too important

Cache flushing for communications DMA is a nuisanceCaches are built into most μ-processors, not worth designing our own!

Off-chip memoryThe amount of off-chip memory is determined by cost

If the cost of the memory is » 50% of the total cost then buy more processing nodes instead

After all, the processors are almost free!


A D Kennedy 89

Communications Network: I

This is where a massively parallel computer differs from a desktop or server machineIn the future the network will become the principal bottleneck

For large data parallel applicationsWe will end up designing networks and decorating them with processors and memories which are almost free


A D Kennedy 90

Communications Network

TopologyGrid

This is easiest to buildHow many dimensions?

QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6Extra dimensions allow easier partitioning of the machine

Hypercube/Fat Tree/Ω network/Butterfly networkThese are all essentially an infinite dimensional gridGood for FFTs

SwitchExpensive, and does not scale well


A D Kennedy 91

36

Global Operations

6 8 10 12

1 2 3 4

5 6 7 8

Grid wiresNot very good error propagationO(N1/4) latency (grows as the perimeter of the grid)O(N) hardware

Combining networkA tree which can perform arithmetic at each nodeUseful for global reductionsBut global operations can be performed using grid wires

It is all a matter of costUsed in BGL, not in QCDx


A D Kennedy 92

Global Operations

1 52 3 4 6 7 8

3 7 11 15

10 26

36Combining tree

Good error propagationO(log N) latencyO(N log N) hardware

Bit-wise operationsAllow arbitrary precisionUsed on QCDSPNot used on QCDOC or BGL

Because data sent byte-wise (for ECC)


A D Kennedy 93

Broadcast 110100

Global Operations

110100

111000 Select

MaxMSB first

>11010

11100 ?

0

1101

1110 ?

00

110

111 upper

100

11

11 upper

0100

1

1 upper

10100

upper

110100

110100

110100001011

000111

LSB first+

Carry

Sum

00101

00011

1

0

0010

0001

1

10

001

000

1

010

00

00

1

0010

0

0

0

10010

0

010010


A D Kennedy 94

Communications Network: II

ParametersBandwidthLatencyPacket size

The ability to send small [O(102) bytes] packets between neighbours is vital if we are to run efficiently with a small number of sites per processor (σ)

Control (DMA)The DMA engine needs to be sophisticated enough to interchange neighbouring faces without interrupting the CPUBlock-strided movesI/O completion can be polled (or interrupt driven)


A D Kennedy 95

Packet Size


A D Kennedy 96

Block-strided Moves

1514131211109876543210

15141312

111098

7654

3210

15141312

111098

7654

3210

15141312

111098

7654

3210

1514131211109876543210 1514131211109876543210

Block size 14

Stride 312

For each direction separately we specify in memory mapped registers: Source starting address

Target starting address Block size Stride Number of blocks


A D Kennedy 97

Hardware Design Process

Optimise machine for applications of interest

We run simulations of lattice QCD to tune our designMost design tradeoffs can be evaluated using a spreadsheet as the applications are mainly static (data-independent)

It also helps to debug the machine if you use the actual kernels you will run in production as test cases!

But running QCD even on a RTL simulator is painfully slow

Circuit design done using hardware description languages (e.g., VHDL)Time to completion is critical

Trade-off between risk of respin and delay in tape-out


A D Kennedy 99

VHDL

entity lcount5 isport ( signal clk, res: in vlbit;

signal reset, set: in vlbit; signal count: inout vlbit_1d(4 downto 0));

end lcount5;

architecture bhv of lcount5 isbeginprocess

beginwait until prising(clk) or res='1';

if res='1' then count<=b"00000";else if reset='1'

then count<=b"00000"; elsif set='1' then

count(4)<=count(4) xor (count(0) and count(1) and count(2) and count(3)); count(3)<=count(3) xor (count(0) and count(1) and count(2));count(2)<=count(2) xor (count(0) and count(1));count(1)<=count(1) xor count(0);count(0)<=not count(0);

end if;end if; end process ;end bhv;


A D Kennedy 100

QCDSP

Columbia University0.4 TFlops

BNL0.6 Tflops


A D Kennedy 101

4 Mbytes ofEmbedded DRAM

Complete Processor Node forQCD computer on a Chip

fabricated by IBM using SA27E 0.18μ

logic + embedded DRAM processIBM ASIC Library ComponentIBM ASIC Library Component

DMADMAControllerController

QCDOC ASIC Design

DCR SDRAMDCR SDRAMControllerController

DCR SDRAMDCR SDRAM

Custom Designed LogicCustom Designed Logic

InterruptInterruptControllerController

Boot/ClockBoot/ClockSupportSupport

PLBPLBArbiterArbiter

PLBPLB

FPUFPU440 440 CoreCore

HSSLHSSL

HSSLHSSL

HSSLHSSL

SCUSCU

PLLPLL

EDRAMEDRAM

PECPEC

MCMALMCMALEthernetEthernet

DMADMA

locationnumber

IDCIDCInterfaceInterface

Serialnumber

SlaveSlave

OFB-PLBOFB-PLBBridgeBridge

OFBOFBArbiterArbiter

OOFFBB

MII

FIFOFIFO

4KB recv. FIFO4KB recv. FIFO

EMACSEMACS

100 Mbit/secFast Ethernet

24 Off-Node Links12 Gbit/secBandwidth

24 Link DMACommunications

Control

2.6 Gbyte/secEDRAM/SDRAM

DMA

1 (0.8) GlopsDouble PrecisionRISC Processor

8 Gbyte/secMemory-Processor

Bandwidth2.6 Gbyte/secInterface to

External Memory

PEC = Prefetching EDRAM Controller

EDRAM = Embedded DRAM

DRAM = Dynamic RAM

RAM = Random Access Memory


A D Kennedy 102

QCDOC Components


A D Kennedy 103

QCDOC


A D Kennedy 104

Edinburgh ACF

UKQCDOC is located in the ACF near Edinburgh

Our insurers will not let us show photographs of the building for security reasons!


A D Kennedy 105

Blue Gene/L


A D Kennedy 106

Vicious Design Cycle

1. Communications DMA controller requiredOtherwise control by CPU needed

Requires interrupts at each I/O completionCPU becomes bottleneck

2. Introduce more sophisticated DMA instructionsBlock-strided moves for QCDSPSequences of block-strided moves for QCDOC

Why not? It is cheap

3. Use CPU as DMA engineSecond CPU in Blue Gene/L

4. Add FPU Why not? It is cheap, and you double you peak Flops

5. Use second FPU for computationOtherwise sustained fraction of peak is embarrassingly small

6. Go back to step 1


A D Kennedy 107

Power Efficiencies

QCDSP

ASCI White

QCDOC

Blue Gene/ L

NCSA

I ntel Xeon

ASCI Q

Earth Simulator

0.001

0.01

0.1

1

1997 1998 1999 2000 2001 2002 2003 2004 2005

Year

GFlo

ps/W

att


A D Kennedy 108

What Next?

We need a Petaflop computer for QCDNew fermion formulations solve a lot of problems, but cost significantly more in computer time

Algorithms are improvingSlowlyBy factors of about 2

Can we build a machine about 10 times more cost effective than commercial machines?

Otherwise not worth the effort & risk of building our ownBlue Gene/L and its successors provide a new opportunity

Algorithms for Dynamical Fermions

Documents

convergence of markov

iiuse markov chains

ergodic markov process

central limit theorem

fixed pointsufficient

desired fixed pointand

fixed number of stepsfor

monte carlo methods