Top Banner
ETH Library Embedding penalties for quantum hardware architectures and performance of simulated quantum annealing Doctoral Thesis Author(s): Könz, Mario Publication date: 2019 Permanent link: https://doi.org/10.3929/ethz-b-000439876 Rights / license: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information, please consult the Terms of use .
158

Embedding penalties for quantum hardware architectures and ...

Apr 07, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Embedding penalties for quantum hardware architectures and ...

ETH Library

Embedding penalties for quantumhardware architectures andperformance of simulated quantumannealing

Doctoral Thesis

Author(s):Könz, Mario

Publication date:2019

Permanent link:https://doi.org/10.3929/ethz-b-000439876

Rights / license:In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection.For more information, please consult the Terms of use.

Page 2: Embedding penalties for quantum hardware architectures and ...

Embedding penalties for quantumhardware architectures and

performance of simulated quantumannealing

Mario S. Könz

Page 3: Embedding penalties for quantum hardware architectures and ...
Page 4: Embedding penalties for quantum hardware architectures and ...

DISS. ETH NO. 25815

Embedding penalties for quantum hardware architectures andperformance of simulated quantum annealing

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH ZURICH

(Dr. sc. ETH Zurich)

presented by

MARIO SILVESTER KÖNZ

MSc. in Interdisciplinary Science ETH Zurich

MSc. in Physics ETH Zurich

born on 03.12.1989

citizen ofScuol (Guarda) and Zurich

accepted on the recommendation of

Prof. Dr. Matthias Troyer, examinerProf. Dr. Helmut G. Katzgraber, co-examiner

Prof. Dr. Wolfgang Lechner, co-examiner

Page 5: Embedding penalties for quantum hardware architectures and ...

2

2019

Page 6: Embedding penalties for quantum hardware architectures and ...

Mario Silvester KönzEmbedding penalties for quantum hardware architectures and performance of simulatedquantum annealingDiss. ETH No. XYZ

Digital Object Identifier DOI: TODOE-mail [email protected]

Page 7: Embedding penalties for quantum hardware architectures and ...
Page 8: Embedding penalties for quantum hardware architectures and ...

Abstract

Quantum computing aims to harness the properties of quantum systems to more effec-tively solve certain kinds of computational tasks. Among quantum computers, quantumannealers are special-purpose machines whose main appeal is their ability to solve hardoptimization problems while giving certain theoretical guarantees for correctness. Hardoptimization problems can be found in many fields, making good and fast solvers highlysought after. While classical annealing algorithms have been established as very success-ful heuristics, a potential scaling advantage of a quantum annealer produced great interest.The possibility to manifest those quantum advantages in real devices however remainspeculative. In short, the inability to create arbitrary large all-to-all connected graph likestructures in hardware forces the use of embeddings, mappings of the original problemonto a quantum system with only short range interactions. Currently there is one maincommercial quantum annealer, which uses such an embedding to map the optimizationproblem onto the hardware. In more recent time, an alternative embedding was proposed,which we will study as well. In order to study the different embeddings, we need a classof optimization problems with potential for quantum annealing. This means pickingproblems where classical computing does not yet have fast and optimal solvers. We simu-late the different embeddings and analyze their scaling compared to simulated quantumannealing, since a classical simulation of quantum annealing has no real world constraintsand can realize all-to-all connectivity without any embedding necessary. Our researchsuggests that there might be an exponential penalty when using these embeddings forfinding the embedded ground-state. In a next step, we analyze the ability of quantumannealing to find degenerate ground-states with equal probabilities, a property called

Page 9: Embedding penalties for quantum hardware architectures and ...

6

fair sampling. With a simple transverse field driver, quantum annealing is known to notsample fairly, i.e. some degenerate ground-states are suppressed exponentially. We willaddress the conjecture that more advanced drivers will lead to fair sampling.

Throughout this thesis, much emphasis is placed on clean software design thatmaximizes reusability, flexibility and maintainability of the scientific software. Thisentire thesis should be reproducible with very little effort, since no external data sourcesare necessary.

Page 10: Embedding penalties for quantum hardware architectures and ...

Zusammenfassung

Quantum-Computing strebt die Eigenschaften von Quantensystemen zu nutzen um ge-wisse Arten von Berechnungen effektiver durchführen zu können. Der Quanten-Annealer,eine Unterart von Quantencomputer, ist ein Spezialrechner, deren Hauptfähigkeit darinbesteht, schwierige Optimierungsprobleme zu lösen, und dabei gewisse theoretische Ga-rantien bezüglich Korrektheit liefern zu können. Solche Optimierungsprobleme kommenin vielen Disziplinen vor und schnelle Algorithmen sind von grossen Nutzen. KlassischeAnnealing Algorithmen haben sich bereits seid langem als sehr gute Heuristiken bewährt,und da ein Quanten-Annealer potenziell ein Vorteil durch seine Quantennatur habenkönnte, stieg das Interesse weiter an. Diesen Vorteil in einer realen Maschine zu reali-sieren bleibt jedoch nach wie vor zu zeigen. Das Kernproblem liegt in der momentanenUnmöglichkeit, vollständig zusammenhängende graphähnliche Strukturen technisch zuimplementieren, was die Nutzung von Einbettungen erzwingt, welche das originale Pro-blem auf ein Quantensystem abbildet, welches nur Verbindungen mit kurzer Reichweitevorweist. Gegenwärtig gibt es einen kommerziellen Quanten-Annealer, welcher jedocheine solche Abbildung benötigt, um das Optimierungsproblem im Annealer darstellen zukönnen. Vor kurzem wurde eine weitere Art für solche eine Einbettung und physischeRealisierung vorgeschlagen, welche wir ebenfalls untersuchen. Um diese verschiedenenEinbettungen zu untersuchen, benötigen wir eine Klasse Optimierungsproblem, welcheinteressant für Quanten-Annealer sein könnten. Dies bedeutet ein Problem zu finden, fürwelches es noch nicht schnelle und sehr gute klassische Algorithmen gibt. Wir simulierenden Quanten-Annealer mit jeder dieser Einbettungen und analysieren die algorithmischeKomplexität. Diese vergleichen wir mit direktem Quantenannealing, da die klassische

Page 11: Embedding penalties for quantum hardware architectures and ...

Simulation keiner Einschränkungen unterliegt, was die Implementierbarkeit von vollstän-dig zusammenhängend Graphen betrifft. Unsere Ergebnisse deuten darauf hin, dass alleuntersuchten Einbettungen einen exponentiellen Preis bezüglich Komplexität aufweisenkönnte um den Grundzustand der Einbettungen zu erhalten. Weiter analysieren wir dieEigenschaft von Quanten-Annealer, entartete Grundzustände mit gleicher Wahrschein-lichkeit zu finden. Es ist bekannt, das ein Quanten-Annealer mit einfachem transversalenFeld, wo gewisse entartete Grundzustände exponentiell unterdrückt sind, nicht ausreicht.Wir gehen der Frage nach, ob komplexere Driver diese Schwäche beheben können.

Während der gesamten Arbeit wurde viel Wert auf sauberes Software Design gelegt,um die Wiederverwendbarkeit, Flexibilität und Instandhaltbarkeit zu maximieren. Diegesamte Arbeit sollte so einfach wie möglich reproduzierbar sein, da sie von keinenexternen Datenquellen abhängt.

Page 12: Embedding penalties for quantum hardware architectures and ...

Contents

I Introduction

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1 Optimization Problems 161.1.1 Maximum Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.1.2 Quadratic Unconstrained Binary Optimization . . . . . . . . . . . . . . . . . . 18

1.2 Algorithms 191.2.1 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.2.2 Indirect Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.2.3 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.2.4 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.2.5 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.2.6 Quantum Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1 General Implementation Design 272.1.1 Modular Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.1.2 Parameter Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.1.3 Base Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Page 13: Embedding penalties for quantum hardware architectures and ...

2.1.4 Basic Simulation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2 Simulated Annealing Implementation 362.2.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2.2 Temperature Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.4 Single Spin Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.5 UML Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3 Simulated Quantum Annealing Implementation 422.3.1 Analogy between Quantum and Classical Ising Model . . . . . . . . . . . 422.3.2 Analogy Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.3.3 Cluster Update and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4 Library Overview 532.4.1 siquan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.4.2 frescolino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.4.3 giarsinom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552.4.4 zurkon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552.4.5 phd_thesis_msk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

II Results

3 Maxcut: Comparing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 59

3.1 MaxCut Generator 593.2 Exact Algorithms 603.2.1 Brute Force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.2.2 BiqMac Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3 Goemans Williamson MaxCut Algorithm 623.4 Comparison between GW, SA and SQA 623.4.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.4.2 Quality of Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.4.3 Runtime Scaling for GW, SA and SQA . . . . . . . . . . . . . . . . . . . . . . . . . 643.4.4 Quality and Runtime Scaling SA vs SQA . . . . . . . . . . . . . . . . . . . . . . . 65

4 Embedding Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 Introduction 674.2 Minor Embedding 684.2.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 Minor Embedding on a Chimera Graph 704.3.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Page 14: Embedding penalties for quantum hardware architectures and ...

4.4 Parity adiabatic quantum optimization 714.4.1 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.5 Comparison 754.5.1 Time to Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.5.2 Scaling Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.5.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.6 Conclusion 81

5 Fair Sampling with Quantum Annealers . . . . . . . . . . . . . . . . . 83

5.1 Introduction 835.1.1 Fair Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.1.2 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Toy Problems 855.2.1 Graph Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3 Perturbation Theory 91

5.4 Large Scale Results 93

5.5 Effects of more complex drivers 95

5.6 Conclusions 95

III Tools

6 Scientific Measurement Framework . . . . . . . . . . . . . . . . . . . . . 99

6.1 Challenges in scientific Computing 996.1.1 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.2 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.1.3 Active Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.1.4 Personal Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Zurkon Scheduler 1046.2.1 Multiprocessing in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.2 Static Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.2.3 Dynamic Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3 Zurkon Provenance 1166.3.1 Basic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.3.2 Subcall Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.3.3 Passing Through Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.3.4 Self Similar Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Page 15: Embedding penalties for quantum hardware architectures and ...

6.4 Zurkon Caching 1216.4.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4.2 Prune Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.4.3 Lazy Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.4.4 Provenance Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.5 Technical Implementation 1276.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.5.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.5.3 Context Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.5.4 Three Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.5.5 Caching with multiple Hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.6 Conclusion 130

7 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.1 Implementation 1357.2 Maxcut: Comparing Algorithms 1357.3 Embedding 1367.4 Fair Sampling 1367.5 Scientific Measurement Framework 136

IV Appendix

8 List of publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

10 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

11 Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Page 16: Embedding penalties for quantum hardware architectures and ...

I1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 151.1 Optimization Problems1.2 Algorithms

2 Algorithm Implementation . . . . . . . . . . . 272.1 General Implementation Design2.2 Simulated Annealing Implementation2.3 Simulated Quantum Annealing Implementation2.4 Library Overview

Introduction

Page 17: Embedding penalties for quantum hardware architectures and ...
Page 18: Embedding penalties for quantum hardware architectures and ...

1. Introduction

We want to solve hard optimization problems efficiently and fast. They show up innumerous fields and often can be reduced to some well known optimization problem.They help to streamline processes, find ideal blueprints for a constrained constructionproblem, find solutions not necessarily intuitive to humans and much more. Due to theimportance of these solvers, efforts in academia and industry push the algorithms andsolving strategies further and further. With the dawn of the quantum computing, morespecifically quantum annealing, the question of its usefulness for relevant optimizationproblems arose. As expected, the addition of quantum mechanics opens new avenues thatbenefit certain problems, and can hence be solved faster in theory. But when buildingsuch a device, current technological constraints demand overheads which might reducethe usefulness of currently going quantum. This is important to understand since itmight determine which technology is better suited for these optimization problems. Themain focus of this thesis consist in investigating these overheads and use many differentmethods for solving hard discrete optimization problems and comparing them. Chapter 2will cover these methods in detail, while this introductory chapter introduces the problemsand the fundamental ideas behind the methods. First, the optimization problems ofinterest will be discussed, followed by a high level description of the central algorithms.A comparison between quantum and classical methods for a specific optimization problemwill be provided in chapter 3. The thesis continues by analyzing the overhead currentquantum annealing hardware incur in chapter 4. Following, we investigate secondaryproperties and consequences when using quantum annealing in chapter 5. The last partin chapter 6 covers the challenges and solutions for running a scalable framework for

Page 19: Embedding penalties for quantum hardware architectures and ...

16 Chapter 1. Introduction

scientific measurements.

1.1 Optimization ProblemsThe problem of understanding ferromagnetism led to the description of the Ising model. Itwas simple, yet capable to explain many phenomena physicists observed. One successfultool to study the Ising model were Monte Carlo methods, described in the next sections.The same methods can be applied successfully to many similar problems.

This section will start with the original Ising problem, then generalize it, and lastlyintroduce the concrete problems studied in this thesis.

The classical Ising model is a 2D square lattice with a spin on each the N vertex,capable of assuming the value either +1 or −1. Each spin i is connected to its four1

nearest neighbors, denoted as ∈< i, j >, with a ferromagnetic coupling J > 0, whichleads to the following energy for a specific spin-configuration or state σ ∈ {−1,1}N :

H(σ) =−J ∑<i, j>

σiσ j (1.1)

This can be used to calculate the partition function and investigate the aforementionedphysical properties. We can add a local field, which biases the previously inversioninvariant2 problem:

H(σ) =−J ∑<i, j>

σiσ j−h∑i

σi (1.2)

Furthermore, we can allow for individual couplings Ji, j and local fields hi:

H(σ) =− ∑<i, j>

Ji, jσiσ j−∑i

hiσi (1.3)

Next, we can remove the requirement of nearest neighbors, since in principle all spinscould be connected to all others.

H(σ) =−∑i, j

Ji, jσiσ j−∑i

hiσi (1.4)

where we use ∑i, j as ∑i ∑ j>i in order to prevent double counting the equivalent couplingsJi, j and J j,i.

For now, only couplings between two spins are regarded, but the formula can beexpanded to take into account for example couplings between three spins. Couplings withmore than two spins will sometimes be referred to as higher order interactions or higherorder couplings:

H(σ) =−∑i

hiσi−∑i, j

Ji, jσiσ j−∑i, j,k

Ji, j,kσiσ jσk (1.5)

1periodic boundary conditions for the spins at the edges2if one flips all spins the energy remains the same

Page 20: Embedding penalties for quantum hardware architectures and ...

1.1 Optimization Problems 17

Here we have sorted the terms by how many spins are part of the interaction. If we lastlygeneralize this formula for arbitrary higher order interactions, we obtain:

H(σ) =−N

∑c=1

(∑

i∈Mc

Ji

c

∏l=1

σil

)(1.6)

where Mc = {(i1, . . . , ic)|ik < ik+1} and c is the coupling order. For example, c = 1 wouldbe the local field. This formula can describe many interesting optimization problems, twoof which are presented in more detail.

1.1.1 Maximum CutThe maximum cut, short (MaxCut) problem is an NP-hard optimization problem originat-ing from graph theory. NP-hard problems or non-deterministic polynomial-time problem,cannot be solved in polynomial time if P 6= NP. This is not proven, but currently there isno algorithm that solves the MaxCut in polynomial time to optimality. The goal is to finda cut through a graph that cuts as many edges as possible3. Let G be a graph with verticesvi ∈V and edges ei, j ∈ E. The goal is to separate the vertices into two sets, such that thenumber of edges with one vertex in each set is maximal.

MaxCut = maxS⊂V

∣∣{ei, j ∈ E | vi ∈ S XOR v j ∈ S}∣∣ (1.7)

With the following identities, we can transform eq. (1.7) to a form closer to the Isingmodel (eq. (1.6)):

Ji, j =

{−1 for ei, j ∈ E

0 for ei, j 6∈ E

σi =

{+1 for vi ∈ S−1 for vi 6∈ S

(1.8)

The subset S ⊂ V is identified by the spins σ , and the edges as negative couplingsJi, j =−1. Minimizing

H(σ) =−∑i, j

Ji, jσiσ j (1.9)

yields a state σ which achieves the MaxCut, since we get a negative energy contributionif σi 6= σ j and Ji, j =−1. If we allow more than just −1 as coupling value, we triviallyextend to weighted MaxCut where we want to maximize the total weight of edges cut,not necessarily the amount. To get the value of the MaxCut once the optimal σ is known,we either calculate it with eq. (1.7) or use the following connection between the MaxCutand the energy H(σ):

MaxCut =12(−∑

i, jJi, j−H(σ)) (1.10)

3for the unweighted case

Page 21: Embedding penalties for quantum hardware architectures and ...

18 Chapter 1. Introduction

This works because −∑i, j Ji, j is the energy where all vertices are in the same set andsubtracting H(σ) will remove the edges that are not cut and add the edges that are cut.Now we remain with twice the edges cut, requiring the division by 2.

1.1.2 Quadratic Unconstrained Binary OptimizationQuadratic unconstrained binary optimization, short QUBO , strives to minimize the targetfunction:

E(x1, . . . ,xN) = ∑i

cixi +∑i, j

qi, jxix j (1.11)

where xi ∈ {0,1} and ci,qi, j ∈ R. The weights matrix is symmetric qi, j = q j,i, which willbe used later in the derivation. The main difference compared to MaxCut or Ising is thevalues {0,1} instead of {−1,1} for the variables. This eq. (1.11) can be transformed tothe form of eq. (1.6) by substituting:

xi =12(1+σi) (1.12)

E(σ1, . . . ,σN) = ∑i

ci12(1+σi)+∑

i, jqi, j

12(1+σi)

12(1+σ j)

=14

(2∑

ici(1+σi)+∑

i, jqi, j(1+σi)(1+σ j)

)

=14

(2∑

ici(1+σi)+∑

i, jqi, j(1+σi +σ j +σiσ j)

)

=14

(2∑

ici +∑

i, jqi, j +2∑

iciσi +∑

i, jqi, jσi +∑

i, jqi, jσ j +∑

i, jqi, jσiσ j

)

=14

(2∑

ici +∑

i, jqi, j +2∑

i(ci +∑

jqi, j)σi +∑

i, jqi, jσiσ j

)(1.13)

In the last step we relabeled the indices and used qi, j = q j,i. Now we can identify:

E0 =14

(2∑

ici +∑

i, jqi, j

)

hi =−12

(ci +∑

jqi, j

)

Ji, j =−14

qi, j

(1.14)

Page 22: Embedding penalties for quantum hardware architectures and ...

1.2 Algorithms 19

to arrive to the form of eq. (1.6) (up to an additive constant):

H(σ) = E0−∑i

hiσi−∑i, j

Ji, jσiσ j (1.15)

1.2 Algorithms

The following section describes the core ideas and mechanisms behind the algorithmsused throughout this thesis. First, we gradually build towards Metropolis-HastingsMarkov-Chain Monte-Carlo by explaining each component, Monte-Carlo, Markov-Chainand Metropolis-Hastings, separately. Second, we describe the annealing-technique andfinish with its quantum analog, quantum annealing, by highlighting the central differencesbetween the two.

1.2.1 Monte CarloThe Monte Carlo Method[6] is a numerical algorithm that leverages random sampling.The classic educational example is the calculation of π , illustrated in fig. 1.1a.

0.0 0.5 1.0

x

0.0

0.2

0.4

0.6

0.8

1.0

y

a)

102 104 106

N number of points

10−4

10−3

10−2

10−1

|rela

tive

erro

r|

b)

1 run

16 runs1√N

Figure 1.1: a) The quarter-circle can be sampled with uniformly random points on theunit square and used to calculate π . b) The absolute error of π scales as 1√

N(green). The

blue line depicts a single run of points, while the orange is an average of 16 such lineswith different pseudo random generator seeds.

The area of the quarter-circle is π

4 . By uniformly sampling points in the unit square,we can determine π

4 numerically as the ratio between points inside the circle and thetotal amount of points. This method only requires counting points and division and canestimate π to arbitrary precision, given enough points. More precisely, the error of the

Page 23: Embedding penalties for quantum hardware architectures and ...

20 Chapter 1. Introduction

estimate scales as 1√N

, as shown in fig. 1.1b. The key ingredient we need is a samplingmechanism which covers the whole square and good random numbers. While there aremany ways to improve the efficiency4 and many noteworthy properties to this method,this brief introduction will suffice as a basis. As final note, while this example is useful toillustrate the method, there are many method better suited for 2D integration than MonteCarlo. But this method is among the few viable as soon as the dimensionality of theproblem is large.

1.2.2 Indirect SamplingImagine being tasked to measure the area of a body of water by throwing tennis balls andcounting how often a splash occurs. If the body of water is a puddle, this will work nicelyby the method described in section 1.2.1, also referred to as direct sampling. One canstand in proximity to the puddle, throw tennis balls and reach every part of the samplingarea. As long as this is the case, it does not really matter where the tennis balls are thrownfrom. But if the body of water is a lake, it becomes impossible to reach every edge of itwith throwing5. In this scenario we can still use Monte Carlo, but instead of samplingdirectly, we can only throw the tennis ball a certain fixed distance. Then we walk/row tothe landing location, pick it up, and throw it anew in a random direction.

0.0 0.5 1.0

x

0.0

0.5

1.0

y

a)

0.0 0.5 1.0

x

b)

0.0 0.5 1.0

x

c)

Figure 1.2: Sketch of indirect sampling methods. Earlier parts of a walk are moretransparent. a) single random walk by choosing the next point at a fixed distance, butrandom direction. b) same random walk with further iterations. c) multiple independentwalkers.

Figure 1.2a shows such a potential walk on a circle shaped pond. If we continue towalk in such a random manner (fig. 1.2b) we can get the area6, same as before as the

4for example, importance sampling5Unless the reader owns a helicopter or a tennis ball cannon.6One might have issues implementing the shown periodic boundary conditions in reality. Other boundary

conditions would work as well, but the computation of π would be slightly more complicated if the samplingdistribution in not uniform.

Page 24: Embedding penalties for quantum hardware architectures and ...

1.2 Algorithms 21

ration between points inside and the total number of points. Periodic boundary conditionswere chose, since the random walk described asymptotically approximates a uniformdistribution. Another strategy would be to call a few colleagues, each following anindividual walk, shown in fig. 1.2c. After some walking, everybody adds their findingsfor a more precise result.

There are two very attractive properties from a computational point of viewto bothapproaches, direct and indirect sampling. First, it is trivial to parallelize, and second, wecan achieve arbitrary precision by only running it longer. Every two orders of magnitudein sampling points yields an order of magnitude in precision. The main reason forchoosing indirect sampling is the inability to generate configurations from the targetdistribution at all or if possible, in an efficient manner. Therefore the indirect samplingapproach is superior in practice. But there is one caveat compared to direct sampling.Two neighboring points on a walk are correlated, since the throwing distance is muchsmaller than the area we are sampling. If we where to take into account all points of thewalk, the estimated error would be grossly underestimated. Therefore, we should onlysample a point on the walk if it is sufficiently de-correlated from the previous one.

1.2.3 Markov ChainThese walks described in section 1.2.2 are so called Markov Chains, since the next locationonly depends on the current location and not on the history of a walk. Furthermore, wecall them reversible Markov Chains, if they satisfy the detailed Balance condition

ρ(A)∗ t(A→ B) = ρ(B)∗ t(B→ A) (1.16)

where ρ(A) is the probability of being in A and t(A→ B) the transitional probabilityof moving from A to B. This is the case for our all our previous examples. If we furtherare able to reach every point from any other point in a finite amount of steps we callthe Markov Chain ergodic. These two conditions, ergodicity and detailed balance aresufficient to guarantee a stationary distribution ρ . In section 1.2.1 the distribution wasknown by design, since we used uniformly distributed points. But it is not immediatelyclear what distribution the random walk with periodic boundary conditions in section 1.2.2produces7. What if mirror boundary condition would have been used? In general, it isnot necessary to know the distribution in advance for the method to work, as long as itsatisfies ergodicity and detailed balance. One can always weight the points accordingto the distribution recorded while sampling in order to calculate the value of π . If thesampling and target distribution have very little overlap, the error converges much slower.

We will now depart from using Monte Carlo as an integrator, and discuss the abilityof sampling distributions. One application would stem from statistical thermodynamics,where we might want to calculate an ensemble average of an observable property ffor a system. For such a system with a configuration space {x} and corresponding

7in this case we converge towards uniform distribution.

Page 25: Embedding penalties for quantum hardware architectures and ...

22 Chapter 1. Introduction

unnormalized distribution ρ(x) of the configurations x, we would need to calculate:

〈 f (x)〉= ∑x ρ(x)∗ f (x)∑x ρ(x)

(1.17)

For the π-examples the function would be 1 inside and 0 outside the circle ,and ρ(x) = 1for all x. For any larger system eq. (1.17) becomes impossible to evaluate, since we wouldneed to integrate over the entire configuration space. Instead, we use Monte Carlo andMarkov Chains to sample configurations and approximate the ensemble average numeri-cally. We start with a configuration x, calculate f (x) and then update the configuration,until the next configuration sample is de-correlated. But in order not to waste valuablecompute time, we would like to sample the more likely configurations more frequently,since their weight is larger in eq. (1.17). Given x, we can calculate ρ(x), which does notneed to be normalized. The next section shows how one can traverse the configurationspace in an efficient manner.

1.2.4 Metropolis-HastingsThe Metropolis-Hastings algorithm uses Monte Carlo and Markov Chains. The coreingredient is a distribution function which ensures detailed balance and samples themore likely configurations more frequently. This is achieved by defining the transitionalfunction as follows:

t(A,B) = min(

1,ρ(B)ρ(A)

)(1.18)

This means if we propose to go from configuration A to B, we always accept the updateif B is more likely than A. If not, we only accept it with probability ρ(B)

ρ(A) . This way,more likely configurations will be sampled more frequently. Furthermore, t satisfieseq. (1.16) which can easily be checked by just inserting the definition. The basic steps ofthe algorithm are:

• start: Generate a starting configuration A.• update: Generate the next configuration B by applying an update to A.• accept: Accept B with the probability given by t.• sample: Use B as a sample if it is sufficiently de-correlated from the last sample.• terminate: Repeat update, accept and sample until the error is in the desired range.

The only two functions a user must provide is the (not necessarily normalized) distributionρ(x) and an update routine that proposes B given A. In the next section we show how theMetropolis-Hastings algorithm can calculate the solution of optimization problems.

1.2.5 Simulated AnnealingSo far, we have highlighted methods to sample configurations according to some distribu-tion, but this alone is not sufficient to solve the problem outlined in section 1.1. We wantto find the optimal solution to a problem and are not necessarily interested in averages andother configurations. The approach of annealing stems originally from material science,

Page 26: Embedding penalties for quantum hardware architectures and ...

1.2 Algorithms 23

where a heated material would assume a different crystal structure depending on howfast it was cooled. The slower a material is cooled, the less energy its final state typicallyhas. The material has enough time to always adjust to the new thermal equilibrium. Ifit is cooled too fast, material stress can be induced, which amounts to additional energyi.e. a local minimum. This can be desirable for making tempered glass which scattersinto many small pieces upon failure instead of leaving sharp edges, or quenching (rapidcooling) steel to make it harder.

Here we simulate this process of thermal annealing to find ground states. It is referredto as Simulated Annealing[7], Classical Annealing, Thermal Annealing or SA in short.For this, we simulate a canonical ensemble with a Boltzmann distribution:

ρ(x) = e−E(x)kBT = e−βE(x) (1.19)

where E(x) is the energy of configuration x. For optimization problems, this would be thecost function to minimize. The advantage of this Boltzmann distribution is that loweringthe temperature accentuates the lower lying states further and further, which is shown infig. 1.3.

0 2 4 6 8 10

Energy [meV]

0.0

0.2

0.4

0.6

0.8

1.0

Un

nor

med

Bol

tzm

ann

Dis

trib

uti

on T = 1 K

T = 20 K

T = 100 K

T = 10000 K

Figure 1.3: Shows the unnormed Boltzmann distribution for different Temperatures. If itwas normed, the integral of each distribution would result in 1. At high temperatures, allenergy states are practically equally likely. But the lower the temperature gets, the higherenergy states probabilities are suppressed in favor of the lower energy states.

We can start generating configuration from a high temperature distribution and slowlyreduce it, while keeping our generated configuration true to the current distribution.When we update a configuration A to get B, the transitional probability according to

Page 27: Embedding penalties for quantum hardware architectures and ...

24 Chapter 1. Introduction

Metropolis-Hastings is given by:

t(A,B) = min(1,ρ(B)ρ(A)

) = min(1,e−βE(B)

e−βE(A)) = min(1,e−β∆E) (1.20)

This only requires the energy difference, which is fast and easy to compute for mostupdate routines. Ideally this leads us to ground states after the cooling process is finished.Figure 1.4a shows schematically how thermal fluctuations overcome barriers in the energyor cost function. There is a theoretical guarantee to reach the ground state if the annealing

Configuration

En

ergy

a) Thermal Annealing

Configuration

b) Quantum Annealing

Figure 1.4: Sketch of thermal and quantum annealing. a) Thermal fluctuations enablea configuration to escape the local minimum. The height of the barrier determines thesuccess. b) Quantum tunneling allows the configuration to bypass the barrier and escapethe local minimum. The wider the barrier the less likely this succeeds.

is slow enough. This is formulated in the theorem of Geman and Geman[8], which statesthat if the temperature T in each simulation step k is not smaller than

T (k)>c

1+ k(1.21)

where c is a k-independent constant, then the probability of reaching the ground stateconverges to 1 as k→ ∞. SA has also proven to be a very successful heuristic whenrun faster i.e. with parameter and schedulers that do not guarantee the ground state. Ofcourse this then requires validating the problem space of interest against exact methodsand loosing the theoretical guarantee. But for many application, speed and performancecan be much more important than the this guarantee.

Figure 1.5 shows the energy for a small Ising model when temperature slowly de-creased from right to left. More details about the algorithm will be discussed in section 2.2.

1.2.6 Quantum AnnealingQuantum Annealing[9, 10, 11, 12, 13, 14, 15, 16, 17] is part of the quantum computingfield. There are large efforts to leverage the effects of quantum mechanics for computa-tional tasks. As in classical computing, there are two broad subfields, analog computing

Page 28: Embedding penalties for quantum hardware architectures and ...

1.2 Algorithms 25

0 2 4 6 8 10

Temperature / kB

−200

−150

−100

−50

0

En

ergy

Figure 1.5: Energy for a single annealing run of a 10x10 periodic ferromagneticIsing model. When the temperature is high, the energy is high as well, due to thethermal excitation, or more algorithmically, since there are so many more higher energyconfigurations which get accepted due to the high temperature. When the temperatureis lowered, the ground state at E =−200 is found. (The 10x10 nearest neighbor Isingmodel has 200 bonds)

and digital computing. Quantum annealing belongs to the former, and is a special pur-pose device that solves one task only. General purpose quantum computing requiresprogrammable gates and is located in the later.

Similar to classical analog devices, quantum annealers deal with analog signals,and hence have real world limitations due to calibration-precision and measurementresolution. This is important to keep in mind when challenging algorithms on classicaldigital hardware, that don’t have this limitation.

The quantum annealing method has a similar name to the classical thermal annealing,since the general theme is the same, yet the mechanism and properties are fundamentallydifferent. Instead of using thermal fluctuations to overcome barriers in the cost function(fig. 1.4a), we use quantum tunneling to bypass barriers, as shown in fig. 1.4b. Barriersshould ideally be narrow, since the tunnel probability decreases exponentially with thebarrier width.

Similar to simulated thermal annealing, the so called adiabatic theorem guarantees usa ground state, if the following conditions are met: It states that if we start in a groundstate of a Hamiltonian, and slowly (adiabatically) transform this Hamiltonian into anotherone, the corresponding changing state will always be the instantaneous ground state, ifthere is a non vanishing energy gap between the ground state and the first excited statealong the entire annealing path. Again the term "slowly" or "slow enough" needs to bequantified. For numerical and optimization purposes, we usually are on the (too) fastside concerning annealing speeds, in order to increase performance and time-to-solution.It can even be beneficial for finding the ground state to run annealing multiple times

Page 29: Embedding penalties for quantum hardware architectures and ...

26 Chapter 1. Introduction

sloppy (i.e. fast) than once slow, given the same runtime. Throughout this thesis, we usesome form of transverse field Hamiltonian as starting Hamiltonian, since we can easilygenerate its ground state, and then slowly transform toward the problem Hamiltonian, thequantum analog of eq. (1.6), diagonal in the computational basis. Figure 1.6 shows thefull spectrum of a simulated quantum anneal for a small spin system. The consequence ofannealing too fast can be seen when comparing fig. 1.6a with fig. 1.6b. But even if theannealing path leaves the instantaneous ground state, for many problems there is still asolid chance to measure the final ground state. The state may now be a superposition ofthe ground state and some higher energy states. Since a final read-out measurement willcollapse this superposition, we get the ground state according to its probability. But itis also possible for some problems not to be able to recover the ground state, since itsamplitude in the wave function vanished. Hence this practice should be used as a heuristicand validated thoroughly in the problem space of interest. Derivation and more details ofthe algorithm will be covered in section 2.3.

0 1 2

Annealing Paramenter t

−5

−4

−3

−2

−1

En

ergy

a)

0 10 20

Annealing Paramenter t

b)

Figure 1.6: The 8 lowest energy states of a 5 spin system with 6 degenerate ground states.When t = 0, only the the transverse field Hamiltonian is active with one clear and simpleto initialize ground state. We then slowly transform towards the problem Hamiltonian.The annealing parameter t is an indicator for the time. a) If the anneal is too fast, the statewill leave the ground state. b) If the anneal is slow enough, the state always stays in theground state.

Page 30: Embedding penalties for quantum hardware architectures and ...

2. Algorithm Implementation

This chapter highlights the implementation of the algorithms described in the introduction.There are several layers to an implementation. First the high level design philosophy,second, the choice theoretical algorithm and third, the technical implementation detailsand optimization techniques in C++. The first section covers high level design and tech-nical details shared by all implementation. After this, and implementation for SimulatedAnnealing is discussed and later the Simulated Quantum Annealing implementation. Bothchapters deal with the theoretical algorithm as well as technical implementation details.The final section summarizes the different libraries and their functionality.

2.1 General Implementation Design

All core algorithms are written in modern C++ to meet a certain practical performance.This section details the core design choices while implementing in C++. It represents thecode-base in its final state. Many choices only became apparent with growing project sizeand experience of the author. Many interesting implementations that served as steppingstones to this final one will not be covered. We introduce the modular design, technicaldetails as well as the basic simulation component structure available. The library has theworking name siquan short for simulated quantum annealing. Despite the central focuson SQA, it also contains other algorithms such as simulated annealing and exact methods.

Page 31: Embedding penalties for quantum hardware architectures and ...

28 Chapter 2. Algorithm Implementation

2.1.1 Modular DesignA core aspect of science is the potentially unknown follow-up to a finding. Will the projectbe developed further or left as is? Are there new interesting questions worth investigating?To satisfy this need for flexibility, a modular programming approach has been prioritized.By separating self contained sub task into modules, a library of interchangeable modulescan be created, eliminating the need to rewrite the same functionality. The individualmodules should be as closed as possible and only expose the minimal interface necessary.Furthermore they should respect certain overarching interface decisions in order to beinterchangeable. This allows reusing, many parts of all the algorithms are very similarand should not be implemented twice.

Static PolymorphismIn order to implement this modularity in an efficient and safe way, static polymorphism isused. The main goal is to use types and methods from base classes as well as mix-in anykind of functionality with just the addition of one more class. To increase performance,as much as possible is resolved during compile-time. Hence the implementation makesheavy use of templates. The modularity is implemented via the aforementioned staticpolymorphism using the object oriented programming and inheritance with templates.To avoid runtime overhead, dynamic polymorphism is entirely disabled, i.e. there areno virtual functions. Listing 2.1 shows a basic example with the corresponding UML-diagram displayed in figure fig. 2.1. Throughout this work, we use the directional termabove (or higher) for classes earlier in the inheritance chain and below (or lower) forclasses that inherit from higher classes. In listing 2.1 the base_t would be higher up thanderived_t.

Listing 2.1: Static Polymorphism

1 namespace demo {class base_t {public:

using data_type = int;

6 void greet() const {std::cout << "Hi from base" << std::endl;

}

data_type base_property; // store some state for example

};

11template <typename super>class derived_t : public super { // super is inherited from

public:using typename super::data_type; // use types from base

16void greet() const {

Page 32: Embedding penalties for quantum hardware architectures and ...

2.1 General Implementation Design 29

super::greet(); // reuse functionality from base

std::cout << "Hi from derived" << std::endl;

}

21 data_type derived_property; // store some state as well

};

using use_t = derived_t <base_t >; // the concrete type

} // end namespace demo

Figure 2.1: The UML-diagram of the content in listing 2.1. The namespace demo isdisplayed by a white box, where as every instantiated class is depicted by a yellow box.The top right white box of a yellow box show the template parameter of that class.

The main technique for modularity are the templated classes like derived_t. Bychaining them together as needed, one can get all the necessary features while not loosingany performance due to the compile-time resolution. In listing 2.1 we show this withthe example of two classes base_t and derived_t which implement a greet function.Note that the base_t class is not templated and has a greet implementation that doesnot invoke any further calls to a base class. Hence there is the need for one special classwhich is on top and not provided by a user. section 2.1.3 describes this class in moredetail.

Page 33: Embedding penalties for quantum hardware architectures and ...

30 Chapter 2. Algorithm Implementation

2.1.2 Parameter InitializationA common task for any kind of simulation code is setting the initial conditions and pa-rameters. Due to the modular structure outlined in the previous section 2.1.1 initializationby passing all arguments in the constructor is impossible since it would require as manyoverloaded constructors as there are combinations of modules. By introducing anothertype which deals with carrying the data required for construction, this problem can besolved elegantly. Listing 2.2 shows how this is implemented in C++ and fig. 2.2 showsthe corresponding UML-diagram.

Listing 2.2: Initialization via param struct

1 #include <iostream >

namespace demo {class base_t {

5 public:struct param {

int base_param;};

10 explicit base_t(param const& p) {std::cout << p.base_param << std::endl;

// accessing p.derived_param would fail here

}

};

15template <typename super>class derived_t : public super {public:

struct param : public super::param {20 int derived_param;

};

explicit derived_t(param const& p) : super(p) { // pass p

std::cout << p.base_param << std::endl; // works

25 std::cout << p.derived_param << std::endl;

}

};

} // end namespace demo

30int main() {

using namespace demo;using use_t = derived_t <base_t >; // the concrete type

Page 34: Embedding penalties for quantum hardware architectures and ...

2.1 General Implementation Design 31

// initialize the param struct

35 use_t::param p;

p.base_param = 1;

p.derived_param = 2;

// initialize the use type

use_t sim(p);

40 // ... do stuff

return 0;}

Figure 2.2: The UML-diagram of the content in listing 2.2. The red arrow shows that atype was defined inside another type. This fact could also be seen by the name of thistype.

Every module defines a struct param if it needs user-data during construction. Thisis inherited from the super::param which is the previous param in the inheritance chain.If modules choose not to implement such a param, they have to use super::param astype of the constructor parameter. The elegant part is the inability of modules higher up toaccess constructor parameter information of lower modules due to C++ base slicing. Theother way around is possible, but that can be considered a feature, since the derived_tis anyway dependent to some degree of the base_t. This way, the base_t does not needto explicitly expose all arguments it got during construction, since derived modules canjust access that information during construction as well.

2.1.3 Base ClassAs mentioned in section 2.1.1, the user should only define modules and the base class isfixed and provided by the library. Listing 2.3 shows a listing that defines a type_carrier

Page 35: Embedding penalties for quantum hardware architectures and ...

32 Chapter 2. Algorithm Implementation

and a step_counter and then combines them using the siquan::compose functional-ity.

Listing 2.3: use of siquan base class

1 #include <iostream >#include <siquan_core/base.hpp>

3namespace demo {struct type_carrier {

using count_type = double;};

8template <typename super>class step_counter : public super {public:

using typename super::count_type;13

struct param : public super::param {count_type counter;

};

18 explicit step_counter(param const& p) : super(p) {}

void step() {super::step();

std::cout << "step" << std::endl;

23 }

bool stop() const { return super::stop() or true; }};

using sim_type = siquan::compose<type_carrier , step_counter >;28 } // end namespace demo

For now, we will ignore the content of these classes and focus on the construc-tion of sim_type. siquan::compose helps to write step_counter< siquan::base<type_carrier > > in a nicer way. The type carrier specifies all possible types that arerequired by the modules and is technically the first base class. This avoids forcing theuser to include and derive the type carrier from siquan::base. The UML-diagram infig. 2.3 reveals that there is a additional finalizer class, which will always be last in theinheritance chain, since it can be useful for the library to control that last module. Asanalogy, the base and finalizer are the pieces of bread in a sandwich, while everythingelse is situated in between (with the sole exception of type carrier).

Page 36: Embedding penalties for quantum hardware architectures and ...

2.1 General Implementation Design 33

Figure 2.3: The UML-diagram of the content in listing 2.3. There is a inheritance chainof four modules that compose the type requested by the user. An additional inheritancechain of two make up the associated parameter struct. There is no red arrow betweenbase and seed_param as between step_counter and its param since seed_param isdefined outside base, but the typedef base::param resolves to it.

Parameter Base

If modules define a param, it always has to derive from the one above. Same as forthe modules, there must be a base param which is the first one in the inheritance chain.This special param is exposed through siquan::base and is called seed_param (seefig. 2.3). From a design point of view, it would have been cleaner to leave it empty, butsince almost all simulations require random numbers generators (rng), it is useful to havea seed rng . Its task is to initialize all subsequent random number generators, and thus

Page 37: Embedding penalties for quantum hardware architectures and ...

34 Chapter 2. Algorithm Implementation

holding the master-seed value needed to reproduce a specific simulation.

2.1.4 Basic Simulation Components

The siquan simulation framework splits the work into three major functions, step,advance and update, which can be seen inside the while loop in listing 2.4, as well asthree minor functions for setup, teardown and determining when to stop.

Listing 2.4: basic simulation steps

1 int main() {2 demo::sim_type::param p;

demo::sim_type sim(p);

sim.init();

sim.update();

7 while(not sim.stop()) {sim.step();

// if the schedulers require, we can pass

// the appropriate data here, e.g. sim

sim.advance(sim);

12 sim.update();

}

sim.finish();

return 0;17 }

Before describing these functions, we introduce the following characterizations formodules, depending on what their central task is. It is possible a module belongs to morethan one, but that usually means it could be split into two modules. The order in whichthese modules are described here is usually also the order in which they are inherited.

Initializer

These modules only act upon initialization and are idle after. Mostly used to processinput files and check if everything is in order. These modules implement only the initfunction.

Scheduler

This module type provides some controlled value to the simulation, like the temperaturefor simulated annealing (see section 1.2.5). It governs the advancement of such a param-eter. The scheduler is closely tied to the algorithm module and it only makes sense tobreak out this functionality from the algorithm if it can be reused in other places.

Page 38: Embedding penalties for quantum hardware architectures and ...

2.1 General Implementation Design 35

ConfigurationThe state (or configuration) of the simulated system is held by this kind of module. Itis usually a pure data module that offers no functionality, since it is the algorithms jobto change the configuration. The benefit of not including the configuration directly intoan algorithm module is the ability to change the data structure independently form thealgorithm. This can result in performance increase for certain problems that profit from adifferent internal data structure.

AlgorithmThe central module responsible for stepping from one configuration to another. Thiswould contain the Metropolis-Hastings algorithm as one example. The algorithm may bedependent on the presence of specific schedulers.

ObserverIf we want to observe/measure the configuration during the simulation, we can employ anobserver module to do so. The observer updates its internal state and collects the desiredinformation. It can observe anything a module above it defines.

FinalizerSimilar to the initializer, this type of modules only act after the simulation is done. Theycan analyze the final state and compute some desirable quantities. It can be thought of asan observer that only measures once at the end.

MethodsWith the different types of modules in mind, the following part will clarify the role ofeach method shown in listing 2.4.

• init: is responsible for the setup, e.g. reading the problem file, setting up sched-ulers and observers, and initializing the starting configuration.

• update: does not change the configuration of the simulated system. If somethingis measured repeatedly during the simulation, it would happen during this call.Note that there is one update call more than advance and step since we alsomeasure the initial configuration.

• stop: determines if the simulation is done. Any module can define a stop methodwhere its criteria should be combined in an or statement with the stop call of theprior module, as seen in listing 2.3 (line 24). The base::stop() always returnsfalse. The or statement is a design choice. One module can stop the simulationwithout considering other modules that implement stop.

• step: only concerns the algorithm module. It changes the configuration of thesystem and ideally makes sure it is sufficiently de-correlated from the previous one.This is however not strictly necessary. If step produces correlated samples, somekind of autocorrelation observer might be necessary.

• advance: changes the state of scheduler. This method takes one templated argu-ment, sim in listing 2.4. This is necessary for an adaptive scheduler that depends onobserved values. For example, when using classical annealing on the Ising model

Page 39: Embedding penalties for quantum hardware architectures and ...

36 Chapter 2. Algorithm Implementation

the temperature should be changed more slowly when the system is close to a phasetransition, i.e. if the susceptibility is high. It is not possible for higher modulesto access information of lower ones, hence we need to pass this information asargument.

• finish: executes any tasks necessary after the simulation is done. It can rangefrom refining observed data to decoding (see chapter 4) the configuration.

Having defined the general structure of our simulation framework, we now present specificalgorithm implementations.

2.2 Simulated Annealing ImplementationThe SA implementation of siquan is very simple and limited, but still yields good result.We also implemented a more powerful version, not part of the core siquan library, withperformance comparable to [18].

This section expands on section 1.2.5. First, the configuration is specified followedby the implementation of the scheduler. Finally, the single spin algorithm is explained.

2.2.1 ConfigurationThe goal of this algorithm is to minimize eq. (1.6). The configuration σ is implementedas std::vector<bool>. Every coupling consists of a weight Ji, j,... and an index vectorcontaining the spin-indices i, j, . . . belonging to that coupling (not just two body terms).Further there is a mapping from each spin-index to the subset of couplings it is a partof. This is required for performance, in order to look up the couplings of a spin in a fastmanner.

2.2.2 Temperature SchedulerFor simulated annealing, the temperature needs to be decreased slowly. This is the taskof the temperature scheduler. siquan provides a few common schedulers, as shown infig. 2.4. All these schedulers require the presence of a simulation step counter, i.e. theyall need to know at what fraction of completion p ∈ [0,1] the simulation currently resides.

LinearThe linear scheduler is the simplest and takes a starting value a and end value b. Thevalue depending on the simulation progress is given by:

T (p) = a+ p∗ (b−a) (2.1)

Figure 2.4a shows the linear scheduler with a = 11 and b = 1

InverseThe inverse scheduler takes the same arguments as the linear scheduler, but schedules thevalue according to:

T (p) =1

1a + p∗ (1

b − 1a)

(2.2)

Page 40: Embedding penalties for quantum hardware architectures and ...

2.2 Simulated Annealing Implementation 37

5

10

Tem

per

atu

relineara) inverseb)

0.0 0.2 0.4 0.6 0.8 1.0

5

10

Tem

per

atu

re

stepped linearc)

0.0 0.2 0.4 0.6 0.8 1.0

piecewise multid)

Figure 2.4: Sketch of the main schedulers provided by siquan.

a≈ 0 and b≈ 0 should be avoided as values. Figure 2.4b shows the inverse schedulerwith a = 11 and b = 1

Stepped

This is not an independent scheduler, but rather a modification that can be applied to anyscheduler. It takes a step value s and rounds the value of the scheduler to the next value in{b+n∗ s|n ∈ Z}. If a is not in this set, it will not be assumed, despite being the startingvalue. b is chosen as reference since it is much more important to match the final (usuallyclose to zero) temperature rather than the initial temperature which should just be hotenough when annealing. Figure 2.4c) shows a linear scheduler with a = 11 and b = 1stepped with s = 1.

Piecewise Linear

This scheduler can take more than the two values a and b, but N > 1 different values.It will divide the simulation into N−1 equal parts and schedules linearly between thecorresponding numbers in each section. E.g. an input of 11,2,3 would result in linearscheduling from 11 to 2 in the first half and linear scheduling from 2 to 3 in the secondhalf of the simulation.

Piecewise Multi

The piecewise multi scheduler works similar to the piecewise linear scheduler. Instead ofonly allowing linear scheduling in each section however, one can specify a few functionalbehaviors, such as inverse and square segments. The linear and inverse behavior is

Page 41: Embedding penalties for quantum hardware architectures and ...

38 Chapter 2. Algorithm Implementation

highlighted above, while the square scheduling is defined as:

T (p) = a+ p2 ∗ (b−a) (2.3)

An additional option can be passed to reverse the shape of the square scheduling, andmake it change faster in the beginning:

T (p) = a+(1− (1− p)2)∗ (b−a) (2.4)

This changes whether the change starts fast (steep slope) or slow. The possible functionalforms are passed as string shortcuts: square starting slow sS, or fast sF, inverse startingslow iS or fast iF. If no specification is found, linear scheduling is used. Figure 2.4d)shows a piecewise multi scheduler with input [11,sF,5,9, iS,2,1]. There are four parts,first a fast square descent from 11 to 5 followed by a linear ascent to 9. A slow startinginverse scheduler continues until 2, where a linear scheduling takes over and finishes at 1.While one does not full have control over the functional behavior of the scheduler, thisalready allows for many interesting scheduling functions. Full control however couldeasily be implemented as well.

2.2.3 NotationTo better describe the algorithms, we use the visual representation defined in fig. 2.5. Thequantum spins are not relevant for this section, but still introduced for the next one.

a) b)

c)

Figure 2.5: Visual notation used to describe annealing algorithms: a) a shows threeQuantum 1

2 -spins, the first up, the second down and the third in any superposition,which is not specified further. b) shows two Classical spins, one up and one down,while c) shows four Classical spins, connected by ferromagnetic couplings (orange) andantiferromagnetic couplings (blue). If the couplings adds energy, it is marked with a redcross, and if it reduces energy with a green check-mark.

2.2.4 Single Spin AlgorithmThe single spin update algorithm is possibly the simplest to use with Metropolis Hastings.In order to change the configuration, we select a random spin and try to flip it. This

Page 42: Embedding penalties for quantum hardware architectures and ...

2.2 Simulated Annealing Implementation 39

can be seen in fig. 2.6a-b and fig. 2.6c-d. In accordance with section 1.2.5, we need tocalculate the energy difference, which would be −8J in the a-b case and +2J in the c-dcase assuming all coupling J are equal. This difference is used to calculate the acceptanceprobability pacc = min(1,e−β∆E). Lastly we generate a random number r ∈ [0,1) andaccept the spin flip if r < pacc. Otherwise we reject it. One of the weakness amongst afew is the practical1 inability to flip larger clusters of spins at low temperatures if certaincriteria are met. Figure 2.6c-d shows such a situation, where two clusters of oppositespins coexist with a straight boundary2 separating them. If we try to move the boundarywith single spin flips, we first have to increase in energy since a dent in the boundaryneeds more energy, which is unlikely at low temperatures. This should be considered ifinitial results do not seem to make sense. Even in irregular and more connected systemsthis could potentially happen, not just in the very symmetric Ising model.

Optimization: Precomputing ExponentialsIf the couplings are all equal, one can precompute the expensive min(1,e−β∆E) forall possible ∆E. If the highest degree in the connection graph is D, possible energydifferences are {−2NJ,−2(N−1)J, . . . ,2(N−1)J,2NJ}. We can reduce this by morethan half, since we know that all non-positive energy differences will always be accepted,leaving us with {2J, . . . ,2(N−1)J,2NJ}. For the Ising model with D = 4, that wouldcorrespond to 4 values. But as soon as the temperature changes, these values need tobe recomputed again, since β changed. This is one application of the stepped schedulermodification described in section 2.2.2, since it changes the temperature less frequently.

2.2.5 UML DiagramFigure 2.7 shows the full UML Diagram of the SA-simulation provided py siquan. Thedifferent stages in order of inheritance are briefly described:

1there is always a very small non zero probability2Assume it goes across the periodic boundary, i.e. no spin section encircles the other one completely.

Page 43: Embedding penalties for quantum hardware architectures and ...

40 Chapter 2. Algorithm Implementation

a) b)

c) d)

Figure 2.6: A 4x3 subsection of a ferromagnetic Ising model. The Ising model ischosen here for visualization reasons. Any kind of connected graph could be chosen,not restricted to nearest neighbors and two-body interactions. a) A single spin pointsdown and leaves all couplings to its neighbors in the high energy state. b) the spin wasflipped and now all bonds are satisfied. c) there are two blocks with a boundary of threeunsatisfied couplings. d) Trying to intrude in one section increases the energy sincethere are now two more unsatisfied bonds. Such boundaries as in c) are stable at lowtemperatures if they extend over the periodic boundary and only single spin updates areused.

• type_carrier: contains all typedefs necessary for all modules.• base: is the first module, and derives from type_carrier. For more details, see

section 2.1.3.• connect::basic: is a data class that stores the couplings. In the program, we call

them connections instead of couplings, hence the namespace connect.• connect::read_in: get a filename, containing the problem, but does not specify

the format.

Page 44: Embedding penalties for quantum hardware architectures and ...

2.2 Simulated Annealing Implementation 41

Figure 2.7: Full UML-diagram for the SA-simulation provided py siquan.

• connect::read_in_txt: uses the filename and reads in connections from a .txtfile.

• connect::remap: makes sure that the internal spin labels start at 0. The useris free to even use strings as spins labels, with corresponding typedef in the

Page 45: Embedding penalties for quantum hardware architectures and ...

42 Chapter 2. Algorithm Implementation

type_carrier. This remapping is undone in the last module, connect::unmap,s.t. the user gets the original labels back.

• scheduler::sim_step: Takes a steps argument and stops the simulation afterthis many steps. Also exposes the current simulation progress for all modulesbelow.

• scheduler::piecewise_multi_T: schedules the temperature, see section 2.2.2.• state::simple: holds the state of the current configuration. In this case it is just

a small wrapper around a std::vector<bool>.• state::state_connect: since many algorithms need to access all connections

belonging to a specific spin in a fast manner, this stage provided the appropriatedata structure.

• algo::simulated_annealing: contains the single-spin update algorithm de-scribed in section 2.2.4.

• algo::analyze_energy: calculates the total energy from the configuration andthe couplings. Note that this is an expensive operation. The algo::simulated_annealingmodule holds internally an energy variable that is constantly updated without globalrecalculation, which could also be used. This modules recalculates the energy infinish to ensure correctness.

• observer::state: records the configuration.• observer::T_scheduler: records the temperature.• observer::accept: observes the acceptance ration.• observer::snapshot_limit: triggers an observation. We do not necessarily

want as many observations as steps for performance reasons, hence the parametersnapshots will determine how many observations are made.

• connect::unmap: reverses the internal mapping. The user gets back the originalspins labels.

2.3 Simulated Quantum Annealing ImplementationThe implementation of simulated quantum annealing (SQA) requires a few more stepsthan the perviously discussed SA implementation. First, we mathematically derive theanalogy between SQA and SA. Second, an interpretation of the parts of the mappingare given and last, we discuss a more advanced cluster update implementation that isnecessary for SQA along with implementation optimizations. This work build on top ofan initial implementation provided by Bettina Heim in collaboration with other membersof the Troyer Group. Most techniques discussed where already present in some form inthis initial implementation.

2.3.1 Analogy between Quantum and Classical Ising ModelIn the following section, we derive the analogy between the classical and the quantumIsing model with a transverse field. We choose the 1D Quantum Ising chain of lengthL for simplicity, shown in fig. 2.8. The derivation holds all the same for an all to allconnected system with arbitrary z-body interaction terms as in eq. (1.6). Throughout this

Page 46: Embedding penalties for quantum hardware architectures and ...

2.3 Simulated Quantum Annealing Implementation 43

Figure 2.8: A 1D quantum spin chain of size L = 8 with nearest neighbor interactionsand periodic boundary conditions. See fig. 2.5 for the detailed meaning of each visualcomponent.

section we will use the colors orange for matrices, red for vectors and green to highlightscalar values, though not all scalar values will be highlighted this way. The generalHamiltonian we want to investigate

H = Hx +Hz (2.5)

has two components: First, the problem Hamiltonian

Hz =−JL

∑p

p zσ

p+1 (2.6)

containing the nearest-neighbor interactions. xσp represents the Pauli matrices σx actingonly on the spin at position p in the chain. Same notation is used for zσp. Consistentnotation of indices is the reason for the unusual top-left index x and z.

p = I2⊗ . . . σx︸︷︷︸position p

. . .⊗ I2 (2.7)

where I2 is the two dimensional identity matrix. The neighbors in the chain (lets assumeperiodic boundary conditions) want to be parallel or anti-parallel depending on the valueof J. And second, the transverse field Hamiltonian

Hx =−Γ

L

∑p

p (2.8)

will put the spins into a superposition of up and down. Depending whether the transversefield strength Γ or the coupling strength J dominates, the spins will be fully mixed or closeto one of the basis states all up or all down. In order to conveniently make observationson our system, we need to find the partition sum Z

Z = Tre−βH = TrO = ∑x1∈B〈x1|O |x1〉 (2.9)

where the trace can be expressed by a sum over a basis B. Even if it is not necessary forthe trace, we choose B to be the orthonormal computational basis xi = x1

i ⊗ . . .⊗xLi , i.e.

Page 47: Embedding penalties for quantum hardware architectures and ...

44 Chapter 2. Algorithm Implementation

(1,0, . . . ,0),(0,1, . . . ,0), . . . ,(0,0, . . . ,1), since xpi are the eigenvectors of σx. β = 1

kBTcontains the temperature and Boltzmann factor. Further we use O as a shorthand forthe matrix exponential. The main difficulty lies in this matrix exponential and the noncommuting parts of our Hamiltonian. First we split the matrix exponential into manysmall parts

O = (O1M )M = O

1M . . .O

1M (2.10)

and then use the second order Suzuki-Trotter expansion (ST2) [19]

O1M = e−

β

M H = e−∆Hx−∆HzST 2≈ e−

2 Hze−∆Hxe−∆

2 Hz (2.11)

which becomes exact for M→ ∞. We introduced ∆:

∆ =β

M(2.12)

The great advantage over the first order Suzuki-Trotter expansion (ST1) is the preservationof unitarity, which of great help in the numerical simulations. Now we insert the aboveinto the partition sum and add many identities in the form of I = ∑xi |xi〉〈xi|, which workssince the basis xi is orthonormal3.

Z = TrO = ∑x1

〈x1|O1M︷ ︸︸ ︷

I = ∑x2|x2〉〈x2 |O

1M . . .︷︸︸︷

IO

1M |x1〉 (2.13)

= ∑x1...xM

〈x1|O1M |x2〉 . . .〈xM|O

1M |x1〉 (2.14)

= ∑x1...xM

M

∏i〈xi|O

1M |xi+1〉︸ ︷︷ ︸

xM+1 := x1

(2.15)

= ∑x1...xM

M

∏i〈xi|e−

2 Hz︷ ︸︸ ︷I = ∑

xk|xk〉〈xk |e−∆Hx︷︸︸︷

Ie−

2 Hz |xi+1〉 (2.16)

= ∑x1...xM

M

∏i

∑xk,xn

〈xi|e−∆

2 Hz |xk〉︸ ︷︷ ︸z Part

〈xk|e−∆Hx |xn〉︸ ︷︷ ︸x Part

〈xn|e−∆

2 Hz |xi+1〉︸ ︷︷ ︸z Part

(2.17)

We now have the path integral4 formulation of the partition sum, which we know inprinciple how to sample using the methods described in section 1.2.4. But first, we needto simplify the two matrix exponentials.

z PartSince xi is the computational basis, Hz is diagonal and thus the matrix exponential isdiagonal as well, leading to:

〈xi|e−∆

2 Hz |xk〉= δi,ke−∆

2 〈xi|Hz|xk〉 = δi,ke−∆

2 Hz(xi) (2.18)

3 in the following, we will omit explicitly writing xi ∈ B for brevity.4although the integral is a sum here

Page 48: Embedding penalties for quantum hardware architectures and ...

2.3 Simulated Quantum Annealing Implementation 45

where we introduced Hz(xi) = 〈xi|Hz |xi〉 in the last step. Inserting this into eq. (2.17)eliminates xk and xn, resulting in

Z = ∑x1...xM

M

∏i

e−∆

2 Hz(xi) 〈xi|e−∆Hx |xi+1〉e−∆

2 Hz(xi+1) (2.19)

= ∑x1...xM

M

∏i

e−∆Hz(xi) 〈xi|e−∆Hx |xi+1〉︸ ︷︷ ︸x Part

(2.20)

where we rearranged the scalar factors in the product in the last step, which works due tothe periodic boundary conditions.

x Part

This part takes considerably more effort, since the transverse field Hamiltonian Hx isnot diagonal with respect to B. First, we observe that all xσp commute amongst eachother. This can easily be confirmed by checking the definition in eq. (2.7) and the factthat two tensor products commute if all its factors commute. We can always decomposeone matrix multiplication of two tensor products into one tensor product of many matrixmultiplications, given that the dimensions of all factors are the same.

(A1⊗B1)(A2⊗B2) = (A1A2)⊗ (B1B2) (2.21)

If two matrices commute, we can simplify a matrix exponential as follows

[A,B] = 0 =⇒ eA·B = eA · eB (2.22)

which we use now on Part 2.

〈xi|e−∆Hx |xi+1〉= 〈xi|e∆Γ∑Lp

xσp |xi+1〉 (2.23)

= 〈xi|L

∏p

e∆Γxσp |xi+1〉 (2.24)

Now we only need to solve the matrix exponential for a single component, which ismanageable. We need the following two equations to achieve this:

eI2⊗...A...⊗I2 = I2⊗ . . .eA . . .⊗ I2 (2.25)

e∆Γσx = e∆Γ

(0 11 0

)=

(cosh∆Γ sinh∆Γ

sinh∆Γ cosh∆Γ

)=: X (2.26)

Page 49: Embedding penalties for quantum hardware architectures and ...

46 Chapter 2. Algorithm Implementation

We now expand eq. (2.24) and with the help of eqs. (2.25) and (2.26) and further simplifywith eq. (2.21):

〈xi|e−∆Hx |xi+1〉= 〈xi|L

∏p=1

e∆Γxσp |xi+1〉 (2.27)

= 〈xi|(X⊗ I2⊗ I2 . . .)(I2⊗X⊗ I2 . . .) . . . |xi+1〉 (2.28)

= 〈xi|(X⊗ . . .⊗X) |xi+1〉 (2.29)

=L

∏p=1

⟨xp

i∣∣X ∣∣xp

i+1⟩

(2.30)

=L

∏p=1

(1−δzpi ,z

pi+1)sinh∆Γ+δzp

i ,zpi+1

cosh∆Γ (2.31)

The second to last step used the decomposition of |xi〉 into its tensor factors ⊗Lp

∣∣xpi⟩

where∣∣xp

i⟩

is a two dimensional basis vector, either up(

10

)or down

(01

). The last step

introduces the notation of zpi as measurement outcome of σz, which is in {−1,1}.

σz∣∣xp

i⟩= zp

i

∣∣xpi⟩

(2.32)

Further Simplification

Since the values zpi ∈ {−1,1} we can simplify eq. (2.31) by using the fact that only one

of the expressions (1−ab) and (1+ab) can be non zero for any choice of a,b ∈ {−1,1}.The arguments of sinh and cosh are all identical and not displayed:

(1−δa,b)sinh+δa,b cosh =12(1−ab)sinh+

12(1+ab)cosh (2.33)

= e12 (1−ab) logsinh+ 1

2 (1+ab) logcosh (2.34)

= e12 (logsinh+ logcosh)︸ ︷︷ ︸

c(∆Γ)

e12 ab(logcosh− logsinh) (2.35)

= ce−12 ab log tanh (2.36)

〈xi|e−∆Hx |xi+1〉=L

∏p

ce−12 log(tanh(∆Γ))zp

i zpi+1 (2.37)

= ce−12 log(tanh(∆Γ))∑

Lp zp

i zpi+1 (2.38)

For the z Part, we apply the same logic behind eq. (2.27) to eq. (2.30) to arrive at:

Page 50: Embedding penalties for quantum hardware architectures and ...

2.3 Simulated Quantum Annealing Implementation 47

Hz(xi) = 〈xi|Hz |xi〉 (2.39)

=−JL

∑p〈xi| zσp z

σp+1 |xi〉 (2.40)

=−JL

∑p

zpi zp+1

i (2.41)

(2.42)

Mapping to 2D Ising

Now that we brought all parts into a suitable form, we continue from eq. (2.20):

Z = ∑x1...xM

M

∏i

e−∆Hz(xi) 〈xi|e−∆Hx |xi+1〉 (2.43)

= ∑z1

1,z21...z

LM

M

∏i

e∆J ∑Lp zp

i zp+1i ce−

12 log(tanh(∆Γ))∑

Lp zp

i zpi+1 (2.44)

= c ∑z1

1,z21...z

LM

M

∏i

e∑Lp ∆Jzp

i zp+1i − 1

2 log(tanh(∆Γ))zpi zp

i+1 (2.45)

drop c⇒ ∑z1

1,z21...z

LM

e∑Mi ∑

Lp ∆Jzp

i zp+1i − 1

2 log(tanh(∆Γ))zpi zp

i+1 (2.46)

(2.47)

We dropped c in the last step, since the constant is irrelevant in a partition sum forcalculating expectation values. If we now compare this form with the partition sum of atwo dimensional classical Ising grid with different couplings in x and y direction

Zcl = ∑z1

1,z21...z

LM

e−βclHcl(z11,z

21...z

LM) (2.48)

= ∑z1

1,z21...z

LM

e−βcl(∑Mi ∑

Lp−Jxzp

i zp+1i −Jyzp

i zpi+1) (2.49)

= ∑z1

1,z21...z

LM

e∑Mi ∑

Lp βclJxzp

i zp+1i +βclJyzp

i zpi+1 (2.50)

(2.51)

we can make the following direct mapping between the one dimensional quantum Isingmodel and the two dimensional classical analog:

Page 51: Embedding penalties for quantum hardware architectures and ...

48 Chapter 2. Algorithm Implementation

βclJx = ∆Jqu =βqu

MJqu (2.52)

βclJy =−12

log(tanh(∆Γ)) =−12

log(tanh(βqu

MΓ)) (2.53)

2.3.2 Analogy Interpretation

Now that we have shown the quantum system to be equivalent within the Suzuki-Trotterapproximation to a classical Ising model, it offers one way of implementing it on aclassical computer. But first we investigate the different parts of the analogy further for abetter understanding.

Trotter Slices

Figure 2.9 shows this analogy. Every row is called a trotter slice and the larger M, themore accurate the approximation. Every trotter slice can be thought of as a potentialmeasurement outcome of the quantum system, i.e. a measurement in reality correspondsto selecting a random trotter slice. Thus we can also calculate the probability to measure aground state with a single simulation, since we can read all trotter slices and are not forcedto only measure one, like a quantum annealer would be. Nevertheless, many repetitionsof the simulation yield better statistics than one simulation with many trotter slices. Inthe case of fig. 2.9, all trotter slices are identical, leaving only one possible measurementoutcome. Note the trotter-direction is orthogonal to a trotter slice. The direction inside atrotter slice is the real world direction.

Influence of Trotter Number M

While it is true that a larger M leads to a more accurate approximation, it can be beneficialto leave M not too large if the goal is to use SQA solely as an optimizer. There is anoptimal number M for which the ground state is found fastest for a given problem family.The reason is that a larger discretization allows to pass through barriers in less updatesteps, and makes it thus much more likely for a more discrete system to overcome suchbarriers. When running SQA as an optimizer, M is a parameter that requires tuning.

Influence of ∆

Defined by ∆ = β

M , delta governs the ratio of the couplings between trotter slices andwithin, since it enters differently into the two couplings, see eq. (2.52). Since thetransverse field is shutting down along the anneal, the couplings between the trotterslices get stronger, dominating the contributions of the intra-trotter couplings. In thebeginning, while the transverse field is strong, the intra-trotter couplings dictate theupdate probability. The key is to not waste compute time by having one part massivelydominating the other, since the interesting solving of the problem happens while theyare comparable. From experience, keeping ∆ ≈ 1 works nicely. This means that thetemperature chosen T ≈ 1/M in natural units.

Page 52: Embedding penalties for quantum hardware architectures and ...

2.3 Simulated Quantum Annealing Implementation 49

a)

b)

Figure 2.9: A 1D quantum spin chain shown in a) can be represented by the 2Dclassical ising model displayed in b). The horizontal or real space direction containsthe modified couplings and the vertical direction, called trotter-direction, contains thecouplings induced by the transverse field.

2.3.3 Cluster Update and Optimizations

Due to the trotter slices M ≈ 1000, around three orders of magnitude more spins need tobe simulated compared to the SA implementation of a comparable system. Single spinupdates do not deliver the performance necessary in this case. Hence a cluster updatealgorithm [20] is used.

Cluster Update

Before we describe the cluster updates, we start with a few observations. We switch toa binary representation of the spins in order to save space. We only use cluster updatesalong the trotter-direction and use local updates in the real-space direction. Lets assumespin i has the following entries along the trotter-direction:

1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1

At low temperatures, only the spins with different neighbors could flip, since any other flip

Page 53: Embedding penalties for quantum hardware architectures and ...

50 Chapter 2. Algorithm Implementation

would increase the energy, as mentioned in section 2.2.4. But if we create the followingclusters, any one would be accepted if flipped together:

1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1

A simple way to generate these clusters is to compute the break-points with a cheap xoroperation:

1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 11 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 xor0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0

The second pattern is bit-shifted to the left by one and the red 1 was moved to theback due to periodic boundary conditions. The last pattern has a 1 where there is a break-point in the original pattern. But if this was the only mechanism, clusters would onlyjoin and never break up, which cannot be true for higher temperatures. Hence we need tomake these clusters smaller, in accordance with the temperature. High temperature shouldyield small clusters, even close to single spin updates, but low temperatures should favorlarger clusters. If we flip one coupling between trotter slices: Jy =− 1

2βcllog(tanh(∆Γ))

we have to insert the energy difference ∆E = 2∗ Jy into the acceptance formula and get:

p = min(1,exp−βcl2Jy) = min(1, tanh(∆Γ)) = tanh(∆Γ) (2.54)

where we dropped the min since tanh(x) ∈ [−1,1]. We now use this probability in thecluster update outlined in [21]. If two neighboring spins along the trotter direction arethe same, these two are connected to a cluster with probability 1− p. Equivalently wecan just insert break points with probability p in the break-point pattern mentioned above.This can be done with a computationally cheap or operation, given a random bit patternwhere each bit is set with probability p, since each bit set to 1 will generate a break-point.

0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 break-point pattern0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 random with probability p0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 or result

Generating Random Bit Pattern with Probability p

As mentioned above, we need a random bit pattern to drive the cluster update in trotter-direction. A simple way would be to use a random number r ∈ [0,1) for each of the Mentries and set the bit if r < p. Since M is a large number, this is computationally tooexpensive. If we limit the precision of p to e.g. 32 bits, we can generate this pattern muchmore elegantly. We use the property that a random number generator, that uniformlyoutputs integer in the range [0,2k), has p = 0.5 for each bit to be set, i.e. acts as a coin

Page 54: Embedding penalties for quantum hardware architectures and ...

2.3 Simulated Quantum Annealing Implementation 51

flip. Listing 2.5 shows a function that behaves as if it would compare the given probabilityp against a random number r ∈ [0,1) with only coin flips available.

Listing 2.5: generating probability p from coin toss in python1 def smaller_than_prob(p):

precision = 32

3 p_int = int(p * (1<<precision))# find the points where the bit−pattern changes from# a sequance of 1 to 0 or a sequance of 0 to 1

pattern_change = p_int ^ (p_int >> 1)

8 bits = [is_bit_set(pattern_change , i)

for i in range(precision)]

bits = list(reversed(bits)) # most significant bit first

13 result = bits[0] # is True if p >= 0.5

for bit in bits[1:]:# there are at most precision−1 many coin tossesflip = coin_flip()

if not flip: # we only continue if we get heads18 break

# flip the result if the bit pattern changes,

# meaning that p now decreases instead of increases

# (or vice−versa)if bit:

23 result ^= True # flips the value, mask is True

return result

One might correctly argue, that we need many more coin flips than if we where ableto generate a single random number in r ∈ [0,1). While this is true, the fact that a coinflip is much cheaper, and the ability to run this for many bits at the same time, makes thismethod superior. In order to run example for a bit-string of length k instead for a singlebit, we need one minor modification and to replace the coin_flip with the mentionedrandom number generator that uniformly outputs an integer in [0,2k). It is equivalent oftossing k coins in parallel. The result will be a bit pattern where each bit will be set withprobability p, exactly what was needed for the cluster algorithm.

Accepting Cluster UpdateWith the cluster now formed, the last ingredient is the acceptance probability of flipping acluster. This is given by calculation the energy difference in real space direction of allspins belonging to the cluster, as shown in fig. 2.10. This energy-difference is processednormally p = min(1,e−β∆E) to get to the acceptance probability. To conclude the update

Page 55: Embedding penalties for quantum hardware architectures and ...

52 Chapter 2. Algorithm Implementation

a)

b)

Figure 2.10: Same setup as in fig. 2.9. If we want to flip the red cluster, we need tocalculate the energy this cluster has within the trotter slice. All couplings that need to beregarded are marked with a cross.

for one quantum spin, we decide for every cluster built if it should be flipped. Thecluster update is not just more efficient, but features also some nice theoretical propertiesthe single spin update lacks, e.g. no critical slowing down when approaching a phasetransition.

Multi-Spin CodingAn important optimization is to use the best data structure available and use cheapoperations whenever possible. Every quantum spin has M corresponding classical spinsin the M trotter slices. When we update a quantum spin, many classical spins potentiallychange. After this step, we need to recalculate the local energies of all M spins, whichinvolves computing all couplings of eq. (1.6). For an N-body coupling term, we need tomultiply N spins σi ∈ {−1,1} M times. The simulation stores the −1 as 1 and the 1 as0, s.t. this multiplication is reduces to N xor operations. This way we find out if thereis an even or odd number of down spins in the N spins. Rather than repeat this process

Page 56: Embedding penalties for quantum hardware architectures and ...

2.4 Library Overview 53

M times, we pack the classical spins of a quantum spins in large types that support xor,e.g. an uint64_t. This way we execute 64 bit-xor operations at the same time. At timeof writing, much larger types could have been used as well. This technique of packingmultiple spins into one type is called multi-spin coding and speeds up computationdrastically. We still need to compute the energy differences for a single spin separatelyduring an update, since floating point operations cannot be reduced to simple parallelbinary operations5 like the multiplication of spins.

2.4 Library Overview

This section covers the different libraries that will be mentioned throughout this thesis.

2.4.1 siquan

The simulated quantum annealing (working name siquan) library is a written in C++and a header only library. Despite the main purpose being SQA, the library containsfurther implementations such as SA and exact methods. The templates are required toachieve the modularity outlined in section 2.1.1. Header-only-libraries tend to have longercompile times, but the size and absence of complex Meta Template Programming keepsthis time acceptable (< 10sec). All implementations provided are capable of simulationproblems formulated by eq. (1.6). While performance was prioritized and achieved, onecould still speed up the code considerably. For example by writing specialized versionthat do not solve the general eq. (1.6), but a restricted version of it, e.g. by introduce adegree limit or only allow up to 2-body couplings. The second untapped potential lies inthe use of Intel Intrinsics, such as SSE and AVX. This restricts probability of the code, butis expected to yield further speedup.

2.4.2 frescolino

The frescolino library6 , abbreviated to fsc in the code, is a collection of quality oflife improvements for C++ and python projects. It is a collaborative effort betweenDominik Gresch, Donjan Rodic and the author. The goal was to avoid everyone having toimplement similar functionality for themselves. The content is basic functionality usefulfor any computational project. We briefly highlight the most useful parts the siquanlibrary uses from fsc.

Standard SupportA C++ library, which introduces capabilities to print most types from the StandardTemplate Library (STL). It further contains a templated function fsc::sto<T>, inspiredby the many std::stoi, std::stod, ... functions. It convert strings to the specifiedtype T. This template can be specialized and supports many types of the STL. It allowsto convert e.g. "[1,2]" to a std::vector<int>. The inverse templated function,

5 vectorized intrinsics would be needed for computing multiple floating point types in parallel.6named after the mini-fridge in the office of the authors

Page 57: Embedding penalties for quantum hardware architectures and ...

54 Chapter 2. Algorithm Implementation

fsc::to_string, modeled after std::to_string, converts many types of the STL totheir string form. These two functions facilitate convenient IO operations of parameterand data.

Explicit Instantiation

As mentioned in section 2.4.1, header-only-libraries do not scale well in size, since therequired compile time grows proportionally. Any time a compilation unit needs to becompiled, all templates need to be instantiated again. The second factor that increasescompile-time is the following: Assuming the user of the header only library has 4 com-pilation units that then get linked to form the final executable. Every compilation unitrequires the same template instantiations of the header-only-library. The compiler hasto instantiate the same set of types 4 times separately, since we could potentially linkthe compilation units with any other objects. Therefore every compilation unit needsto be self contained. But if we now link these 4 units, 3 of the 4 instantiated templatesets will be removed by the linker, since they are identical. The compile-time for in-stantiating these redundant sets is wasted. In order to solve this problem, C++ allowsexplicit instantiation. We only provide the template declarations to the compilation unitstogether with a specific set of supported types the templates are instantiated for. If theuser requests the instantiation with a not supported type, a linker error will occur. Thesesupported types will be instantiated and compiled in a further compilation unit and linked,which avoids wasting compile time. Furthermore, if there is a change in the internalimplementation of a template, only this one unit needs recompiling, instead of all 4 inthe scenario without explicit instantiation. If the interface changes, we always have torecompile all. The process of splitting a single header containing a template declarationand definition into two headers, definition and declaration, as well as specifying the cppfile with the explicit instantiation is tedious and invasive, if the developer of the templatelibrary want to support this. Provided the C++ project uses cmake, frescolino offers anelegant solution for solving this issue. The cmake-extension fsc/explicit_inst allowsto specify the explicit instantiation in the CMakeLists.txt file without any further stepsrequired. Since cmake has full control over include paths, it can split the headers andmake sure the right one is selected during compilation. Adding the additional compilationunit and linking it accordingly can also be done by cmake. We show one example lineone would need to write in aCMakeLists.txt file below:

explicit_inst("main lib1 lib2" ./src/A.hpp A<int> A<char>)

The first argument, "main lib1 lib2", specifies all targets that the explicit instan-tiation should be linked to, i.e. if we make main, the explicit instantiation compilationunit should be linked. The second argument, /src/A.hpp, is the source location of thetemplate A we want to instantiated explicitly. All further arguments, A<int> A<char>,specify for which types the template A should be instantiated.

This cmake-extension reduces compile-time in a convenient way without beinginvasive, which is especially useful during development.

Page 58: Embedding penalties for quantum hardware architectures and ...

2.4 Library Overview 55

2.4.3 giarsinomThe giarsinom library7 contains all algorithms required by the thesis. Since pythonis a more convenient language, all algorithms in C++ that use siquan where exposedwith pybind [22] and cppimport to python. In order to be installable through pythonmechanisms only, two giarsinom-helper-modules house the necessary source code ofsiquan and frescolino. This way the user does not have to install any C++ librariesbefore installing modules of giarsinom.

2.4.4 zurkonThe zurkon library (working name)8, is written in python and provides a scalable frame-work for scientific measurements. It handles workload distribution, caching, provenancetracking as well as many quality of life improvements for working in python. The zurkonlibrary will be covered in chapter 6 in more detail.

2.4.5 phd_thesis_mskThis python library contains everything needed to produce this document. If installed,running a single main.py file will run all necessary simulations, produce all figures andlistings, and finally this document. Many measurement results are stored for convenience,but it is possible to reproduce this document without using any measurement data theauthor has created, although it will require access to a supercomputer.

7named after a mountain in the lower Engadine, Switzerland8named after a character from the video game series "Ratchet & Clank"

Page 59: Embedding penalties for quantum hardware architectures and ...
Page 60: Embedding penalties for quantum hardware architectures and ...

II3 Maxcut: Comparing Algorithms . . . . . . 593.1 MaxCut Generator3.2 Exact Algorithms3.3 Goemans Williamson MaxCut Algorithm3.4 Comparison between GW, SA and SQA

4 Embedding Penalties . . . . . . . . . . . . . . . . 674.1 Introduction4.2 Minor Embedding4.3 Minor Embedding on a Chimera Graph4.4 Parity adiabatic quantum optimization4.5 Comparison4.6 Conclusion

5 Fair Sampling with Quantum Annealers 835.1 Introduction5.2 Toy Problems5.3 Perturbation Theory5.4 Large Scale Results5.5 Effects of more complex drivers5.6 Conclusions

Results

Page 61: Embedding penalties for quantum hardware architectures and ...
Page 62: Embedding penalties for quantum hardware architectures and ...

3. Maxcut: Comparing Algorithms

In this chapter, we compare different algorithms for solving the MaxCut problem. Thereason for picking this particular problem is the absence of classical algorithms thatcan approximate the ideal solution arbitrarily with polynomial complexity. A quantumannealer, being an analog device, has physical limits, e.g. due to calibration, which willput practical upper limits on its performance. Due to the lack of classical algorithms thatguarantee close to optimal solutions, the question of a potential advantage of quantumannealing remains interesting. We will compare a classical algorithm with a theoreticalguarantee to the two algorithms SA and SQA which offer none such guarantee when run asheuristic. The reason for running them without satisfying the conditions for a theoreticalguarantee too, is performance. Here we are interested in solving the optimization problemfast, hence we will also look into greedy/aggressive optimizations that break fundamentalprerequisites for a theoretical guarantee. The success of the heuristic will justify thesechanges. But due to the lack of the lower bound guarantee, careful benchmarking in theproblem space of interest is required.

3.1 MaxCut Generator

In order to generate the problem instances for the algorithms to solve, we wrote a generatorfor the MaxCut problem outlined . We use unweighted MaxCut instances characterizedby their connection density where 100% would correspond to an all to all connectedgraph. Given an size Nv, density p and seed s, we get and instance with Nv verticesand Ne = p∗ Nv∗(Nv−1)

2 edges. This number is rounded to the nearest integer. The seed s

Page 63: Embedding penalties for quantum hardware architectures and ...

60 Chapter 3. Maxcut: Comparing Algorithms

initializes the random number generator that draws Ne edges from the set of all possibleedges. The reason to fix the number of edges deterministically rather than select eachedge with probability p is the inability of some external solver to deal with the trivial caseNe = 0, where the solution is E = 0. For small sizes and densities, this would happenedand required unnecessary special handling in algorithms we don’t have access to andhence cannot fix. All edges carry an anti-ferromagnetic weight J =−1. Figure 3.1 shows

0

1

2

3

4

a)

0

1

23

4

5

6

7 8

9

b)

0

12

3456

78

9

10

1112

1314 15 16

1718

19

c)

Figure 3.1: a) Size N = 5 with density p = 0.5. b) Size N = 10 with density p = 0.4. c)Size N = 20 with density p = 0.2.

a small selection of MaxCut instances. The generating function for the MaxCut instancesis written in python and available in giarsinom.density_maxcut_instances.

3.2 Exact AlgorithmsIn order to validate the quality of solutions given by any heuristic solver we need exactsolver. Solving a problem of exponential complexity takes exponential time, hence wecan only solve small problems exactly. Still, these results are a good basis to test theheuristic solvers against. We describe two ways to solve the MaxCut instance exactly.

3.2.1 Brute ForceAs the name implies, "brute force" is neither an elegant nor a clever algorithm, but asimple and fast one to implement. The brute force algorithm tests all 2N configurationsif N is the number of variables in the optimization problem. Aside from getting theground-state energy or optimal solution, we further get an exact report on the numberdegenerate of ground-states, i.e. the degeneracy. This can be very beneficial whenanalyzing secondary properties of an algorithm, like fair sampling discussed in chapter 5.

There are three easy optimizations that speed up the brute force algorithm by arounda factor of 8. First, searching only half the configuration space is enough due to spininversion symmetry of the MaxCut problem. Second, the degrees in the MaxCut prob-lem graph are highly heterogeneous. We want to flip the spins with high degree less

Page 64: Embedding penalties for quantum hardware architectures and ...

3.2 Exact Algorithms 61

often, since these spins require more work for calculation all couplings. If we think ofthe configuration space {−1,1}N as a number in binary, e.g. 01000101, the simplestapproach would be to count from 0 to 2N−1. At each step we check which spins changedcompared to the last configuration and update the difference. This is done by checking allthis spins couplings, looking up all vertices its connected with and calculating the energychange if it flipped. We can save work by sorting the spins such that the high degree spinscorrespond to high bits and the ones with low degree to low bits, since the last bit flipsevery time and the first one only once, thus needing less lookups of connected vertices.Lastly, there is a better way to count from 0 to 2N−1 instead of incrementing. Withincrementing, often more than one bit flips at the same time. Gray code (reflected binarycode) always changes only one bit when changing configuration while still covering theentire space. In this way, the updates to compute are smaller, i.e. only one spins flipson any update. Table 3.1 shows the difference between incrementing in binary and graycode. While there are multiple ways of covering the entire space by always only changingone bit, we pick the strategy that that also flips the lower bits exponential more oftenthan the higher bits. The flip occurrence flip at position i, where i = 0 is the highestbit, of binary counting is flipbin(i) = 2i−1 while the flip occurrence of the gray code isflipgray(i) = 2i−1. We save roughly a factor of 2 in flips.

Dec Binary Gray Dec0 → 0 0 0 0 0 0 → 01 → 0 0 1 0 0 1 → 12 → 0 1 0 0 1 1 → 33 → 0 1 1 0 1 0 → 24 → 1 0 0 1 1 0 → 65 → 1 0 1 1 1 1 → 76 → 1 1 0 1 0 1 → 57 → 1 1 1 1 0 0 → 4

Table 3.1: If counting in binary, many bit change occasionally, which does never happenin gray code. Only one bit changes every time the number is advanced.

This code was used to solve systems of sizes up to 30 spins, with runtimes of a fewminutes.

3.2.2 BiqMac Server

The biquadratic and maxcut solver, short BiqMac [23] is an online service. One functionit provides is to solve maxcut to optimality by Intersecting Semidefinite and PolyhedralRelaxations [24]. Since there is a upper limit on compute time for any supplied instance,we solved up to N = 100 spin systems with densities up to p = 0.5 with this service.

Page 65: Embedding penalties for quantum hardware architectures and ...

62 Chapter 3. Maxcut: Comparing Algorithms

3.3 Goemans Williamson MaxCut AlgorithmThe Goemans Williamson Algorithm (GW) for solving MaxCut is a classical approxima-tion algorithm [25]. It has polynomial complexity and offers a theoretical guarantee to atleast find≈ 0.878 of the optimal cut. If the unique games conjecture is true, the guaranteeis even better at≈ 0.941, but this is not proven yet. Many optimization algorithms samplethe local gradient of the cost function to determine where to move next. For integerproblems, this is not possible, since one cannot change a variable only slightly. Thediscrete variable changes from −1 to 1 or from one side of the cut to the other, whichrenders these methods useless for discrete problems. To circumvent that, the core idea isto relax the original integer σi ∈ {−1,1} problem

H(σ) =−∑i< j

Ji, jσiσ j (3.1)

where Ji, j ∈ {−1,0}. The configuration space is expanded by substituting the originalinteger variables σi to vectors in the n-dimensional unit-sphere vi ∈ Sn. These vectors nowhave the ability to be changed just slightly if an algorithm requires it. It is sufficient tochoose the dimension n equal to the number of spins, or fewer if there are less couplingsthan spins:

R(v1, . . . ,vn) =−∑i< j

Ji, jvivj (3.2)

where the inner product is used. This problem can now be processed with a semidefiniteprogramming solver, which has polynomial scaling. Once we have the relaxed solution,vectors pointing in all possible directions on an unit sphere, it needs to be projected backto the original configuration space. This is done by selecting a random plane which goesthrough the origin (represented by its normal vector r ∈ Sn) and setting:

σi =

{+1 if vir < 0−1 else

(3.3)

This corresponds to splitting the space with the plane and thus performing a cut. Allvectors on one side of the plane are on one side of the cut. In order to prevent a potentialbad random plane (normal vector r), this last step is repeated multiple times1 and the bestresult selected.

3.4 Comparison between GW, SA and SQAWhen comparing these algorithms, there are two main features to compare. Time toreach a solution and quality of solution found. First, we outline the exact problem familyused for the comparison, followed by an analysis on quality and finally the performancecomparison, tanking both quality and speed into account.

1Here usually > 1000 since it is a cheap operation compared to the semidefinite programming solver.

Page 66: Embedding penalties for quantum hardware architectures and ...

3.4 Comparison between GW, SA and SQA 63

3.4.1 ProblemsThroughout this section, three problem categories were used, differing in edge-densityp ∈ {0.1,0.3,0.5}. For every density, problems of sizes ranging from 5 up to 300 weregenerated multiple times with different seeds. The main reason for analyzing differentdensities is our knowledge, that the algorithms should scale with the average connectivityof the problem, since checking more bonds takes more time. Going from 5 to 300 spansalmost two orders of magnitude, which helps to see the scaling beyond the finite-sizeeffects. For the quality of solution and plots involved GW, the upper size is capped at 100,since this was the possible limit due to external services.

3.4.2 Quality of SolutionFirst we check quality of solution. Should one algorithm not perform well, there is noreason to check it time-performance, since even a fast bad solution is still a bad solution.Figure 3.2 concentrates on this quality. In order to quantify the quality, we need the exactsolution, which was generated using BiqMac. Then we run 100 instances per size anddensity for GW and 100 repeated runs for each instance with SA and SQA. The simulationparameter can be found in [26, 27]. It can be noted, that all box-plots in this thesis adhereto the following scheme: The box symbolize the 1st to 3rd quartile, the whiskers representthe 5th and 95th percentile while the center line displays the median. Figure 3.2 showsthat all three algorithms have no problems coming within 95% of the optimal solutionfound by the exact solver BiqMac. The SA solver with the current settings performscomparatively bad for small sizes but catches up to GW for large sizes. This shows thatthe heuristics deliver solutions as good as GW in at least 95% (lower whisker plotted) ofthe cases. It is further evident, that all three algorithms perform better than the ≈ 0.878guarantee of GW. This indicates either, that the algorithms are good for problems ofthese sizes, or more likely, the problems are too easy. It could be interesting to look atgaussian distributed weights on MaxCut instances. But for unweighted MaxCut, all threealgorithms deliver quality of solution, so it makes sense to compare the runtime scaling.

Page 67: Embedding penalties for quantum hardware architectures and ...

64 Chapter 3. Maxcut: Comparing Algorithms

0.98

0.99

1.00

sa/

opt

density 0.1a) density 0.3b) density 0.5c)

0.98

0.99

1.00

sqa

/op

t

d) e) f)

0 50 100

instance size

0.98

0.99

1.00

gw/

opt

g)

0 50 100

instance size

h)

0 50 100

instance size

i)

Figure 3.2: The cut of solution found divided by the known optimal solution. Thecolumns are the different densities while the rows contain the different algorithms. SQA(d-f) has the best performance, almost always finding the optimal solution, while SA(a-c) find the optimal solution less reliable with the chosen parameter settings. Furthershould be noted that all algorithms, including GW, are above the 87.9% worst caseguarantee of GW. Every box in the box-plot corresponds to a instance size contain 10000measurement points for SA and SQA. 100 different instances with 100 runs per instance.For GW it contains 100 instances, since there is no benefit of rerunning the algorithmwith different seed, although many different random planes where used to improve thecut, see section 3.3.

3.4.3 Runtime Scaling for GW, SA and SQA

Having checked quality of solution, we can procede to analyze the runtime scaling for thethree algorithms. Figure 3.3 shows the corresponding runtimes on a logarithmic scale forthe same data used in fig. 3.2. The GW algorithm (green) scales much worse that SA orSQA. Hence the heuristics do not just compete in quality of solution but beat GW wherethe scaling of runtime is concerned. This demonstrates the viability of these heuristics for

Page 68: Embedding penalties for quantum hardware architectures and ...

3.4 Comparison between GW, SA and SQA 65

101 102

instance size

10−3

10−2

10−1

100

101

102

run

tim

e[s

]density 0.1a)

101 102

instance size

density 0.3b)

101 102

instance size

density 0.5c)

sa

sqa

gw

Figure 3.3: The runtime plotted against the problem size of the algorithms for thethree different densities (a-c) on a double logarithmic scale. The scaling for SA andSQA is much better than for GW. It should be noted though, that the scaling for SA andSQA worsens with increasing density (more couplings to store and update) while GW isunaffected by this.

MaxCut problems.

3.4.4 Quality and Runtime Scaling SA vs SQA

For a closer look between the two heuristics, we run larger sizes up to 300, to compare thescaling further. For these sizes we could not get the optimal solution via exact methods,hence only the relative quotient between the solution of SA divided by the solution of SQAis shown aside the scaling. Nevertheless comparing the solution and runtime is interestingto get an indication if quantum could give an advantage for this problem. Figure 3.4compares the scaling of SQA and SA only. The slope of the linear regression shows a clearincrease for both algorithms when the density increases, which is to be expected due tothe higher number of couplings the algorithms have to track. Furthermore, there is a slightadvantage of SQA compared to the scaling of SA. While it could be interpreted as a scalingadvantage of quantum vs classical, this is most likely not the case here. Our classicalalgorithm implementation SA is as mentioned a very basic and much more powerfulvariants of simulated annealing exist [28, 29]. There are many classical annealer withbetter scaling. But if SQA would have been worse than the simple SA implementation,the viability of SQA would need to be questioned for MaxCut instances. Furthermore, theslight curvature most likely is due to the acceptance rations changing for larger instances.If the acceptance ratio is higher, the algorithm has more work to do, since more updatesare successful. While SA performs worse for smaller instances concerning quality of

Page 69: Embedding penalties for quantum hardware architectures and ...

66 Chapter 3. Maxcut: Comparing Algorithms

10−2

100

102ru

nti

me

[s]

density 0.1a)

sa m=2.26

sqa m=1.80

density 0.3b)

sa m=3.00

sqa m=2.12

density 0.5c)

sa m=3.49

sqa m=2.20

101 102

instance size

0.98

1.00

sa/

sqa

d)

101 102

instance size

e)

101 102

instance size

f)

Figure 3.4: Comparing SQA and SA for problem sizes up to 300. Plots a-c show theruntime in a double logarithmic scale, while d-f display the ration of quality of solutionof SA divided by SQA. If larger than 1, SA found better solutions and vice-versa.

solution, it is balanced for larger sizes. Sometimes SQA finds a better solution, sometimesSA. It is an indication, that SQA has potential to compare or even beat classical annealers,but further comparison with the latest classical annealers is required.

Page 70: Embedding penalties for quantum hardware architectures and ...

4. Embedding Penalties

Parts of this chapter are in the process of being published in ref. [1].

4.1 Introduction

One of the many challenges of building a Quantum Annealer to this date is the absence ofgood-long range couplers between quantum spins. For a larger general all to all connectedproblem, there is no feasible direct way of building a device to solve it. Rather, onebuilds a physical system with shorter interaction distances, i.e. only local couplings, uponwhich the logical problem is then mapped. The general idea is to solve the physicalproblem and then undo the mapping to retrieve the logical solution. One can think ofthe mapping as encoding the problem in a way that the physical problem optimum willget decoded to the logical problem solution as shown in fig. 4.1. Mapping the logicalproblem onto a physical system is also called embedding it into this physical system. Thephysical problem is always larger than the logical one and the resulting additional degreesof freedom need to be restricted with corresponding constraints in form of additionalcoupling terms. The larger size can also be harnessed to build in redundancy and errortolerance. But regardless of these advantages, embedding a system always comes at acost due to the many constraints that have to be deployed, resulting in a larger morecomplex system. The main question this chapter tries to investigate: Does the embeddingcost out-weight a potential advantage quantum annealers have? Could the current bestrealization of a quantum annealer be a simulation on classical hardware? The next threesections describe various embeddings and corresponding decoding strategies. Figure 4.2

Page 71: Embedding penalties for quantum hardware architectures and ...

68 Chapter 4. Embedding Penalties

Figure 4.1: Currently, no device exists that directly solves general logical problems withquantum annealing. One has to take the detour via a buildable physical system.

will serve as an example logical problem throughout these sections. Note that we are now

Figure 4.2: All-to-all connected logical problem with N = 8 spins.

restricted to at most 2-body couplings, i.e. we cannot solve eq. (1.6) directly with theseembeddings. Further, we do not mention the local field term (1-body coupling) separately,since it can always be implemented as a 2-body term with an auxiliary spin that is fixed.

4.2 Minor Embedding

4.2.1 Encoding

In this and the following sections we use the fully-connected 8-variable problem depictedin fig. 4.2 as an example logical problem. Our focus is on the embedding overhead. Assuch, we do not discuss the mapping overhead introduced by reducing a k-local problemto a 2-local one. Finally, without loss of generality we assume that the problem does nothave local biases (field or 1-local terms), because these can always be implemented as2-local terms with a fixed auxiliary variable.

Page 72: Embedding penalties for quantum hardware architectures and ...

4.2 Minor Embedding 69

Minor embedding [30] reduces an all-to-all logical system to a physical systemconnected only locally. This is achieved by representing each logical variable as a chainof physical qubits coupled strongly enough such that the chain ideally is either up ordown and not split. Figure 4.3 shows the minor embedding for the example presentedin fig. 4.2. Each variable is represented by a chain of qubits of length N−1 = 7. With

Figure 4.3: Minor embedding represents the e.g. logical variable C by a chain of N−1physical qubits. Constraints are represented by black (dark) lines. These chains are laidout such that every variable chain can connect to every other variable chain, shown by thered (light) couplers.

a careful chain layout, each variable chain has a coupler to every other variable chain,where the logical couplings (red/light color) are located. If the logical problem has acertain structure or limited connectivity, the minor embedding of the logical problem canbe optimized to use as few physical qubits as possible. It should be noted that this taskis computationally intensive with an effort that grows with the number of variables. Assuch, it is often inefficient to require the best embedding possible, because the numericaleffort required for such embedding might vastly exceed the time needed for the actualoptimization of the physical problem. The black (dark) couplers along the chains are theconstraints needed to convert a group of physical qubits into one logical variable. The

Page 73: Embedding penalties for quantum hardware architectures and ...

70 Chapter 4. Embedding Penalties

constraints for each logical variables are chosen via

Ci(β ,γ) = β + γ ∑j 6=i|Ji j| (4.1)

where β is a constant determining a fixed base constraint, while γ is the so called “sumconstraint” that is multiplied with the sum of the absolute value of the couplers Ji j

involving variable i. This allows to constrain variables with more or larger couplingsstronger that variables with fewer or smaller couplings, therefore helping to not constrainthe system unnecessarily. Generally, we want to constrain as little as possible 1, yet asmuch as necessary. If the decoding strategy can decode errors (split chains), we mighteven deliberately pick weak constraints and fix the errors later, if beneficial to solving theproblem faster. The chain length scales linearly with the number of variables, hence thesame is true for constraints if γ 6= 0.

4.2.2 DecodingOnce we obtain a result post optimization for the physical system, it needs to be decodedto get the solution for the logical system. If no constraints are broken, there is a trivialmapping from the chains to the logical variables. If constraints are broken, there are manydecoding strategies for minor embedding. We pick a straightforward and computationallycheap strategy here: For every chain, majority voting determines the logical variable tobe whatever the majority of physical qubits in a chain are.

4.3 Minor Embedding on a Chimera Graph4.3.1 EncodingIf the physical coupler range is slightly larger, one can use a variant of the aforementionedminor embedding. It consists of small bipartite cells of size 2c for some fixed integer c.One logical variable is represented by a chain of physical qubits, but it takes fewer qubitsdue to the higher connectivity in a single cell. This embedding is of interest because ac = 4 chimera graph is the basic building block of current quantum annealing hardware.Figure 4.4 shows the minor embedding on a c = 4 chimera graph for the example shownin fig. 4.2. The cells on the bottom and right contain the same c qubits while all othercells (in this example only one) contain 2c unique qubits. A chimera c = 1 graph isidentical 2 to the minor embedding. Constraints are then taken care of as for genericminor embedding procedure.

4.3.2 DecodingLike for minor embedding, we use majority voting decoding. The logical variable state isdetermined by the majority of states of physical qubits belonging to that logical variable.

1This constraint modeling can be improved. Constraints closer to the chain end can be smaller, since theyneed only be stronger than the weaker part of the chain.

2The only difference is that the chain in the k = 1 chimera contains one more unnecessary spin.

Page 74: Embedding penalties for quantum hardware architectures and ...

4.4 Parity adiabatic quantum optimization 71

Figure 4.4: c = 4 Chimera graph consisting of a minor-embedding-like chain structure(black/dark lines) combined with bipartite graph cells. These cells contain the logicalcouplers (red/light lines).

4.4 Parity adiabatic quantum optimization

An alternate way to map the logical problem was recently proposed in Ref. [31]. Theparity adiabatic quantum optimization (PAQO) or Lechner-Hauke-Zoller (LHZ) schemeencodes the logical problem fundamentally differently. Instead of having the notionof logical variables in the physical system, each physical variable encodes the productbetween two logical variables αi, j = σiσ j, i.e., stores if they are equal or opposite. Themain advantage of this approach is the mapping from logical couplings to physical localbiases that can be controlled much better than physical couplers. Figure 4.5 shows thePAQO embedding for the example in fig. 4.2.

The constraints are modeled as 4-local terms 3 between 4 variables that form a tile.Fixed variables that are positive can be introduced at the lower-right edge to completetiles (not shown in fig. 4.5). Because every logical variable appears twice in a constraintterm, it must be 1, i.e for the example in fig. 4.2,

αAHαAGαBHαBG = σAσHσAσGσBσHσBσG = 1, (4.2)

3 In absence of a 4-body coupling therm, these constraints can be broken down into 2-body couplingterms by introducing further auxiliary spins.

Page 75: Embedding penalties for quantum hardware architectures and ...

72 Chapter 4. Embedding Penalties

Figure 4.5: In the PAQO embedding, a physical variable αi, j is the product of two logicalvariables σiσ j. The logical couplers correspond to physical local biases (red). Constraintsare enforced with 4-local-term interactions (black crosses) between neighboring spins.

where α denotes physical variables (qubits) and σ denotes logical variables. This meansevery tile needs an even number of negative physical variables.

4.4.1 Constraints

The constraints for the PAQO embedding need to be selected more carefully because asafe lower bound for the constraint strength is not immediately evident. Figure 4.6 showsan example where allowing one broken constraint lowers the energy by E = (1/56)N2

where N is the logical system size. This number can be found by minimizing thearea of unsatisfied couplings shown in fig. 4.6c and fig. 4.6e in red and calculating thedifference. This lower bound dictates the constraint to grow at least as fast in order notto be broken in the physical ground-state. Because this is just one example, it is not asafe theoretical lower bound for any arbitrary problem. As such, calculating the neededminimal constraints is more difficult than for minor embedding. Figure 4.7 shows aconcrete example of a logical system of size N = 10, for which the physical ground-stateis shown depending on the constraint strength β of the tiles. It implements the couplingsidentical to fig. 4.6. As expected, we observed the necessity to continuously increaseβ when increasing N in the simulations to get a physical ground-state without broken

Page 76: Embedding penalties for quantum hardware architectures and ...

4.4 Parity adiabatic quantum optimization 73

unconstraineda) constrained enoughb) unsatisfied couplingsc)

c and f superimposedd) constrained too littlee) unsatisfied couplingsf)

Figure 4.6: Example to analyze constraint scaling. The logical couplers are set to Ji, j = 1if two variables satisfy i+ j < N, and Ji, j =−1 else. This is encoded in the local fields onαi, j in the embedded system. (a) If there are no constraints, the physical variables alignwith the local field terms, up (orange) or down (blue). Panel (b) shows the ground state ofthe physical system if the constraints are strong enough, i.e., no constraints are broken.The first 75% of logical variables are parallel to each other (larger orange triangle). Thisblock is anti-parallel to the remaining 25% of the variables (blue). These remaining 25%of the variables are again parallel to each other (smaller orange triangle). (c) The red areadepicts unsatisfied physical variables in the true ground state and is the overlap betweenthe configurations shown in panels (a) and (b). Panel (e) shows that the constraints areslightly too small and there is one broken constraint, displayed as a red circle. The groundstate of this physical system does not directly correspond to a logical state any longer.Panel (f) shows the number of unsatisfied physical variables when constraints are tooweak, and is the overlap between the configurations in panels (a) and (e). Panel (d) showsthe unsatisfied variables for both scenarios for better comparison. The area difference is∆A = (1/56)N2.

constraints. This is in agreement with the theoretical example shown in fig. 4.6. Whilethere might be problems that require only weaker constraint scaling, there are cases thatrequire some constraints to scale ∝ N2. The average of the constraints might scale witha smaller exponent, however. Still, if the problem is unknown, there is no choice but to

Page 77: Embedding penalties for quantum hardware architectures and ...

74 Chapter 4. Embedding Penalties

a) b) c)

d) e) f)

Figure 4.7: A concrete example of a N = 10 spin system, embedded in PAQO. a)shows the ground-state for the constraint strength BC ≡ β = 0. There is no unsatisfiedspin (red circle), but 10 broken constraints (red cross). The energy is calculated E =(−45+ 2 ∗ #(red circles)+ 2β ∗ #(red cross)). b) increasing β to 1 removes 8 brokenconstraints in favor of 5 local field alignments. The new minimal energy is E = −31.c) shows same situation as in b) but the energy increased due to the stronger constraint.d) One further constraint was satisfied by adding two more unsatisfied spins, which isenergetically favorable at β = 3. d) and e) show that constraint strength β > 3 is sufficientto eliminate all broken constraints and yield the logical ground state energy of E =−25.

scale the constraints according to the worst case scenario, i.e., ∝ N2. An alternative is toset the constraints, perform the optimization and then verify if any of the constraints arebroken. If so, increase the value and iterate. This, however, can be a costly exercise.

4.4.2 Decoding

The PAQO embedding shares similarities with a low-density parity code, which can bedecoded using belief propagation, as also proposed by Ref. [?] for this embedding. Theoriginal choice of constraints is not unique in any way; there are many combinationsof physical variables that form loops, meaning that each logical variable is representedan even number of times leading the expression to be 1. Belief propagation uses thisfact and creates, for example, all possible 3-variable loops, e.g., αABαACαβ = 1 and

Page 78: Embedding penalties for quantum hardware architectures and ...

4.5 Comparison 75

determines which of the physical variables are most likely to be wrong in a brokenconstraint. Figure 4.8 shows the probability of decoding a state correctly given a certainerror probability p. Each correctly encoded physical bit is flipped with probability pand we then try to decode the correct state. This reproduces the findings of ref. [32].While the random bit-flip error resilience of the PAQO scheme might be very useful in

10 15 20 25 30 35 40 45 50

Logical Problem Size

0.0

0.2

0.4

0.6

0.8

1.0

Log

ical

Err

or

0.01

0.1

0.2

0.3

0.4

Figure 4.8: The larger a system, the more 3-spin loops are available to correct forrandom flip errors with probability p. For each point, 1250 different random seeds forthe initialization of the state and the error flips where used.

an implementation, it cannot correct for single broken constraints as shown in fig. 4.6.Belief propagation will decode the wrong logical state if started from fig. 4.6e comparedto 4.6b. Hence, reducing the constraints is undesirable.

4.5 Comparison

We now compare the different embedding schemes using the time to solution (tts) [33] forquantum annealing with the goal of determining the the cost overhead compared to a directoptimization of the logical problem. As benchmark problem and proxy to real-worldapplications we use unweighted MaxCut instances characterized by their connectivitydensity ρ , where 100% corresponds to an all-to-all connected graph. Given a graph of sizeN, density p, and seed s, we generate instance with N vertices and Ne = p[N(N−1)]/2edges. This number is then rounded to the nearest integer. The seed initializes the randomnumber generator that draws Ne random edges from the set of all, possible edges. Alledges have an antiferromagnetic weight, i.e., J =−1.

Page 79: Embedding penalties for quantum hardware architectures and ...

76 Chapter 4. Embedding Penalties

4.5.1 Time to SolutionIn order to compare the different methods, we use the time to solution, as defined inthe following. When annealing, the annealing duration T is important. Figure 1.6 fromsection 1.2.6 shows that we leave the ground-state if this duration is chosen too short.On the other hand setting T too large will waste compute time. Hence, we introduce theprobability to find the ground-state pgs(T ). Since this is a non linear function, it can bebetter to repeat the annealing procedure multiple times with low pgs instead of running itonce with a large success probability pgs. If we set a certain target probability ptar withwhich we want to find the ground-state, we can calculate the repetitions rptar necessary:

rptar(T ) =log(1− ptar)

log(1− pgs(T ))(4.3)

The time to solution (tts) required until the solution is found with probability ptar is givenby:

ttsptar(T ) = T ∗ rptar(T ) (4.4)

In order to have a fair comparison, we run every algorithm with many different annealdurations and pick the minimal tts as follows:

ttsoptptar

= minT

(ttsptar(T )) (4.5)

We use a simple grid search to determine the minimum because the ttsptar(T ) for a giventarget probability as a function of the anneal time T is typically a convex function, butbetter optimization algorithms could be used to find the minimum faster.

4.5.2 Scaling ComparisonFirst we compare tts for finding the physical ground-state. As mentioned, depending onthe decoding scheme, it could be possible, that some broken constraints are tolerable. Ifno constraint is broken, decoding is trivial. To test the different embeddings, we run atransverse-field simulated quantum annealing [13, 34, 35, 36, 3, 37] algorithm first forthe native MaxCut problems outlined above, and second for the three embeddings tocompare time to solution required to find the ground-state with a probability of at least90% (ttsopt

0.9). The MaxCut instance sizes are N ∈ 10,15,20,25 spins and densities ofρ ∈ 0.3,0.5. For every combination of N and ρ , we generate 5 different instances. Forevery instance, we average ttsopt

0.9 over 5 different runs of the algorithm. Figure 4.9 showshow y = ttsopt

0.9 grows according to the logical system size N. While the logical sizes onlygo up to N = 25, the scaling is clearly different between the embedded algorithms andthe direct QA. The reason for the slight curvature might stem from the finite size effectand will start diminishing at larger sizes. A nice example of the finite size effect canbe seen in the previous chapter fig. 3.3. The main mechanism for the minor embeddingcost might lie in the reduced tunneling rate of the physical chains, since they grow with

Page 80: Embedding penalties for quantum hardware architectures and ...

4.5 Comparison 77

10 15 20 25

Logical System Size

103

104

105

106

ttsopt

0.9

[arb

itra

ryu

nit

]density=0.3a)

qa

minor

chimera

paqo

10 15 20 25

Logical System Size

density=0.5b)

qa

minor

chimera

paqo

Figure 4.9: Shows ttsopt0.9 in a logarithmic scale for all four methods for two different

connectivity densities. Direct quantum annealing (QA), minor embedding, minor embed-ding on a chimera k = 4 graph and PAQO embedding. Every point consist of 5 differentinstances for which each algorithm is run with 5 different starting seeds. pgs is determinedby averaging over these 5 algorithm runs. The reason for the absence of the largestsystem size for PAQO is the inability to find the ground-state due to insufficient constraintscaling.

the system size. For the PAQO embedding the fast growing constraints might cause thescaling. This remains to be analyzed theoretically.

fig. 4.10 shows the x = ttsopt0.9 of the direct simulation without embedding vs y = ttsopt

0.9of the three embeddings. Using log(y)=m*x+c according to the medians with the leastsquares method yields the fitted line. For minor and minor on chimera, we use β =0,γ = 1.1 while for paqo, β = N2/50,γ = 1.1 is used as constraint modeling. The exactsimulation parameter can be found in [38, 39, 40, 41].

4.5.3 Theoretical Analysis

The following section provides reasoning how an exponential penalty could be understood.It first established the necessity for spin flips during the anneal process

Page 81: Embedding penalties for quantum hardware architectures and ...

78 Chapter 4. Embedding Penalties

200 300 400 500

ttsopt0.9 direct

103

104

105

106

ttsopt

0.9

density=0.3a)

200 300 400

ttsopt0.9 direct

density=0.5b)

minor

chimera

paqo

Figure 4.10: Shows y = ttsopt0.9 of the three embeddings vs x = ttsopt

0.9 for the directsimulation in a logarithmic scale for two different connectivity densities. The bottom barswith blue median represent ttsopt

0.9 direct for comparison of the less important pre-factor.Every bar consist of 5 different instances for which the corresponding algorithm is runwith 5 different starting seeds. ttsopt

0.9 for each instance is determined by averaging overthese 5 algorithm runs. The fit is performed by least squares for log(y)=m*x+c accordingto the medians. The versions with embedding seem to scale exponentially worse. Thereason for the absence of the largest system size for PAQO is the inability to find theground-state most likely due to insufficient constraint scaling.

Avoided Level Crossings

In hard optimization problems, avoided level crossings [43] arise in the annealing spec-trum since the driver might initially favor certain configurations that become unfavorableonce the driver is weak enough. To remain in the instantaneous ground-state and followthe adiabatic anneal path, spins have to flip. Problems where the initial horizontal spins(due to the transverse field driver) only move towards up or down and do not need to flipduring, are much easier to solve in contrast. We include an 3 spin example [42] withspectrum in fig. 4.11(a-b) that displays such a avoided crossing at t = 0.92, where T isthe annealing time and t the progress in the annealing schedule. The gap in fig. 4.11(c-d)shrinks linearly to near-zero and grows again. The instantaneous ground-state displayed incomputational basis probabilities in fig. 4.11(e-f) clearly show that the first two spins flipfrom up (pup ≈ 1) to down (pdown ≈ 1), while the third spin changes from a superposition

Page 82: Embedding penalties for quantum hardware architectures and ...

4.5 Comparison 79

−5

0

5

En

ergy

Spectruma)

−5.6

−5.4

Avoid Lvl Crossb)

0

1

2

Gap

:E

1−E

0

Energy Gapc)

10−2

(Log-Scale)d)

0.00 0.25 0.50 0.75 1.00

Annealing Progress t

0.0

0.2

0.4

0.6

0.8

1.0

Pro

bab

ilityp

GS Compositione)

↑↑↑↑↑↓↓↓↓

0.88 0.90 0.92 0.94

Annealing Progress t

0.0

0.2

0.4

0.6

f)

Figure 4.11: Spectrum, Gap and instantaneous ground-state (GS) for a 3-spin ferromag-netic chain with individual local fields [42]. a) shows the spectrum of a transverse fieldannealing, starting with the pure transverse field at t = 0 and transforming to the pureproblem hamiltonian at t = 1 in a linear manner. b) highlights the avoided level crossingbetween the two lowest states. c-d) show the first energy gap, and a highlight in log-scale.The gap reaches ∆ ≈ 10−3 at its smallest position. e) The instantaneous ground-stateis shown by the probabilities in the computational basis, of which we only label the 3dominant parts and display the remaining 5 as black lines. f) shows the sudden rise andfall of the GS-components representing a flip in the first two spins, and changing the thirdspin from a superposition to up.

to down. Figure 4.12(a-d) show the spins required to be flipped if one flips the logical spinmarked C. For minor and minor on chimera, it corresponds to a ferromagnetic chain thatgrows linearly with the system size N. In PAQO, we also require an amount of flips thatgrows linearly with N. The physical spins are connected by 4-local-terms and hence forma chain structure as well as seen in fig. 4.12(c). Any embedding needs to accommodatethe flips of logical spins during the anneal (ideally) in an efficient manner.

Page 83: Embedding penalties for quantum hardware architectures and ...

80 Chapter 4. Embedding Penalties

a) Direct b) Minor

c) LHZ d) Chimera c=3

Figure 4.12: a) Fully-connected problem with 6 variables labeled “A” – “F.” Couplersbetween the variables are represented as solid (red) lines. b) Minor embedding representsthe e.g. logical variable C by a chain of N−1 physical qubits. Constraints are representedby black (dark) lines. These chains are laid out such that every variable chain canconnect to every other variable chain, shown by the red (light) couplers. c) In the PAQOembedding, a physical variable αi, j is the product of two logical variables σiσ j. Thelogical couplers correspond to physical local biases (red). Constraints are enforced with4-local-term interactions (black crosses) between neighboring spins. d) c = 3 Chimeragraph consisting of a minor-embedding-like chain structure (black/dark lines) combinedwith bipartite graph cells. These cells contain the logical couplers (red/light lines).

Exponential Penalty

Due to the chain(-like) nature of the physical spins representing a logical spin, all threeembeddings are subject to the fact, that the gap of a ferromagnetic chain in a weak field isexponentially suppressed [35] with systems size N. More precise, [35] showed the gapto be ∆ ≈ ΓN

JN−1 , where Γ is the field strength and J the ferromagnetic couplings of thechain and connected this to the tunneling rate for flipping the chain, which determines thedynamics and hence the required annealing time T. This suppression is the key reason,why these embeddings are expected to suffer an exponential penalty. Furthermore, the

Page 84: Embedding penalties for quantum hardware architectures and ...

4.6 Conclusion 81

constrains to logical coupling ratio worsens with growing chains, since longer chainsrequire stronger constrains to prevent breaks. For our density maxcut instances, they growwith N too, making the denominator of order NN−1.

Recovery with DecodingIn the previous analysis, we only tested if the physical system reaches its ground-state,which seems an excessively strict requirement considering that good decoding algorithmsmight fix errors during decoding. There might be physical states not in GS that decodeto a logical GS we are looking for. In light of the previous section however, the coreproblem was identified as the inability of the chains to flip efficiently, hence the decodingalgorithms should not be able to improve ttslogical

ptar for the logical GS. We run multipleconstraint schemes and confirmed that ttslogical

ptar behaves like ttsphysicalptar as expected, see

fig. 4.13 for one example. If a decoding algorithm can however retrieve the logicalGS from a clearly wrong physical GS, it has to be a potent solver on its own withcorresponding complexities.

4.6 ConclusionWith the scaling being exponentially worse for planar embeddings compared to non-embedded QA, it is likely that these embeddings cannot outperform classical simulationson general all-to-all connected problems. The limit of short range local couplers needs tobe overcome in order to build a more powerful quantum annealer. But currently, QA assimulation on classical hardware SQA seems to outperform the embeddings in terms ofscaling.

Page 85: Embedding penalties for quantum hardware architectures and ...

82 Chapter 4. Embedding Penalties

10 15 20 25 30 35

Logical System Size

103

104

105

106

ttsopt

0.9

[arb

itra

ryu

nit

]

density=0.3a)

qa

minor

chimera

10 15 20 25 30 35

Logical System Size

density=0.5b)

qa

minor

chimera

Figure 4.13: Shows ttsopt0.9 in a logarithmic scale for three methods for two different con-

nectivity densities. Direct quantum annealing (QA), minor embedding, minor embeddingon and a chimera k = 4 graph embedding. Every point consist of 5 different instancesfor which each algorithm is run with 5 different starting seeds. pgs is determined byaveraging over these 5 algorithm runs. The scaling for the decoded state behaves similarto fig. 4.9, i.e. the decoding algorithms cannot substanically speed up the anneal byrecovering errors.

Page 86: Embedding penalties for quantum hardware architectures and ...

5. Fair Sampling with Quantum Annealers

Parts of this chapter were previously published in ref. [2].Recently, it was demonstrated both theoretically and experimentally on the D-Wave

quantum annealer that transverse-field quantum annealing does not find all ground stateswith equal probability. In particular, it was proposed that more complex driver Hamil-tonians beyond transverse fields might mitigate this shortcoming. Here, we investigatethe mechanisms of (un)fair sampling in quantum annealing. While higher-order termscan improve the sampling for selected small problems, we present multiple counterex-amples where driver Hamiltonians that go beyond transverse fields do not remove thesampling bias. Using perturbation theory we explain why this is the case. In addition,we present large-scale quantum Monte Carlo simulations for spin glasses with knowndegeneracy in two space dimensions and demonstrate that the fair-sampling performanceof quadratic driver terms is comparable to standard transverse-field drivers. Our resultssuggest that quantum annealing machines are not well suited for sampling applications,unless post-processing techniques to improve the sampling are applied.

5.1 Introduction

5.1.1 Fair SamplingQuantum annealing (QA) [9, 10, 11, 12, 13, 14, 15, 16, 17] is a heuristic designed toharness the advantages of quantum mechanics to solve optimization problems. The perfor-mance of QA and, in particular, QA machines such as the D-Wave Systems Inc. devicesare controversial to date [44, 45, 46, 47, 48, 33, 49, 50, 51, 52, 53, 54, 55, 48, 56, 57, 58,

Page 87: Embedding penalties for quantum hardware architectures and ...

84 Chapter 5. Fair Sampling with Quantum Annealers

59, 60, 61, 62, 63, 64, 65]. Most studies have focused on finding the minimum valueof a binary quadratic cost function (problem Hamiltonian), yet less on the variety ofsolutions obtained when repeating the optimization procedure multiple times. Importantapplications that rely on sampling, such as SAT-based probabilistic membership filters[66, 67, 68, 69], propositional model counting and related problems [70, 71, 72], ormachine learning [73, 74] rely on ideally uncorrelated states. This sought-after fairsampling ability of an algorithm, i.e., the ability to find (ideally all) states associated witha cost function with (ideally) the same probability, is thus of importance for a variety ofapplications. Moreover, the ability of an algorithm to sample ground states with similarprobability is directly related its ergodicity which strongly influences the efficiency ofoptimization and sampling techniques.

Following small-scale studies [75], Ref. [76] recently performed systematic exper-iments on the D-Wave 2X annealer. The results demonstrated that quantum annealersusing a transverse-field driver are biased samplers, an effect also observed in previousstudies [47, 48, 77]. Matsuda et al. [75] conjectured that more complex drivers mightalleviate this bias, something we test in this work.

5.1.2 Driver

Binary optimization problems can be mapped onto k-local spin Hamiltonians. Withoutloss of generality we study problem Hamiltonians with N degrees of freedom in a z-basisof the form

HP =−N

∑i, j=1

Ji jσzi σ

zj , (5.1)

where σzi is the z-component of the Pauli operator acting on site i. Note that local biases

can also act on the variables. For such a problem Hamiltonian, in principle, a driver ofthe form

Hx,N =N

∑n=1

Γx,n[⊗σ

x]n (5.2)

would induce transitions between all states and therefore ensure a fair sampling, providedthe anneal is performed slow enough. Unfortunately, such driver is hard to engineer and,at best, one can expect drivers of the form Hx,2 =−∑ j Γxσx

j +∑ j,k Kxj,kσx

j σxk . Quantum

fluctuations are induced by the driver and then reduced to sample states from the problemHamiltonian, i.e., H (t) = (1− t/T )Hx,n +(t/T )HP, where t ∈ [0,T ], T the annealingtime, and n the order of the interactions in the driver. For an infinitely-slow anneal,the adiabatic theorem [10, 78] ensures that for t = T a (ground) state of the problemHamiltonian is reached. It is therefore desirable to know if after an infinite amount ofrepetitions, the process results in all minimizing states, i.e., fair sampling.

Page 88: Embedding penalties for quantum hardware architectures and ...

5.2 Toy Problems 85

5.2 Toy ProblemsHere we analyze the behavior of more complex drivers of the form Hx,n (n > 1) on thefair sampling abilities of QA. Following Ref. [75] we first study small systems where theSchrödinger equation can be integrated using QuTiP [79]. We have exhaustively analyzedall possible graphs with up to N = 6 with both ferromagnetic and antiferromagneticinteractions and show in Fig. 5.1 paradigmatic examples that illustrate different scenariosusing drivers with n ≤ 2. Even for some of these small instances, in some cases theinclusion of higher-order driver terms does not remove the bias. If we anneal adiabatically,i.e., T large enough, the instantaneous ground states is never left, which means towardsthe end of annealing at T −λ (for a small λ > 0) the system is in the ground state ofH (T −λ ). This observation is key to predicting the sampling probabilities for differentdegenerate ground states. These probabilities are given by squaring the amplitudes ofthe lowest eigenvector of H (T −λ ), assuming for now the small contribution from thedriver lifts the degeneracies. Because H (T −λ ) can be viewed as HP perturbed byHx,n, we analyze fair sampling using a perturbative approach [80]. To better quantify thefair-sampling behavior of a given system, we use the term “hard suppression” (i.e., totalsuppression) if the sampling probability is 0 for a particular ground-state configuration atthe end of the anneal and the term “soft suppression” if a particular state is undersampledby a certain finite fraction in comparison to other minimizing configurations. Finally,we complement these studies with quantum Monte Carlo simulations for large two-dimensional Ising spin-glass problems following Ref. [76] and discuss the effects ofhigher-order drivers. Our results show that QA is not well suited for sampling applications,unless post-processing techniques are implemented [81].

Page 89: Embedding penalties for quantum hardware architectures and ...

86 Chapter 5. Fair Sampling with Quantum Annealers

1

2

0

3 4

(a)

+1

+2

−1

−2

100 102 104

Annealing Time T

0.2

0.4

0.6

0.8

1.0

pG

SHx,2

0.2

0.4

0.6

0.8

1.0

pG

SHx,1

| ↑↑↑↑↑〉| ↑↑↑↓↓〉| ↑↑↓↓↓〉total

0

2

43

1

(b)

100 102 104

Annealing Time T

| ↑↑↑↑↑〉| ↑↑↑↓↑〉| ↑↑↓↑↓〉| ↑↑↓↓↑〉| ↑↓↑↑↑〉| ↑↓↑↑↓〉total

2 3

5

4

0 1(c)

100 102 104

Annealing Time T

| ↑↑↑↑↑↓〉| ↑↓↑↓↑↑〉| ↑↓↑↓↓↓〉total

0

1

2 3

(d)

100 102 104

Annealing Time T

| ↑↑↑↓〉| ↑↑↓↑〉| ↑↑↓↓〉total

Figure 5.1: Toy problems with up to N = 6 variables and both ferromagnetic (solidlines) and antiferromagnetic (dashed lines) interactions of different strength (thicknessof the lines) integrated using QuTiP [79]. Data for both transverse field (Hx,1) andquadratic (Hx,2) drivers are shown. The data show the instantaneous probability pGS tofind different states that minimize the cost function (up to spin-reversal symmetry) as afunction of anneal time t. In all cases, the first spin (labeled with 0) is in the state | ↑〉. (a)Toy problem studied in Ref. [75] where drivers with n = 2 sample the states fairly. (b)Similar behavior to panel (a), however, the unfair sampling sets in earlier in the anneal.(c) Even the inclusion of n = 2 drivers does not remove the unfair sampling. As in (a)and (b) at least one state is suppressed. Note that a driver Hamiltonian with n = 4 resultsin fair sampling. (d) The sampling is not exponentially biased. However, one state occurstwice as likely as the others. Note that the sampling probabilities swap when going fromn = 1 to n = 2.

5.2.1 Graph Generation

As described in the introduction, we started the investigation with the toy problem of ref.[76] In order to find out, if the behavior of this problem is a generic or a special case,we needed to generate all possible graphs of comparable sizes and connectivities. This

Page 90: Embedding penalties for quantum hardware architectures and ...

5.2 Toy Problems 87

section describes how these graphs where generated efficiently.

Efficient GenerationWe want to generate all non-isomorphic graphs of a given size and connectivity. Twographs are isomorphic, if one can be changed into the other by only relabeling thevertices. While it is possible to combinatorially generate all graphs and the delete theones isomorphic to each other, this check becomes expensive quick. A smarter generationis necessary to avoid unreasonable runtimes for even small sizes (5-7 vertices). Byidentifying possible criteria which prevent graphs from being isomorphic to each other,one can use those for the generation.

Let us define the following terms:• order: The number of vertices in the graph• edgesize: The number of edges in the graph• colors: The number of possible different weights the edges can have• graphic number: The sorted list of degrees

A few examples of the notation is given in fig. 5.2. If one of the following criteria holds,

0

1

2

3

0 1 2 3

Vertex Number

0

1

2

3

4

Ou

tgoi

ng

Ed

ges

02

3

1

0 1 2 3

Vertex Number

0

3

21

4

0 1 2 3 4

Vertex Number

1

43

0

2

0 1 2 3 4

Vertex Number

0

3

21

4

0 1 2 3 4

Vertex Number

Figure 5.2: Graphical notation for describing graphs. The bar-plot on the bottom rowshows the number of edges for each vertex, colored according to the edge. For the firstand second example, the order is 4 while the other examples have order 5. The edgesizefor the examples is 4, 5, 5, 7 and 7. Aside from the second example, which has 2, all have3 colors. The graphic number for the first example would be [2,2,2,2] and for the lastexample [4,3,3,3,1].

the graphs are not isomorphic.• different order• different edgesize• different colors• different graphic number• different graphic number restricted on a specific color.

Page 91: Embedding penalties for quantum hardware architectures and ...

88 Chapter 5. Fair Sampling with Quantum Annealers

The last point can be understood as follows. Since we have different colors, we can thinkof the graph as an overlay of different monochromatic graphs. For each color, we canapply the criteria above to determine if two graphs are not isomorphic. It is important tonote that this is not an exhaustive test. If two graphs meet none of these criteria, this doesnot imply that they are not isomorphic, see fig. 5.3.

0

12

3

4 5

0 1 2 3 4 5

Vertex Number

0

1

2

3

4

Ou

tgoi

ng

Ed

ges

0

12

3

4 5

0 1 2 3 4 5

Vertex Number

0

12

3

4 5

0 1 2 3 4 5

Vertex Number

0

12

3

4 5

0 1 2 3 4 5

Vertex Number

Figure 5.3: All examples here have order 6, with edgesize 8 with 2 colors. The graphicnumber [3,3,3,3,2,2] as well as the graphic number for red [2,2,2,2,2,2] and blue[1,1,1,1] are identical. But the first three examples are not isomorphic to each otherwhile the third and fourth are isomorphic.

Graphic NumbersThe graphic number has a certain structure, since not any sequence of numbers describesthe number of edge out of a vertex on a graph. For example, the sum has to be even,otherwise there is an outgoing edge with no partner. Formally, this structure is given bythe Erdos-Gallai theorem, which states that any non-increasing sequence (x1, . . . ,xn) canbe the list of degrees of vertices in a graph if and only if:

k

∑i=1

xi ≤ k(k−1)+n

∑i=k+1

min(k,xi) ∀1≤ k ≤ n (5.3)

Below we list the first 20 graphic numbers, on the left the number of edges and on theright the list of degrees for the vertices as graphic number:

By subtracting a graphic number from another one, we obtain a graphic number. Thiscan be understood as removing a subgraph out of a graph. The remaining graph is still avalid graph and hence has a graphic number associated with it. Since there are multiple

Page 92: Embedding penalties for quantum hardware architectures and ...

5.2 Toy Problems 89

1 → 1 12 → 2 1 12 → 1 1 1 13 → 3 1 1 13 → 2 2 23 → 2 2 1 13 → 2 1 1 1 13 → 1 1 1 1 1 14 → 4 1 1 1 14 → 3 2 2 14 → 3 2 1 1 14 → 3 1 1 1 1 14 → 2 2 2 24 → 2 2 2 1 14 → 2 2 1 1 1 14 → 2 1 1 1 1 1 14 → 1 1 1 1 1 1 1 15 → 5 1 1 1 1 15 → 4 2 2 1 15 → 4 2 1 1 1 1

subgraphs to remove, subtracting graphic numbers can be done in different ways, too. Butone cannot just randomly subtract entries, e.g. if we want to remove [1,1] from [2,1,1],we cannot just cancel out the ones, remaining with [2], since this is not a graphic number1.For [3,2,2,1] we can subtract [1,1] in three different ways, with [3,1,1,1], [2,2,2] or[2,2,1,1] remaining.

Algorithm

In order to generate all graphs, a generator first generates the main graphic number. Next,all graphic number that fit within (i.e. can be subtracted from the main number leaving agraphic number) the main number are generated. For each of those, again, all graphicnumbers are generated that fit it, until the process is repeated colors many times. Thiscan be visualized by fig. 5.2. First, all colors are the same, next the columns of the barplot are split into two colors, s.t. both individually are graphic numbers and so on. Tosave more time, the color distribution is generated in such a way, that the number ofedges of the first color is always larger equal to the second. The next color is has alwaysequal or less edges as the previous one. It is very cheap to just permute the colors laterand remove the very few isomorphic copies generated in the process. These isomorphiccopies happen, e.g. in [2,1,1] where one edge is red and the other blue, and permutingthe colors yields the same graph again. The number of isomorphic independent graphsis minuscule compared to a simple combinatorial construction of all graphs. In order to

1 Although [2] is a valid degree list if we allow loops, edges which start and end at the same vertex.

Page 93: Embedding penalties for quantum hardware architectures and ...

90 Chapter 5. Fair Sampling with Quantum Annealers

deal with the remaining isomorphism - see fig. 5.3 - we use the isomorphism algorithmfrom the networkx package [82]. Figure 5.4 shows all graphs for order 4 with 2 colors.

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

Figure 5.4: All graphs which are isomorphically independent for order 4, with edgesize0 to 6 and 2 colors. One can see the repetition due to color inversion which the algorithmonly generates at the end, since the can easily be generated by swapping colors on areduced (minimal) set of graphs.

Page 94: Embedding penalties for quantum hardware architectures and ...

5.3 Perturbation Theory 91

5.3 Perturbation Theory

In the following, we show how to determine the sampling probabilities, as well as theinfluence the driver has on it. In short, if we apply HD as a perturbation of strength λ toHP, some degeneracies will be lifted, i.e. the perturbed ground-state-space is smaller. Theground-state-space is never left during an adiabatic anneal, hence it will not be possibleto reach the entire ground-state-space of the unperturbed hamiltonian by annealing in thegeneric case. This analysis hold for any driver hamiltonian HD, not just the stoquasticHx,n-type drivers we use in this work. In non-degenerate perturbation theory, the first

order corrected wave function |n〉 is given by |n〉= |n0〉+λ ∑m 6=n〈m0|HD|n0〉

E0m−E0

n|m0〉, where

|n0〉 are the eigenstates and E0n the eigenvalues of the unperturbed hamiltonian HP. If

states m 6= n are degenerate, i.e. E0m = E0

n , there is a singularity. To avoid it, degenerateperturbation theory is requires linear combinations |α0〉 which satisfy 〈α0|HD|β 0〉 ∼ δα,β

in every degenerate subspace. This ensures that the corrected wave function does notdiverge due to singularities. We focus on the ground-state subspace, but the procedure isidentical for any subspace. Given k ground-states |n0

gs〉 of HP with energy E0gs, we need

to form the k× k subspace matrix Vn,m = 〈n0gs|HD|m0

gs〉. Since HD is hermitian, V is too.Every hermitian matrix can be diagonalized by a unitary transformation (U−1VU = D)and we find the correct linear combinations |α0

gs〉 in the columns of U . It satisfies〈α0

gs|V |β 0gs〉 ∼ δα,β since D is diagonal. The diagonal entries of D are the eigenvalues of

V and also the first order energy corrections E1α . We need to pick the lowest eigenvalue

E1α,low and find the corrected ground state energy EGS = E0

GS+E1α,low. The corresponding

l eigenvectors |α0gs〉 will now determine the sampling behavior, since the annealing state

will be in their span.(i) l = 1 – In this case pi = 〈n0

gs|φ 0α,low〉2, because there is a single state |φ 0

α,low〉. Ifsampling is fair, it will remain fair, regardless of how much the higher energy eigenvaluesof HP change during the adiabatic anneal. If certain states have pi = 0, |n0

gs〉 will neverbe available at the end of the anneal.

(ii) l > 1 – Let A be the k×m matrix consisting of all l |φ 0α,low〉. If there is a vector

x such that Ax = y and yi · y∗i = 1 for all i, then fair sampling is potentially possibleaccording to first order. If there exists an i such that yi = 0 for all x, then that ground-stateis never found. The same argument can be made for biased sampling where there is nosuppression but certain states are over-sampled.

(iii) V is zero – All eigenvalues E1α = 0 and the sampling probabilities are determined

by second-order perturbation, i.e., the probabilities depend on higher eigenvalues of HP(see fig. 5.6).

The second-order perturbation terms only play a relevant role if V is trivial. If l > 0,the sampling behavior is determined by V which does not depend on HP. This meansthat the sampling behavior is purely a property of the driver Hamiltonian Hx,n and theground-state eigenvectors of HP. We have verified this on numerous small systems, aswell as structured and random-coupling systems with direct integration and were alwaysable to predict the sampling probabilities that correspond to the state found after the

Page 95: Embedding penalties for quantum hardware architectures and ...

92 Chapter 5. Fair Sampling with Quantum Annealers

Hx,1 =

0 −1 −1 0 −1 0 0 0−1 0 0 −1 0 −1 0 0−1 0 0 −1 0 0 −1 00 −1 −1 0 0 0 0 −1−1 0 0 0 0 −1 −1 00 −1 0 0 −1 0 0 −10 0 −1 0 −1 0 0 −10 0 0 −1 0 −1 −1 0

V =

0 0 00 0 −10 −1 0

=⇒ klow =1√2

011

Figure 5.5: To obtain the sampling probabilities, the ground-state eigenvectors |gi〉 needto be known, represented here as shaded rows and columns in the matrix Hx,1, since thesolution of the diagonal HP is a classical one. One then needs to analyze the subspace Vthat is formed by restricting the driver (here Hx,1) to space spanned by the ground-statesof HP. The lowest eigenvector(s) determine the sampling probabilities. In this example,there is one lowest eigenvector and the first ground state corresponding to the top column(first row) is suppressed (all spins up) and will never be sampled in an adiabatic anneal.

anneal.

Figure 5.1(a) is the example studied in Ref. [75], where Hx,2 lead to fair sampling.There are 6 degenerate ground states, 2 of which are suppressed. With a driver of theform Hx,1 we obtain l > 1, meaning that there are multiple |α0

gs〉 states that determinethe sampling. However, the suppressed states have pi = 0. In Fig. 5.1(b) we show a morecomplex example – the smallest problem we were able to find that has l = 1 and onestate where pi = 0. It is a 12-fold degenerate system with two states fully suppressedwhen Hx,1 is used as a driver. The fact that l = 1 could be a reason why the suppressionsets in earlier during the anneal. This case is problematic for annealing schedules thatare fast quenches, because there is a much smaller window during the anneal where thetotal ground-state probability is approximately unity and the suppressed state has notyet reached zero probability. Using Hx,2 results in fair sampling. Figure 5.1(c) showsa system that has 6 ground states with 2 ground states in hard suppression. Using Hx,1as a driver, we obtain l = 1 and a unique |α0

gs〉 with 2 hard suppressed states with zeroprobability. For Hx,2, l > 1 we obtain multiple |α0

gs〉 states. However, two states are hard

Page 96: Embedding penalties for quantum hardware architectures and ...

5.4 Large Scale Results 93

suppressed. Using Hx,3 as a driver results in l = 1 and a unique |α0gs〉. However, there is

a soft suppression of two ground states (not shown). Finally, using Hx,4 we obtain l = 1and fair sampling. The case shown in Fig. 5.1(d) reveals the undersampled states whenHx,1 is replaced by Hx,2. More precisely, it changes from l > 1 with 4 soft suppressedstates to l = 1 with the previously 2 oversampled state now being undersampled. Using adriver Hx,3 results in l = 1 and fair sampling.

Figure 5.6 shows a problem where by changing the strength of J3,4 one can change thesampling bias arbitrarily. Note that changing J3,4 =−1.2 to −1.8 does shift the relativeenergies of the ground state and the various low exited states, but does not change theirorder. In terms of perturbation theory, V is trivial, and second-order perturbations dictatethe behavior of the system. Because there are terms ∝ 1/(Ei−EGS) with EGS the groundstate energy, shifting the energy levels Ei will influence the sampling.

2 3

4

0 1(e)

+1

−1

J3,4 101 104

Annealing Time T

0.2

0.4

0.6

0.8

1.0

pG

S

Hx,1, J3,4 = −1.2

101 104

Annealing Time T

Hx,1, J3,4 = −1.8

| ↑↑↑↑↓〉| ↑↓↑↓↑〉total

Figure 5.6: Toy problem with N = 5 variables and both ferromagnetic (solid lines) andantiferromagnetic (dashed lines) interactions of different strength (thickness of the lines)integrated using QuTiP [79]. Data for a driver Hx,1. By changing J3,4 one can change thesampling bias arbitrarily.

5.4 Large Scale ResultsTo corroborate our results with larger systems, we perform a fair-sampling study analo-gous to the one done in Ref. [76] for two-dimensional Ising spin glasses on a square latticewith periodic boundary conditions. The couplers are chosen from Ji, j ∈ {±1,±2,±4}.This ensures that degeneracies are small. The coupler-configuration space is minedfor specific degeneracies as done in Ref. [76]. Figure 5.7 shows representative rank-ordered probabilities to find different minimizing configurations using simulated anneal-ing (SA) [83, 18, 84], as well as transverse-field simulated quantum annealing (SQA-Hx,1)[3][13, 34, 35, 36, 37] and simulated quantum annealing with a stoquastic two-spinsdriver (SQA-Hx,2) [85] [86]. The data are averaged over 100 disorder realizations. Whilethe data for SA for this particular problem show a fair sampling of all minimizing con-figurations, neither a transverse-field Hx,1 nor a more complex Hx,2 driver can remove

Page 97: Embedding penalties for quantum hardware architectures and ...

94 Chapter 5. Fair Sampling with Quantum Annealers

the bias. This suggests that even if QA machines with more complex Hx,2 drivers areconstructed, sampling will remain unfair unless post-processing is applied [81]. The closeconnection between SQA and QA performances is discussed in Refs. [35, 36].

1 8 16 24 32

GS number (p sorted)

0.00

0.05

0.10

pG

S

SA

Hx,2-SQA

0.00

0.05

0.10

pG

S

SA

Hx,1-SQA

Figure 5.7: Rank-ordered probability pGS to find different degenerate states for a two-dimensional Ising spin glass with N = 82 = 64 and a ground-state degeneracy of 32. Thedata are averaged over 100 disorder realizations. For each instances, 500 independent runsare performed and the probability to find a given ground-state configuration computed.While simulated annealing (SA) samples close to fair, both SQA-Hx,1 and SQA-Hx,2show a clear bias in the sampling. In particular, there is no notable improvement of usinga driver of the form Hx,2 over a transverse-field driver Hx,1. We have also simulated sys-tems with up to N = 122 = 144 variables and ground states with up to 96-fold degeneracyobtaining similar results. Note that the bias becomes more pronounced for increasingsystem size N (not shown).

Page 98: Embedding penalties for quantum hardware architectures and ...

5.5 Effects of more complex drivers 95

5.5 Effects of more complex driversThe following section shows that any driver (stoquastic or non-stoquastic) needs to besufficiently dense to sample fair for generic larger systems. To predict the samplingprobabilities it is sufficient to know V (except when V is trivial). V can be constructedwith only the ground-state eigenvectors of HP (no eigenenergies needed) and the driverHx,n. This can be used to analyze different drivers—without specifying a concreteproblem Hamiltonian HP—by merely sampling from possible ground-state combinations.As an example consider a 2-fold degeneracy in a 5-spin system. Because we want to testthe driver for all possible ground-state combinations, we can exhaustively generate allthe ground-state pairs, i.e., N(N−1)/2, where N = 25 and check V for each one pair, toanalyze the sampling behavior. For larger N, we sample instead of searching exhaustively.

Figure 5.8 shows how probable it is for a random degeneracy and a ground-statecombination to be sampled according to the following categories:

fair – All ground states have the same probability.soft – At least one ground state is soft suppressed with a ratio smaller than 1 : 100(least likely vs most likely).hard – At least one ground state is soft suppressed with a ratio larger than 1 : 100or not found at all, i.e., total suppression. For better visibility in fig. 5.8 we combinethese two cases. However, most of the time the suppression is total.highord – The matrix V is trivial. Higher-order perturbation will determine thesampling behavior. In the generic case of random couplings this leads to both softor hard suppression.

In all cases and for Hx,n with n ≤ 8 we use Γx,n = 1. Using different values for thedifferent amplitudes leads to worse sampling, because the matrix V has multiple differententries. A random matrix has a unique eigenvector which is not parallel to (1,1, . . . ,1,1)in the generic case. Hence, introducing more variety into V leads to more unique (andunfair) klow. How the ratio of soft to hard suppression is influenced by this was notinvestigated, since it is unfair in the generic case. Repeating multiple annealing runs withindividually randomized Γ

x,ni, j,... and averaging improves the sampling but, if not dense

enough, will not be able to remove all hard suppression in a generic case.

5.6 ConclusionsWe have studied the necessary ingredients needed for quantum annealing to sampleground states fairly. From Fig. 5.8 we surmise that a fairly dense driver is needed toobtain fair sampling. Carefully controlling the anneal with additional parameters, forexample as shown in Ref. [87] might help mitigate the bias, however, this remains to betested. We do emphasize, however, that a Hx,2 driver with the typical annealing modusoperandi used in current hardware will not yield a fair sampling of states and performscomparably to a vanilla transverse-field driver.

Page 99: Embedding penalties for quantum hardware architectures and ...

96 Chapter 5. Fair Sampling with Quantum Annealers

fair

soft

har

d

hig

hor

d

Hx,1

Hx,2

Hx,3

Hx,4

Hx,5

Hx,6

Hx,7

Hx,8

4 spins

fair

soft

har

d

hig

hor

d

8 spins

fair

soft

har

d

hig

hor

d

20 spins

0.0

0.2

0.4

0.6

0.8

Figure 5.8: For each spin system all possible degeneracies are sampled with 400 randomground-state combinations. For small systems, all combinatorial possibilities of groundstates can be checked, and the random sampling approximates the exact result fast. Fairsampling is reached for all possible problem Hamiltonians HP, once the driver matrix isdense, i.e., all off-diagonal contain equal nonzero entries. For example, this is the casewhen using Hx,4 for a system with 4 spins. Similarly, for 8 spins the system moves fromhard to soft, to fair sampling as more complex drivers are used. In the system with 20spins, for Hx,n with n < 7 there is only a dependence on second-order perturbation (seefig. 5.6) or hard suppression in the average case.

Page 100: Embedding penalties for quantum hardware architectures and ...

III6 Scientific Measurement Framework . . . 996.1 Challenges in scientific Computing6.2 Zurkon Scheduler6.3 Zurkon Provenance6.4 Zurkon Caching6.5 Technical Implementation6.6 Conclusion

7 Conclusion and Outlook . . . . . . . . . . . . 1357.1 Implementation7.2 Maxcut: Comparing Algorithms7.3 Embedding7.4 Fair Sampling7.5 Scientific Measurement Framework

Tools

Page 101: Embedding penalties for quantum hardware architectures and ...
Page 102: Embedding penalties for quantum hardware architectures and ...

6. Scientific Measurement Framework

Computational science has the luxury that almost all of its conduct happens in the digitalrealm, hence can be automated and tracked given the proper tools. Compared to a physicallab, where there are always external uncontrollable factors to some degree, computationalscience does have no such excuses, and should be reproducible. There are few exceptions,like e.g. performance, when the infrastructure changes. The following chapter describesthe progress of the tools developed as part of the effort to make this thesis as reproducibleand elegant as possible. The main goal was to provide a tool-suite that hide all possibletechnical details and leave the user with a clean, convenient front end. First we willhighlight challenges in scientific computing. Second, we provide techniques to solvescheduler related challenges. Third, a tool for tracking provenance of data is presented,followed by caching mechanisms for efficient re-use of already calculated data. Thechapter is concludes with technical remarks that make the convenient front end possible.

6.1 Challenges in scientific Computing

In scientific computing, many challenges arise from the volume of simulations one needsto run. We will introduce the most important ones in the following section. We assumethe user to work on a local device, e.g. laptop or workstation and also to have access to aremote system with large compute capabilities, such as a cluster.

Page 103: Embedding penalties for quantum hardware architectures and ...

100 Chapter 6. Scientific Measurement Framework

6.1.1 InfrastructureNo matter which super-computer or cluster one uses, it is very likely that a programs needsome adjustment before running on it efficiently. Since the cluster maintainers have tomake sure that all software they offer are compatible, this usually means the latest versionof software such as compilers, interpreters or databases are usually unavailable. One hasto take this into account from day one when developing on the local machine. The storageformat on remote computing machines might differ as well, which will be discussed next.

StorageThe choice of storage should also be checked beforehand on the cluster. Due to the largescale nature of the clusters, they usually have specialized filesystems that may have badperformance for certain tasks, e.g. file-locking. Synchronizing a file-lock over and entirefilesystem server understandably takes much more time than on your local machine, henceone needs to be careful when using any storage strategy that relies on such mechanisms.Best to test your storage system (files / sqlite / database / hdf5) in beforehand or abstractit. This means isolating code related to storage with another abstraction layer that doesnot depend on the concrete storage implementation. The main programs only interactswith this layer. A later change in storage is thus manageable since it also only interfaceswith this layer and not all programs.

SchedulerAlmost all clusters have some sort of scheduling systems, since it would be unmanageableto manually log into every compute node and run jobs. It also balances the load andprioritizes the different users appropriately. The main goal of a scheduler is to use thecluster resources as efficiently as possible. There are many different scheduling systems,such as PBS, Slurm, LSF, SGE, LoadLeveler, and while they all achieve the same goal,the details of their operation and limitations can be fairly different. There is always somesort of job limit in place, since most scheduler do not deal well with many jobs (> 100k).This has to be taken into account when developing code. Furthermore, there is always anon zero chance of a node-failure or scheduler crash or ultimately cluster shutdown, sothat checkpointing and caching is advised. If the code should run on multiple clusterswith different schedulers, the scheduler needs to be abstracted and plugins for eachscheduler created. Another key difference between the cluster and the local machine isthe requirement to specify the duration of each job. If the duration is chosen too short,the scheduler will abort the unfinished simulation, and if chosen too long, the job willtypically start later and depending on the scheduler reduce your overall priority. Choosingdurations which are much too long further puts strain on the scheduler, since it needs toback-fill the resources you did not use, but apply for. Especially when the job durationdepends on the input variables, it can be a tedious to figure out an acceptable strategy.

6.1.2 CachingAs mentioned in previous section section 6.1.1, caching your results during computationis strongly advised, since there can always be failures on the cluster. A lot of compute

Page 104: Embedding penalties for quantum hardware architectures and ...

6.1 Challenges in scientific Computing 101

time can be lost if just one node fails, or there is an error at the end of your program. Thereare many options to store intermediate and final outputs of simulations as mentionedin section 6.1.1. Caching can be done at different levels, starting on the local machine,which can be useful, e.g. if one just runs different plotting scripts on the results, toprevent having to connect to the cluster every time. The next level would be cluster widecaching, that every compute node can access. This level usually requires more attentionduring development, due to reasons mentioned in section 6.1.1. Caching in a productionsetting, where the source code rarely changes, is fairly straight forward. But in scienceone is almost by definition in a testing setting, where source code changes frequently,and so caching can be more involved. One has to prevent false positive hits in the cache,i.e. a modified function should never retrieve results associated with an older version ofitself from the cache. This usually involves tracking the version of the cached functionsand a subsequently invalidating the cache, or at least using a different cache-key wheninteracting with the cache. Just modifying the cache-key (a shorthand for the storagelocation of your function with the corresponding inputs) is easier to implement, but leavesthe cache ever growing since old useless entries are not deleted. On a cluster there is ascratch space where unused files get deleted eventually. But if you one uses a databaseor hdf5, there is no automatic cleaning. On the local machine, this issue can lead todisk-shortage in the long run.

6.1.3 Active Development

When trying to answer scientific questions, one needs to adapt the code according to thefindings. The scientific process is a badly defined problem from the software developmentperspective. Maybe the simulation needs to answer questions tomorrow that nobodythought of today. For running the simulations, this means that one needs to track thesource code well since it might be that it changed while e.g. a long running simulation isstill running. If the main code on the cluster were changed in the middle of execution,that would most likely interfere with the running simulation. Hence it must be possible tohave multiple copies of the source code on the cluster at the same time. Having multiplecopies of anything and managing them manually is a reliable source of errors, hence thisshould be at automated or at least be delegated to version control systems.

While developing for the cluster, there are a few ways of writing code and testing onthe cluster. The main goal is to keep the loop between changing code and testing it asshort as possible. Possible solution to achieve this are:

Develop on Cluster

The closest way is to modify the code on the cluster directly, which keeps the loop veryshort. The major downside, depending on your choice of IDE1, is the lack of any GUI2. Ifone prefers a terminal based editor, this is no problem, but in any other case, the reducedcoding speed in an unfamiliar editor is a deal-breaker.

1Integrated Development Environment2Graphical User Interface

Page 105: Embedding penalties for quantum hardware architectures and ...

102 Chapter 6. Scientific Measurement Framework

Develop on Local Machine

One can develop on the local machine and push the changes via version-control to thecluster. The main advantage is the familiar environment and that one only deals withone code-repository. Key downside is the very long loop, since pushing and pullingtakes a long time compared to maybe a few lines that changed. Unless one fixes theversion-control history later, this also leads to many small commits. A faster way ofsharing the code could be done with tools like rsync or scp.

Mounting Cluster

To have the best of both worlds, one can mount the filesystem of the cluster (if supported)locally. The files on the cluster can be modified with your favorite editor and will be saveddirectly on the cluster, making this the preferred way. The only part that requires disciplineis to avoid having the same file open locally and remotely. This leads to confusion, sinceyou think having modified the file on the cluster, but instead you changed the file on yourlocal machine. Since one modifies now two versions of the code-repository, mergingmight be necessary.

6.1.4 Personal Experience

This section highlights the personal experience of the author with the challenges describedabove.

Infrastructure

Regarding storage, I worked file based in my early days. With growing simulation sizesthe time spent on reading and writing these files became unacceptable, since one needsto read and parse the whole file, even if only a small information in it is used. But thefiles approach worked well on the cluster, as every different compute node and processdid not work on the same file concurrently. Later I switched to sqlite, a server-lessSQL implementation which solved the inefficiency of read/write, but took adjustingon the cluster due to the lustre-filesystem. sqlite needs to make sure that no twoprocesses write to the database at the same time, since now there are potentially many(1000+) processes reading and writing to it. As already hinted in section 6.1.1, the mainperformance issues was the synchronization of the database. sqlite by default uses afile-lock to track if it is locked. But these file-locks are very slow on lustre. Adjustingthe options of sqlite solved this problem. A future option for even larger scale wouldbe a server based SQL storage system. The super-computers at our disposal all run theslurm workload manager. Being in statistical physics, we run large scans of the sameprogram for different instances and/or settings, resulting in many small jobs. At thebeginning the cluster did not limit the job-submission well, and the author unknowinglycrashed the entire scheduler on the cluster by submitting too many jobs. All scheduledjobs of all cluster users where lost, which should never be repeated again. But having tomonitor the cluster closely at all time is not feasible either, hence the scheduler describedin section 6.2 took form to avoid these problems.

Page 106: Embedding penalties for quantum hardware architectures and ...

6.1 Challenges in scientific Computing 103

CachingBoth the file-based caching as well as the sqlite based caching served us well on thecluster. In order to remove old entries, we implemented a full cache invalidation system,as well as minimized the data transfer between the local machine and the cluster. Themain work is tracking which function calls which other functions to the subsequentlyinvalidate all entries down the dependency tree. This system is described in section 6.4.

Active DevelopmentAt the beginning, we mounted the cluster directly on the local machine, but with thecontinues tools-development, the entire code-management on the cluster was automatedto the point where no manual cluster management was necessary anymore. Currently weuse the second strategy outlined in section 6.1.3 which is a bit slower due to the constantpushing and pulling with git, but since the cluster side is automated, the loss in speed isacceptable.

Page 107: Embedding penalties for quantum hardware architectures and ...

104 Chapter 6. Scientific Measurement Framework

6.2 Zurkon SchedulerStarting point of this tool suite is the scheduler. As mentioned in section 6.1.1 andsection 6.1.4, using the scheduler of the cluster directly neither elegant nor possiblewithout modifying the code.

For all listings, we use python as language, but the concepts could be also be appliedto similar languages. The following terms will be used throughout the next sections:

• content hash: Given a certain python object, the content hash is generated using acryptographic hash function (e.g. hashlib.md5). It is used as a way to identifythe object without having to store the entire object. While there is a theoreticaldanger of collisions, i.e. different object mapping to the same hash, this is of nopractical relevance with a good cryptographic hash and non gigantic project sizes.

• payload: The combination of a function and its input arguments. The payload canbe executed without further arguments. It can be thought of as a delayed apply, oras a functool.partial in python.

• payload hash: Given a function and it’s input arguments, the payload hash isdependent on the public function name, the function version if specified, and thecontent hashes of the input-arguments. The payload hash is a shorthand for thecontent hash of a payload.

• call graph: A tree structure where every function call (which has an associatedpayload hash) is a node. If a payload A invoces payload B, the B is a child ofA in the graph. The call graph visualizes the dependencies between the differentfunctions for the given input arguments.

6.2.1 Multiprocessing in PythonMulti process programming is long established as vital technique to increase performance,and many programming languages, python amongst them, provide some solutions. Whatthese solutions provide and their limits will be discussed in this section.

Listing 6.1: using a ProcessPoolExecutor and evaluating the future returned

1 from concurrent.futures import ProcessPoolExecutor

def foo(x):return x

5with ProcessPoolExecutor(max_workers=2) as pool:

future = pool.submit(foo, 2)print(future.result())

Listing listing 6.1 shows the typical use of a pool. We delegate the work of runningfoo(2) to a process pool and get a so called future. This is a handle to get the resultand if it is not yet done, wait for it. The concept of delayed or lazy computing is verypowerful, since e.g. if we where to call foo with many different numbers, we can submitall these tasks and only start blocking the program (i.e. waiting) when we need a result.

Page 108: Embedding penalties for quantum hardware architectures and ...

6.2 Zurkon Scheduler 105

Listing 6.2: the foo calls are not submitted to the pool

1 def foo(x):2 return x

def bar(x):return foo(x)+foo(x+1)

7 with ProcessPoolExecutor(max_workers=2) as pool:

print(pool.submit(bar, 1).result())

Listing 6.2 shows the issue of the ProcessPoolExecutor. If the foo calls areexpensive and especially if there are more than just two, we would like to run this workin parallel and not just on the one process tasked with working on bar.

Listing 6.3: bar starts a new pool which doubles the resource consumption

1 def foo(x):2 return x

def bar(x):with ProcessPoolExecutor(max_workers=2) as pool_bar:

future1 = pool_bar.submit(foo, x)7 future2 = pool_bar.submit(foo, x+1)

return future1.result() + future2.result()

with ProcessPoolExecutor(max_workers=2) as pool:

print(pool.submit(bar, 1).result())

Listing 6.3 shows a solution where bar starts another pool, but this has the ugly effect,that we cannot limit the resource consumption anymore, since every function will ask forresources, oblivious of how many processes and pools are already in use.

Listing 6.4: passing the pool to functions

1 def foo(x, pool):return x

4 def bar(x, pool):future_a = pool.submit(foo, x, pool)future_b = pool.submit(foo, x+1, pool)return future_a.result() + future_b.result()

9 with ProcessPoolExecutor(max_workers=2) as pool:

print(pool.submit(bar, 1, pool).result())

Page 109: Embedding penalties for quantum hardware architectures and ...

106 Chapter 6. Scientific Measurement Framework

The second solution show in listing 6.4 passes the pool along to the functions, s.t. theycan use it as well. The first issue is the ugly syntax that requires the user to always pass thepool, and the second more imporant issue is that it does not work. While this is workingcode with a ThreadPoolExecutor, it does not work with a ProcessPoolExecutorsince it cannot be serialized. While these pools do a splendid job for flat call graphs theydo not work for repeated use from submitted functions.

6.2.2 Static SchedulerThe first futile attempt was a static scheduling system. Static refers to the necessity ofknowing the entire call graph (which function invokes what other functions) before anyfunction body is executed.

Listing 6.5: static dependency specification

1 @dependencies()

def foo(x):return x

5 @dependencies("foo(x), foo(x+1)")

def bar(x):return foo(x)+foo(x+1)

Listing 6.5 shows that the function bar needs to specify the foo calls in a @dependenciesdecorator. While the downside of code duplication are obvious, they could easily beremoved with static code analysis later. The key advantage of static scheduling is theability to scheduler perfectly, i.e. one knows all dependencies, the entire call graph, andcan run foo(x) and foo(x+1) before bar(x). The dependency structure also showsthat the foo calls can be run independently at the same time. But listing 6.6 reveals themain issue with the static scheduling requirement for the user. If there is a decision in thefunction bar that cannot be deduced by static analysis, one would need to run it to figureout which foo call is required, which defeats the idea of static scheduling entirely.

Listing 6.6: main issue with static dependency specification

1 @dependencies()

def long_decision(x):3 time.sleep(100)

return x==1

@dependencies()

def foo(x):8 return x

@dependencies("long_decision(x), foo(x)?, foo(x+1)?")

def bar(x):

Page 110: Embedding penalties for quantum hardware architectures and ...

6.2 Zurkon Scheduler 107

if long_decision(x):13 return foo(x)

else:return foo(x+1)

The only way forward is to either restrict coding practices for the user to preventthese decision situations, or adapt the scheduling system. The restrictions are too gravefor slightly advanced simulations and thus the only viable option was the later one.

6.2.3 Dynamic SchedulerNot knowing the call graph and still scheduling optimally or at least reasonable, is amuch more difficult task, which the following implementation attempts to solve. Fornow, we just assume that there is a node/worker running with a few slots (worker threads)for functions, to illustrate the various issues. Later we discuss in more detail how theseworkers will get launched. Listing 6.3 and Listing 6.4 clearly suffer from boilerplatecode and can be improved with a bit of python magic, see listing 6.7. Be aware, thateven if it looks like normal python code, there are future and submit occurring. One canenvision a submit whenever a function decorated with @context_enable is called, anda future.result every time we use a result. The full capabilities of these decoratorswill be discussed later.

Listing 6.7: the decorators @context_enable and the MagicPool avoid us to write codelike in listing 6.4

1 @context_enable()

def foo(x):return x

5 @context_enable()

def bar(x):return foo(x) + foo(x+1)

with MagicPool(max_workers=2):

10 bar(1)

Bearable LoadThe normal cluster schedulers are not necessary set up to deal with many small jobs, andhence the first benefit of a custom scheduler is to lessen the load on the cluster schedulerby only submitting a few larger worker jobs, which the intern can work on up to millionsof small jobs, something that might not be possible otherwise.

Relaunch JobsThe solutions within python-standard have no notion of a job duration, but as mentionedin section 6.1.1, this is required on every cluster. Sometimes your estimated duration is

Page 111: Embedding penalties for quantum hardware architectures and ...

108 Chapter 6. Scientific Measurement Framework

not long enough and the scheduler kills your job before it is finished. This is especiallyannoying if it happens in some small function that then prevents the entire simulationfrom running, and one finds an aborted simulations, instead of interesting results. Themax_extend keyword specifies how many times a timed out job will be extended until itis finally declared timed out. On every extension, the scheduled duration is doubled, s.t.in the worst case, one has only wasted 50% of the anticipated compute time in the limitof infinite extensions. Listing 6.8 and fig. 6.1 show this scenario for three function calls.

Listing 6.8: relaunch demonstration see fig. 6.1

1 @context_enable(duration=1.2)

def foo(x):time.sleep(x)

return x5with Pool(max_nodes=1, jobs_per_node=3, max_extend=2):

foo(5)foo(3)foo(1)

00 01 02 03 04 05 06 07 08 09 10

node 00

foo(5) foo(5) foo(5)

foo(3) foo(3) foo(3)

foo(1)

Figure 6.1: The three functions can be launched in parallel (think for the return value asa future), foo(1) can finish successfully (blue) in the given limit of duration=1.2. Theother calls time out (orange) and get extended. foo(3) can finish after two extensions,brining the scheduler duration up to 2∗2∗1.2 = 4.8, but for foo(5) this is still to short,and the task times out with no extension left (red).

Hot CachingHot caching refers to storing results in working memory rather than on disk. If the poolis shut down, the hot cache is lost as well. The code run in this example is shown inlisting 6.9.

Listing 6.9: hot cache demonstration see fig. 6.2 without and fig. 6.3 with hot caching

1 @context_enable(duration=4)

def foo(x):time.sleep(x)

Page 112: Embedding penalties for quantum hardware architectures and ...

6.2 Zurkon Scheduler 109

return x

6 @context_enable(duration=5)

def bar(x):res = foo(3)time.sleep(1)

return x + res11

with Pool(max_nodes=1, jobs_per_node=3, max_extend=2):

bar(1)bar(2)

00 01 02 03 04 05 06 07 08 09 10

node 00

bar(1)

bar(2)

bar(2)

foo(3) foo(3)

foo(3)

Figure 6.2: Listing 6.9 run without hot caching.

Figure 6.2 show the run with hot caching disabled. The two bar calls launched. Theyboth schedule the same foo(3) call, but only bar(1)’s call can run immediately, whichleads to a successful conclusion. The second foo(3) call starts at the 3 second mark, butis not done in time before bar(2) times out. It gets extended and relaunched, submits afoo(3) again, since the new bar(2) has no recollection of the previous foo(3) which isnow orphaned and just wasting compute time. After the new foo(3) is done, the secondbar(2) completes successful as well. foo(3) is run 3 times, which seems wasteful.This can be improved by storing the payload hash and its associated future in the pool.Whenever the same payload is presented again, the pool just returns the same future again,since the payload is already scheduled or running. Figure 6.3 show the execution with

00 01 02 03 04 05 06 07 08 09 10

node 00

bar(1)

bar(2)

foo(3)

Figure 6.3: Listing 6.9 run with hot caching.

Page 113: Embedding penalties for quantum hardware architectures and ...

110 Chapter 6. Scientific Measurement Framework

hot caching. Both bar calls submit foo(3) which the pool recognizes. When foo(3) isdone, both bar calls terminate successfully. This saves lots of compute time with the onlycost being working memory of the machine that runs the pool. For long running pools,last recently used or similar strategies can be deployed to keep the memory consumptionin check.

Dependency TrackingIf a payload A invokes a payload B much longer than itself, the question is whether Ashould take this into account in its duration declaration or not. For multiple reasons,the implementation chooses not to put this bookkeeping burden on A. B might change,become more efficient suddenly call other subroutines, and it would be a maintenancenightmare to always update A’s duration. Hence the only duration A needs to get right,it the time it uses itself, without considering submitted subroutines like B. Listing 6.10shows this scenario. Also note the dynamic @duration=lambda x: x+1 statement in foo.This is a convenient way to further optimize the estimated runtime dependent on theinputs.

Listing 6.10: dependency tracking demonstration see Figure 6.4 and Figure 6.5

1 @context_enable(duration=lambda x: x + 1)def foo(x):

time.sleep(x)

return x5@context_enable(duration=0.8)

def bar(x):res = foo(7)time.sleep(1)

10 return x + res

with Pool(max_nodes=1, jobs_per_node=2, max_extend=2):

bar(1)

00 01 02 03 04 05 06 07 08 09 10

node 00

bar(1)bar(1) bar(1)

foo(7)

Figure 6.4: Listing 6.10 run without dependency tracking.

Figure 6.4 shows the main issue when we run the scenario with no further measuresin place. Function bar gets extended and relaunched, since it does not know that the

Page 114: Embedding penalties for quantum hardware architectures and ...

6.2 Zurkon Scheduler 111

reason for its timeouts is the unfinished function foo. This can be fixed by tracking whichpayload submits which children and building the call graph dynamically as submissionenter. When knowing the local call graph around bar, the pool only relaunched it onceall children (foo) are done. This behavior can be seen in Figure 6.5. Note that a timeout

00 01 02 03 04 05 06 07 08 09 10

node 00

bar(1) bar(1)bar(1)

foo(7)

Figure 6.5: Listing 6.10 run with dependency tracking.

caused by a child will not trigger an extension of the duration, since extensions are limitedby max_extend. Thus bar is relaunch at the 7 second mark after foo is done, but theduration=0.8 is too short, s.t. an additional extension relaunch is necessary to satisfythe time.sleep(1) statement. This decision forces the user to write all functions in anidempotent way, i.e. there is no additional effect if a function is run a second time withthe same arguments. Most functions have this property already, such that this constraintis bearable. It allows to only specify the self duration of a function and functions canpotentially get resubmitted many times. Anyone familiar with async style programmingwill recognize the very similar philosophy of suspend-able functions, although here inless elegant manner since is impossible to force all frameworks, c-bindings and more toadopt async practices.

Idle TimeoutIn the previous section, it was not apparent, but dependency resubmitting has one effi-ciency issue, namely that parents wait their entire duration for a child to finish, blockingvaluable computing resources from running other functions.

Listing 6.11: idle timeout demonstration see Figure 6.6 and Figure 6.7

1 @context_enable(duration=lambda x: x + 1)def foo(x):

time.sleep(x)

return x5@context_enable(duration=8) # maybe we overestimated the duration

def bar(x):res = [

foo(5),10 foo(4)

]

time.sleep(1)

Page 115: Embedding penalties for quantum hardware architectures and ...

112 Chapter 6. Scientific Measurement Framework

return x + sum(res)

15 @context_enable(duration=2)

def baz(x):time.sleep(1)

return bar(x)

20 with Pool(max_nodes=1, jobs_per_node=2, max_extend=2):

baz(1)

00 01 02 03 04 05 06 07 08 09 10

10 11 12 13 14 15 16 17 18 19 20

node 00

...node 00

baz(1)

baz(1)

bar(1)

bar(1)

foo(5)

foo(4)

...(4)

Figure 6.6: Listing 6.11 run without idle timeout.

Listing 6.11 and fig. 6.6 show this situation. The bar(1) call at the 1 second marklaunches both foo calls immediately, but since the node has no more free spots, foo(4)is only started when baz times out naturally. But since bar waits it full duration it blocksresources that could be used to run foo(4). Figure 6.7 fixes this issue by introducing a

00 01 02 03 04 05 06 07 08 09 10

node 00

baz(1) baz(1)

bar(1)

bar(1)foo(5)

foo(4)

Figure 6.7: Listing 6.11 run with idle timeout set to 1 second.

idle timeout of 1 second, meaning any job waiting for more than the idle timeout will getaborted and relaunched once all dependencies are completed. This kind of timeout, likethe dependency resubmission, does not extend the jobs duration. The reason a seamlesslyuseless redirection via the baz call was introduces, is the fact, that only functions thatdo not directly origin from the user can benefit from the idle timeout. Typically the userprocess is not on the cluster anyway and these idle timeouts work via sending an error to

Page 116: Embedding penalties for quantum hardware architectures and ...

6.2 Zurkon Scheduler 113

the waiting process, which is then handled correctly by the pool when the process reportsthat the function encountered this specific error.

Local Processing

While idle timeout solves many issues that concern efficiency of resource use, in manysituation we can do better than aborting the parent job. When aborting the parent job, wepay the idle_timeout and have to re-run the workload in the parent job upon relaunchup to the point where it was aborted. If a job requires the result of a future, and it is not yetlaunched, we can delegate the processing of it to the requesting job for local processing.This only works, if the remaining duration of the requesting job is long enough to fit therequested job’s duration. Furthermore, the scheduler does not send work to functionsinvoked by the user directly, hence the presence of the baz function in listing 6.12 again.

Listing 6.12: local processing demonstration see fig. 6.8 and fig. 6.9

1 @context_enable(duration=lambda x: x + 1)def foo(x):

time.sleep(x)

4 return x

@context_enable(duration=7)

def bar(x):time.sleep(1)

9 res = x + foo(1) # forces evaluation

time.sleep(.25)

res += foo(1.25)time.sleep(.25)

return res14

@context_enable(duration=2)

def baz(x):time.sleep(1)

return bar(x)19

with Pool(max_nodes=1, jobs_per_node=1, max_extend=2):

baz(1)

Figure 6.8 shows this run without local processing. Since there is only one computespot, baz and bar abort due to the idle_timeout=0.25. Then foo(1) runs, but sincefoo(1.25) only gets launched once bar is relaunched. If we enable local processing, asshown in fig. 6.9, both foo calls can be run locally in the bar process. Local processingis not limited on one level, as long as the un-launched job a parent job is waiting for, fitsits remaining duration, it can be processed locally, e.g. if a foo call would call anotherfunction, it could be processed locally as well, no matter how the foo call is handled.

Page 117: Embedding penalties for quantum hardware architectures and ...

114 Chapter 6. Scientific Measurement Framework

00 01 02 03 04 05 06 07 08 09 10

node 00

baz(1) baz(1)bar(1) bar(1) bar(1)foo(1) foo(1.25)

Figure 6.8: Listing 6.12 run without local processing, with an idle timeout of 0.25seconds

00 01 02 03 04 05 06 07 08 09 10

node 00

baz(1) baz(1)bar(1) foo(1) foo(1.25)

Figure 6.9: Listing 6.12 run with local processing, with an idle timeout of 0.25 seconds

While it might seem artificial to have just one worker slot, this situation can easily happenwith multiple slots as well.

Worker Scheduling

This next section concerns the management of the workers, which has been skipped so far.The main difference to the previous section is the circumstance, that node-worker do notlaunch immediately after being submitted, but rather have to wait a potentially long timein the cluster queue. Since we interact with the cluster scheduler, concerns mentioned insection 6.1.1 have to be addressed. One crucial important decision is the strategy withwhich one starts workers. This implementation provides a few options, but they all baseon the following principles:

• Priority: prioritizes long jobs over short jobs• Efficiency: cancel node-worker that are not launched if not needed anymore• Limit: only launch as many node workers as specified by max_nodes• Overestimation: assume that jobs will finish earlier than specified by their duration

and do not relay too much on this information.Figure fig. 6.10 shows how a the workers are scheduled to satisfy the waiting jobs. Theyalways cover the longest waiting jobs in the workload. If there are many more jobs thancapacity, one can switch to a strategy that regards the integrated workload, but due tothe frequent overestimation, focusing on the longest jobs and leaving the shorter ones asback-fill material is more efficient. Not necessary in workers launched, but in time tocompletion, since shorter node-worker usually get launched quicker on the cluster. Evenwith this trivial strategy given jobs_per_node=72, the amount of jobs the cluster wouldsee compared to the actual jobs launched is already substantially smaller. The strategyused in production uses the longest job as reference, but adds additional time to preventexcessive worker submissions, further reducing the stain on the cluster scheduler.

Page 118: Embedding penalties for quantum hardware architectures and ...

6.2 Zurkon Scheduler 115

0 10 20 30 40 50

duration of job (sorted) [s]

0

10

20

30

40

50

nu

mb

ero

fw

ait

ing

job

s

0 10 20 30 40 50

duration of job (sorted) [s]

waiting jobs

waiting node-worker

Figure 6.10: This shows the node-workers for a given workload distribution. Every node-worker can work on 10 jobs in parallel jobs_per_node=10 and we allow a maximum ofmax_nodes=4 nodes. The left figure has 55 jobs submitted every job a longer than thelast in a linear fashion. Since it is only possible to launch 4 nodes, the orange curve stopsat 40. The right figure has enough nodes to cover the 35 jobs.

Worker StarvationThe management of the node worker is easier that the job management discussed above,but a few issues apply for the node workers as well. The most important is to shut downworker that cannot run jobs anymore due to their remaining duration.

Listing 6.13: worker starve demonstration see fig. 6.11 and fig. 6.12

1 @context_enable(duration=4)

def bar(x):time.sleep(2)

4 return x

with Pool(max_nodes=1, jobs_per_node=1, max_extend=2):

bar(1)bar(2)

Listing 6.13 and fig. 6.11 show the issue when not taking this into account. Sincebar requires a duration of 4, a worker with duration 4 is launched. After bar(1) is doneat the 3 second mark, the worker cannot process bar(2) with a duration requirement of4 seconds with its remaining 2 seconds. But instead of exiting, node 00 run until theend of its lifetime, before a second worker can be launched. Figure 6.12 shows the samesituation with the workers. starving after 1 second if they have nothing to do anymore.In this example it seems like a minor optimization, but in production this is of vital

Page 119: Embedding penalties for quantum hardware architectures and ...

116 Chapter 6. Scientific Measurement Framework

00 01 02 03 04 05 06 07 08 09 10

node 00

bar(1)

node 01

bar(2)

Figure 6.11: Listing 6.13 run without worker starving. The lime green color indicatedthe time a node-worker had to wait until it was launched.

00 01 02 03 04 05 06 07 08 09 10

node 00

bar(1)

node 01

bar(2)

Figure 6.12: Listing 6.13 run with workers starving after 1 second. The lime green colorindicated the time a node-worker had to wait until it was launched.

importance to prevent entire compute nodes from idling.

6.3 Zurkon Provenance

Whenever data is stored, the provenance of such data should ideally be specified aswell. It should be clear how the data was created and under which circumstances. Moreformally, for a computed result, we should later be able to identify the function as wellas the input arguments that lead to this result. If we are presented with only the result,e.g. the number 2.14, we otherwise have no way of knowing, if it is the result of animportant simulation, or just a number produced by a random number generator. Sinceinput arguments can be very large, it is sufficient to know its provenance instead ofhaving to store a copy. Following this logic, we end up with something similar to a callgraph, that should be manageable to store. This implementation distinguishes betweeninvocation of a function and usage of the result. Furthermore, any function that doesnot choose to support provenance is invisible to the tracker. Tracking every addition andbasic operation may be interesting, but much too expensive in a larger simulation. Once aresult with provenance is modified in the slightest, it looses the original provenance. Itthe modifying operation supports provenance, a new one is assigned to the new result. Ifnot, the new result is data without provenance. The following examples explain the basicprinciples behind the provenance and highlight potential subtleties.

6.3.1 Basic ExampleListing 6.14 shows a simple addition of two function results. Note that fuctions decoratedwith @context_enable support provenance while the ones without do not. Figure 6.13shows the provenance graphically.

Page 120: Embedding penalties for quantum hardware architectures and ...

6.3 Zurkon Provenance 117

Listing 6.14: basic example

1 @context_enable

2 def echo(x):return x

@context_enable

def adding(x, y):7 return x + y # used the values of echo(x)

with ProvenanceContext():

x = echo(1) # the user invoked echo

y = echo(2)

12 res = adding(x, y)

print(x, res) # x is used by the user as well

user

echov0.0

x=1

echov0.0

x=2

addingv0.0

x=1y=2

Figure 6.13: Listing 6.14 In absence of further specification, every function has theversion 0.0. The solid red lines with a diamond indicate that the diamond facing box wasinvoked by the connected box. A dashed red line with an arrow symbolizes that the boxthe arrow points to accessed the result of the connected box. If a box both invoked andused another one, the line is solid with diamond and arrow.

The user invoced all three functions and adding used the results of the echo functions.Furthermore, the user printed echo(1) and thus it shows up as used while echo(2) does

Page 121: Embedding penalties for quantum hardware architectures and ...

118 Chapter 6. Scientific Measurement Framework

not.

Page 122: Embedding penalties for quantum hardware architectures and ...

6.3 Zurkon Provenance 119

6.3.2 Subcall Example

Listing 6.15 shows the adding_with_echo function that internally calls the echo func-tions instead of getting the results as arguments, which is shown in fig. 6.14.

Listing 6.15: subcall example

1 @context_enable

2 def echo(x):return x

@context_enable

def adding_with_echo(x, y):7 return echo(x) + echo(y) # invoked and used

with ProvenanceContext():

x = 1

y = 2

12 res = adding_with_echo(x, y)

print(x, res)

user

adding_with_echov0.0

x=1y=2

echov0.0

x=1

echov0.0

x=2

Figure 6.14: Listing 6.15 has an adding function with subcalls. Now the user onlyinvokes one function while the rest is invoked and used by adding_with_echo

Page 123: Embedding penalties for quantum hardware architectures and ...

120 Chapter 6. Scientific Measurement Framework

6.3.3 Passing Through ExampleListing 6.16 shows two similar calls. echo_print uses the argument and returns it, whilewrap_echo_print returns a list containing the argument. Figure 6.15 shows that in latercase, the original argument is used by the user, while in the former case, the argumentdoes not show up as used.

Listing 6.16: passing through example

1 @context_enable

2 def echo(x):return x

@context_enable

def echo_print(x):7 print(x)

return x

@context_enable

def wrap_echo_print(x):12 print(x)

return [x] # x does not get used here

with ProvenanceContext():

x = echo(1)

17 y = echo(2)

res_x = echo_print(x)

res_y = wrap_echo_print(y)

print(res_x, res_y)

If a result with provenance is directly returned by another function with provenance,the original provenance is overwritten. The second call does not overwrite the provenanceof the argument, since the new provenance is attached to the list. This overwrite preventsstacking of multiple provenance layers on the same result.

6.3.4 Self Similar ExampleListing 6.17 displays three echo_print calls each fed with the result of the previousone.

Listing 6.17: self similar example

1 @context_enable

def echo_print(x):print(x)return x

5

Page 124: Embedding penalties for quantum hardware architectures and ...

6.4 Zurkon Caching 121

user

echov0.0

x=1

echov0.0

x=2

echo_printv0.0

x=1

wrap_echo_printv0.0

x=2

Figure 6.15: Listing 6.16 shows two calls, one where the original provenance is over-written. Note that echo(1) is not used by the user, while echo(2) is.

with ProvenanceContext():

x = echo_print(1)

y = echo_print(x)

z = echo_print(y)

10 print(z)

The first one gets a the argument 1 with no provenance, while the second two geta 1 with identical provenance, both got generated by echo_print(1). In order for theprovenance tracker to resolve this, it is necessary, for each result to have an unique id.This allows cleaner tracking even if function, arguments and results are identical.

A few more challenges of provenance appear with the introduction of caching andwill be discussed in section 6.4.

6.4 Zurkon Caching

6.4.1 CachingA useful mechanism to save computation time is caching. Once a result has beencalculated, it should be stored in case the user requests the same calculation again. Thetwo limits for caching are available storage and writing speed. For very small calculations,it might be faster to recalculate than retrieve from a slower memory. In the zurkonframework caching can be added by a context.

Page 125: Embedding penalties for quantum hardware architectures and ...

122 Chapter 6. Scientific Measurement Framework

user

echo_printv0.0

x=1

echo_printv0.0

x=1

echo_printv0.0

x=1

Figure 6.16: Listing 6.17 shows three calls, each fed with the result of the previousone. Since in the last two calls, everything is identical, an additional internal unique id isrequired to allow clean resolution.

Listing 6.18: caching example

1 @context_enable(version="0.0", cache=True)

def echo(x):return x

5 with CacheContext(root):

res = echo(1)

print(res)res_b = echo(1) # cached

print(res_b)10

with CacheContext(root):

Page 126: Embedding penalties for quantum hardware architectures and ...

6.4 Zurkon Caching 123

res = echo(1) # still cached

print(res)

Listing 6.18 shows an example of such a context. The @context_enable decoratorcan be instructed if a function should be cached or not. If the user does not specifya storage function for the function, the framework will try to use a simple serializer(msgpack) that supports all built in types, but no custom classes. The user either hasto provide support for their serialization or specify more advanced and efficient storageformats like for example hdf5 or sql. Same as for the duration (section 6.2.3), thedecision to cache could also be dependent on the input arguments by passing a functioninstead of a simple boolean. A version number can be specified for every function ormethod. Whenever the function version changes during development, all previous cacheddata for this function will not be used anymore. If the version is reverted back again, theold matching cache entries will be used again.

6.4.2 Prune CacheDuring development, there is usually a cycle of measurements and subsequent modifica-tions of the source code. As displayed in listing 6.19, every function has a version thatcan be increased.

Listing 6.19: cache prune example

1 @context_enable(version="0.0")

2 def foo(x):return x

@context_enable(version="0.0")

def bar(x):7 return foo(x)+1

with CacheContext(root):

print(bar(1)) # prints 2

print(bar(1)) # cached

12# later on when foo ' s version changed to "0.1"

with CacheContext(root) as cach:

# deletes all outdate cache entries and

17 # every other entry that used them

cach.prune()

print(bar(1)) # gets recalculated

The issue is that the cached entry of bar does not realize that some function inside ofbar changed version and will still use cached values as long as the version of bar is the

Page 127: Embedding penalties for quantum hardware architectures and ...

124 Chapter 6. Scientific Measurement Framework

same. Since foo got a new version, should we also have to update the version of everyfunction that uses foo? This would not be maintainable, hence another solution is needed.The cache offers a prune function that removes all entries that do not correspond to thecurrent version of functions as well as removing all entries that made use of the nowoutdated data. This will delete all cache entries that have to be recalculated due to theversion change or changes. By tracking the usage of functions results during execution,the cache can learn the dependency necessary for successful cache pruning later.

6.4.3 Lazy CachingWhen data gets large, loading from cache can take up a substantial amount of time. Hence,only necessary data should ever be loaded from cache. Listing 6.20 shows a situation,where original data, which we assume to be very large, gets refined by another functions.Both functions are cached but in order to access the refine cache entry, we need theoriginal_data cache entry, which takes much longer to load, even though the usernever uses this result directly.

Listing 6.20: lazy cache example

1 @context_enable

def original_data(x):return something_large(x)

@context_enable

6 def refine(x):return some_averages(x)

with CacheContext(root):

# let ' s assume everything is cached11 data = original_data(1)

averages = refine(data)

print(averages) # we do not use data

In order to compute the cache entry location, the payload hash of the function call isneeded, see section 6.2. For this, we only need the content hash of the function arguments.Therefore it is possible to postpone loading a cache entry, by only loading its contenthash. This is very small and will be enough to access further cache entries of functionsthat use this data. In listing 6.20, it prevents the large data from ever being loaded.

6.4.4 Provenance CachingAs mentioned in section 6.3, there are a few more challenges to provenance when cachingis involved. The data caching mechanism solely operates on data and does not take intoaccount provenance. We could rewrite the data caching, but that would be invasive andslow down the caching in a scenario where the user does not want provenance. Thereforethe task of solving potential issues lies with the provenance implementation.

Page 128: Embedding penalties for quantum hardware architectures and ...

6.4 Zurkon Caching 125

Caching ProvenanceThe first question is storage of auxiliary data. This might be provenance data, performancedata by a pool context or any other data that is not returned by the function but stillassociated with the result. The caching implementation offers a namespace for this typeof auxiliary data which is stored parallel to the content data. A direct consequence is thenecessity to invalidate cache and rerun a function if new contexts are added, since theauxiliary data was not originally recorded. Removing a context that generates auxiliarydata only needs rerunning if one want to clean the cache and remove now useless auxiliarydata from the namespace.

Caching with different ProvenanceThe second question concerns the reconstruction of provenance for cached entries. Sincethe cache decides hits or misses via the payload hash, the we have to construct the correctprovenance in case of a hit. This demands that provenance is stored in a way that doesnot reference any other provenances, since these could be different for every cache hit.The agnostic provenance storage relates the input arguments and their effect with theoutput. Every input argument, no matter if it has a provenance or not, is wrapped in aproxy and observed while the function is executed. Listing 6.21 and fig. 6.17 show thison a concrete example.

Listing 6.21: caching provenance example1 @context_enable

2 def echo(x):print(x) # uses #0

return x

@context_enable

7 def demo_fct(a, b):x = echo(a) # CREATE−0 with #0 −> #0y = echo(b) # CREATE−1 with #1 −> #0z = echo(y) # CREATE−2 with CREATE−1 −> #0return z + x # uses CREATE−2 and CREATE−0

12with NspCacheContext(root), ProvenanceContext():

res = demo_fct(1, 2)

print(res) # part A

17 with NspCacheContext(root), ProvenanceContext():

res = demo_fct(1, echo(2)) # will be cached

print(res) # part B

First, we run demo_fct(1, 2) which is not yet cached. While running, we note whatprovenance-relevant functions get invoked and mark their result as CREATE-x where xstarts at 0. This is a way to label the dynamically created provenance entities. Arguments

Page 129: Embedding penalties for quantum hardware architectures and ...

126 Chapter 6. Scientific Measurement Framework

user

demo_fctv0.0

a=1b=2

echov0.0

x=1

echov0.0

x=2

echov0.0

x=2

Figure 6.17: The provenance of the part A in listing 6.21. Nothing is yet cached, hencethe function is run and the result along with the agnostic provenance stored.

.

are marked with #0, #1, . . . , while keyswords are marked as #key, where key is astand in for the actual key name. Whenever a subcall happens, the mapping from callside to arguments is made in terms of the labels described above. The comments inlisting 6.21 show this. While this may look like static code analysis, it is far from it.Conditional expressions where left out for readability. For every concrete payload hash,i.e. demo_fct(1, 2) this agnostic provenance is recorded, and may vary heavily in formdepending on the inputs. When the same payload hash is encountered in the second calldemo_fct(1, echo(2)), the cached value is used. Together with the cached value weget the provenance and can easily identify where we have to attach the provenance of thesecond argument echo(2). It is argument #1 and this is used by CREATE-1 in its firstargument slot #0. Hence we get the provenance displayed by fig. 6.18. The importantpart is storing the provenance in a input provenance agnostic way.

Page 130: Embedding penalties for quantum hardware architectures and ...

6.5 Technical Implementation 127

user

echov0.0

x=2

demo_fctv0.0

a=1b=2

echov0.0

x=1

echov0.0

x=2

echov0.0

x=2

Figure 6.18: The provenance of the part B in listing 6.21. Since the imput arguments arestill 1 and 2, the caching mechanism retrieves the result from cache. Now the provenancehas to be constructed, since it is not the same as in fig. 6.17.

6.5 Technical Implementation

This section covers more details about the technical implementation of the contexts seenso far.

Page 131: Embedding penalties for quantum hardware architectures and ...

128 Chapter 6. Scientific Measurement Framework

6.5.1 MotivationThe main goal of this context focused implementation is the seamless transition betweenprototype code and production code. This is especially important in computational sciencedue to the changing questions and implementations. Starting out with a single script,the user can just expand the code naturally and upgrade it with caching or advancedpools with schedulers by adding the corresponding contexts. The only requirement is thedecoration of the desired functions with @context_enable. Optional specification, likestorage functions can be specified when needed, and do not need to be invasive.

6.5.2 ContextA context is a class that modifies the behavior of a function decorated with @context_enable.It has the following properties: When initialized and not yet entered, i.e. before any withstatement, the context is serializable, meaning we can send it to other processes and evenhosts. Further, the context can specify a subcontext method that returns a child contextin case another context enabled function is called within the first one. Once the contextis entered, it gets notified about certain events and can take action. See section 6.5.4 fordetails about the events. One of these events is the call of any context enabled function orthe use of data originating from such functions.

6.5.3 Context ProxyIn order to make this un-invasive features work, every result returned by a context enabledfunction is wrapped in a proxy object. The user will not notice this, except if one checksfor type(x). But this is not recommended in python, rather one would access the typeinformation with x.__class__ which return the type of the wrapped object. Havingcontrol over these proxy objects allows full access monitoring and notifying the contextabout it. As example, for lazy caching it would load the data as soon as anything isrequested of the empty proxy. For schedulers, the proxy holds the futures object and onlyblock and wait for evaluation when needed. And provenance can track dependencies thisway.

6.5.4 Three ExamplesListing 6.22 shows three context, a cache context in part A and B, and a pool context inpart C. Figure 6.19, fig. 6.20 and fig. 6.21 show the corresponding detailed way thesecontext work.

Listing 6.22: context proxy example

1 @context_enable

def echo(x):time.sleep(1)

return x

6 with Cache(root):

Page 132: Embedding penalties for quantum hardware architectures and ...

6.5 Technical Implementation 129

x = echo(1) # Part A

print(x)

with Cache(root):

11 x = echo(1) # Part B

print(x)

with Pool():

x = echo(1) # Part C

16 print(x)

In the first part A, see fig. 6.19, we first enter the context. Upon invoking echo thefollowing happens:

• on_pre_call: Before the call gets invoked, a context could for example modifythe incoming arguments or its internal state.

• submit: Submit offers the option to ship off execution of this function to anotherlocation, like a process pool. Along the function and its arguments, all subcontextof the active contexts will sent as well. In this scenario though, we have no pooland the subcontext is entered in the same thread.

• on_pre_exec: gets called on the execution site, before running the function.• save_cache:happens after the function finished successfully. The cache subcon-

text spawned by the cache context now stores the result.• on_post_exec: gets called on the execution site, after the function successfully

finished and the subcontext is exited.• on_post_call: Is called when the function is done and the result returned.• produced The result gets wrapped in a context proxy.• on_use: The result is used by the user.Part B, see fig. 6.20 is identical to part A, but now the data is cached. The steps are

the same, but after on_pre_call, the cache entry is found and load_cache preventsanything from being submitted.

In part C, we do not have any caching, but a pool that allows to send work to otherprocesses. The corresponding fig. 6.21 is the similar to fig. 6.19, but the subcontext andexecution of echo is now handled by another thread. More advanced pools can send workto other hosts entirely.

6.5.5 Caching with multiple HostsWhen using schedulers that connect to other hosts for launching their workers, cachinghappens on multiple hosts. The caching context that supports caching on multiple hostsalso require one path per host for the caching location, since it most likely differs.Figure 6.21 shows such an example for two threads. Since now a potential save_cachehappens on the execution host (e.g. cluster) and not on the local host (e.g. our laptop),we need to trigger another save_cache before the on_post_call on the invoking host.The same applies for load_cache. We first check if the payload is cached on the local

Page 133: Embedding penalties for quantum hardware architectures and ...

130 Chapter 6. Scientific Measurement Framework

host, and if not we check again right before on_pre_exec. A nice benefit of this cachingsystem is the lower storage requirement on the local host. Only function the user invokeddirectly will also be stored on the local host. Any subroutine run by these directly invokedfunctions will only be stored on the executing host.

6.6 ConclusionThe development of tools as well as striving for elegant and optimal conduct of scientificcomputing was a large part of this thesis. We implemented a complete framework inpython that allows switching from an exploratory prototyping phase to the productionphase in an simple un-invasive manner, by adding the appropriate contexts. All maintasks - caching, provenance and scheduling - have been implemented using this technique.Furthermore the tool completely automates the cluster side workflow, eliminating errorsstemming from having to work on multiple machines. While the implementation ischallenging, the resulting end user interface is convenient and offers many quality of lifeimprovements compared to current solutions.

Page 134: Embedding penalties for quantum hardware architectures and ...

6.6 Conclusion 131

threads functions results

echo(x=1)

1

0-T0

0-T0 echo(x=1)

1

0.00 enter

0.00 on_pre_call

echo(x=1)

0.00 submit

0.00 enter

0.00 on_pre_exec

1.00 save_cache

1.01 on_post_exec

1.01 exit

1.01 on_post_call

1.01 produced

1

1.01 on_use

1.01 exit

Figure 6.19: Part A in listing 6.22. 0-T0 is the main user thread..

Page 135: Embedding penalties for quantum hardware architectures and ...

132 Chapter 6. Scientific Measurement Framework

threads functions results

echo(x=1)

1

0-T0

0-T0 echo(x=1)

1

0.00 enter

0.00 on_pre_call

echo(x=1)

0.00 load_cache

0.00 on_post_call

0.00 produced

1

0.00 on_use

0.00 exit

Figure 6.20: Part B in listing 6.22..

Page 136: Embedding penalties for quantum hardware architectures and ...

6.6 Conclusion 133

threads functions results

echo(x=1)

1

0-T0

0-T0

0-T1

0-T1 echo(x=1)

1

0.00 enter

0.00 on_pre_call

echo(x=1)

0.00 submit

0.00 enter

0.00 on_pre_exec

1.01 on_post_exec

1.01 exit

1.01 on_post_call

1.01 produced

1

1.01 on_use

1.01 exit

Figure 6.21: Part C in listing 6.22..

Page 137: Embedding penalties for quantum hardware architectures and ...
Page 138: Embedding penalties for quantum hardware architectures and ...

7. Conclusion and Outlook

7.1 ImplementationIn this thesis we implemented an efficient Monte Carlo-Metropolis algorithm with clusterupdates to simulate quantum annealing. The modularity allows extension to solve otherpotential questions, as well as easy maintenance. The code was made available toeveryone under a permissive software license, in enable reuse where sensible. Furtherinteresting work could be done with regards to the modern classical annealers and theirstrategies, like iso-energetic cluster updates, parallel tempering or population annealing.Adding these in a modular way would allow to mix the quantum aspect with successfulclassical algorithms to get the best of both worlds, and potentially gain even more insightwhat makes a good optimizer. Another open avenue would be optimization of the codeitself and get a few factors of speedup, but this should only be done, once the need isapparent.

7.2 Maxcut: Comparing AlgorithmsWe highlighted why MaxCut is an interesting problem with potential for simulatedquantum annealing. It seems to compete with and even beat a simple classical annealer,which warrants further investigation. Showing that simulated quantum annealing canoutperform classical annealing on these MaxCut instances, would be an important resultand potentially reveal more properties of MaxCut. While unweighted MaxCut is farfrom an easy problem, it would be interesting to do the identical comparison with harderinstances to grasp the limits of these algorithms and learn even more about them.

Page 139: Embedding penalties for quantum hardware architectures and ...

136 Chapter 7. Conclusion and Outlook

7.3 EmbeddingThe question of embedding cost was covered with respect to the ground state of the em-bedded system as well as the decoded state. We find indications of an exponential penaltyassociated with the investigated embeddings, when compared to non embedded quantumannealing. Further research into the effect of advanced decoders and optimal constraintscheduling is necessary, but unlikely to offset this exponential penalty substantially. Thisis in part due to the Avoided Level Crossings which will be a source of error, if notannealed properly, that no decoding strategy investigated will be able to fix. If the decoderon the other extreme becomes too powerful, meaning it can recover any random state, ititself is the solver with corresponding scaling that will likely be worse than annealing.Nevertheless would it be interesting to optimize schedulers and decoders to learn moreabout recoverable defects and efficiencies of annealers. With this initial results however,the current planar embeddings in real devices might not be able to compete against SQAor SA methods for all-to-all connected problem classes. If this result can be supportedwith solid theoretical work and more data, it stand as an important result showing thecurrent limits of quantum annealer hardwares.

The question of the currently best planar embedding however remains open and inter-esting. Further research should go into the constraint modeling of the paqo embedding,since it probably is overly constraint in the current state.

7.4 Fair SamplingWe showed that SQA with transverse field type driver do not sample fairly and cannotbe expected to do so unless the driver Hamiltonian matrix is almost dense. This result issupported by perturbation theory, direct integration on small systems as well as large scalesimulations with advanced Monte Carlo-Metropolis algorithms. It is an important negativeresult showing the limitation of quantum annealing compared to classical annealers wherefair sampling is concerned. Futile efforts to build slightly more complex drivers onhardware to improve sampling can thus be saved. We also outlined a simple fast way totest potential drivers.

It would be very interesting to further research iterative methods to improve fairsampling or more post processing methods, which already showed great promise in thisfield.

7.5 Scientific Measurement FrameworkThroughout the thesis, we also highlighted many technical challenges when it comes toscalable scientific calculation, presenting techniques for schedulers, reusable provenanceand caching. This entire thesis is completely automated and - with exception of measuredhardware performance - exactly reproducible. The engineering challenge behind sucha framework is substantial and very many lessons were learned. These lessons andexperiences of not just the author, but 20 additional phd students, will be distilled ina separate work Best Practice Guide for good Research Code. This thesis, with the

Page 140: Embedding penalties for quantum hardware architectures and ...

7.5 Scientific Measurement Framework 137

ambition to fully automate it, allowed the author to asses many common problem andchallenges in research programming, which help creating a guide, that hopefully will helpfuture students in the field of scientific programming.

Page 141: Embedding penalties for quantum hardware architectures and ...
Page 142: Embedding penalties for quantum hardware architectures and ...

IV8 List of publications . . . . . . . . . . . . . . . . . . 141

9 References . . . . . . . . . . . . . . . . . . . . . . . . . 143

10 Acknowledgements . . . . . . . . . . . . . . . . 151

11 Curriculum Vitae . . . . . . . . . . . . . . . . . . . 153

Appendix

Page 143: Embedding penalties for quantum hardware architectures and ...
Page 144: Embedding penalties for quantum hardware architectures and ...

8. List of publications

[1] M. Könz, W. Lechner, Katzgraber H. G., and M. Troyer. Exponential asymptoticscaling overhead when embedding binary optimization problems. in preparation,2019.

[2] M. Könz, G. Mazzola, A. Ochoa, Katzgraber H. G., and M. Troyer. Uncertain Fateof Fair Sampling in Quantum Annealing. (arXiv:1806.06081 [quant-ph]), 2018.

[3] M. Könz, B. Heim, and M. Troyer. Sqa algorithm. in preparation, 2018.

[4] D. Herr, E. Brown, B. Heim, M. Könz, G. Mazzola, and M. Troyer. OptimizingSchedules for Quantum Annealing. (arXiv:1705.00420v1 [quant-ph]), 2017.

[5] A. Gaenko, A. E. Antipov, G. Carcassi, T. Chen, X. Chen, Q. Dong, L. Gamper,J. Gukelberger, R. Igarashi, S. Iskakov, M. Könz, J. P. F. LeBlanc, R. Levy, P. N. Ma,J. E. Paki, H. Shinaoka, S. Todo, M. Troyer, and E. Gull. Updated Core Libraries ofthe ALPS Project. (arXiv:1609.03930v2 [physics.comp-ph]), 2016.

Page 145: Embedding penalties for quantum hardware architectures and ...
Page 146: Embedding penalties for quantum hardware architectures and ...

9. References

[6] N. Metropolis and S. Ulam. The Monte Carlo Method. J. Am. Stat. Assoc., 44:335,1949.

[7] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization bysimulated annealing. Science, 220(4598):671–680, 1983. URL:http://science.sciencemag.org/content/220/4598/671, arXiv:http://science.sciencemag.org/content/220/4598/671.full.pdf,doi:10.1126/science.220.4598.671.

[8] S. Geman and D. Geman. IEEE Trans. Pattern. Analy. Mach. Intell., PAMI-6:721,1984.

[9] A. B. Finnila, M. A. Gomez, C. Sebenik, C. Stenson, and J. D. Doll. Quantumannealing: A new method for minimizing multidimensional functions. Chem. Phys.Lett., 219:343, 1994.

[10] T. Kadowaki and H. Nishimori. Quantum annealing in the transverse Ising model.Phys. Rev. E, 58:5355, 1998.

[11] J. Brooke, D. Bitko, T. F. Rosenbaum, and G. Aepli. Quantum annealing of adisordered magnet. Science, 284:779, 1999.

[12] E. Farhi, J. Goldstone, S. Gutmann, J. Lapan, A. Lundgren, and D. Preda. Aquantum adiabatic evolution algorithm applied to random instances of an NP-complete problem. Science, 292:472, 2001.

Page 147: Embedding penalties for quantum hardware architectures and ...

144 REFERENCES

[13] G. Santoro, E. Martonák, R. Tosatti, and R. Car. Theory of quantum annealing ofan Ising spin glass. Science, 295:2427, 2002.

[14] A. Das and B. K. Chakrabarti. Quantum Annealing and Related OptimizationMethods. Edited by A. Das and B.K. Chakrabarti, Lecture Notes in Physics 679,Berlin: Springer, 2005.

[15] G. E. Santoro and E. Tosatti. TOPICAL REVIEW: Optimization using quantummechanics: quantum annealing through adiabatic evolution. J. Phys. A, 39:R393,2006.

[16] A. Das and B. K. Chakrabarti. Quantum Annealing and Analog Quantum Computa-tion. Rev. Mod. Phys., 80:1061, 2008.

[17] S. Morita and H. Nishimori. Mathematical Foundation of Quantum Annealing. J.Math. Phys., 49:125210, 2008.

[18] S. V. Isakov, I. N. Zintchenko, T. F. Rønnow, and M. Troyer. Optimized simulatedannealing for Ising spin glasses. Comput. Phys. Commun., 192:265, 2015. (see alsoancillary material to arxiv:cond-mat/1401.1084).

[19] H.F. Trotter, Proc. Am. Math. Soc. 10, 545 (1959); M. Suzuki, Prog. Theor. Phys.56 1454 (1976).

[20] Rieger, H. and Kawashima, N. Application of a continuous time cluster algorithm tothe two-dimensional random quantum ising ferromagnet. Eur. Phys. J. B, 9(2):233–236, 1999. URL: https://doi.org/10.1007/s100510050761, doi:10.1007/s100510050761.

[21] Satoshi Morita, Sei Suzuki, and Tota Nakamura. Quantum-thermal anneal-ing with a cluster-flip algorithm. Phys. Rev. E, 79:065701, Jun 2009. URL:https://link.aps.org/doi/10.1103/PhysRevE.79.065701, doi:10.1103/PhysRevE.79.065701.

[22] github.com/pybind/pybind11.

[23] Information about the Biq Mac server offering a semidefinite-based branch-and-bound algorithm for solving unconstrained binary quadratic programs is available athttp://biqmac.uni-klu.ac.at/.

[24] Franz Rendl, Giovanni Rinaldi, and Angelika Wiegele. Solving Max-Cut to opti-mality by intersecting semidefinite and polyhedral relaxations. Math. Programming,121(2):307, 2010.

[25] Michel X. Goemans and David P. Williamson. Improved approximation algorithmsfor maximum cut and satisfiability problems using semidefinite programming. J.ACM, 42(6):1115–1145, November 1995. URL: http://doi.acm.org/10.1145/227683.227684, doi:10.1145/227683.227684.

Page 148: Embedding penalties for quantum hardware architectures and ...

REFERENCES 145

[26] Simulation Parameters for SQA: H = [10, iF,0.01], Ntrotter = 100, T = 0.01, steps =1000.

[27] Simulation Parameters for SA: T = [0.2,0.0001], steps = 1000.

[28] Z. Zhu, C. Fang, and H. G. Katzgraber. borealis - A generalized global updatealgorithm for Boolean optimization problems. (arXiv:1605.09399), 2016.

[29] W. Wang, J. Machta, and H. G. Katzgraber. Comparing Monte Carlo methodsfor finding ground states of Ising spin glasses: Population annealing, simulatedannealing, and parallel tempering. Phys. Rev. E, 92(1):013303, July 2015. arXiv:1412.2104, doi:10.1103/PhysRevE.92.013303.

[30] V. Choi. Minor-embedding in adiabatic quantum computation. I: The parametersetting problem. Quantum Inf. Process., 7:193, 2008.

[31] Wolfgang Lechner, Philipp Hauke, and Peter Zoller. A quantum annealing architec-ture with all-to-all connectivity from local interactions. Science Advances, 1(9),2015. URL: http://advances.sciencemag.org/content/1/9/e1500838,arXiv:http://advances.sciencemag.org/content/1/9/e1500838.full.pdf, doi:10.1126/sciadv.1500838.

[32] Fernando Pastawski and John Preskill. Error correction for encoded quantumannealing. Phys. Rev. A, 93:052325, May 2016. URL: https://link.aps.org/doi/10.1103/PhysRevA.93.052325, doi:10.1103/PhysRevA.93.052325.

[33] T. F. Rønnow, Z. Wang, J. Job, S. Boixo, S. V. Isakov, D. Wecker, J. M. Martinis,D. A. Lidar, and M. Troyer. Defining and detecting quantum speedup. Science,345:420, 2014.

[34] B. Heim, T. F. Rønnow, S. V. Isakov, and M. Troyer. Quantum versus classicalannealing of Ising spin glasses. Science, 348(6231):215, 2015.

[35] Sergei V. Isakov, Guglielmo Mazzola, Vadim N. Smelyanskiy, Zhang Jiang, SergioBoixo, Hartmut Neven, and Matthias Troyer. Understanding Quantum Tunnelingthrough Quantum Monte Carlo Simulations. Phys. Rev. Lett., 117:180402, 2016.

[36] G. Mazzola, V. N. Smelyanskiy, and M. Troyer. Quantum Monte Carlo tunnelingfrom quantum chemistry to quantum annealing. Phys. Rev. B, 96:134305, 2017.

[37] Simulation Parameters for SQA-Hx,1: H = 10, Ntrotter = 1024, T = 0.001, steps =1000.

[38] Simulation Parameters for SQA: H = [.5,sF,0.001], Ntrotter = 1024, T = 1Ntrotter

,steps = [300,500,750,1k,2k,3k,5k,9k,11k,13k,15k].

Page 149: Embedding penalties for quantum hardware architectures and ...

146 REFERENCES

[39] Simulation Parameters for SQA with Minor: H = [.5,sF,0.001], Ntrotter = 1024,T = 1

Ntrotter, steps = [300,500,750,1k,2k,3k,5k,9k,11k,13k,15k], BC = 0, SC =

1.1.

[40] Simulation Parameters for SQA with Chimera k=4: H = [.5,sF,0.001], Ntrotter =1024, T = 1

Ntrotter, steps = [300,500,750,1k,2k,3k,5k,9k,11k,13k,15k], BC = 0,

SC = 1.1.

[41] Simulation Parameters for SQA with PAQO: H = [.5,sF,0.001], Ntrotter = 1024,T = 1

Ntrotter, steps = [300,500,750,1k,2k,3k,5k,9k,11k,13k,15k], BC = N2

50 , SC =1.1.

[42] The coupling for the 3-spin systems are (unspecified energy unit): J1,2 = 5,J2,3 = 1and the local fields are: h1 =−0.1,h2 = 1.04,h3 =−0.97.

[43] S. Knysh. Computational Bottlenecks of Quantum Annealing. (arXiv:quant-phys/1506.08608), 2015.

[44] N. G. Dickson, M. W. Johnson, M. H. Amin, R. Harris, F. Altomare, A. J. Berkley,P. Bunyk, J. Cai, E. M. Chapple, P. Chavez, F. Cioata, T. Cirip, P. Debuen, M. Drew-Brook, C. Enderud, S. Gildert, F. Hamze, J. P. Hilton, E. Hoskinson, K. Karimi,E. Ladizinsky, N. Ladizinsky, T. Lanting, T. Mahon, R. Neufeld, T. Oh, I. Perminov,C. Petroff, A. Przybysz, C. Rich, P. Spear, A. Tcaciuc, M. C. Thom, E. Tolkacheva,S. Uchaikin, J. Wang, A. B. Wilson, Z. Merali, and G. Rose. Thermally assistedquantum annealing of a 16-qubit problem. Nat. Commun., 4:1903, 2013.

[45] K. L. Pudenz, T. Albash, and D. A. Lidar. Error-corrected quantum annealing withhundreds of qubits. Nat. Commun., 5:3243, 2014.

[46] G. Smith and J. Smolin. Putting “Quantumness” to the Test. Physics, 6:105, 2013.

[47] S. Boixo, T. Albash, F. M. Spedalieri, N. Chancellor, and D. A. Lidar. Experimentalsignature of programmable quantum annealing. Nat. Commun., 4:2067, 2013.

[48] T. Albash, T. F. Rønnow, M. Troyer, and D. A. Lidar. Reexamining classical andquantum models for the D-Wave One processor. Eur. Phys. J. Spec. Top., 224:111,2015.

[49] H. G. Katzgraber, F. Hamze, and R. S. Andrist. Glassy Chimeras Could Be Blindto Quantum Speedup: Designing Better Benchmarks for Quantum Annealing Ma-chines. Phys. Rev. X, 4:021008, 2014.

[50] T. Lanting, A. J. Przybysz, A. Y. Smirnov, F. M. Spedalieri, M. H. Amin, A. J.Berkley, R. Harris, F. Altomare, S. Boixo, P. Bunyk, N. Dickson, C. Enderud, J. P.Hilton, E. Hoskinson, M. W. Johnson, E. Ladizinsky, N. Ladizinsky, R. Neufeld,T. Oh, I. Perminov, C. Rich, M. C. Thom, E. Tolkacheva, S. Uchaikin, A. B. Wilson,and G. Rose. Entanglement in a quantum annealing processor. Phys. Rev. X,4:021041, 2014.

Page 150: Embedding penalties for quantum hardware architectures and ...

REFERENCES 147

[51] S. Santra, G. Quiroz, G. Ver Steeg, and D. A. Lidar. Max 2-SAT with up to 108qubits. New J. Phys., 16(4):045006, 2014.

[52] S. W. Shin, G. Smith, J. A. Smolin, and U. Vazirani. How “Quantum” is the D-WaveMachine? (arXiv:1401.7087), 2014.

[53] W. Vinci, T. Albash, A. Mishra, P. A. Warburton, and D. A. Lidar. Distinguishingclassical and quantum models for the D-Wave device. (arXiv:1403.4228), 2014.

[54] S. Boixo, T. F. Rønnow, S. V. Isakov, Z. Wang, D. Wecker, D. A. Lidar, J. M.Martinis, and M. Troyer. Evidence for quantum annealing with more than onehundred qubits. Nat. Phys., 10:218, 2014.

[55] T. Albash, W. Vinci, A. Mishra, P. A. Warburton, and D. A. Lidar. Consistency Testsof Classical and Quantum Models for a Quantum Device. Phys. Rev. A, 91:042314,2015.

[56] H. G. Katzgraber, F. Hamze, Z. Zhu, A. J. Ochoa, and H. Munoz-Bauza. SeekingQuantum Speedup Through Spin Glasses: The Good, the Bad, and the Ugly. Phys.Rev. X, 5:031026, 2015.

[57] V. Martin-Mayor and I. Hen. Unraveling Quantum Annealers using ClassicalHardness. Nature Scientific Reports, 5:15324, 2015.

[58] Kristen L. Pudenz, Tameem Albash, and Daniel A. Lidar. Quantum AnnealingCorrection for Random Ising Problems. Phys. Rev. A, 91:042302, 2015.

[59] Itay Hen, Joshua Job, Tameem Albash, Troels F. Rønnow, Matthias Troyer, andDaniel A. Lidar. Probing for quantum speedup in spin-glass problems with plantedsolutions. Phys. Rev. A, 92:042325, 2015.

[60] Davide Venturelli, Salvatore Mandrà, Sergey Knysh, Bryan O’Gorman, RupakBiswas, and Vadim Smelyanskiy. Quantum Optimization of Fully Connected SpinGlasses. Phys. Rev. X, 5:031040, 2015.

[61] Walter Vinci, Tameem Albash, Gerardo Paz-Silva, Itay Hen, and Daniel A. Lidar.Quantum annealing correction with minor embedding. Phys. Rev. A, 92:042310,2015.

[62] Z. Zhu, A. J. Ochoa, F. Hamze, S. Schnabel, and H. G. Katzgraber. Best-caseperformance of quantum annealers on native spin-glass benchmarks: How chaoscan affect success probabilities. Phys. Rev. A, 93:012317, 2016.

[63] S. Mandrà, Z. Zhu, W. Wang, A. Perdomo-Ortiz, and H. G. Katzgraber. Strengthsand weaknesses of weak-strong cluster problems: A detailed overview of state-of-the-art classical heuristics versus quantum approaches. Phys. Rev. A, 94:022337,2016.

Page 151: Embedding penalties for quantum hardware architectures and ...

148 REFERENCES

[64] S. Mandrà and H. G. Katzgraber. The pitfalls of planar spin-glass benchmarks:Raising the bar for quantum annealers (again). Quantum Sci. Technol., 2:038501,2017.

[65] S. Mandrà and H. G. Katzgraber. A deceptive step towards quantum speedupdetection. 2017. (arxiv:1711.01368).

[66] S. A. Weaver, K. J. Ray, V. W. Marek, A. J. Mayer, and A. K. Walker. Satisfiability-based set membership filters. Journal on Satisfiability, Boolean Modeling andComputation (JSAT), 8:129, 2014.

[67] T. J. Schaefer. The Complexity of Satisfiability Problems. In Proceedings of theTenth Annual ACM Symposium on Theory of Computing, STOC ’78, page 216, NewYork, NY, USA, 1978. ACM.

[68] A. Douglass, A. D. King, and J. Raymond. Constructing SAT Filters with a QuantumAnnealer. In Theory and Applications of Satisfiability Testing – SAT 2015, pages104–120. Springer, Austin TX, 2015.

[69] D. Herr, M. Troyer, M. Azinovic, B. Heim, and E. Brown. Assessment of quantumannealing for the construction of satisfiability filters. SciPost Physics, 2:013, 2017.

[70] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorialstructures from a uniform distribution. Theoretical Computer Science, 43:169, 1986.

[71] C. P. Gomes, A. Sabharwal, and B. Selman. Model counting. In A. Biere, M. Heule,H. van Maaren, and T. Walsch, editors, Handbook of Satisfiability. IOS Press, 2008.

[72] P. Gopalan, A. Klivans, R. Meka, D. Stefankovic, S. Vempala, and E. Vigoda.An FPTAS for # Knapsack and Related Counting Problems. In Foundations ofComputer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, page 817, PalmSprings CA, 2011. IEEE.

[73] G. E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence.Neural Comput., 14:1771, 2002.

[74] S. M. A. Eslami, N. Heess, Ch. K. I. Williams, and J. Winn. The shape Boltzmannmachine: A strong model of object shape. Int. J. of Computer Vision, 107:155,2014.

[75] Y. Matsuda, H. Nishimori, and H. G. Katzgraber. Ground-state statistics from an-nealing algorithms: quantum versus classical approaches. New J. Phys., 11:073021,2009.

[76] S. Mandrà, Z. Zhu, and H. G. Katzgraber. Exponentially Biased Ground-State Sam-pling of Quantum Annealing Machines with Transverse-Field Driving Hamiltonians.Phys. Rev. Lett., 118:070502, 2017.

Page 152: Embedding penalties for quantum hardware architectures and ...

REFERENCES 149

[77] A. D. King, E. Hoskinson, T. Lanting, E. Andriyash, and M. H. Amin. Degeneracy,degree, and heavy tails in quantum annealing. Phys. Rev. A, 93:052320, 2016.

[78] E. Farhi, J. Goldstone, S. Gutmann, and M. Sipser. Quantum Computation byAdiabatic Evolution. arXiv:quant-ph/0001106, 2000.

[79] J. R. Johansson, P. D. Nation, and F. Nori. QuTiP 2: A Python framework for thedynamics of open quantum systems. Comp. Phys. Comm., 184:1234, 2013.

[80] Trevor Lanting, Andrew D. King, Bram Evert, and Emile Hoskinson. Experimen-tal demonstration of perturbative anticrossing mitigation using nonuniform driverHamiltonians. Phys. Rev. A, 96:042322, 2017.

[81] A. J. Ochoa, D. C. Jacob, S. Mandrà, and H. G. Katzgraber. Feeding the Multitude:A Polynomial-time Algorithm to Improve Sampling. 2018. (arXiv:1801.07681).

[82] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure,dynamics, and function using networkx. In Gaël Varoquaux, Travis Vaught, andJarrod Millman, editors, Proceedings of the 7th Python in Science Conference, pages11 – 15, Pasadena, CA USA, 2008.

[83] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi. Optimization by simulatedannealing. Science, 220:671, 1983.

[84] J. J. Moreno, H. G. Katzgraber, and A. K. Hartmann. Finding low-temperaturestates with parallel tempering, simulated annealing and simple Monte Carlo. Int. J.Mod. Phys. C, 14:285, 2003.

[85] Simulation Parameters for SQA-Hx,2: H = Hx = 2, Ntrotter = 64, T = 0.08,equilibrate = 10000, steps = 2000000.

[86] G. Mazzola and M. Troyer. Quantum Monte Carlo annealing with multi-spindynamics. J. Stat. Mech., P53105, 2017.

[87] Y. Susa, Y. Yamashiro, M. Yamamoto, and N. Nishimori. Exponential Speedup ofQuantum Annealing by Inhomogeneous Driving of the Transverse Field. J. Phys.Soc. Jpn., 87:023002, 2018.

Page 153: Embedding penalties for quantum hardware architectures and ...
Page 154: Embedding penalties for quantum hardware architectures and ...

10. Acknowledgements

I would like to thank Dominik Gresch and Donjan Rodic, whom I shared an officewith, for the many interesting discussions about physics, software design and generalsharing of know-how. Furthermore I like to thank Giuseppe Carleo, Thomas Häner,Guglielmo Mazzola and Damian Steiger, members of the Zurich based Troyer Group, forthe discussions and good company during traveling and conferences. Further thanks goesto all further and former members I had the privilege to know and work aside. I thankProf. Wolfgang Lechner for the exciting ongoing collaboration and the possibility to visitthe beautiful town of Innsbruck multiple times. A special thanks goes to Prof. HelmutKatzgraber, for the very efficient collaboration on fair sampling. His continued supportmade sure that the project was finished in short time, which was fun. Finally, I thank Prof.Matthias Troyer, my doctoral thesis supervisor, for the many interesting experiences thePhD allowed me. I appreciated the freedom to approach the scientific problems in a theway we though was best. Even though it did not always turn out to be the best way, thesedetours where very instructive. Despite his absence from ETH, he was always availablefor questions and support whenever needed, which I appreciated greatly.

Page 155: Embedding penalties for quantum hardware architectures and ...
Page 156: Embedding penalties for quantum hardware architectures and ...

11. Curriculum Vitae

Personal dataName: Mario Silvester KönzDate of birth: 03. 12. 1989Citizen of: Scuol (GR) & Zurich (ZH), SwitzerlandNationality: Switzerland

EducationCurrent DOCTOR OF SCIENCES (PH.D)

AUG 2015 ”Computational Physics”Swiss Federal Institiute of Technology Zurich ETH

SEP 2014 - JUN 2015 MASTER OF SCIENCE (M.S.)”Master of Science ETH in Physik”Swiss Federal Institiute of Technology Zurich ETH

SEP 2012 - AUG 2014 MASTER OF SCIENCE (M.S.)”Master of Science ETH in Interdisciplinary Sciences"Graduation with DistinctionSwiss Federal Institiute of Technology Zurich ETH

Page 157: Embedding penalties for quantum hardware architectures and ...

154 Chapter 11. Curriculum Vitae

SEP 2009 - AUG 2014 BACHELOR OF SCIENCE (B.S.)”Bachelor of Science ETH in Interdisciplinary Sciences"Swiss Federal Institiute of Technology Zurich ETH

AUG 2002 - JUL 2008 MATURA

"Best Degree of the Year award"Lyceum Alpinum Zuoz, Switzerland

Employment

Current DOCTOR OF SCIENCES (PH.D)AUG 2015 ’Teaching Assistent / Substitute Lecturer, Federal Institute

of Technology Zurich, Zurich"HS17: Programming Techniques for Scientific Simulations 1

FS17: Computational Quantum Physics

HS16: Programming Techniques for Scientific Simulations 1

FS16: Theory of Heat

HS15: Programming Techniques for Scientific Simulations 2

Swiss Federal Institiute of Technology Zurich ETH

United States RESEARCH INTERNSHIP

JUN 2017 - SEP 2017 Microsoft Quantum - Redmond (QuArC), SeattleAccelerating Simulated Quantum Annealing on custom Hardware.

Switzerland TUTOR FOR EXAM PREPARATION COURSES IN PHYSICS

JUN 2012 - JUL 2012 VMP, Federal Institute of Technology Zurich, ZurichPrepare and hold lectures to recap the material of the Physics coursestaught in the undergraduate years at ETH.

Switzerland ASSISTANT DEVELOPER FOR THE ALPS-LIBARAY

JUL 2011 - JUN 2012 Institute for Theoretical Physics, Federal Institute of Tech-nology Zurich, ZurichWriting high performance C++ code for the APLS project.

Page 158: Embedding penalties for quantum hardware architectures and ...

155

OthersSwitzerland WINNER CLIMATE KIC CLIMATHON 2015 WINTERTHUR

JUN 2015 Team member of winning project ’VeloWage’

Switzerland MILITARY

JUL 2008 - AUG 2009 Mandatory Military Service "Infanterie Führungsstaffel"

International INTERNATIONAL SCIENCE OLYMPIADS

JUN 2009 Honorable Mention at the 40. IPhO (Physics) in MexicoJUN 2008 Participation at the 40. IChO (Chemistry) in Hungary

Switzerland SWISS SCIENCE OLYMPIADS

MAY 2009 4th place in ChemistryMAY 2009 6th place in PhysicsMAY 2008 2nd place in Chemistry

Switzerland LYCEUM ALPINUM ZUOZ: CORPS OF VOLUNTEERS

OCT 2007 Restoration of a school in RomaniaOCT 2006 Restoration of a school in RomaniaOCT 2005 Restoration of a school in Romania