Top Banner
Intel Corporation Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 2nd CERN Advanced Performance Tuning workshop Top Down Analysis Never lost with Xeon® perf. counters Ahmad Yasin Intel Core™ Monitoring & Analysis
39

Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

Jul 17, 2019

Download

Documents

vannga
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

Intel Corporation Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

2nd CERN Advanced

Performance Tuning workshop

Top Down Analysis Never lost with Xeon® perf. counters

Ahmad Yasin

Intel Core™ Monitoring & Analysis

Page 2: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

2 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Motivation

Page 3: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

3 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Motivation

Page 4: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

4 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Motivation

Page 5: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

5 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Motivation

Page 6: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

6 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Preface

• Performance Optimization Is Difficult

– Complicated micro-architectures

– Application/workload diversity

– Unmanageable data

– Tougher constraints – Time, Resources, Priorities

• Top Down Analysis Method

– Identify the true bottleneck in a structured hierarchical process

– Analysis is made easier for non-expert users

– Simplified hierarchy avoids the u-arch high-learning curve

Page 7: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

7 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Agenda

Motivation

• Top Level Heuristics

• Top Down hierarchy

– Results

– Memory breakdown

– Frontend breakdown

• Example

– Many use-cases

• Summary

Page 8: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

8 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Performance Analysis

• Process – System Level

– Memory setup

– Application Level – Algorithm

– Architectural & micro-architectural Levels – Vector code, Cache misses

• Assumptions/Caveats – CPU Bound (IA)

– Predefined analysis goal

– Goal: detect bottleneck – Not-a-goal: quantify speedup

– Forward compatibility

Page 9: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

9 Ahmad Yasin – Top Down Analysis: never lost with perf counters – CERN workshop 2013

Intel Core™ µarch

Front end of processor pipeline

Back end of processor pipeline

Where To Start In This Complex Microarchitecture?

Top Level counters are located here

Page 10: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

10 Ahmad Yasin – Top Down Analysis: never lost with perf counters – CERN workshop 2013

Top Level Breakdown – the idea

Uop

Issue?

Uop ever Retire?

Retiring Bad

Speculation

BackEnd stall?

BackEnd

Bound

FrontEnd Bound

No Yes

No No Yes Yes

Page 11: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

11 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

The Top Down Hierarchy

Systematically Find True Bottleneck with Less Guess Work

Page 12: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

12 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Top Level Breakdown

Uop Allocate?

BackEnd stall?

FrontEnd

Bound

BackEnd

Bound

Uop ever Retire?

Bad Speculation

Retiring

Cycle 1 2 3 4 5 Back End Stall 0 0 1 0 0 Alloc Slot 0 - v - v v Alloc Slot 1 - v - v v Alloc Slot 2 - - - v v Alloc Slot 3 - - - v - Frontend Bound 4 2 0 1 Backend Bound 4 0 0 Retiring 2 1 2 Bad Speculation 3 1

Classify Each Pipeline Slot Into 1 of 4 Categories

yes

yes

yes no

no

no

Page 13: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

13 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Top Level Equations

• Front End Bound

– The front end is delivering < 4 uops per cycle while the back end of the pipeline is ready to accept uops • IDQ_UOPS_NOT_DELIVERED.CORE / (4 * Clockticks)

• Bad Speculation

– Tracks uops that never retire or allocation slots wasted due to recovery from branch miss-prediction or clears • (UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4* INT_MISC.RECOVERY_CYCLES) /(4* Clockticks)

• Retiring

– Successfully delivered uops who eventually do retire • UOPS_RETIRED.RETIRE_SLOTS / (4 * Clockticks)

• Back End Bound

– No uops are delivered due to lack of required resources at the back end of the pipeline • 1 – ( FrontEnd Bound + Bad Speculation + Retiring )

Just 5 Events Provide Much Invaluable Insights

Page 14: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

14 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Top Level for SPEC CPU2006

SPEC rate 1-copy, Intel Complier 13, IvyBridge @ 3 GHz

Most Apps Are Backend Bound, esp. FP

INT Apps have quiet some Frontend/Bad Spec. issues

43.5% 6.5% 13% 37%

Top Down Correctly Characterizes All Workloads

Page 15: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

15 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

VTune “new General Exploration” interface

Hover to see Metric description + formula of PMU events, or click arrow to expand a column to see a breakdown of issues pertaining to that category

Page 16: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

16 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

• Motivation

• Top Level Heuristics

• Top Down hierarchy

• Memory breakdown

• Frontend breakdown

• Examples

• Summary

Load Bound

Page 17: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

17 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Backend Bound

• First distinction

– Core- vs Memory-Bound

• Memory Bound

– Loads limited by which level

– MEM Latency vs Bandwidth

– Store Issues

– Legacy tuning metrics plugged into the hierarchy

– Data Sharing, Store Forward Blocks, False Sharing, …

• Core Bound

– Non-memory core-internal issues

– Example: Divider, Execution Ports Utilization

Page 18: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Results: Memory-level drilldown

(0.20)

(0.10)

-

0.10

0.20

0.30

0.40

0.50

0.60

0.70

40

0.p

erl

be

nch

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

45

6.h

mm

er

45

8.s

jen

g

46

2.li

bq

ua

ntu

m

46

4.h

26

4re

f

47

1.o

mn

etp

p

47

3.a

star

48

3.x

ala

ncb

mk

41

0.b

wa

ve

s

41

6.g

am

ess

43

3.m

ilc

43

4.z

eu

smp

43

5.g

rom

acs

43

6.c

act

usA

DM

43

7.le

slie

3d

44

4.n

am

d

44

7.d

ea

lII

45

0.s

op

lex

45

3.p

ov

ray

45

4.C

alc

ulix

45

9.G

em

sFD

TD

46

5.t

on

to

47

0.lb

m

48

1.w

rf

48

2.s

ph

inx

3

INT FP

Memory breakdown

L1 Bound L2 Bound L3 Bound MEM Bound Stores Bound

Frac

tion

of to

tal C

lockt

icks

Page 19: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

19 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Memory & multi-core (1-copy vs 4-copy)

Source: http://www.jaleels.org/ajaleel/workload/

Page 20: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

20 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

• Motivation

• Top Level Heuristics

• Top Down hierarchy

• Memory breakdown

• Frontend breakdown

• Examples

• Summary

Page 21: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

21 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

• FrontEnd issues

– Less encountered in traditional client/HPC, more common in servers/enterprise

• Breakdown

– Rough Frontend Latency vs BW classification

– Frontend Latency

– Intervals with uop delivery starvation

– Buckets: i-Cache Miss, iTLB Miss, Branch Resteers

– Frontend Bandwidth

– Intervals when supplied non optimal # of uops per cycles

– Breakdown by Fetch source unit (DSB, MITE, LSD)

FrontEnd Bound

Page 22: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

22 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

50

60

70

80

90

100

110

-

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

40

0.p

erl

be

nch

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

45

6.h

mm

er

45

8.s

jen

g

46

2.li

bq

ua

ntu

m

46

4.h

26

4re

f

47

1.o

mn

etp

p

47

3.a

star

48

3.x

ala

ncb

mk

INT i-geomean

Frontend Latency breakdown

ICache Misses ITLB misses Branch Resteers

DSB switches Slots wasted per mispredict

Results: Frontend drilldown

0.06

0.21 0.05

0.13

0.23

0.03

0.17

0.01 0.05 0.07

0.23 0.04 0.072

-

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00 Top Level .

Frontend Bound Bad Speculation Retiring Backend Bound

-

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50 Frontend breakdown

Frontend Latency Frontend Bandwidth

Note

Difference

in Mis-

prediction

cost

Page 23: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

23 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Frontend

Enterprise

Latency Bound

“Client”

Bandwidth Sensitive

Page 24: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

24 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Hold on… but why this differs?

• Top Down utilizes designated PMU heuristics

– IDQ_UOPS_NOT_DELIVERED

– CYCLE_ACTIVITY.STALLS_L2_MISS

• Naïve methods are often inaccurate

– Example: Counted_Stalls = Σ Fixed_Penalty_i * Number_i

– Many Issues

– Assumes stalls are sequential!

– Speculations not well handled

– Fixed penalty for all workloads

– Restriction to a pre-defined set of miss-events

– Superscalar oblivious

Page 25: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

25 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

EXAMPLE 1: MATRIX MULTIPLY

Page 26: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

26 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Un-tuned

Page 27: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

27 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Loop Interchange

void matrix_multiply ()

{

// Multiply the two matrices

for (int i = 0 ; i < ROWS ; i++) {

for (int j = 0 ; j < COLUMNS ; j++) {

for (int k = 0 ; k < COLUMNS ; k++) {

matrix_r[i][j] = matrix_r[i][j] + matrix_a[i][k] * matrix_b[k][j];

}

}

}

}

Page 28: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

28 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Loop Interchange

Page 29: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

29 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Vectorization

Page 30: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

30 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Example 2: False Sharing

• Field threading example

– By UIUC class using VTune

– Single-threaded compute-bound kernel is parallelized

– 1st attempt shows no speedup due to false sharing

– Backend.Memory.StoreBound is highlighted

– 2nd attempt works. 3.8x Speedup achieved and code is back to be compute-bound

Page 31: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

31 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Example 3: Software prefetching

Original Code Tuned (1.35x speedup)

Prefetching can help Memory Latency Bound Apps. Use Carefully

Page 32: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

32 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Example 4: Microarchitecture comparison

• Haswell (4th Core gen) has improved front-end

– Speculative iTLB and cache accesses with better timing to improve the benefits of prefetching

• Benefiting benchmarks clearly show reduction in Frontend Bound

Using Top Down, forward compatibility is assured on Intel Core™

Page 33: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

33 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Enterprise Challenges

Software PMU/Tools

• Counter Multiplexing

• Hyper-Threading

• Precise profiling accuracy * A joint work with CERN openlab

• Long-tail

profiles

– Streams

across modules

• Data Profiling

• …

• LARGE – Data and Code size

– # modules/developers

• Un-optimized code – E.g. x87

– Dead code

– JITed

• Cloud era: Virtualized, …

Classic Precise Error Error

Page 34: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

34 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Summary

• Top Down Analysis

– An effective method to identify the true bottleneck

– Google “Ahmad Yasin Intel” – for the ISCA’13 talk/article links

• Integrated into VTune™, Linux perf toplev wrapper, and other tools

• Forward compatibility on Intel Core™ platforms

Try it out and share your feedback

Page 35: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

35 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Links

• Whitepaper

– How to Tune Applications Using a Top-down Characterization of Microarchitectural Issues

– http://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues

• Tools

– VTune Amplifier XE 2013 (Update 8 or later)

– Basic support in PBA – Performance Bottleneck Analyzer

– ocperf / toplev – A wrapper on top of the Linux perf utility

• Tutorial on Analysis Methodologies and Tools – ISCA’2013 – https://sites.google.com/site/analysismethods/isca2013/program-1

• Questions or feedback – + [email protected]

– `

Page 36: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown
Page 37: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

37 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

EXAMPLE 3: PINPOINT A MEMORY SUBTLE ISSUE

Page 38: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

38 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Memory Bound breakdown* for Spec FP, on Ivy Bridge

0.04 0.04 0.02 0.04 0.05 0.13

0.02 0.08 0.06

0.02 0.04 0.03 0.01 0.07 0.07

0.03 0.03

0.44

0.00

0.54

0.26

0.00

0.13 0.37

0.00

0.02

0.58

0.00

0.01

0.40

0.01

0.22

0.27

0.00

-

0.10

0.20

0.30

0.40

0.50

0.60

0.70

L1 Bound L2 Bound

L3 Bound Estim. DRAM Bound Estim.

Backend Bound

Core Bound

Memory Bound

L1 L2 L3 Mem

Page 39: Top Down Analysis - never lost with Xeon perf counters · 18 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013 Results: Memory-level drilldown

39 Ahmad Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop 2013

Sandy Bridge field example: Pinpoint Memory Issue across-functions in 465.tonto

– Step 1: Top Level

– Step 2: Back-End Level breakdown

(indirectly inferred in SNB)

– Step 3: Memory Level breakdown

– Step 4: Mem L1D blocks

Frontend

Bound Bad

Speculation Retiring

Backend

Bound

0.01 0.00 0.47 0.52

Resource Stalls Execution Busy

MEM_RS OOO Port 0 Port 3 Port 5

0.43 0.10 0.45 0.3 0.38

Load Per Inst L1 Hits L1 misses penalty

0.267 0.988 0.004

% Loads with L1 Blocks 0.09

Loads blocked due Store Forwards -

penalty cycles% 0.10

Drill down on Back-End

and only Back-End

IvyBridge added a flavor to CYCLE_ACTIVTY which directly detects Memory Bound

IvyBridge added a flavor to CYCLE_ACTIVTY

which measures L1 Bound

Top Down Analysis relies on designated PMU events

Stream#

Blo

ck#

Instr #

Function RIP ASM Line comment

0 0 0 SHELL2_MO.. 0x140193E58 mov r9,qword ptr [rbp+2e58]

0 0 1 SHELL2_MO.. 0x140193E5F lea rcx,ptr [rbp+23a0] sparing area for paramaters &

0 0 2 SHELL2_MO.. 0x140193E66 mov r10,qword ptr [rbp+2700] returned value on stack

0 1 7 cexp 0x14009CACD mov qword ptr [rsp+b0],rcx

0 4 6 cexp 0x14009CDE2 addpd xmm2,xmm6

calculations… 0 4 7 cexp 0x14009CDE6 mulpd xmm0,xmm2

0 4 8 cexp 0x14009CDEA movq xmm1,xmm0

0 4 9 cexp 0x14009CDEE pshufd xmm0,xmm0,e

0 4 10 cexp 0x14009CDF3 mov rcx,qword ptr [rsp+b0]

0 4 11 cexp 0x14009CDFB movq qword ptr [rcx],xmm0 store result on stack

0 4 17 cexp 0x14009CE26 ret

0 5 0 SHELL2_MO.. 0x140193EC4 vmulpd xmm1,xmm15,xmmword ptr [rbp+23a0] Load using cexp() returned data

0 5 1 SHELL2_MO.. 0x140193ECC vmovddup xmm0,qword ptr [rbx+r12*1]

0 5 2 SHELL2_MO.. 0x140193ED2 inc r15

0 5 10 SHELL2_MO.. 0x140193EFF jb 1.40E+63