Perseus Design
Dec 13, 2015
2
Behavioral Analysis
Components + Behavioral Meta-data Design
OptimizationEngine
Platform Models
Optimization Heuristics
Original Application
Components
Instrumentation Engine
Instrumented Code
Execution Engine
Configuration Plan
Architecture
Behavioral “signatures” are extracted from a baseline execution
Prototype will focus on support for x86 binaries on a Linux platform
Configuration plans define application-triggered system control (affinity & power)
The plethora of variables presents a huge solution space ideal for Genetic Algorithm approaches
Platform models define number of cores and cache characteristics
Second phase of instrumentation “hooks” configuration plan into application
3
Behavioral Analysis Sub-system
Behavioral analysis is performed is split in two TEG (Temporal Execution Graph) TMAM (Temporal Memory Access Map)
Precise data is collected for on a per-thread, per-call site basis Binary instrumentation is facilitated by Dyninst (University Wisconsin
Madison) Accurate counting (e.g., processor cycles) and timing is facilitated through
PAPI (University Tennessee)
Binary Code Instrumentation
Real Platform
Execution
Behavioral Profile
Components Instrumented with
Measurement Probes
Data Distillation and Model
ConstructionRaw
Trace Data
4
TEG Collection
TEG collects information about how much time the application spent executing different functions in the application. Both cycle count and timestamps are collected so that potential for “slow-downs” can be identified
Per-thread, per-call site timing and cycle count information is collected for selected function calls
Results provide timing distributions for functions as opposed to averages and counts (e.g., gprof, callgrind)
Overhead is dependent upon density of instrumentation (i.e., number of functions + calls) ~ in most cases negligible
TEGInstrumentor
(teg.exe)
Real Platform
ExecutionComponents Instrumented with
Measurement Probes
Shared Memory
Data Logger(logger.exe)Event
Data
TEG Binary
File(.teg)Application
Binary
5
TMAM Collection All application reads and writes to memory are captured via probes
instrumented at the binary level. This data is essential for cache false-sharing identification
Data is collected via a shared memory logger Overhead is very expensive - O(x100) slower
At these levels we have to be careful not to affect normal behavior. Dynamic probe placement and sampling could be used to alleviate this problem
Massive volumes of data result (e.g., 20 second program can generate 100 Gb +)
Two modes of operation: off-line analysis, real-time analysis
TMAM Instrumentor(tmam.exe)
Real Platform
ExecutionComponents Instrumented with
Measurement Probes
TMAMInstrumentor(tmam.exe)
Real Platform
ExecutionComponents Instrumented with
Measurement Probes
Shared Memory
Data Logger(logger.exe)
Conflicts Analysis
(conflicts.exe).tmamEventData
ConflictsFile
Real-time Data Distiller(distiller.exe)Event
Data
ConflictsFile
Application Binary
Application Binary
6
Platform Analysis Micro-benchmarks implemented as part of current solution
empirically measure data concerning Number of processors, number (and values) of frequency steppings Cost of thread migration (i.e. affinity change) Ratios of power-to-cycles at different frequencies Cost (in cycles) of frequency modulation Core topology
7
Example Platform Information Example data empirically collected through fine-grained
on-chip timing and micro-benchmark program
Data collected from Dual-processor Quad-core Xeon running Debian Linux. Each matrix element is shaded according to measured latencies of the migration (darker is slower).
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
From Core
To
Cor
e
Represents “slow” migration across cores on different
processors
Migration between two of the four cores
is faster within a single processor.
(suggesting that the Xeon Quad is two Dual-cores linked
together)
8
Design Optimization Engine
Temporal Execution Graph
Temporal Memory Access Map
Platform Models & GA Parameters
XMLXML
Data Transformation and Compilation
Random Population Generation
Fitness Evaluation
MutationReproductionCross-over
(2-point)
Next Generation
GA-based Optimization Engine
DeploymentPlan
(source code & trigger point description)
Random Population
Infusion
Tournament Selection
9
Example Deployment Data Deployment results are made up of a trigger locations and
auto-generated trigger source code
libControl.so8048C07,Before_CS_8048C078048C98,Before_CS_8048C988048D92,Before_CS_8048D928048DB0,Before_CS_8048DB0
#include <pthread.h>#include "affinity.h"#include "fvctrl.h"#include "triggeraux.h"
void Init_Frequency(){ modulate_cpu(0, 1, 0); modulate_cpu(1, 1, 0); modulate_cpu(2, 1, 0); modulate_cpu(3, 0, 0); modulate_cpu(4, 0, 0); modulate_cpu(5, 0, 0); modulate_cpu(6, 0, 0); modulate_cpu(7, 1, 0);}
void Before_CS_8048D92(){ switch(GetThreadInstanceId()) { case 1: { affinize_thread(0, pthread_self()); break; }
case 2: { affinize_thread(3, pthread_self()); break; }
case 3: { affinize_thread(1, pthread_self()); break; }
case 4: { affinize_thread(1, pthread_self()); break; }
}}