Top Banner
Optimizing Thread Performance for a Genomics Variant Caller
33

Optimizing thread performance for a genomics variant caller

Apr 12, 2017

Download

Software

AllineaSoftware
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimizing thread performance for a genomics variant caller

Optimizing Thread Performance for a

Genomics Variant Caller

Page 2: Optimizing thread performance for a genomics variant caller

This talk

• Introduce two tools that can help improve the performance of

multithreaded code

• Apply the tools to a real world Genomics code

Page 3: Optimizing thread performance for a genomics variant caller

caption

Tool 1: Allinea Performance Reports – benchmarking and

characterization

Page 4: Optimizing thread performance for a genomics variant caller

Tool 2: Allinea Forge - Debugging and Profiling

• Debug and profile from one interface, configuration

• Secure native remote and local access

• Rapidly switch between the tasks

• Edit, build, commit, debug, profile, optimize..

Page 5: Optimizing thread performance for a genomics variant caller

Small data files

<5% slowdown

No instrumentation

No recompilation

Our profiler finds the performance bottlenecks

Page 6: Optimizing thread performance for a genomics variant caller

Our debugger helps bugs and performance

• Observe why workload is imbalanced

• Observe why particular code paths are followed

• .. And fix any bugs that optimization creates!

Page 7: Optimizing thread performance for a genomics variant caller

Above all…

• The tools are aimed at any performance problem that matters

– Focus on time: the ultimate judge of performance

• Do not prejudge the problem

– Don’t assume it’s MPI messages, threads or I/O before profiling!

• If there’s a problem..

– Allinea Performance Reports shows it, and advises you on solutions

– Allinea Forge’s profiler shows it, next to your code

Page 8: Optimizing thread performance for a genomics variant caller

6 steps to improve performance

Get a realistic test case

• Performance on real data matters

• Keep the test case for reference and re-use

Profile your code

• Add “-g” flag to your compilation

• Run with a profiler

Look for the significant

• Which part/phase of the code dominates time?

• Is there any unexpected significant time use?

What is the nature of the problem?

• Compute? I/O? MPI? Thread synchronization?

• Display the metrics that show the problem best

Apply brain to solve

• MPI – can you balance the work better?

• Compute – is memory time dominant – can you improve layout?

Think of the future

• Try larger process or thread counts to watch for scalability problems

• Keep the profile (.map file) for future comparison

Page 9: Optimizing thread performance for a genomics variant caller

Example: Improving Thread Usage in Genomics

• DISCOVAR

– Variant caller and small genome assembler

– Sub-mammalian sized genomes

– Newer DISCOVAR de novo for larger genomes

• C++ and OpenMP

• Developed by Broad Institute at MIT

Page 10: Optimizing thread performance for a genomics variant caller

A first look – on real hardware

• It’s not I/O intensive

• Good quantity of

OpenMP time

• No vectorization

Page 11: Optimizing thread performance for a genomics variant caller

OpenMP in detail

• Physical cores are

200% loaded:

hyperthreading is on

• 17% of parallel region

time is synchronization

• .. That’s quite high

Page 12: Optimizing thread performance for a genomics variant caller

Investigating the OpenMP synchronization

• Horizontal time axis: colour coded– Dark green – single core

– Light green – OpenMP work

– Light blue – pthreadsynchronization

– Gray – idle

• Vertical axis– #cores doing something

• Something’s very wrong towards the end – with all the gray

Page 13: Optimizing thread performance for a genomics variant caller

Zoom in on the region

• Stacks, code, regions,

time are all focused on

zoom area

• Key observation:

– OpenMP region with

“omp critical” is where

the time is being wasted

Page 14: Optimizing thread performance for a genomics variant caller

Fixing

• #pragma omp critical– Execute exactly one

thread at a time to ensure safety

• Is costing too much – Passing “token” from

thread to thread to do small pieces of work.

• Run whole section on one thread instead– Has same semantics

Page 15: Optimizing thread performance for a genomics variant caller

Impact of change

• Runtime down by 7%

Page 16: Optimizing thread performance for a genomics variant caller

As a performance report

• Improvements in

– Runtime

– Synchronization

overhead

Page 17: Optimizing thread performance for a genomics variant caller

Let’s try something bigger – into Amazon cloud!

• C4.8xlarge– 36 hyperthreaded cores

– 60GB RAM

– Xeon E5-2666 v3 Haswell

– 25MB Cache

– 2.6GHZ

vs

• Our physical server– 24 hyperthreaded cores

– 24 GB RAM

– Xeon E5-2407 v2

– 10MB Cache

– 2.4GHz

$ ./runme.sh

discovar version: Discovar r52488

loadaverage: 0.05 0.98 1.36 1/790 16317

2015-07-27 07:57 PERF: REAL 835.857 USER 36.188 SYSTEM 5.441 PERC 4.71

835 seconds to run on EC2

… vs …

~448 seconds on our physical server

Why?

Page 18: Optimizing thread performance for a genomics variant caller

Profile with Allinea Forge to find where the problem is

• Focus on initial 300

seconds: something

must be wrong here

• Serious lack of good

“green” compute

Page 19: Optimizing thread performance for a genomics variant caller

In detail…

• 36 threads, waiting… but who is using madvise?!

Page 20: Optimizing thread performance for a genomics variant caller

Why is glibc so bad?

• madvise system call in _int_free()– At least two context

switches each call ..

– This glibc version has issues…?

• What other options are there?

Page 21: Optimizing thread performance for a genomics variant caller

Maybe Google TCMalloc?

• Optimized for multi-threaded applications

• No-win– Same run time

– Issue is use of sys_futexnot madvise

• .. Not optimized for thismultithreaded application!

Page 22: Optimizing thread performance for a genomics variant caller

Jemalloc?

• As recommended by

the Broad Institute

• … same runtime

Page 23: Optimizing thread performance for a genomics variant caller

Jemalloc – same problem

• Source proves the issue

again…

Page 24: Optimizing thread performance for a genomics variant caller

Can Intel libraries help?

• We try the Intel TBB multithreaded allocator

• 14 minutes down to 10 minutes!

• .. But still this code has scope for more…

Page 25: Optimizing thread performance for a genomics variant caller

Real optimization of OpenMP regions

• NB – still profiling for

first 300 seconds only

• Significant inactivity in

final 60 seconds

• OpenMP region

– #pragma omp parallel for

• Is it working?

– No – the threads are idle

• Let’s remove

Page 26: Optimizing thread performance for a genomics variant caller

After the first fix…

• Now able to run to

completion

– 358 seconds

• Still inactivity at end of

run

Page 27: Optimizing thread performance for a genomics variant caller

Zoomed to the inactivity…

• Another OpenMP region

• Quick edit: comment out

the OpenMP, again!

Page 28: Optimizing thread performance for a genomics variant caller

… and the impact

• Down to 304 seconds

Page 29: Optimizing thread performance for a genomics variant caller

Finally… something to sort out

• Recursive, in-place

multithreaded sorter

• Is not scaling well in

thread counts

• Options?

– Re-engineer

– Replace

– Tune

Page 30: Optimizing thread performance for a genomics variant caller

Let’s tune

• Try limiting the thread pool to 8 workers

– Better than 36 clashing threads?

Page 31: Optimizing thread performance for a genomics variant caller

Result…

• Runtime 4.7 minutes

• 3x improvement on

original

• #1 position on the

Broad Benchmark list

for a sub-$2 / hour

system!

Page 32: Optimizing thread performance for a genomics variant caller

Lessons learned

• Real codes exhibit many different performance patterns– Profiling real data sets at real scales is vital to target the effort

– Small test cases do not expose all the problems

– Small thread counts can be too small to find real problems

• Changing code can be simple– Use threads wisely – it will not always be faster

– Changing libraries – someone else might have fixed your problem

• Re-engineering is sometimes necessary– Take advantage of vector units

– Take advantage of threads

Page 33: Optimizing thread performance for a genomics variant caller

Increase the performance of your software

Analyze and tune with Allinea Performance Reports

Develop, profile and debug applications with Allinea Forge

With professional support when you need it most

Read more!