Benchmarking PyCon AU 2011 v0

The Centre for Australian Weather and Climate ResearchA partnership between CSIRO and the Bureau of Meteorology

Benchmarking your applicationsWith lots of shiny pictures

Tennessee Leeuwenburg

20 August 2011

www.cawcr.gov.au

The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology

What is Benchmarking?

Originally (circa 1842) a mark cut into a stone by land surveyors to secure a "bench" (from 19th century land surveying jargon, meaning a type of bracket), to mount measuring equipment

Another claim is that the term benchmarking originated with cobblers measuring people’s feet

Benchmarking is important to anyone who wishes to ensure a process is consistent and repeatable.

Images are attributed in the PPT notes

http://en.wiktionary.org/wiki/bracket


What is Benchmarking?

• Evaluate or check (something) by comparison with a standard: "we are benchmarking our performance against external criteria".

• Fundamentally, it’s evaluating performance through measurement

Competitor Analysis

0123456789

10

Jan

Feb

Mar

chApr

ilM

ayJu

ne July

Augus

t

Septe

mbe

r

Octobe

r

Novem

ber

Dec

adal

Sca

le

How good they are

How good we are

Image by the author, fictional data


PyPy: A Concrete Software Example

• PyPy doesn’t use benchmarker.py, they have a custom benchmark execution rig

• They have concentrated on building a system for visualising performance, called “CodeSpeed”.

• They have chosen to measure speed. They could have focused on memory, or network performance, or anything else that makes sense for their “business”.

• However, speed is one of the most important aspects of a language, and one of the biggest reasons for someone not to choose PyPy instead of standard CPython


PyPy Benchmark against CPython

Image source: speed.pypy.org


Benchmarking to Drive Development

• “What gets measured gets done” – Peter Drucker

• Benchmarking introduces performance into the feedback loop that drives our activity at work. Hide the information, and you hide the pressure. Publicise the information, and you increase the pressure.

• It’s a tool for raising the profile of what you are measuring. Its says performance is important.

• To improve performance, first measure it. (Of course, it’s not the only way, but it helps)

• This makes selecting your measurement important… measure something meaningful


Performance over Time

Image source: speed.pypy.org


A word of warning…

http://xkcd.com/605/

http://xkcd.com/605/


What to compare against?

• Benchmarking over time (by revision or date). • The most normal kind of benchmarking is a historical comparison of past

performance. This lets you understand what, if any, progress is being made in the application

• Benchmarking by configuration• If an application has multiple configurations, especially if it has to run over a larger

data set in production than in test, benchmarking those differences can be important

• Benchmarking by hardware• Benchmarking by hardware has the obvious advantage that you can evaluate the

impact of purchasing a hardware upgrade

• If there is a direct competitor, and you have their code, you can benchmark your procedures against theirs. But this is unlikely.

• Some applications may have standard trials and tests


Benchmarking to Notice Problems

• Benchmarking can also solve specific problems by bringing them to your attention

• Most usefully, it can highlight when something bad goes in with a commit

• This graph shows a timeline of commits

• No need to worry about performance ahead of time

• Something slow went in with revision 10

• So go fix it!

• Best of all, bounce the commit!

Time Taken

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10

Revision

Tim

e T

aken

Image by the author, fictional data


What is benchmarking of software?

• This question has a few parts, being:• What is measurable about software?

• What should be the basis for comparison?

• What standards for comparison exist?

• Most software benchmarking is about speed. Why?• It’s easiest to measure

• It’s important

• Most people understand speed of execution and what it it’s like for an application to be unresponsive for the user

• It’s often easy to fix

• But…• Memory, disk and networking?

• User acceptance and use?


You too can benchmark your Python code!

• Benchmarker.py … this thing I wrote and would like to share

• Benchmarker.py will collect all this data for you. It Just Works (YMMV).

• Benchmarker.py is a tool which measures and reports on execution speed. It utilises the cProfile Python module to record statistics in a historical archive

• Benchmarker.py has an integration module for CodeSpeed, a website for visualising performance.

• Your manager will love it! (YMMV)


Introducing Benchmarker.py

• Easily available!• Easy_install decorator.py Easy to install!

• https://bitbucket.org/tleeuwenburg/benchmarker.py/ Grab the source!

• Easy to follow tutorials!• https://bitbucket.org/tleeuwenburg/benchmarker.py/wiki/FirstTutorial

• Easy to use!• Simple syntax: simply decorate the function you would like profiled – no

complex function execution required.

• Or, integrate directly with py.test to use without any code modification at all

• Test-driven benchmarking• Because benchmarking in operations will slow the app down

https://bitbucket.org/tleeuwenburg/benchmarker.py/



https://bitbucket.org/tleeuwenburg/benchmarker.py/wiki/FirstTutorial


I’m trying to avoid this…

Image source: icanhascheezburger.com


How to Use Benchmarker.py

• ln [2]: import bench• In [3]: import bench.benchmarker• In [4]: from bench.benchmarker import benchmark• In [5]: @benchmark()• ...: def foo():• ...: for i in range(100):• ...: pass• ...: • In [6]: foo()• In [7]: bench.benchmarker.print_stats()

• 100 function calls in 0.005 CPU seconds• Random listing order was used• ncalls tottime percall cumtime percall filename:lineno(function)• 0 0.000 0.000 profile:0(profiler)• 100 0.005 0.000 0.005 0.000 <ipython console>:1(foo)


Creating a good historical archive

• The key to this is really to maintain a good historical archive. This means a certain amount of integration with your tool chain, but if you are using py.test it’s easy.

demo_project]$ find /tmp/bench_history

/tmp/bench_history

/tmp/bench_history/demonstration

/tmp/bench_history/demonstration/Z400

/tmp/bench_history/demonstration/Z400/full_tests

/tmp/bench_history/demonstration/Z400/full_tests/2011

/tmp/bench_history/demonstration/Z400/full_tests/2011/07

/tmp/bench_history/demonstration/Z400/full_tests/2011/07/25

/tmp/bench_history/demonstration/Z400/full_tests/2011/07/25/2011_07_25_06_19.pstats


Choosing what to pay attention to

• One of the fundamental choices when benchmarking is what to watch. Nothing can automate this, although choosing the ten most expensive functions is probably not a bad first try. Options include:

• Watching the most expensive functions

• Watching the most common user operations

• Hand-selecting a mix of “inner loop” type functions and “outer loop” type functions

• “Critical path” functions that can’t execute in the background or be avoided

• Crafting a watch list based on a specific objective or system component


Figuring it out the first time

• Before setting up the list of watched functions for the graph server, try open the file in a spreadsheet. Benchmarker comes with a csv export mode.

Function name # of calls Total time % Total Cumulative % Total

<_AFPSSup.ReferenceData_grid> 33929 1217 21 21

_getLandOrWaterRefData 544 860 15 36

<compile> 10584 839 14 51

<_AFPSDB.Parm_saveParameter> 11476 662 12 63

<_AFPSSup.IFPClient_getReferenceData> 34386 386 7 69

<_AFPSSup.ReferenceData_pyGrid> 25257 313 5 75

<_AFPSSup.new_HistoSampler> 2356 289 5 80

shuffle 4746 224 4 84

<_AFPSSup.IFPClient_getTextData> 9864 112 2 86

<_AFPSSup.IFPClient_getParmList> 2299 90 2 87

<_AFPSSup.IFPClient_getReferenceInventory> 1345 79 1 89

<_AFPSSup.IFPClient_getTopoData> 1173 60 1 90

Data taken from BoM application tests


Almost all the time is in one place

• In the previous slide, based off our actual application at work, 90% of the time spent in the automated test was concentrated in just 12 functions.

• The total number of functions measured was 6, 763.

• 90% of the time is spent in around about 0.2% (2 hundredths) of the functions. Looking for where to improve speed is no mystery here!

• The codebase is mostly Python… but the expensive operations are mostly in C. I guess this is a good thing!


Version Control Integration

• Version control integration is primitive, but available

• py.test --bench_history –override-decorator –version_tag=0.4

• Goals are to:• Clean up the syntax for this

• Set up auto-sniffing of version tags


Visualisation and Key Metrics

• Integration with codespeed is in a decoupled module which only relies on the filesystem structure created by benchmarker.py

• Which means you can make use of benchmarker.py on-the-desk to produce reports without the web interface

• Or it means you can adjust your own benchmarking rig to produce compatible file output and easily integrate with codespeed


Taking a look at the demo

Image produced by the author.

Data based on real execution of sort functions.


Benchmarking 102

• Controlling the environment• Run it on a box that isn’t doing anything else!

• Distributed is solvable, but not done yet

• Writing specific tests• Your tests may not be representative of program user experience, so you

might want to write specific tests for benchmarking against

• Execution time is data-dependent (e.g. large arrays). Make sure you have a consistent standard, and make sure you have a realistic standard

• Measure the test, not the function• The function may get called by other top-level functions, so you need to pull

that apart to understand the relationships


Benchmarking 102

• Total Time vs Cumulative Time• Total time is where a three-deep loop iterates on a large array

• Cumulative time is where you call that function with a large array… and wait

• Total time is the CPU time in-function

• Cumulative time accumulates the cost of called functions

• Large per-call total time is bad. • It means a large operation.

• Either increase its efficiency, or reduce the number of times it is called

• Small per-call total time can be okay. • It means a small operation.

• Efficiency is only important if it is called many times

• But can you unroll the function to reduce call overhead?


Future Directions (Bugs n Stuff)

• (1) Needs a userbase larger than one• (1) Improved version control information (version sniffing)

• (2) Needs to properly namespace functions• (2) The codespeed timeline is a bit broken (uses submission time, not

data validity time – looks like a bug in codespeed)

• (3) Expansion into memory, disk and network profiling• (3) Expansion into interactive benchmarking through usage analysis

and dialog-based user queries• (3) Maybe create a benchmarker class to allow multiple instances? (I

believe this is actually not as necessary as feedback would suggest)


Acknowledgements

• Thanks to• Ed Schofield, who got the Codespeed integration over the line

• Miquel Torres, developer of Codespeed

• Bureau of Meteorology, for allowing this work to progress as open source

The Centre for Australian Weather and Climate ResearchA partnership between CSIRO and the Bureau of Meteorology

Tennessee Leeuwenburg

Phone: 03 9669 4310Work Email: [email protected]: [email protected]: www.cawcr.gov.au

Thank youwww.cawcr.gov.au

Benchmarking PyCon AU 2011 v0

Technology

important benchmarking

software benchmarking

hardware benchmarking

application benchmarking

problems benchmarking

term benchmarking

testdriven benchmarking

peoples feet benchmarking