The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Benchmarking your applications With lots of shiny pictures Tennessee Leeuwenburg 20 August 2011 www.cawcr.gov. au
Jan 21, 2015
The Centre for Australian Weather and Climate ResearchA partnership between CSIRO and the Bureau of Meteorology
Benchmarking your applicationsWith lots of shiny pictures
Tennessee Leeuwenburg
20 August 2011
www.cawcr.gov.au
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
What is Benchmarking?
Originally (circa 1842) a mark cut into a stone by land surveyors to secure a "bench" (from 19th century land surveying jargon, meaning a type of bracket), to mount measuring equipment
Another claim is that the term benchmarking originated with cobblers measuring people’s feet
Benchmarking is important to anyone who wishes to ensure a process is consistent and repeatable.
Images are attributed in the PPT notes
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
What is Benchmarking?
• Evaluate or check (something) by comparison with a standard: "we are benchmarking our performance against external criteria".
• Fundamentally, it’s evaluating performance through measurement
Competitor Analysis
0123456789
10
Jan
Feb
Mar
chApr
ilM
ayJu
ne July
Augus
t
Septe
mbe
r
Octobe
r
Novem
ber
Dec
adal
Sca
le
How good they are
How good we are
Image by the author, fictional data
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
PyPy: A Concrete Software Example
• PyPy doesn’t use benchmarker.py, they have a custom benchmark execution rig
• They have concentrated on building a system for visualising performance, called “CodeSpeed”.
• They have chosen to measure speed. They could have focused on memory, or network performance, or anything else that makes sense for their “business”.
• However, speed is one of the most important aspects of a language, and one of the biggest reasons for someone not to choose PyPy instead of standard CPython
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
PyPy Benchmark against CPython
Image source: speed.pypy.org
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Benchmarking to Drive Development
• “What gets measured gets done” – Peter Drucker
• Benchmarking introduces performance into the feedback loop that drives our activity at work. Hide the information, and you hide the pressure. Publicise the information, and you increase the pressure.
• It’s a tool for raising the profile of what you are measuring. Its says performance is important.
• To improve performance, first measure it. (Of course, it’s not the only way, but it helps)
• This makes selecting your measurement important… measure something meaningful
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Performance over Time
Image source: speed.pypy.org
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
A word of warning…
http://xkcd.com/605/
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
What to compare against?
• Benchmarking over time (by revision or date). • The most normal kind of benchmarking is a historical comparison of past
performance. This lets you understand what, if any, progress is being made in the application
• Benchmarking by configuration• If an application has multiple configurations, especially if it has to run over a larger
data set in production than in test, benchmarking those differences can be important
• Benchmarking by hardware• Benchmarking by hardware has the obvious advantage that you can evaluate the
impact of purchasing a hardware upgrade
• If there is a direct competitor, and you have their code, you can benchmark your procedures against theirs. But this is unlikely.
• Some applications may have standard trials and tests
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Benchmarking to Notice Problems
• Benchmarking can also solve specific problems by bringing them to your attention
• Most usefully, it can highlight when something bad goes in with a commit
• This graph shows a timeline of commits
• No need to worry about performance ahead of time
• Something slow went in with revision 10
• So go fix it!
• Best of all, bounce the commit!
Time Taken
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10
Revision
Tim
e T
aken
Image by the author, fictional data
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
What is benchmarking of software?
• This question has a few parts, being:• What is measurable about software?
• What should be the basis for comparison?
• What standards for comparison exist?
• Most software benchmarking is about speed. Why?• It’s easiest to measure
• It’s important
• Most people understand speed of execution and what it it’s like for an application to be unresponsive for the user
• It’s often easy to fix
• But…• Memory, disk and networking?
• User acceptance and use?
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
You too can benchmark your Python code!
• Benchmarker.py … this thing I wrote and would like to share
• Benchmarker.py will collect all this data for you. It Just Works (YMMV).
• Benchmarker.py is a tool which measures and reports on execution speed. It utilises the cProfile Python module to record statistics in a historical archive
• Benchmarker.py has an integration module for CodeSpeed, a website for visualising performance.
• Your manager will love it! (YMMV)
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Introducing Benchmarker.py
• Easily available!• Easy_install decorator.py Easy to install!
• https://bitbucket.org/tleeuwenburg/benchmarker.py/ Grab the source!
• Easy to follow tutorials!• https://bitbucket.org/tleeuwenburg/benchmarker.py/wiki/FirstTutorial
• Easy to use!• Simple syntax: simply decorate the function you would like profiled – no
complex function execution required.
• Or, integrate directly with py.test to use without any code modification at all
• Test-driven benchmarking• Because benchmarking in operations will slow the app down
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
I’m trying to avoid this…
Image source: icanhascheezburger.com
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
How to Use Benchmarker.py
• ln [2]: import bench• In [3]: import bench.benchmarker• In [4]: from bench.benchmarker import benchmark• In [5]: @benchmark()• ...: def foo():• ...: for i in range(100):• ...: pass• ...: • In [6]: foo()• In [7]: bench.benchmarker.print_stats()
• 100 function calls in 0.005 CPU seconds• Random listing order was used• ncalls tottime percall cumtime percall filename:lineno(function)• 0 0.000 0.000 profile:0(profiler)• 100 0.005 0.000 0.005 0.000 <ipython console>:1(foo)
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Creating a good historical archive
• The key to this is really to maintain a good historical archive. This means a certain amount of integration with your tool chain, but if you are using py.test it’s easy.
demo_project]$ find /tmp/bench_history
/tmp/bench_history
/tmp/bench_history/demonstration
/tmp/bench_history/demonstration/Z400
/tmp/bench_history/demonstration/Z400/full_tests
/tmp/bench_history/demonstration/Z400/full_tests/2011
/tmp/bench_history/demonstration/Z400/full_tests/2011/07
/tmp/bench_history/demonstration/Z400/full_tests/2011/07/25
/tmp/bench_history/demonstration/Z400/full_tests/2011/07/25/2011_07_25_06_19.pstats
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Choosing what to pay attention to
• One of the fundamental choices when benchmarking is what to watch. Nothing can automate this, although choosing the ten most expensive functions is probably not a bad first try. Options include:
• Watching the most expensive functions
• Watching the most common user operations
• Hand-selecting a mix of “inner loop” type functions and “outer loop” type functions
• “Critical path” functions that can’t execute in the background or be avoided
• Crafting a watch list based on a specific objective or system component
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Figuring it out the first time
• Before setting up the list of watched functions for the graph server, try open the file in a spreadsheet. Benchmarker comes with a csv export mode.
Function name # of calls Total time % Total Cumulative % Total
<_AFPSSup.ReferenceData_grid> 33929 1217 21 21
_getLandOrWaterRefData 544 860 15 36
<compile> 10584 839 14 51
<_AFPSDB.Parm_saveParameter> 11476 662 12 63
<_AFPSSup.IFPClient_getReferenceData> 34386 386 7 69
<_AFPSSup.ReferenceData_pyGrid> 25257 313 5 75
<_AFPSSup.new_HistoSampler> 2356 289 5 80
shuffle 4746 224 4 84
<_AFPSSup.IFPClient_getTextData> 9864 112 2 86
<_AFPSSup.IFPClient_getParmList> 2299 90 2 87
<_AFPSSup.IFPClient_getReferenceInventory> 1345 79 1 89
<_AFPSSup.IFPClient_getTopoData> 1173 60 1 90
Data taken from BoM application tests
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Almost all the time is in one place
• In the previous slide, based off our actual application at work, 90% of the time spent in the automated test was concentrated in just 12 functions.
• The total number of functions measured was 6, 763.
• 90% of the time is spent in around about 0.2% (2 hundredths) of the functions. Looking for where to improve speed is no mystery here!
• The codebase is mostly Python… but the expensive operations are mostly in C. I guess this is a good thing!
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Version Control Integration
• Version control integration is primitive, but available
• py.test --bench_history –override-decorator –version_tag=0.4
• Goals are to:• Clean up the syntax for this
• Set up auto-sniffing of version tags
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Visualisation and Key Metrics
• Integration with codespeed is in a decoupled module which only relies on the filesystem structure created by benchmarker.py
• Which means you can make use of benchmarker.py on-the-desk to produce reports without the web interface
• Or it means you can adjust your own benchmarking rig to produce compatible file output and easily integrate with codespeed
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Taking a look at the demo
Image produced by the author.
Data based on real execution of sort functions.
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Benchmarking 102
• Controlling the environment• Run it on a box that isn’t doing anything else!
• Distributed is solvable, but not done yet
• Writing specific tests• Your tests may not be representative of program user experience, so you
might want to write specific tests for benchmarking against
• Execution time is data-dependent (e.g. large arrays). Make sure you have a consistent standard, and make sure you have a realistic standard
• Measure the test, not the function• The function may get called by other top-level functions, so you need to pull
that apart to understand the relationships
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Benchmarking 102
• Total Time vs Cumulative Time• Total time is where a three-deep loop iterates on a large array
• Cumulative time is where you call that function with a large array… and wait
• Total time is the CPU time in-function
• Cumulative time accumulates the cost of called functions
• Large per-call total time is bad. • It means a large operation.
• Either increase its efficiency, or reduce the number of times it is called
• Small per-call total time can be okay. • It means a small operation.
• Efficiency is only important if it is called many times
• But can you unroll the function to reduce call overhead?
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Future Directions (Bugs n Stuff)
• (1) Needs a userbase larger than one• (1) Improved version control information (version sniffing)
• (2) Needs to properly namespace functions• (2) The codespeed timeline is a bit broken (uses submission time, not
data validity time – looks like a bug in codespeed)
• (3) Expansion into memory, disk and network profiling• (3) Expansion into interactive benchmarking through usage analysis
and dialog-based user queries• (3) Maybe create a benchmarker class to allow multiple instances? (I
believe this is actually not as necessary as feedback would suggest)
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology
Acknowledgements
• Thanks to• Ed Schofield, who got the Codespeed integration over the line
• Miquel Torres, developer of Codespeed
• Bureau of Meteorology, for allowing this work to progress as open source
The Centre for Australian Weather and Climate ResearchA partnership between CSIRO and the Bureau of Meteorology
Tennessee Leeuwenburg
Phone: 03 9669 4310Work Email: [email protected]: [email protected]: www.cawcr.gov.au
Thank youwww.cawcr.gov.au