Top Banner
A look inside pandas design and development Wes McKinney Lambda Foundry, Inc. @wesmckinn NYC Python Meetup, 1/10/2012 1
74

A look inside pandas design and development

Jan 27, 2015

Download

Technology

wesm

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A look inside pandas design and development

A look inside pandas design and development

Wes McKinneyLambda Foundry, Inc.

@wesmckinn

NYC Python Meetup, 1/10/2012

1

Page 2: A look inside pandas design and development

a.k.a. “Pragmatic Python for high performance

data analysis”

2

Page 3: A look inside pandas design and development

a.k.a. “Rise of the pandas”

3

Page 4: A look inside pandas design and development

Me

4

Page 5: A look inside pandas design and development

More like...

SPEED!!!

5

Page 6: A look inside pandas design and development

Or maybe... (j/k)

6

Page 7: A look inside pandas design and development

Me

• Mathematician at heart

• 3 years in the quant finance industry

• Last 2: statistics + freelance + open source

• My new company: Lambda Foundry

• Building analytics and tools for finance and other domains

7

Page 8: A look inside pandas design and development

Me

• Blog: http://blog.wesmckinney.com

• GitHub: http://github.com/wesm

• Twitter: @wesmckinn

• Working on “Python for Data Analysis” for O’Reilly Media

• Giving PyCon tutorial on pandas (!)

8

Page 9: A look inside pandas design and development

pandas?

• http://pandas.sf.net

• Swiss-army knife of (in-memory) data manipulation in Python

• Like R’s data.frame on steroids

• Excellent performance

• Easy-to-use, highly consistent API

• A foundation for data analysis in Python

9

Page 10: A look inside pandas design and development

pandas

• In heavy production use in the financial industry

• Generally much better performance than other open source alternatives (e.g. R)

• Hope: basis for the “next generation” data analytical environment in Python

10

Page 11: A look inside pandas design and development

Simplifying data wrangling

• Data munging / preparation / cleaning / integration is slow, error prone, and time consuming

• Everyone already <3’s Python for data wrangling: pandas takes it to the next level

11

Page 12: A look inside pandas design and development

Explosive pandas growth

• Last 6 months: 240 files changed 49428 insertions(+), 15358 deletions(-)

Cython-generated C removed

12

Page 13: A look inside pandas design and development

Rigorous unit testing

• Need to be able to trust your $1e3/e6/e9s to pandas

• > 98% line coverage as measured by coverage.py

• v0.3.0 (2/19/2011): 533 test functions

• v0.7.0 (1/09/2012): 1272 test functions

13

Page 14: A look inside pandas design and development

Some development asides

• I get a lot of questions about my dev env

• Emacs + IPython FTW

• Indispensible development tools

• pdb (and IPython-enhanced pdb)

• pylint / pyflakes (integrated with Emacs)

• nose

• coverage.py

• grin, for searching code. >> ack/grep IMHO

14

Page 15: A look inside pandas design and development

IPython

• Matthew Goodman: “If you are not using this tool, you are doing it wrong!”

• Tab completion, introspection, interactive debugger, command history

• Designed to enhance your productivity in every way. I can’t live without it

• IPython HTML notebook is a game changer

15

Page 16: A look inside pandas design and development

Profiling and optimization

• %time, %timeit in IPython

• %prun, to profile a statement with cProfile

• %run -p to profile whole programs

• line_profiler module, for line-by-line timing

• Optimization: find right algorithm first. Cython-ize the bottlenecks (if need be)

16

Page 17: A look inside pandas design and development

Other things that matter

• Follow PEP8 religiously

• Naming conventions, other code style

• 80 character per line hard limit

• Test more than you think you need to, aim for 100% line coverage

• Avoid long functions (> 50 lines), refactor aggressively

17

Page 18: A look inside pandas design and development

I’m serious about function length

http://gist.github.com/1580880

18

Page 19: A look inside pandas design and development

Don’t make a mess

YouTube: “What killed Smalltalk could kill s/Ruby/Python, too”

Uncle Bob

19

Page 20: A look inside pandas design and development

Other stuff

• Good keyboard

20

Page 21: A look inside pandas design and development

Other stuff• Big monitors

21

Page 22: A look inside pandas design and development

Other stuff

• Ergonomic chair (good hacking posture)

22

Page 23: A look inside pandas design and development

pandas DataFrame• Jack-of-trades tabular data structure

In [10]: tips[:10]Out[10]: total_bill tip sex smoker day time size1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.50 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.770 2.00 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2

23

Page 24: A look inside pandas design and development

DataFrame

• Heterogeneous columns

• Data alignment and axis indexing

• No-copy data selection (!)

• Agile reshaping

• Fast joining, merging, concatenation

24

Page 25: A look inside pandas design and development

DataFrame

• Axis indexing enable rich data alignment, joins / merges, reshaping, selection, etc.

day Fri Sat Sun Thur sex smoker Female No 3.125 2.725 3.329 2.460 Yes 2.683 2.869 3.500 2.990Male No 2.500 3.257 3.115 2.942 Yes 2.741 2.879 3.521 3.058

25

Page 26: A look inside pandas design and development

Let’s have a little fun

To the IPython Notebook, Batman

http://ashleyw.co.uk/project/food-nutrient-database

26

Page 27: A look inside pandas design and development

Axis indexing, the special pandas-flavored sauce

• Enables “alignment-free” programming

• Prevents major source of data munging frustration and errors

• Fast (O(1) or O(log n)) selecting data

• Powerful way of describing reshape / join / merge / pivot-table operations

27

Page 28: A look inside pandas design and development

Data alignment, join ops

• The brains live in the axis index

• Indexes know how to do set logic

• Join/align ops: produce “indexers”

• Mapping between source/output

• Indexer passed to fast “take” function

28

Page 29: A look inside pandas design and development

Index join example

dbce

abc

left right

JOIN

abcde

joined

-11203

012

-1-1

lidx ridx

left_values.take(lidx, axis) reindexed data

29

Page 30: A look inside pandas design and development

Implementing index joins

• Completely irregular case: use hash tables

• Monotonic / increasing values

• Faster specialized left/right/inner/outer join routines, especially for native types (int32/64, datetime64)

• Lookup hash table is persisted inside the Index object!

30

Page 31: A look inside pandas design and development

Um, hash table?

abcde

joined

0123{ }d

bce

left

map

-11203

indexer

31

Page 32: A look inside pandas design and development

Hash tables

• Form the core of many critical pandas algorithms

• unique (for set intersection / union)

• “factor”ize

• groupby

• join / merge / align

32

Page 33: A look inside pandas design and development

GroupBy, a brief algorithmic exploration• Simple problem: compute group sums for a

vector given group identifications

bbaabaa

-13232

-41

labels values

ab

unique labels

group sums

24

33

Page 34: A look inside pandas design and development

unique_labels = np.unique(labels)results = np.empty(len(unique_labels))

for i, label in enumerate(unique_labels): results[i] = values[labels == label].sum()

For all these examples, assume N data points and K unique groups

GroupBy: Algo #1

34

Page 35: A look inside pandas design and development

GroupBy: Algo #1, don’t do this

unique_labels = np.unique(labels)results = np.empty(len(unique_labels))

for i, label in enumerate(unique_labels): results[i] = values[labels == label].sum()

Some obvious problems• O(N * K) comparisons. Slow for large K• K passes through values• numpy.unique is pretty slow (more on this later)

35

Page 36: A look inside pandas design and development

GroupBy: Algo #2

g_inds = {label : [i where labels[i] == label]}

Pros: one pass through values. ~O(N) for N >> KCons: g_inds can be built in O(N), but too many list/dict API calls, even using Cython

Make this dict in O(N) (pseudocode)

Nowfor i, label in enumerate(unique_labels): indices = g_inds[label] label_values = values.take(indices) result[i] = label_values.sum()

36

Page 37: A look inside pandas design and development

GroupBy: Algo #3, much faster

• “Factorize” labels

• Produce vector of integers from 0, ..., K-1 corresponding to the unique observed values (use a hash table)

result = np.zeros(k)for i, j in enumerate(factorized_labels): result[j] += values[i]

Pros: avoid expensive dict-of-lists creation. Avoid numpy.unique and have option to not to sort the unique labels, skipping O(K lg K) work

37

Page 38: A look inside pandas design and development

Speed comparisons

• Test case: 100,000 data points, 5,000 groups

• Algo 3, don’t sort groups: 5.46 ms

• Algo 3, sort groups: 10.6 ms

• Algo 2: 155 ms (14.6x slower)

• Algo 1: 10.49 seconds (990x slower)

• Algos 2/3 implemented in Cython

38

Page 39: A look inside pandas design and development

GroupBy

• Situation is significantly more complicated in the multi-key case.

• More on this later

39

Page 40: A look inside pandas design and development

Algo 3, profiledIn [32]: %prun for _ in xrange(100) algo3_nosort()

cumtime filename:lineno(function) 0.592 <string>:1(<module>) 0.584 groupby_ex.py:37(algo3_nosort) 0.535 {method 'factorize' of DictFactorizer' objects} 0.047 {pandas._tseries.group_add} 0.002 numeric.py:65(zeros_like) 0.001 {method 'fill' of 'numpy.ndarray' objects} 0.000 {numpy.core.multiarray.empty_like} 0.000 {numpy.core.multiarray.empty}

Curious

40

Page 41: A look inside pandas design and development

Slaves to algorithms

• Turns out that numpy.unique works by sorting, not a hash table. Thus O(N log N) versus O(N)

• Takes > 70% of the runtime of Algo #2

• Factorize is the new bottleneck, possible to go faster?!

41

Page 42: A look inside pandas design and development

Unique-ing fasterBasic algorithm using a dict, do this in Cython

table = {}uniques = []for value in values: if value not in table: table[value] = None # dummy uniques.append(value)if sort: uniques.sort()

Performance may depend on the number of unique groups (due to dict resizing)

42

Page 43: A look inside pandas design and development

Unique-ing faster

No Sort: at best ~70x faster, worst 6.5x faster Sort: at best ~70x faster, worst 1.7x faster

43

Page 44: A look inside pandas design and development

Remember

44

Page 45: A look inside pandas design and development

Can we go faster?

• Python dict is renowned as one of the best hash table implementations anywhere

• But:

• No ability to preallocate, subject to arbitrary resizings

• We don’t care about reference counting, throw away table once done

• Hm, what to do, what to do?

45

Page 46: A look inside pandas design and development

Enter klib

• http://github.com/attractivechaos/klib

• Small, portable C data structures and algorithms

• khash: fast, memory-efficient hash table

• Hack a Cython interface (pxd file) and we’re in business

46

Page 47: A look inside pandas design and development

khash Cython interfacecdef extern from "khash.h": ctypedef struct kh_pymap_t: khint_t n_buckets, size, n_occupied, upper_bound uint32_t *flags PyObject **keys Py_ssize_t *vals

inline kh_pymap_t* kh_init_pymap() inline void kh_destroy_pymap(kh_pymap_t*) inline khint_t kh_get_pymap(kh_pymap_t*, PyObject*) inline khint_t kh_put_pymap(kh_pymap_t*, PyObject*, int*) inline void kh_clear_pymap(kh_pymap_t*) inline void kh_resize_pymap(kh_pymap_t*, khint_t) inline void kh_del_pymap(kh_pymap_t*, khint_t) bint kh_exist_pymap(kh_pymap_t*, khiter_t)

47

Page 48: A look inside pandas design and development

PyDict vs. khash unique

Conclusions: dict resizing makes a big impact48

Page 49: A look inside pandas design and development

Use strcmp in C

49

Page 50: A look inside pandas design and development

Gloves come off with int64

PyObject* boxing / PyRichCompare obvious culprit

50

Page 51: A look inside pandas design and development

Some NumPy-fu• Think about the sorted factorize algorithm

• Want to compute sorted unique labels

• Also compute integer ids relative to the unique values, without making 2 passes through a hash table!

sorter = uniques.argsort() reverse_indexer = np.empty(len(sorter)) reverse_indexer.put(sorter, np.arange(len(sorter)))

labels = reverse_indexer.take(labels)

51

Page 52: A look inside pandas design and development

Aside, for the R community

• R’s factor function is suboptimal

• Makes two hash table passes

• unique uniquify and sort

• match ids relative to unique labels

• This is highly fixable

• R’s integer unique is about 40% slower than my khash_int64 unique

52

Page 53: A look inside pandas design and development

Multi-key GroupBy

• Significantly more complicated because the number of possible key combinations may be very large

• Example, group by two sets of labels

• 1000 unique values in each

• “Key space”: 1,000,000, even though observed key pairs may be small

53

Page 54: A look inside pandas design and development

Multi-key GroupBySimplified Algorithm

id1, count1 = factorize(label1)id2, count2 = factorize(label2)group_id = id1 * count2 + id2nobs = count1 * count2

if nobs > LARGE_NUMBER: group_id, nobs = factorize(group_id)

result = group_add(data, group_id, nobs)

54

Page 55: A look inside pandas design and development

Multi-GroupBy

• Pathological, but realistic example

• 50,000 values, 1e4 unique keys x 2, key space 1e8

• Compress key space: 9.2 ms

• Don’t compress: 1.2s (!)

• I actually discovered this problem while writing this talk (!!)

55

Page 56: A look inside pandas design and development

Speaking of performance

• Testing the correctness of code is easy: write unit tests

• How to systematically test performance?

• Need to catch performance regressions

• Being mildly performance obsessed, I got very tired of playing performance whack-a-mole with pandas

56

Page 57: A look inside pandas design and development

vbench project

• http://github.com/wesm/vbench

• Run benchmarks for each version of your codebase

• vbench checks out each revision of your codebase, builds it, and runs all the benchmarks you define

• Results stored in a SQLite database

• Only works with git right now

57

Page 58: A look inside pandas design and development

vbenchjoin_dataframe_index_single_key_bigger = \ Benchmark("df.join(df_key2, on='key2')", setup, name='join_dataframe_index_single_key_bigger')

58

Page 59: A look inside pandas design and development

vbenchstmt3 = "df.groupby(['key1', 'key2']).sum()"groupby_multi_cython = Benchmark(stmt3, setup, name="groupby_multi_cython", start_date=datetime(2011, 7, 1))

59

Page 60: A look inside pandas design and development

Fast database joins

• Problem: SQL-compatible left, right, inner, outer joins

• Row duplication

• Join on index and / or join on columns

• Sorting vs. not sorting

• Algorithmically closely related to groupby etc.

60

Page 61: A look inside pandas design and development

Row duplicationleft right

key keyouter join

lvalue rvalue

foofoobarbaz

1234

foofoobarqux

5678

key lvalue rvalue

foofoofoofoobarbazqux

112234

NA

56567

NA8

61

Page 62: A look inside pandas design and development

Join indexersleft right

key keyouter join

lvalue rvalue

foofoobarbaz

1234

foofoobarqux

5678

key lidx ridx

foofoofoofoobarbazqux

001123

-1

01012

-13

62

Page 63: A look inside pandas design and development

Join indexersleft right

key keyouter join

lvalue rvalue

foofoobarbaz

1234

foofoobarqux

5678

key lidx ridx

foofoofoofoobarbazqux

001123

-1

01012

-13Problem: factorized keys

need to be sorted!

63

Page 64: A look inside pandas design and development

An algorithmic observation

• If N values are known to be from the range 0 through K - 1, can be sorted in O(N)

• Variant of counting sort

• For our purposes, only compute the sorting indexer (argsort)

64

Page 65: A look inside pandas design and development

Winning join algorithmO(K log K) or O(N)

O(N)

O(N)

O(N_output)

O(N_output)

O(N_output)

don’t sort keyssort keys

(counting sort)

(refactorize)

(this step is actually fairly nontrivial)

Factorize keys columns

Compute / compress group indexes

"Sort" by group indexes

Compute left / right join indexers for join method

Remap indexers relative to original row ordering

Move data efficiently into output DataFrame

65

Page 66: A look inside pandas design and development

“You’re like CLR, I’m like CLRS”- “Kill Dash Nine”, by Monzy

66

Page 67: A look inside pandas design and development

Join test case

• Left: 80k rows, 2 key columns, 8k unique key pairs

• Right: 8k rows, 2 key columns, 8k unique key pairs

• 6k matching key pairs between the tables, many-to-one join

• One column of numerical values in each

67

Page 68: A look inside pandas design and development

Join test case

• Many-to-many case: stack right DataFrame on top of itself to yield 16k rows, 2 rows for each key pair

• Aside: sorting the unique keys dominates the runtime (that pesky O(K log K)), not included in these benchmarks

68

Page 69: A look inside pandas design and development

Quick, algebra!

• Left join: 80k rows

• Right join: 62k rows

• Inner join: 60k rows

• Outer join: 82k rows

• Left join: 140k rows

• Right join: 124k rows

• Inner join: 120k rows

• Outer join: 144k rows

Many-to-manyMany-to-one

69

Page 70: A look inside pandas design and development

Results vs. some R packages

* relative timings70

Page 71: A look inside pandas design and development

Results vs SQLite3

Note: In SQLite3 doing something like

Absolute timings

* outer is LEFT OUTER in SQLite3

71

Page 72: A look inside pandas design and development

DataFrame sort by columns

• Applied same ideas / tools to “sort by multiple columns op” yesterday

72

Page 73: A look inside pandas design and development

The bottom line

• Just a flavor: pretty much all of pandas has seen the same level of design effort and performance scrutiny

• Make sure whoever implemented your data structures and algorithms care about performance. A lot.

• Python has amazingly powerful and productive tools for implementation work

73

Page 74: A look inside pandas design and development

Thanks!

• Follow me on Twitter: @wesmckinn

• Blog: http://blog.wesmckinney.com

• Exciting Python things ahead in 2012

74