Top Banner
Session 5: Extreme Python An introduction to scientific programming with
49

Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

May 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Session 5:Extreme Python

An introduction to scientific programming with

Page 2: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Outline

• Managing your environment

• Efficiently handling large datasets

• Optimising your code

• Squeezing out extra speed

• Writing neat and robust code

• Graphical interfaces

Page 3: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Managing your environment

Page 4: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Managing your environment

• Some good things about Python• lots of modules from many sources

• ongoing development of Python and modules

• Some bad things about Python• lots of modules from many sources

• ongoing development of Python and modules

• A solution• Maintain (or have option to create) separate environments

(or manifests) for different projects

Page 5: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Managing your environment

• virtualenv• general Python solution – http://virtualenv.pypa.io

• modules are installed with pip – https://pip.pypa.io

$ pip install virtualenv # install virtualenv

$ virtualenv ENV1 # create a new environment ENV1

$ source ENV/bin/activate # set PATH to our environment

(ENV1)$ pip install emcee # install modules into ENV1

(ENV1)$ pip install numpy==1.8.2 # install specific version

(ENV1)$ python # use our custom environment

(ENV1)$ deactivate # return our PATH to normal

Page 6: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Managing your environment

• virtualenv• can record current state of modules to a 'requirements' file

• using that file can always recreate the same environment

(ENV1)$ pip freeze > requirements.txt

$ cat requirements.txt

emcee==2.1.0

numpy==1.8.2

$ deactivate

$ virtualenv ENV2

$ sourceENV2/bin/activate

(ENV2)$ pip install -r requirements.txt

Page 7: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Managing your environment

• conda – http://conda.pydata.org

• specific to the Anaconda Python distribution

• similar to 'pip', but can install binaries (not just python)

• can use pip within a conda environment

$ conda create -n ENV1 # create a new environment ENV1

$ source activate ENV1 # set PATH to our environment

$ conda install numpy # install modules into ENV1

$ conda install -c thebamf emcee # install from channel

$ source deactivate # return our PATH to normal

$ conda list –n ENV1 -e > requirements.txt # clone ENV1

$ conda create -n ENV2 --file requirements.txt # to ENV2

Page 8: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Managing your environment

• Updating packages

$ conda update --all

$ conda update scipy emcee

OR

$ pip install --upgrade

$ pip install --upgrade scipy emcee

Page 9: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Efficiently handling large datasets

Page 10: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Databases

• Python has tools for accessing most (all?) databases

• e.g. MySQL, SQLite, MongoDB, Postgres, …

• Allow one to work with huge datasets

• Data can be at remote locations

• However, most databases are designed for webserver use

• Not optimised for data analysis

Page 11: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

PyTables

• http://pytables.github.io

• For creating, storing and analysing datasets

• from simple, small tables to complex, huge datasets

• standard HDF5 file format

• incredibly fast – even faster with indexing

• uses on the fly block compression

• designed for modern systems

• fast multi-code CPU; large, slow memory

• "in-kernel" – data and algorithm are sent to CPU in optimal way

• "out-of-core" – avoids loading whole dataset into memory

Page 12: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

PyTables

>>> from tables import *

>>> h5file = openFile("test.h5", mode = "w")

>>> x = h5file.createArray("/", "x", arange(1000))

>>> y = h5file.createArray("/", "y", sqrt(arange(1000)))>>> h5file.close()

• Can store many things in one HDF5 file (like FITS)

• Tree structure

• Everything in a group (starting with root group, '/')

• Data stored in leaves

• Arrays (e.g. n-dimensional images)

Page 13: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

PyTables

>>> class MyTable(IsDescription):

z = Float32Col()

>>> table = h5file.createTable("/", "mytable", MyTable)

>>> row = table.row

>>> for i in xrange(1000):

row["z"] = i**(3.0/2.0)

row.append()

>>> table.flush()>>> z = table.cols.z

• Tables (columns with different formats)

• described by a class

• accessed by a row iterator

Page 14: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

PyTables Expr

>>> r = h5file.createArray("/", "r", np.zeros(1000))

>>> xyz = Expr("x*y*z")

>>> xyz.setOutput(r)

>>> xyz.eval()

/r (Array(1000,)) ''

atom := Float64Atom(shape=(), dflt=0.0)maindim := 0

flavor := 'numpy'

byteorder := 'little'

chunkshape := None

>>> r.read(0, 10)array([ 0. , 1. , 7.99999986, 26.9999989 ,

64. , 124.99999917, 216.00000085, 343.00001259,

511.99999124, 729. ])

• Expr enables in-kernel & out-of-core operations

Page 15: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

PyTables Expr

>>> r_bigish = [ row['z'] for row in

table.where('(z > 1000) & (z <= 2000)' ]

>>> for big in table.where('z > 10000;'):

... print('A big z is {}'.format(big['z'])

• where enables in-kernel selections

• There is also a where in Expr

Page 16: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

• Python Data Analysis Library• http://pandas.pydata.org

• Easy-to-use data structures• DataFrame (more friendly recarray)• Handles missing data (more friendly masked array)• read and write various data formats• data-alignment

• tries to be helpful, though not always intuitive• Easy to combine data tables• Surprisingly fast!

Notebook demo…

Page 17: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Graphical interfaces

Page 18: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

GUIs

• Give your scientific code a friendly face!• easy configuration• monitor progress• particularly for public code, cloud computing, HPC

• Many modules to construct GUIs in Python

• PyJs (used to be PyJamas) – browser based

• Kivy – modern and cross-platform

• Tkinter – built-in

• Qt – C++

• wx – C++

Page 19: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

GUIs

wxpython is popular – recently updated to Python 3: Project Pheonix

http://www.wxpython.org

e.g., https://github.com/bamford/control/

Page 20: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

GUIs

For simple GUI, especially if output is a plot…

matplotlib widgets are very useful

• layout controls on a canvas

• functionality implemented using callback functions:

• every time a control is activated it will call the function

• function then examines the event and takes action

Page 21: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Python web frameworks

Djangoa high-level Python Web framework that encourages rapid development and clean, pragmatic design.

and many others, e.g. Zope (massive), web2py (light), …

(and PyJs)

• An (unscientific) example

Page 22: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Optimising your code

Page 23: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Testing performance

timeit – use in interpreter, script or command line

Options:

-s S, --setup=S

statement to be executed once initially (default pass)

-n N, --number=N

how many times to execute 'statement' (default: take ~0.2 sec total)

-r N, --repeat=N

how many times to repeat the timer (default 3)

iPython magic version

python -m timeit [-n N] [-r N] [-s S] [statement ...]

%timeit # one line%%timeit # whole notebook cell

Page 24: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Testing performance

# fastest way to calculate x**5?

$ python -m timeit -s 'from math import pow; x = 1.23' 'x*x*x*x*x'

10000000 loops, best of 3: 0.161 usec per loop

$ python -m timeit -s 'from math import pow; x = 1.23' 'x**5'

10000000 loops, best of 3: 0.111 usec per loop

$ python -m timeit -s 'from math import pow; x = 1.23' 'pow(x, 5)'

1000000 loops, best of 3: 0.184 usec per loop

Page 25: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Profiling

• Understand which parts of your code limit its execution time

• print summary to screen, or save file for detailed analysis

From shell

From iPython

Lots of functionality… see docs

$ python -m cProfile –o program.prof my_program.py

%prun -D program.prof my_function()

%%prun # profile an entire notebook cell

Page 26: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Profiling

Nice visualisation with snakeviz – http://jiffyclub.github.io/snakeviz/

In iPython:

$ conda install snakeviz

OR

$ pip install snakeviz

%load_ext snakeviz

%snakeviz my_function()

%%snakeviz # profile entire cell

Page 27: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Extremely neat Python

Page 28: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Coding Guidlines

• PEP8• https://www.python.org/dev/peps/pep-0008/

• several tools to test compliance – search ‘pep8’

• 4 spaces for indentation

• 79 characters maximum line length (72 for docstrings)

• use blank lines sparingly, but always around functions, etc.

• whitespace

• one space around operators (except = in function args)

• after commas

• ClassNames and function_names

Page 29: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Writing robust code

Page 30: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Tests

• Unit tests• test individual units of code

• specific units

• e.g. a single function or interaction between functions

• tested as generally as possible

• Functional tests• test the whole programme under a variety of inputs

• Regression tests• check for inconsistent behaviour between consecutive versions

• detect new bugs, ensure old bugs do not reoccur

Page 31: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Tests

• Several testing frameworks

• unittest is the main Python module

• doctest enables tests in documentation strings

• nose is a third-party module that nicely automates testing

• pytest is preferred by astropy

• but generally all interoperable

• Review docs, but basically just name any tests test_*

• astropy has detailed testing guidelines:

• http://docs.astropy.org/en/stable/development/testguide.html

Page 32: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Tests

• Online testing (continuous integration) services

• Integrate with GitHub

• Free for public open source projects

• Travis CI: https://travis-ci.org

• Test coverage reports

• Coveralls: https://coveralls.io

• astropy affiliated package template

• guides you through setup of these services

• https://github.com/astropy/package-template

Page 33: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Squeezing out extra speed

Page 34: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Multiprocessing

• Python includes modules for writing "parallel" programs:

• threaded – limited by the Global Interpreter Lock

• multiprocessing – generally more useful

from multiprocessing import Pool

def f(x):return x*x

pool = Pool(processes=4) # start 4 worker processes

z = range(10)print pool.map(f, z) # apply f to each element of z in parallel

Page 35: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Multiprocessing

from multiprocessing import Processfrom time import sleep

def f(name):print('Hello {}, I am going to sleep now'.format(name))sleep(3)print('OK, finished sleeping')

if __name__ == '__main__':p = Process(target=f, args=(lock, 'Steven'))p.start() # start additional processsleep(1) # carry on doing stuffprint 'Wow, how lazy is that function!'p.join() # wait for process to complete

$ python thinking.pyHello Steven, I am going to sleep nowWow, how lazy is that function!OK, finished sleeping

(Really, should use a lock to avoid writing output to screen at same time)

Page 36: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Mixing Python and C – fast and flexible

Cython is used for compiling Python-like code to machine-code• supports a big subset of the Python language• conditions and loops run 2-8x faster, overall 30% faster for plain Python

code• add types for speedups (hundreds of times)• easily use native libraries (C/C++/Fortran) directly

• Cython code is turned into C code• uses the CPython API and runtime

• Coding in Cython is like coding in Python and C at the same time!

Some material borrowed from Dag Sverre Seljebotn (University of Oslo) EuroSciPy 2010 presentation

Page 37: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython

Use cases:

• Performance-critical code• which does not translate to array-based approach (numpy / pytables)• existing Python code à optimise critical parts

• Wrapping existing C/C++ libraries• particularly higher-level Pythonised wrapper• for one-to-one wrapping other tools might be better suited

Page 38: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython

Cython code must be compiled.

Two stages:

• A .pyx file is compiled by Cython to a .c file, containing the code of a Python extension module

• The .c file is compiled by a C compiler• Generated C code can be built without Cython installed• Cython is a developer dependency, not a build-time dependency• Generated C code works with Python 2.3+• The result is a .so file (or .pyd on Windows) which can be imported

directly into a Python session

Page 39: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython

Ways of building Cython code:

• Run cython command-line utility and compile the resulting C file• use favourite build tool• for cross-system operation you need to query Python for the C build

options to use

• Use pyximport to importing Cython .pyx files as if they were .py files; building on the fly (recommended to start).• things get complicated if you must link to native libraries• larger projects tend to need a build phase anyway

• Write a distutils setup.py• standard way of distributing, building and installing Python modules

Page 40: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython

• Cython supports most of normal Python

• Most standard Python code can be used directly with Cython• typical speedups of (very roughly) a factor of two• should not ever slow down code – safe to try• name file .pyx or use pyimport = True

>>> import pyximport

>>> pyximport.install()

>>> import mypyxmodule # converts and compiles on the fly

>>> pyximport.install(pyimport = True)

>>> import mypymodule # converts and compiles on the fly# should fall back to Python if fails

Page 41: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython

• Big speedup from defining types of key variables

• Use native C-types (int, double, char *, etc.)

• Use Python C-types (Py_int_t, Py_float_t, etc.)

• Use cdef to declare variable types

• Also use cdef to declare C-only functions (with return type)• can also use cpdef to declare functions which are automatically treated

as C or Python depending on usage

• Don't forget function arguments (but note cdef not used here)

Page 42: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython – primes example

• Efficient algorithm to find first N prime numbers

def primes(kmax):p = []k = 0n = 2while k < kmax:

i = 0while i < k and n % p[i] != 0:

i = i + 1if i == k:

k = k + 1p.append(n)

n = n + 1return p

$ python -m timeit -s 'import primes as p' 'p.primes(100)'1000 loops, best of 3: 1.35 msec per loop

primes.py

Page 43: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython – primes example

$ python -m timeit -s 'import pyximport; pyximport.install(); import cprimes as p' 'p.primes(100)'

1000 loops, best of 3: 731 usec per loop1.8x speedup

def primes(kmax):p = []k = 0n = 2while k < kmax:

i = 0while i < k and n % p[i] != 0:

i = i + 1if i == k:

k = k + 1p.append(n)

n = n + 1return p

cprimes.pyx

Page 44: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython – primes example

def primes(int kmax): # declare types of parameterscdef int n, k, i # declare types of variablescdef int p[1000] # including arraysresult = [] # can still use normal Python typesif kmax > 1000: # in this case need to hardcode limit

kmax = 1000k = 0n = 2while k < kmax:

i = 0while i < k and n % p[i] != 0:

i = i + 1if i == k:

p[k] = nk = k + 1result.append(n)

n = n + 1return result # return Python object

xprimes.pyx

40.8 usec per loop

33x speedup

contains only C-code

Page 45: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Cython and Numpy

• Cython provides a way to quickly access Numpy arrays with specified types and dimensionality

à for implementing fast specific algorithms

• Also provides way to create generalized Ufuncs

• Can be useful, but often using functions provided by numpy, scipy, numexpr or pytables will be easier and faster

Page 46: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Numba

• JIT: just in time compilation of Python functions

• Compilation for both CPU and GPU hardware

from numba import jit

@jitdef sum2d(arr):

M, N = arr.shaperesult = 0.0for i in range(M):

for j in range(N):result += arr[i,j]

return result

a = np.arange(10000).reshape(1000,10)%timeit sum2d(a)1000 loops, best of 3: 334 µs per loop%timeit sum2d(a) # without @jit1 loops, best of 3: 2.15 µs per loop

Page 47: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Numba

• JIT: just in time compilation of Python functions

• Compilation for both CPU and GPU hardware

from numbapro import vectorize, float32

# MULTITHREADED UFUNC @vectorize([float32(float32, float32)], target=’parallel’)def sum(a, b):

return a + b

# CUDA ACCELERATED UFUNC @vectorize([float32(float32, float32)], target=’gpu’)def sum(a, b):

return a + b

Page 48: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

Anaconda Accelerate and IOPro

• Accelerate• Optimised versions of numpy, numexpr, scipy, scikit-learn, etc.• NumbaPro: extends and speeds up Numba

• IOPro• loads very large NumPy arrays (and Pandas DataFrames) efficiently• from files, SQL databases, and NoSQL stores• drop-in replacement for NumPy loadtxt() and genfromtxt()

• Free for academics:• https://www.continuum.io/anaconda-academic-subscriptions-available

Page 49: Session 5 - Scientific Programming in Python · Managing your environment •Some good things about Python •lots of modules from many sources •ongoing development of Python and

The End

An introduction to scientific programming with