Introduction to Programming with Pythonipp211/wiki.files/V-Lecture 12...Python Libraries for Data Science SciPy: collection of algorithms for linear algebra, differential equations,

Lecture 12:

1

Data analysis using Python

Time complexity

Introduction to Programming

with Python

Programming skills for data analysis are essential in the era of data explosion

3

Last time: Scientific computing

• Manipulating large arrays and matrices of

numeric data with NumPy

• Plotting with MatPlotLib

• Simple data analysis

Today

• Python Libraries for Data Science

• Data Analysis using Numpy, Scipy and Pandas

• Image Processing

• Time complexity

5

Python Libraries for Data Science

Many popular Python toolboxes/libraries:

• NumPy

• SciPy

• Pandas

Visualization libraries

• matplotlib

and many more …

6


NumPy:

introduces objects for multidimensional arrays and matrices, as

well as functions that allow to easily perform advanced

mathematical and statistical operations on those objects

provides vectorization of mathematical operations on arrays and

matrices which significantly improves the performance

many other python libraries are built on NumPy

7

Link: http://www.numpy.org/

http://www.numpy.org/


SciPy:

collection of algorithms for linear algebra, differential

equations, numerical integration, optimization, statistics

and more

part of SciPy Stack

built on NumPy

8

Link: https://www.scipy.org/scipylib/

https://www.scipy.org/scipylib/

matplotlib:

python 2D plotting library which produces publication quality

figures in a variety of hardcopy formats

a set of functionalities similar to those of MATLAB

line plots, scatter plots, barcharts, histograms, pie charts etc.

relatively low-level; some effort needed to create advanced

visualization

Link: https://matplotlib.org/


9

https://matplotlib.org/


Pandas:

adds data structures and tools designed to work with

table-like data (similar to Series and Data Frames in R)

provides tools for data manipulation: reshaping,

merging, sorting, slicing, aggregation etc.

allows handling missing data

10

Link: http://pandas.pydata.org/

http://pandas.pydata.org/

Pandas

11

https://pandas.pydata.org/

https://pandas.pydata.org/

Pandas

12

Example:

Simple data analysis using Numpy

Processing medical data13

Real data analysis example

• Look at the file inflammation-01.csv

• (download the lecture’s code)

• Tables as CSV files

• Text files

• Each row holds the same number of columns

• Values are separated by commas

14

Real data analysis example

• Look at the file inflammation-01.csv

• The data: inflammation level in patients after a treatment

(CSV) format

• Each row holds information for a single patient

• The columns represent successive days.

• The first few lines in the file:

15

Patients

Days

Reading the data using NumPy

16

Properties of the data

17

Tasks

• Remove the first and last 10 days

• Plot the average, min, and max inflammation score

per day

18

Data trimming and plots

19

# remove the 10 first and last days

n,m = data.shape

data = data[:,10:(m-9)]

How can we get the average/min/max of each day?

Removing first and last 10 days

20

Inflammation values per day

21

Grade Analysis(Example for Data Analysis using NumPy)

22

We’ll use the following imports:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import scipy.cluster.vq

import scipy.cluster.hierarchy

In this example we will analyze the following grade

sheet in python.

The table is saved as a CSV file .

Grade Analysis

23

grades.csv

1. Load the grades from a CSV file

• We will read a complex tabular file containing column and row

headers into NumPy Arrays.

2. Count failing grades in total and per student.

3. Visualize the data using a box plot and a heatmap

4. Sort the table based on student’s average

5. Cluster the students into groups based on grade similarity

1. Using K-means clustering

2. Using Hierarchical clustering

Analysis steps

24

Reading the grades table from file

• Read CSV file using Pandas library

Data

Column Headers

Row

Headers

StudentGrades.csv

Yael Nadav Michal Shoshana Danielle Omer Yarden Avi Roy Tal

Programming 46 90 56 87 97 87 43 54 98 45

Marine Biology 60 92 45 89 93 91 54 48 91 58

Stellar Cartography 59 60 47 72 85 86 58 40 85 57

Math 94 65 58 80 90 85 92 55 92 88

History 95 70 60 78 90 81 87 53 90 86

Planet Survival 85 78 52 73 98 79 92 50 100 87

Art 97 70 50 70 98 75 100 51 95 84

Panda’s DataFrame

https://pandas.pydata.org/pandas-docs/stable/dsintro.html


Grades table

columns

rows

Accessing columns and rows

Column:

Yael’s

grades

ClassesRow:

Panda’s Series



Row??

“Programming”

is not a row

or a column

Numeric Row Selection: iloc

Correcting the index column

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

Why the DataFrame have not changed?

change did not

occur…

most Pandas

operations are

not inPlace!

Using assignment

use assignment!

After Updating the index:

• Yael’s grades: before:

Row\Column slicing using indices

• Note the inclusive range

Data handling made easy with Pandas

37

After Loading:

def load_table_as_array(filename):

f = open(filename,'r')

# Parsing the first line containing column headers

header_line = f.readline()

header_line = header_line.strip(',\n')

column_names = header_line.split(',’)

#Parsing the rest of the file

mat = []

row_names = []

for line in f:

tokens = line.rstrip().split(',')

row_names.append(tokens[0]) #Add first token to row header list

values = [float(n) for n in tokens[1:]] # Convert to float

mat.append(values) # Append the current row to the matrix

f.close()

row_names = np.array(row_names)

column_names = np.array(column_names)

data = np.array(mat)

return data, column_names, row_names

38

Without Pandas..

List Comprehension

provides another way

to produce a list

Reminder …

List Comprehensions• Python supports a concept called "list

comprehensions". It can be used to

construct lists in a very natural, easy way:

39

>>> S = [x**2 for x in range(10)]

>>> V = [2**i for i in range(13)]

>>> M = [x for x in S if x % 2 == 0]

>>>

>>> print S; print V; print M

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]

[0, 4, 16, 36, 64]

Total number of fails

• How many grades below 60 are there in the entire table ?

40

Number of fails per student

• Use conditional indexing

41

all cells where

value is not >=60

are set to NaN

isna( ) operator

• returns True for NaN, False otherwise

42

Now use sum

count NaN values per student

43

Nans represent

grades < 60

Same as

Number of fails per student

Without Pandas:

44

Plotting a boxplot of the grades

45

The resulting Box-Plot

46

Querying the data

47

argmin() and argmax()

Returns the indices of the minimal/maximal element in

an array.

48

Data queries

What is the name of the student with the highest

average ?

49

Data queries

What’s the name of the course with minimal number of

failing students?

50

Is the course with the least number of failures also the

course with the highest average?

51

Data queries

Plotting an image of the table (Heatmap)

52

Draw a heatmap

Since we are going to draw several heatmaps, let’s

wrap this code as a function:

53

Reordering the matrix based on

student’s average

54

How can we sort the matrix columns based on their mean ?

Reordering the grade matrix based on

student’s averageUsing argsort to retrieve the indices that would sort

the array

55


student’s average

56

from previous slide


student’s average

57

The resulting reordered heatmap

58

Clustering students based on grade

similarity• Clustering algorithms can be used to automatically partition

items into groups, exposing inner matrix structures.

• Item (a student in our case) vector (Student’s grades).

• Cluster items based on vector similarity

• given some distance metric (e.g., Euclidean distance)

59

Clustering students based on grade

similarity• Clustering analysis is a type of unsupervised machine

learning (without “labeled” data)

• Popular clustering algorithms include K-Means, Hierarchical

clustering (Both are implemented in SciPy!)

60

Clustering applications

61

Market Segmentation

Exploring Data

Visualizing

Social

Networks

Market Segmentation

Social Network Analysis

Clustering applications

62

Compressing images

with color reduction

Applying the K-Means clustering

algorithm to the students’ grades

• Grouping the students (columns) into K groups based

on the similarity between their grades. K is a

parameter to the algorithm.

• Each group is represented by its mean vector (named

‘centroid’).

• K-Means iteratively decreases the distance between all

vectors (students) and their assigned centroids.

(algorithmic details a bit later).

63

Clustering the students into 4 groups

using the K-Means algorithm

64

The assignment list

maps each student

to one of 4 groups

no T:

we cluster

courses

Printing the assignment of students into

the 4 clusters

65

Michal and Avi

are “similar”

Reordering the matrix columns based

on the clustering results

66

similar students

are grouped

together

Clustering the students into 4 groups

based on their grades

67Cluster 0 Cluster 1 Cluster 2 Cluster 3

Cluster 0 : ['Nadav' 'Shoshana' 'Omer']

Cluster 1 : ['Michal' 'Avi']

Cluster 2 : ['Yael' 'Yarden' 'Tal']

Cluster 3 : ['Danielle' 'Roy']

Bonus Example: Applying Hierarchical

Clustering on the matrix columns• Hierarchical clustering is another famous algorithm for

dividing a set of items into groups based on their similarity.

• Applying to the matrix columns will arrange the columns

hierarchically based on their similarity.

• Works by iteratively uniting similar items into a cluster,

based on a given distance function (Euclidean,

correlation…).

• The output of the algorithm can be visualized as a

dendrogram.

68

Hierarchical Clustering – The output

69

Applying Hierarchical Clustering on the

matrix columns

70

# Step 1: Calculate the linkage matrixZ = scipy.cluster.hierarchy.linkage(data.T, method='average', metric='euclidean')

print Z # The linkage matrix describes the clustering process

[[ 4. 8. 4.69041576 2. ][ 2. 7. 11.18033989 2. ][ 0. 6. 13.11487705 2. ][ 3. 5. 17.17556404 2. ][ 9. 12. 17.57609378 3. ][ 1. 13. 28.46576198 3. ][ 10. 15. 43.67974589 5. ][ 14. 16. 68.29991281 8. ][ 11. 17. 85.94480647 10. ]]

Each row in the linkage table describes an algorithm step.

On the first step, students #4 and #8 are united to form a cluster of size 2, since they have the smallest distance (4.6)

On step 5, the first cluster of size 3 is formed by adding student #9 to an existing cluster of 2 (designated as #12).


matrix columns

71


matrix columns

73


matrix columns

74

Hierarchical Clustering on the

columns

75

The fcluster command extracts K clusters from Z

and returns the assignment of each item to a cluster

(Corresponds to cutting the dendrogram at a level

that would produce a K-partition)

Grade Analysis Summary

• We parsed a tabular file and loaded its data into NumPy arrays,

using Pandas

• NumPy commands were used to manipulate and explore the data.

• We used the matplotlib package to draw a box-plot and a heatmap.

• argsort was used to sort the matrix columns.

• SciPy commands for applying k-means / hierarchical clustering on

the matrix columns were used (kmeans, vq, linkage, dendrogram,

fcluster).

• Read more about hierarchical clustering here

• http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

• https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-

dendrogram-tutorial/

76

http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.htmlhttps://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

Time complexity

list• The Average Case assumes

parameters generated uniformly at

random.

• Internally, a list is represented as an

array; the largest costs come from

growing beyond the current allocation

size (because everything must move),

or from inserting or deleting

somewhere near the beginning

(because everything after that must

move). If you need to add/remove at

both ends, consider using a

collections.deque instead.

77

Operation Average CaseAmortized Worst

Case

Copy O(n) O(n)

Append[1] O(1) O(1)

Pop last O(1) O(1)

Pop intermediate O(k) O(k)

Insert O(n) O(n)

Get Item O(1) O(1)

Set Item O(1) O(1)

Delete Item O(n) O(n)

Iteration O(n) O(n)

Get Slice O(k) O(k)

Del Slice O(n) O(n)

Set Slice O(k+n) O(k+n)

Extend[1] O(k) O(k)

Sort O(n log n) O(n log n)

Multiply O(nk) O(nk)

x in s O(n)

min(s), max(s) O(n)

Get Length O(1) O(1)

Generally, 'n' is the number of elements currently in the container. 'k' is either the

value of a parameter or the number of elements in the parameter.

http://en.wikipedia.org/wiki/Amortized_analysishttp://svn.python.org/projects/python/trunk/Objects/listsort.txt

Time complexity

78

collections.deque

• A deque (double-ended

queue) is represented

internally as a doubly linked

list. (Well, a list of arrays

rather than objects, for greater

efficiency.) Both ends are

accessible, but even looking

at the middle is slow, and

adding to or removing from

the middle is slower still.

Operation Average CaseAmortized

Worst Case

Copy O(n) O(n)

append O(1) O(1)

appendleft O(1) O(1)

pop O(1) O(1)

popleft O(1) O(1)

extend O(k) O(k)

extendleft O(k) O(k)

rotate O(k) O(k)

remove O(n) O(n)

Time complexity

79

dict

The Average Case times listed for dict objects assume that the hash function for the

objects is sufficiently robust to make collisions uncommon. The Average Case assumes

the keys used in parameters are selected uniformly at random from the set of all keys.

Note that there is a fast-path for dicts that (in practice) only deal with str keys; this doesn't

affect the algorithmic complexity, but it can significantly affect the constant factors: how

quickly a typical program finishes.

OperationAverage

Case

Amortized

Worst Case

Copy[2] O(n) O(n)

Get Item O(1) O(n)

Set Item[1] O(1) O(n)

Delete Item O(1) O(n)

Iteration[2] O(n) O(n)

Notes

[1] = These operations rely on the "Amortized" part of

"Amortized Worst Case". Individual actions may take

surprisingly long, depending on the history of the

container.

[2] = For these operations, the worst case n is the

maximum size the container ever achieved, rather than

just the current size. For example, if N objects are

added to a dictionary, then N-1 are deleted, the

dictionary will still be sized for N objects (at least) until

another insertion is made.

Introduction to Programming with Pythonipp211/wiki.files/V-Lecture 12...Python Libraries for Data Science SciPy: collection of algorithms for linear algebra, differential equations,

Documents