Lecture 12:
1
Data analysis using Python
Time complexity
Introduction to Programming
with Python
2
Programming skills for data analysis are essential in the era of data explosion
3
Last time: Scientific computing
• Manipulating large arrays and matrices of
numeric data with NumPy
• Plotting with MatPlotLib
• Simple data analysis
Today
• Python Libraries for Data Science
• Data Analysis using Numpy, Scipy and Pandas
• Image Processing
• Time complexity
5
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
• NumPy
• SciPy
• Pandas
Visualization libraries
• matplotlib
and many more …
6
Python Libraries for Data Science
NumPy:
introduces objects for multidimensional arrays and matrices, as
well as functions that allow to easily perform advanced
mathematical and statistical operations on those objects
provides vectorization of mathematical operations on arrays and
matrices which significantly improves the performance
many other python libraries are built on NumPy
7
Link: http://www.numpy.org/
http://www.numpy.org/
Python Libraries for Data Science
SciPy:
collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics
and more
part of SciPy Stack
built on NumPy
8
Link: https://www.scipy.org/scipylib/
https://www.scipy.org/scipylib/
matplotlib:
python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats
a set of functionalities similar to those of MATLAB
line plots, scatter plots, barcharts, histograms, pie charts etc.
relatively low-level; some effort needed to create advanced
visualization
Link: https://matplotlib.org/
Python Libraries for Data Science
9
https://matplotlib.org/
Python Libraries for Data Science
Pandas:
adds data structures and tools designed to work with
table-like data (similar to Series and Data Frames in R)
provides tools for data manipulation: reshaping,
merging, sorting, slicing, aggregation etc.
allows handling missing data
10
Link: http://pandas.pydata.org/
http://pandas.pydata.org/
Pandas
11
https://pandas.pydata.org/
https://pandas.pydata.org/
Pandas
12
Example:
Simple data analysis using Numpy
Processing medical data13
Real data analysis example
• Look at the file inflammation-01.csv
• (download the lecture’s code)
• Tables as CSV files
• Text files
• Each row holds the same number of columns
• Values are separated by commas
14
Real data analysis example
• Look at the file inflammation-01.csv
• The data: inflammation level in patients after a treatment
(CSV) format
• Each row holds information for a single patient
• The columns represent successive days.
• The first few lines in the file:
15
Patients
Days
Reading the data using NumPy
16
Properties of the data
17
Tasks
• Remove the first and last 10 days
• Plot the average, min, and max inflammation score
per day
18
Data trimming and plots
19
# remove the 10 first and last days
n,m = data.shape
data = data[:,10:(m-9)]
How can we get the average/min/max of each day?
Removing first and last 10 days
20
Inflammation values per day
21
Grade Analysis(Example for Data Analysis using NumPy)
22
We’ll use the following imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.cluster.vq
import scipy.cluster.hierarchy
In this example we will analyze the following grade
sheet in python.
The table is saved as a CSV file .
Grade Analysis
23
grades.csv
1. Load the grades from a CSV file
• We will read a complex tabular file containing column and row
headers into NumPy Arrays.
2. Count failing grades in total and per student.
3. Visualize the data using a box plot and a heatmap
4. Sort the table based on student’s average
5. Cluster the students into groups based on grade similarity
1. Using K-means clustering
2. Using Hierarchical clustering
Analysis steps
24
Reading the grades table from file
• Read CSV file using Pandas library
Data
Column Headers
Row
Headers
StudentGrades.csv
Yael Nadav Michal Shoshana Danielle Omer Yarden Avi Roy Tal
Programming 46 90 56 87 97 87 43 54 98 45
Marine Biology 60 92 45 89 93 91 54 48 91 58
Stellar Cartography 59 60 47 72 85 86 58 40 85 57
Math 94 65 58 80 90 85 92 55 92 88
History 95 70 60 78 90 81 87 53 90 86
Planet Survival 85 78 52 73 98 79 92 50 100 87
Art 97 70 50 70 98 75 100 51 95 84
Panda’s DataFrame
https://pandas.pydata.org/pandas-docs/stable/dsintro.html
https://pandas.pydata.org/pandas-docs/stable/dsintro.html
Grades table
columns
rows
Accessing columns and rows
Column:
Yael’s
grades
ClassesRow:
Panda’s Series
https://pandas.pydata.org/pandas-docs/stable/dsintro.html
https://pandas.pydata.org/pandas-docs/stable/dsintro.html
Row??
“Programming”
is not a row
or a column
Numeric Row Selection: iloc
Correcting the index column
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html
Why the DataFrame have not changed?
change did not
occur…
most Pandas
operations are
not inPlace!
Using assignment
use assignment!
After Updating the index:
• Yael’s grades: before:
Row\Column slicing using indices
• Note the inclusive range
Data handling made easy with Pandas
37
After Loading:
def load_table_as_array(filename):
f = open(filename,'r')
# Parsing the first line containing column headers
header_line = f.readline()
header_line = header_line.strip(',\n')
column_names = header_line.split(',’)
#Parsing the rest of the file
mat = []
row_names = []
for line in f:
tokens = line.rstrip().split(',')
row_names.append(tokens[0]) #Add first token to row header list
values = [float(n) for n in tokens[1:]] # Convert to float
mat.append(values) # Append the current row to the matrix
f.close()
row_names = np.array(row_names)
column_names = np.array(column_names)
data = np.array(mat)
return data, column_names, row_names
38
Without Pandas..
List Comprehension
provides another way
to produce a list
Reminder …
List Comprehensions• Python supports a concept called "list
comprehensions". It can be used to
construct lists in a very natural, easy way:
39
>>> S = [x**2 for x in range(10)]
>>> V = [2**i for i in range(13)]
>>> M = [x for x in S if x % 2 == 0]
>>>
>>> print S; print V; print M
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]
[0, 4, 16, 36, 64]
Total number of fails
• How many grades below 60 are there in the entire table ?
40
Number of fails per student
• Use conditional indexing
41
all cells where
value is not >=60
are set to NaN
isna( ) operator
• returns True for NaN, False otherwise
42
Now use sum
count NaN values per student
43
Nans represent
grades < 60
Same as
Number of fails per student
Without Pandas:
44
Plotting a boxplot of the grades
45
The resulting Box-Plot
46
Querying the data
47
argmin() and argmax()
Returns the indices of the minimal/maximal element in
an array.
48
Data queries
What is the name of the student with the highest
average ?
49
Data queries
What’s the name of the course with minimal number of
failing students?
50
Is the course with the least number of failures also the
course with the highest average?
51
Data queries
Plotting an image of the table (Heatmap)
52
Draw a heatmap
Since we are going to draw several heatmaps, let’s
wrap this code as a function:
53
Reordering the matrix based on
student’s average
54
How can we sort the matrix columns based on their mean ?
Reordering the grade matrix based on
student’s averageUsing argsort to retrieve the indices that would sort
the array
55
Reordering the grade matrix based on
student’s average
56
from previous slide
Reordering the grade matrix based on
student’s average
57
The resulting reordered heatmap
58
Clustering students based on grade
similarity• Clustering algorithms can be used to automatically partition
items into groups, exposing inner matrix structures.
• Item (a student in our case) vector (Student’s grades).
• Cluster items based on vector similarity
• given some distance metric (e.g., Euclidean distance)
59
Clustering students based on grade
similarity• Clustering analysis is a type of unsupervised machine
learning (without “labeled” data)
• Popular clustering algorithms include K-Means, Hierarchical
clustering (Both are implemented in SciPy!)
60
Clustering applications
61
Market Segmentation
Exploring Data
Visualizing
Social
Networks
Market Segmentation
Social Network Analysis
Clustering applications
62
Compressing images
with color reduction
Applying the K-Means clustering
algorithm to the students’ grades
• Grouping the students (columns) into K groups based
on the similarity between their grades. K is a
parameter to the algorithm.
• Each group is represented by its mean vector (named
‘centroid’).
• K-Means iteratively decreases the distance between all
vectors (students) and their assigned centroids.
(algorithmic details a bit later).
63
Clustering the students into 4 groups
using the K-Means algorithm
64
The assignment list
maps each student
to one of 4 groups
no T:
we cluster
courses
Printing the assignment of students into
the 4 clusters
65
Michal and Avi
are “similar”
Reordering the matrix columns based
on the clustering results
66
similar students
are grouped
together
Clustering the students into 4 groups
based on their grades
67Cluster 0 Cluster 1 Cluster 2 Cluster 3
Cluster 0 : ['Nadav' 'Shoshana' 'Omer']
Cluster 1 : ['Michal' 'Avi']
Cluster 2 : ['Yael' 'Yarden' 'Tal']
Cluster 3 : ['Danielle' 'Roy']
Bonus Example: Applying Hierarchical
Clustering on the matrix columns• Hierarchical clustering is another famous algorithm for
dividing a set of items into groups based on their similarity.
• Applying to the matrix columns will arrange the columns
hierarchically based on their similarity.
• Works by iteratively uniting similar items into a cluster,
based on a given distance function (Euclidean,
correlation…).
• The output of the algorithm can be visualized as a
dendrogram.
68
Hierarchical Clustering – The output
69
Applying Hierarchical Clustering on the
matrix columns
70
# Step 1: Calculate the linkage matrixZ = scipy.cluster.hierarchy.linkage(data.T, method='average', metric='euclidean')
print Z # The linkage matrix describes the clustering process
[[ 4. 8. 4.69041576 2. ][ 2. 7. 11.18033989 2. ][ 0. 6. 13.11487705 2. ][ 3. 5. 17.17556404 2. ][ 9. 12. 17.57609378 3. ][ 1. 13. 28.46576198 3. ][ 10. 15. 43.67974589 5. ][ 14. 16. 68.29991281 8. ][ 11. 17. 85.94480647 10. ]]
Each row in the linkage table describes an algorithm step.
On the first step, students #4 and #8 are united to form a cluster of size 2, since they have the smallest distance (4.6)
On step 5, the first cluster of size 3 is formed by adding student #9 to an existing cluster of 2 (designated as #12).
Applying Hierarchical Clustering on the
matrix columns
71
72
Applying Hierarchical Clustering on the
matrix columns
73
Applying Hierarchical Clustering on the
matrix columns
74
Hierarchical Clustering on the
columns
75
The fcluster command extracts K clusters from Z
and returns the assignment of each item to a cluster
(Corresponds to cutting the dendrogram at a level
that would produce a K-partition)
Grade Analysis Summary
• We parsed a tabular file and loaded its data into NumPy arrays,
using Pandas
• NumPy commands were used to manipulate and explore the data.
• We used the matplotlib package to draw a box-plot and a heatmap.
• argsort was used to sort the matrix columns.
• SciPy commands for applying k-means / hierarchical clustering on
the matrix columns were used (kmeans, vq, linkage, dendrogram,
fcluster).
• Read more about hierarchical clustering here
• http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html
• https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-
dendrogram-tutorial/
76
http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.htmlhttps://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
Time complexity
list• The Average Case assumes
parameters generated uniformly at
random.
• Internally, a list is represented as an
array; the largest costs come from
growing beyond the current allocation
size (because everything must move),
or from inserting or deleting
somewhere near the beginning
(because everything after that must
move). If you need to add/remove at
both ends, consider using a
collections.deque instead.
77
Operation Average CaseAmortized Worst
Case
Copy O(n) O(n)
Append[1] O(1) O(1)
Pop last O(1) O(1)
Pop intermediate O(k) O(k)
Insert O(n) O(n)
Get Item O(1) O(1)
Set Item O(1) O(1)
Delete Item O(n) O(n)
Iteration O(n) O(n)
Get Slice O(k) O(k)
Del Slice O(n) O(n)
Set Slice O(k+n) O(k+n)
Extend[1] O(k) O(k)
Sort O(n log n) O(n log n)
Multiply O(nk) O(nk)
x in s O(n)
min(s), max(s) O(n)
Get Length O(1) O(1)
Generally, 'n' is the number of elements currently in the container. 'k' is either the
value of a parameter or the number of elements in the parameter.
http://en.wikipedia.org/wiki/Amortized_analysishttp://svn.python.org/projects/python/trunk/Objects/listsort.txt
Time complexity
78
collections.deque
• A deque (double-ended
queue) is represented
internally as a doubly linked
list. (Well, a list of arrays
rather than objects, for greater
efficiency.) Both ends are
accessible, but even looking
at the middle is slow, and
adding to or removing from
the middle is slower still.
Operation Average CaseAmortized
Worst Case
Copy O(n) O(n)
append O(1) O(1)
appendleft O(1) O(1)
pop O(1) O(1)
popleft O(1) O(1)
extend O(k) O(k)
extendleft O(k) O(k)
rotate O(k) O(k)
remove O(n) O(n)
Time complexity
79
dict
The Average Case times listed for dict objects assume that the hash function for the
objects is sufficiently robust to make collisions uncommon. The Average Case assumes
the keys used in parameters are selected uniformly at random from the set of all keys.
Note that there is a fast-path for dicts that (in practice) only deal with str keys; this doesn't
affect the algorithmic complexity, but it can significantly affect the constant factors: how
quickly a typical program finishes.
OperationAverage
Case
Amortized
Worst Case
Copy[2] O(n) O(n)
Get Item O(1) O(n)
Set Item[1] O(1) O(n)
Delete Item O(1) O(n)
Iteration[2] O(n) O(n)
Notes
[1] = These operations rely on the "Amortized" part of
"Amortized Worst Case". Individual actions may take
surprisingly long, depending on the history of the
container.
[2] = For these operations, the worst case n is the
maximum size the container ever achieved, rather than
just the current size. For example, if N objects are
added to a dictionary, then N-1 are deleted, the
dictionary will still be sized for N objects (at least) until
another insertion is made.