Top Banner
Lecture 12: 1 Data analysis using Python Time complexity Introduction to Programming with Python
79

Introduction to Programming with Pythonipp211/wiki.files/V-Lecture 12...Python Libraries for Data Science SciPy: collection of algorithms for linear algebra, differential equations,

Jan 26, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Lecture 12:

    1

    Data analysis using Python

    Time complexity

    Introduction to Programming

    with Python

  • 2

  • Programming skills for data analysis are essential in the era of data explosion

    3

  • Last time: Scientific computing

    • Manipulating large arrays and matrices of

    numeric data with NumPy

    • Plotting with MatPlotLib

    • Simple data analysis

  • Today

    • Python Libraries for Data Science

    • Data Analysis using Numpy, Scipy and Pandas

    • Image Processing

    • Time complexity

    5

  • Python Libraries for Data Science

    Many popular Python toolboxes/libraries:

    • NumPy

    • SciPy

    • Pandas

    Visualization libraries

    • matplotlib

    and many more …

    6

  • Python Libraries for Data Science

    NumPy:

    introduces objects for multidimensional arrays and matrices, as

    well as functions that allow to easily perform advanced

    mathematical and statistical operations on those objects

    provides vectorization of mathematical operations on arrays and

    matrices which significantly improves the performance

    many other python libraries are built on NumPy

    7

    Link: http://www.numpy.org/

    http://www.numpy.org/

  • Python Libraries for Data Science

    SciPy:

    collection of algorithms for linear algebra, differential

    equations, numerical integration, optimization, statistics

    and more

    part of SciPy Stack

    built on NumPy

    8

    Link: https://www.scipy.org/scipylib/

    https://www.scipy.org/scipylib/

  • matplotlib:

    python 2D plotting library which produces publication quality

    figures in a variety of hardcopy formats

    a set of functionalities similar to those of MATLAB

    line plots, scatter plots, barcharts, histograms, pie charts etc.

    relatively low-level; some effort needed to create advanced

    visualization

    Link: https://matplotlib.org/

    Python Libraries for Data Science

    9

    https://matplotlib.org/

  • Python Libraries for Data Science

    Pandas:

    adds data structures and tools designed to work with

    table-like data (similar to Series and Data Frames in R)

    provides tools for data manipulation: reshaping,

    merging, sorting, slicing, aggregation etc.

    allows handling missing data

    10

    Link: http://pandas.pydata.org/

    http://pandas.pydata.org/

  • Pandas

    11

    https://pandas.pydata.org/

    https://pandas.pydata.org/

  • Pandas

    12

  • Example:

    Simple data analysis using Numpy

    Processing medical data13

  • Real data analysis example

    • Look at the file inflammation-01.csv

    • (download the lecture’s code)

    • Tables as CSV files

    • Text files

    • Each row holds the same number of columns

    • Values are separated by commas

    14

  • Real data analysis example

    • Look at the file inflammation-01.csv

    • The data: inflammation level in patients after a treatment

    (CSV) format

    • Each row holds information for a single patient

    • The columns represent successive days.

    • The first few lines in the file:

    15

    Patients

    Days

  • Reading the data using NumPy

    16

  • Properties of the data

    17

  • Tasks

    • Remove the first and last 10 days

    • Plot the average, min, and max inflammation score

    per day

    18

  • Data trimming and plots

    19

    # remove the 10 first and last days

    n,m = data.shape

    data = data[:,10:(m-9)]

    How can we get the average/min/max of each day?

  • Removing first and last 10 days

    20

  • Inflammation values per day

    21

  • Grade Analysis(Example for Data Analysis using NumPy)

    22

    We’ll use the following imports:

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    import scipy.cluster.vq

    import scipy.cluster.hierarchy

  • In this example we will analyze the following grade

    sheet in python.

    The table is saved as a CSV file .

    Grade Analysis

    23

    grades.csv

  • 1. Load the grades from a CSV file

    • We will read a complex tabular file containing column and row

    headers into NumPy Arrays.

    2. Count failing grades in total and per student.

    3. Visualize the data using a box plot and a heatmap

    4. Sort the table based on student’s average

    5. Cluster the students into groups based on grade similarity

    1. Using K-means clustering

    2. Using Hierarchical clustering

    Analysis steps

    24

  • Reading the grades table from file

    • Read CSV file using Pandas library

    Data

    Column Headers

    Row

    Headers

    StudentGrades.csv

    Yael Nadav Michal Shoshana Danielle Omer Yarden Avi Roy Tal

    Programming 46 90 56 87 97 87 43 54 98 45

    Marine Biology 60 92 45 89 93 91 54 48 91 58

    Stellar Cartography 59 60 47 72 85 86 58 40 85 57

    Math 94 65 58 80 90 85 92 55 92 88

    History 95 70 60 78 90 81 87 53 90 86

    Planet Survival 85 78 52 73 98 79 92 50 100 87

    Art 97 70 50 70 98 75 100 51 95 84

  • Panda’s DataFrame

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

  • Grades table

    columns

    rows

  • Accessing columns and rows

    Column:

    Yael’s

    grades

    ClassesRow:

  • Panda’s Series

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

  • Row??

    “Programming”

    is not a row

    or a column

  • Numeric Row Selection: iloc

  • Correcting the index column

    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

  • Why the DataFrame have not changed?

    change did not

    occur…

    most Pandas

    operations are

    not inPlace!

  • Using assignment

    use assignment!

  • After Updating the index:

    • Yael’s grades: before:

  • Row\Column slicing using indices

    • Note the inclusive range

  • Data handling made easy with Pandas

    37

    After Loading:

  • def load_table_as_array(filename):

    f = open(filename,'r')

    # Parsing the first line containing column headers

    header_line = f.readline()

    header_line = header_line.strip(',\n')

    column_names = header_line.split(',’)

    #Parsing the rest of the file

    mat = []

    row_names = []

    for line in f:

    tokens = line.rstrip().split(',')

    row_names.append(tokens[0]) #Add first token to row header list

    values = [float(n) for n in tokens[1:]] # Convert to float

    mat.append(values) # Append the current row to the matrix

    f.close()

    row_names = np.array(row_names)

    column_names = np.array(column_names)

    data = np.array(mat)

    return data, column_names, row_names

    38

    Without Pandas..

    List Comprehension

    provides another way

    to produce a list

  • Reminder …

    List Comprehensions• Python supports a concept called "list

    comprehensions". It can be used to

    construct lists in a very natural, easy way:

    39

    >>> S = [x**2 for x in range(10)]

    >>> V = [2**i for i in range(13)]

    >>> M = [x for x in S if x % 2 == 0]

    >>>

    >>> print S; print V; print M

    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

    [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]

    [0, 4, 16, 36, 64]

  • Total number of fails

    • How many grades below 60 are there in the entire table ?

    40

  • Number of fails per student

    • Use conditional indexing

    41

    all cells where

    value is not >=60

    are set to NaN

  • isna( ) operator

    • returns True for NaN, False otherwise

    42

  • Now use sum

    count NaN values per student

    43

    Nans represent

    grades < 60

    Same as

  • Number of fails per student

    Without Pandas:

    44

  • Plotting a boxplot of the grades

    45

  • The resulting Box-Plot

    46

  • Querying the data

    47

  • argmin() and argmax()

    Returns the indices of the minimal/maximal element in

    an array.

    48

  • Data queries

    What is the name of the student with the highest

    average ?

    49

  • Data queries

    What’s the name of the course with minimal number of

    failing students?

    50

  • Is the course with the least number of failures also the

    course with the highest average?

    51

    Data queries

  • Plotting an image of the table (Heatmap)

    52

  • Draw a heatmap

    Since we are going to draw several heatmaps, let’s

    wrap this code as a function:

    53

  • Reordering the matrix based on

    student’s average

    54

    How can we sort the matrix columns based on their mean ?

  • Reordering the grade matrix based on

    student’s averageUsing argsort to retrieve the indices that would sort

    the array

    55

  • Reordering the grade matrix based on

    student’s average

    56

    from previous slide

  • Reordering the grade matrix based on

    student’s average

    57

  • The resulting reordered heatmap

    58

  • Clustering students based on grade

    similarity• Clustering algorithms can be used to automatically partition

    items into groups, exposing inner matrix structures.

    • Item (a student in our case) vector (Student’s grades).

    • Cluster items based on vector similarity

    • given some distance metric (e.g., Euclidean distance)

    59

  • Clustering students based on grade

    similarity• Clustering analysis is a type of unsupervised machine

    learning (without “labeled” data)

    • Popular clustering algorithms include K-Means, Hierarchical

    clustering (Both are implemented in SciPy!)

    60

  • Clustering applications

    61

    Market Segmentation

    Exploring Data

    Visualizing

    Social

    Networks

    Market Segmentation

    Social Network Analysis

  • Clustering applications

    62

    Compressing images

    with color reduction

  • Applying the K-Means clustering

    algorithm to the students’ grades

    • Grouping the students (columns) into K groups based

    on the similarity between their grades. K is a

    parameter to the algorithm.

    • Each group is represented by its mean vector (named

    ‘centroid’).

    • K-Means iteratively decreases the distance between all

    vectors (students) and their assigned centroids.

    (algorithmic details a bit later).

    63

  • Clustering the students into 4 groups

    using the K-Means algorithm

    64

    The assignment list

    maps each student

    to one of 4 groups

    no T:

    we cluster

    courses

  • Printing the assignment of students into

    the 4 clusters

    65

    Michal and Avi

    are “similar”

  • Reordering the matrix columns based

    on the clustering results

    66

    similar students

    are grouped

    together

  • Clustering the students into 4 groups

    based on their grades

    67Cluster 0 Cluster 1 Cluster 2 Cluster 3

    Cluster 0 : ['Nadav' 'Shoshana' 'Omer']

    Cluster 1 : ['Michal' 'Avi']

    Cluster 2 : ['Yael' 'Yarden' 'Tal']

    Cluster 3 : ['Danielle' 'Roy']

  • Bonus Example: Applying Hierarchical

    Clustering on the matrix columns• Hierarchical clustering is another famous algorithm for

    dividing a set of items into groups based on their similarity.

    • Applying to the matrix columns will arrange the columns

    hierarchically based on their similarity.

    • Works by iteratively uniting similar items into a cluster,

    based on a given distance function (Euclidean,

    correlation…).

    • The output of the algorithm can be visualized as a

    dendrogram.

    68

  • Hierarchical Clustering – The output

    69

  • Applying Hierarchical Clustering on the

    matrix columns

    70

    # Step 1: Calculate the linkage matrixZ = scipy.cluster.hierarchy.linkage(data.T, method='average', metric='euclidean')

    print Z # The linkage matrix describes the clustering process

    [[ 4. 8. 4.69041576 2. ][ 2. 7. 11.18033989 2. ][ 0. 6. 13.11487705 2. ][ 3. 5. 17.17556404 2. ][ 9. 12. 17.57609378 3. ][ 1. 13. 28.46576198 3. ][ 10. 15. 43.67974589 5. ][ 14. 16. 68.29991281 8. ][ 11. 17. 85.94480647 10. ]]

    Each row in the linkage table describes an algorithm step.

    On the first step, students #4 and #8 are united to form a cluster of size 2, since they have the smallest distance (4.6)

    On step 5, the first cluster of size 3 is formed by adding student #9 to an existing cluster of 2 (designated as #12).

  • Applying Hierarchical Clustering on the

    matrix columns

    71

  • 72

  • Applying Hierarchical Clustering on the

    matrix columns

    73

  • Applying Hierarchical Clustering on the

    matrix columns

    74

  • Hierarchical Clustering on the

    columns

    75

    The fcluster command extracts K clusters from Z

    and returns the assignment of each item to a cluster

    (Corresponds to cutting the dendrogram at a level

    that would produce a K-partition)

  • Grade Analysis Summary

    • We parsed a tabular file and loaded its data into NumPy arrays,

    using Pandas

    • NumPy commands were used to manipulate and explore the data.

    • We used the matplotlib package to draw a box-plot and a heatmap.

    • argsort was used to sort the matrix columns.

    • SciPy commands for applying k-means / hierarchical clustering on

    the matrix columns were used (kmeans, vq, linkage, dendrogram,

    fcluster).

    • Read more about hierarchical clustering here

    • http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

    • https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-

    dendrogram-tutorial/

    76

    http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.htmlhttps://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

  • Time complexity

    list• The Average Case assumes

    parameters generated uniformly at

    random.

    • Internally, a list is represented as an

    array; the largest costs come from

    growing beyond the current allocation

    size (because everything must move),

    or from inserting or deleting

    somewhere near the beginning

    (because everything after that must

    move). If you need to add/remove at

    both ends, consider using a

    collections.deque instead.

    77

    Operation Average CaseAmortized Worst

    Case

    Copy O(n) O(n)

    Append[1] O(1) O(1)

    Pop last O(1) O(1)

    Pop intermediate O(k) O(k)

    Insert O(n) O(n)

    Get Item O(1) O(1)

    Set Item O(1) O(1)

    Delete Item O(n) O(n)

    Iteration O(n) O(n)

    Get Slice O(k) O(k)

    Del Slice O(n) O(n)

    Set Slice O(k+n) O(k+n)

    Extend[1] O(k) O(k)

    Sort O(n log n) O(n log n)

    Multiply O(nk) O(nk)

    x in s O(n)

    min(s), max(s) O(n)

    Get Length O(1) O(1)

    Generally, 'n' is the number of elements currently in the container. 'k' is either the

    value of a parameter or the number of elements in the parameter.

    http://en.wikipedia.org/wiki/Amortized_analysishttp://svn.python.org/projects/python/trunk/Objects/listsort.txt

  • Time complexity

    78

    collections.deque

    • A deque (double-ended

    queue) is represented

    internally as a doubly linked

    list. (Well, a list of arrays

    rather than objects, for greater

    efficiency.) Both ends are

    accessible, but even looking

    at the middle is slow, and

    adding to or removing from

    the middle is slower still.

    Operation Average CaseAmortized

    Worst Case

    Copy O(n) O(n)

    append O(1) O(1)

    appendleft O(1) O(1)

    pop O(1) O(1)

    popleft O(1) O(1)

    extend O(k) O(k)

    extendleft O(k) O(k)

    rotate O(k) O(k)

    remove O(n) O(n)

  • Time complexity

    79

    dict

    The Average Case times listed for dict objects assume that the hash function for the

    objects is sufficiently robust to make collisions uncommon. The Average Case assumes

    the keys used in parameters are selected uniformly at random from the set of all keys.

    Note that there is a fast-path for dicts that (in practice) only deal with str keys; this doesn't

    affect the algorithmic complexity, but it can significantly affect the constant factors: how

    quickly a typical program finishes.

    OperationAverage

    Case

    Amortized

    Worst Case

    Copy[2] O(n) O(n)

    Get Item O(1) O(n)

    Set Item[1] O(1) O(n)

    Delete Item O(1) O(n)

    Iteration[2] O(n) O(n)

    Notes

    [1] = These operations rely on the "Amortized" part of

    "Amortized Worst Case". Individual actions may take

    surprisingly long, depending on the history of the

    container.

    [2] = For these operations, the worst case n is the

    maximum size the container ever achieved, rather than

    just the current size. For example, if N objects are

    added to a dictionary, then N-1 are deleted, the

    dictionary will still be sized for N objects (at least) until

    another insertion is made.