Top Banner

Click here to load reader

Introduction to Programming with Python ipp211/wiki.files/V-Lecture 12... Python Libraries for Data Science SciPy: collection of algorithms for linear algebra, differential equations,

Jan 26, 2021

ReportDownload

Documents

others

  • Lecture 12:

    1

    Data analysis using Python

    Time complexity

    Introduction to Programming

    with Python

  • 2

  • Programming skills for data analysis are essential in the era of data explosion

    3

  • Last time: Scientific computing

    • Manipulating large arrays and matrices of

    numeric data with NumPy

    • Plotting with MatPlotLib

    • Simple data analysis

  • Today

    • Python Libraries for Data Science

    • Data Analysis using Numpy, Scipy and Pandas

    • Image Processing

    • Time complexity

    5

  • Python Libraries for Data Science

    Many popular Python toolboxes/libraries:

    • NumPy

    • SciPy

    • Pandas

    Visualization libraries

    • matplotlib

    and many more …

    6

  • Python Libraries for Data Science

    NumPy:

     introduces objects for multidimensional arrays and matrices, as

    well as functions that allow to easily perform advanced

    mathematical and statistical operations on those objects

     provides vectorization of mathematical operations on arrays and

    matrices which significantly improves the performance

     many other python libraries are built on NumPy

    7

    Link: http://www.numpy.org/

    http://www.numpy.org/

  • Python Libraries for Data Science

    SciPy:

     collection of algorithms for linear algebra, differential

    equations, numerical integration, optimization, statistics

    and more

     part of SciPy Stack

     built on NumPy

    8

    Link: https://www.scipy.org/scipylib/

    https://www.scipy.org/scipylib/

  • matplotlib:

     python 2D plotting library which produces publication quality

    figures in a variety of hardcopy formats

     a set of functionalities similar to those of MATLAB

     line plots, scatter plots, barcharts, histograms, pie charts etc.

     relatively low-level; some effort needed to create advanced

    visualization

    Link: https://matplotlib.org/

    Python Libraries for Data Science

    9

    https://matplotlib.org/

  • Python Libraries for Data Science

    Pandas:

     adds data structures and tools designed to work with

    table-like data (similar to Series and Data Frames in R)

     provides tools for data manipulation: reshaping,

    merging, sorting, slicing, aggregation etc.

     allows handling missing data

    10

    Link: http://pandas.pydata.org/

    http://pandas.pydata.org/

  • Pandas

    11

    https://pandas.pydata.org/

    https://pandas.pydata.org/

  • Pandas

    12

  • Example:

    Simple data analysis using Numpy

    Processing medical data 13

  • Real data analysis example

    • Look at the file inflammation-01.csv

    • (download the lecture’s code)

    • Tables as CSV files

    • Text files

    • Each row holds the same number of columns

    • Values are separated by commas

    14

  • Real data analysis example

    • Look at the file inflammation-01.csv

    • The data: inflammation level in patients after a treatment

    (CSV) format

    • Each row holds information for a single patient

    • The columns represent successive days.

    • The first few lines in the file:

    15

    Patients

    Days

  • Reading the data using NumPy

    16

  • Properties of the data

    17

  • Tasks

    • Remove the first and last 10 days

    • Plot the average, min, and max inflammation score

    per day

    18

  • Data trimming and plots

    19

    # remove the 10 first and last days

    n,m = data.shape

    data = data[:,10:(m-9)]

    How can we get the average/min/max of each day?

  • Removing first and last 10 days

    20

  • Inflammation values per day

    21

  • Grade Analysis (Example for Data Analysis using NumPy)

    22

    We’ll use the following imports:

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    import scipy.cluster.vq

    import scipy.cluster.hierarchy

  • In this example we will analyze the following grade

    sheet in python.

    The table is saved as a CSV file .

    Grade Analysis

    23

    grades.csv

  • 1. Load the grades from a CSV file

    • We will read a complex tabular file containing column and row

    headers into NumPy Arrays.

    2. Count failing grades in total and per student.

    3. Visualize the data using a box plot and a heatmap

    4. Sort the table based on student’s average

    5. Cluster the students into groups based on grade similarity

    1. Using K-means clustering

    2. Using Hierarchical clustering

    Analysis steps

    24

  • Reading the grades table from file

    • Read CSV file using Pandas library

    Data

    Column Headers

    R o w

    H e a d e rs

    StudentGrades.csv

    Yael Nadav Michal Shoshana Danielle Omer Yarden Avi Roy Tal

    Programming 46 90 56 87 97 87 43 54 98 45

    Marine Biology 60 92 45 89 93 91 54 48 91 58

    Stellar Cartography 59 60 47 72 85 86 58 40 85 57

    Math 94 65 58 80 90 85 92 55 92 88

    History 95 70 60 78 90 81 87 53 90 86

    Planet Survival 85 78 52 73 98 79 92 50 100 87

    Art 97 70 50 70 98 75 100 51 95 84

  • Panda’s DataFrame

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

  • Grades table

    columns

    rows

  • Accessing columns and rows

    Column:

    Yael’s

    grades

    ClassesRow:

  • Panda’s Series

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

    https://pandas.pydata.org/pandas-docs/stable/dsintro.html

  • Row??

    “Programming”

    is not a row

    or a column

  • Numeric Row Selection: iloc

  • Correcting the index column

    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

  • Why the DataFrame have not changed?

    change did not

    occur…

    most Pandas

    operations are

    not inPlace!

  • Using assignment

    use assignment!

  • After Updating the index:

    • Yael’s grades: before:

  • Row\Column slicing using indices

    • Note the inclusive range

  • Data handling made easy with Pandas

    37

    After Loading:

  • def load_table_as_array(filename):

    f = open(filename,'r')

    # Parsing the first line containing column headers

    header_line = f.readline()

    header_line = header_line.strip(',\n')

    column_names = header_line.split(',’)

    #Parsing the rest of the file

    mat = []

    row_names = []

    for line in f:

    tokens = line.rstrip().split(',')

    row_names.append(tokens[0]) #Add first token to row header list

    values = [float(n) for n in tokens[1:]] # Convert to float

    mat.append(values) # Append the current row to the matrix

    f.close()

    row_names = np.array(row_names)

    column_names = np.array(column_names)

    data = np.array(mat)

    return data, column_names, row_names

    38

    Without Pandas..

    List Comprehension

    provides another way

    to produce a list

  • Reminder …

    List Comprehensions • Python supports a concept called "list

    comprehensions". It can be used to

    construct lists in a very natural, easy way:

    39

    >>> S = [x**2 for x in range(10)]

    >>> V = [2**i for i in range(13)]

    >>> M = [x for x in S if x % 2 == 0]

    >>>

    >>> print S; print V; print M

    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

    [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]

    [0, 4, 16, 36, 64]

  • Total number of fails

    • How many grades below 60 are there in the entire table ?

    40

  • Number of fails per student

    • Use conditional indexing

    41

    all cells where

    value is not >=60

    are set to NaN

  • isna( ) operator

    • returns True for NaN, False otherwise

    42

  • Now use sum

    count NaN values per student

    43

    Nans represent

    grades < 60

    Same as

  • Number of fails per student

    Without Pandas:

    44

  • Plotting a boxplot of the grades

    45

  • The resulting Box-Plot

    46

  • Querying the data

    47