Transcript
Introduction to Pandas in Python
Pandas is an open-source library that is made mainly for working with
relational or labeled data both easily and intuitively. It provides various data
structures and operations for manipulating numerical data and time series.
This library is built on top of the NumPy library. Pandas is fast and it has high
performance & productivity for users.
Fast and efficient for manipulating and analyzing data.
Data from different file objects can be loaded.
Easy handling of missing data (represented as NaN) in floating point as
well as non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
Data set merging and joining.
Flexible reshaping and pivoting of data sets
Provides time-series functionality.
Powerful group by functionality for performing split-apply-combine
operations on data sets.
import pandas as pd
Here, pd is referred to as an alias to the Pandas. However, it is not necessary
to import the library using the alias, it just helps in writing less amount code
every time a method or property is called.
Pandas generally provide two data structures for manipulating data, They are:
Series
DataFrame
Series:
Pandas Series is a one-dimensional labelled array capable of holding data of
any type (integer, string, float, python objects, etc.). The axis labels are
collectively called indexes. Pandas Series is nothing but a column in an excel
sheet. Labels need not be unique but must be a hashable type. The object
supports both integer and label-based indexing and provides a host of
methods for performing operations involving the index.
Creating a Series
In the real world, a Pandas Series will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, an Excel file. Pandas
Series can be created from the lists, dictionary, and from a scalar value etc.
import pandas as pd
import numpy as np
# Creating empty series
ser = pd.Series()
print(ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)
Series([], dtype: float64)
0 g
1 e
2 e
3 k
4 s
dtype: object
DataFrame
Pandas DataFrame is a two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows and columns). A
Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular
fashion in rows and columns. Pandas DataFrame consists of three principal
components, the data, rows, and columns.
Creating a DataFrame:
In the real world, a Pandas DataFrame will be created by loading
the datasets from existing storage, storage can be SQL Database,
CSV file, an Excel file. Pandas DataFrame can be created from the
lists, dictionary, and from a list of dictionaries, etc.
import pandas as pd
# Calling DataFrame constructor
df = pd.DataFrame()
print(df)
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
Why Pandas is used for Data Science
Pandas are generally used for data science but have you wondered why? This
is because pandas are used in conjunction with other libraries that are used
for data science. It is built on the top of the NumPy library which means that a
lot of structures of NumPy are used or replicated in Pandas. The data
produced by Pandas are often used as input for plotting functions of
Matplotlib, statistical analysis in SciPy, machine learning algorithms in Scikit-
learn.
Pandas program can be run from any text editor but it is recommended to use
Jupyter Notebook for this as Jupyter given the ability to execute code in a
particular cell rather than executing the entire file. Jupyter also provides an
easy way to visualize pandas data frames and plots.
Read Data
In the following examples, the data frame used contains data of some NBA
players. The image of data frame before any operations is attached below.
In this example, top 5 rows of data frame are returned and stored in a new
variable. No parameter is passed to .head() method since by default it is 5.
# importing pandas module
import pandas as pd
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-
content/uploads/nba.csv")
# calling head() method
# storing in new variable
data_top = data.head()
# display
data_top
In this example, the .head() method is called on series with custom input of n
parameter to return top 9 rows of the series.
# importing pandas module
import pandas as pd
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-
content/uploads/nba.csv")
# number of rows to return
n = 9
# creating series
series = data["Name"]
# returning top n rows
top = series.head(n = n)
# display
top
In this example, bottom 5 rows of data frame are returned and stored in a new
variable. No parameter is passed to .tail() method since by default it is 5.
# importing pandas module
import pandas as pd
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-
content/uploads/nba.csv")
# calling tail() method
# storing in new variable
data_bottom = data.tail()
# display
data_bottom
In this example, the .tail() method is called on series with custom input of n
parameter to return bottom 12 rows of the series.
# importing pandas module
import pandas as pd
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-
content/uploads/nba.csv")
# number of rows to return
n = 12
# creating series
series = data["Salary"]
# returning top n rows
bottom = series.tail(n = n)
# display
bottom
In this example, the data frame is described and [‘object’] is passed to
include parameter to see description of object series. [.20, .40, .60, .80] is
passed to percentile parameter to view the respective percentile of Numeric
series.
# importing pandas module
import pandas as pd
# importing regex module
import re
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-
content/uploads/nba.csv")
# removing null values to avoid errors
data.dropna(inplace = True)
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
# calling describe method
desc = data.describe(percentiles = perc, include = include)
# display
desc
In this example, the describe method is called by the Name column to see the
behaviour with object data type.
# importing pandas module
import pandas as pd
# importing regex module
import re
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-
content/uploads/nba.csv")
# removing null values to avoid errors
data.dropna(inplace = True)
# calling describe method
desc = data["Name"].describe()
# display
desc
Dealing with Columns
In order to deal with columns, we perform basic operations on columns like
selecting, deleting, adding and renaming.
In Order to select a column in Pandas DataFrame, we can either access the
columns by calling them by their columns name.
# Import pandas package
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])
In Order to add a column in Pandas DataFrame, we can declare a
new list as a column and add to a existing Dataframe.
# Import pandas package
import pandas as pd
# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
# Using 'Address' as the column name
# and equating it to the list
df['Address'] = address
# Observe the result
print(df)
Dealing with Rows:
In order to deal with rows, we can perform basic operations on rows like
selecting, deleting, adding and renmaing.
Pandas provide a unique method to retrieve rows from a Data
frame.DataFrame.loc[] method is used to retrieve rows from Pandas
DataFrame. Rows can also be selected by passing integer location to an iloc[]
function.
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
In Order to add a Row in Pandas DataFrame, we can concat the old dataframe
with new one.
# importing pandas module
import pandas as pd
# making data frame
df = pd.read_csv("nba.csv", index_col ="Name")
df.head(10)
new_row = pd.DataFrame({'Name':'Geeks', 'Team':'Boston', 'Number':3,
'Position':'PG', 'Age':33, 'Height':'6-2',
'Weight':189, 'College':'MIT', 'Salary':99999},
index =[0])
# simply concatenate both dataframes
df = pd.concat([new_row, df]).reset_index(drop = True)
df.head(5)
Python Pandas - Sorting
There are two kinds of sorting available in Pandas. They are −
By label
By Actual Value
Let us consider an example with an output.
import pandas as pd
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],c
olu
mns=['col2','col1'])
print unsorted_df
Using the sort_index() method, by passing the axis arguments and the
order of sorting, DataFrame can be sorted. By default, sorting is done
on row labels in ascending order.
import pandas as pd
import numpy as np
unsorted_df =
pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index()
print sorted_df
By passing the Boolean value to ascending parameter, the order of the
sorting can be controlled. Let us consider the following example to
understand the same.
import pandas as pd
import numpy as np
unsorted_df =
pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df = unsorted_df.sort_index(ascending=False)
print sorted_df
By passing the axis argument with a value 0 or 1, the sorting can be
done on the column labels. By default, axis=0, sort by row. Let us
consider the following example to understand the same.
import pandas as pd
import numpy as np
unsorted_df =
pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index(axis=1)
print sorted_df
Like index sorting, sort_values() is the method for sorting by values. It
accepts a 'by' argument which will use the column name of the
DataFrame with which the values are to be sorted.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
print sorted_df
Observe, col1 values are sorted and the respective col2 value and row
index will alter along with col1. Thus, they look unsorted.
'by' argument takes a list of column values.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
print sorted_df
sort_values() provides a provision to choose the algorithm from
mergesort, heapsort and quicksort. Mergesort is the only stable
algorithm.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')
print sorted_df
Its output is as follows −
Python Pandas - Aggregations
Applying Aggregations on DataFrame
Let us create a DataFrame and apply aggregations on it.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r
We can aggregate by passing a function to the entire DataFrame, or
select a column via the standard get item method.
Apply Aggregation on a Whole Dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate(np.sum)
Apply Aggregation on a Single Column of a Dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate(np.sum)
Apply Aggregation on Multiple Columns of a DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate(np.sum)
Apply Multiple Functions on a Single Column of a DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate([np.sum,np.mean])
Apply Multiple Functions on Multiple Columns of a DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate([np.sum,np.mean])
Apply Different Functions to Different Columns of a Dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 4),
index = pd.date_range('1/1/2000', periods=3),
columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate({'A' : np.sum,'B' : np.mean})
Python Pandas - Missing Data
When and Why Is Data Missed?
Let us consider an online survey for a product. Many a times, people do
not share all the information related to them. Few people share their
experience, but not how long they are using the product; few people
share how long they are using the product, their experience but not
their contact information. Thus, in some or the other way a part of data
is always missing, and this is very common in real time.
Let us now see how we can handle missing values (say NA or NaN)
using Pandas.
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df
Using reindexing, we have created a DataFrame with missing values. In
the output, NaN means Not a Number.
Check for Missing Values
To make detecting missing values easier (and across different array
dtypes), Pandas provides the isnull() and notnull() functions, which are
also methods on Series and DataFrame objects −
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df['one'].isnull()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df['one'].notnull()
Calculations with Missing Data
When summing data, NA will be treated as Zero
If the data are all NA, then the result will be NA
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df['one'].sum()
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print df['one'].sum()
Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The
fillna function can “fill in” NA values with non-null data in a couple of
ways, which we have illustrated in the following sections.
Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print df
print ("NaN replaced with '0':")
print df.fillna(0)
Here, we are filling with value zero; instead we can also fill with any
other value.
Fill NA Forward and Backward
Using the concepts of filling discussed in the ReIndexing Chapter we
will fill the missing values.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.fillna(method='pad')
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.fillna(method='backfill')
Drop Missing Values
If you want to simply exclude the missing values, then use
the dropna function along with the axis argument. By default, axis=0,
i.e., along row, which means that if any value within a row is NA then
the whole row is excluded.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.dropna()
Replace Missing (or) Generic Values
Many times, we have to replace a generic value with some specific value.
We can achieve this by applying the replace method.
Replacing NA with a scalar value is equivalent behavior of
the fillna() function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
Python Pandas - GroupBy
Any groupby operation involves one of the following operations on the
original object. They are −
Splitting the Object
Applying a function
Combining the results
In many situations, we split the data into sets and we apply some functionality
on each subset. In the apply functionality, we can perform the following
operations −
Aggregation − computing a summary statistic
Transformation − perform some group-specific operation
Filtration − discarding the data with some condition
Let us now create a DataFrame object and perform all the operations on it −
#import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df
Split Data into Groups
Pandas object can be split into any of their objects. There are multiple
ways to split an object like −
obj.groupby('key')
obj.groupby(['key1','key2'])
obj.groupby(key,axis=1)
Let us now see how the grouping objects can be applied to the
DataFrame object
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df.groupby('Team')
View Groups
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df.groupby('Team').groups
With the groupby object in hand, we can iterate through the object similar to
itertools.obj.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
for name,group in grouped:
print name
print group
Group by with multiple columns −
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df.groupby(['Team','Year']).groups
Iterating through Groups
With the groupby object in hand, we can iterate through the object
similar to itertools.obj.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
for name,group in grouped:
print name
print group
By default, the groupby object has the same label name as the group
name.
Select a Group
Using the get_group() method, we can select a single group.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print grouped.get_group(2014)
Aggregations
An aggregated function returns a single aggregated value for each
group. Once the group by object is created, several aggregation
operations can be performed on the grouped data.
An obvious one is aggregation via the aggregate or
equivalent agg method −
# import the pandas library
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print grouped['Points'].agg(np.mean)
Another way to see the size of each group is by applying the size()
function −
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
Attribute Access in Python Pandas
grouped = df.groupby('Team')
print grouped.agg(np.size)
Applying Multiple Aggregation Functions at Once
With grouped Series, you can also pass a list or dict of functions to do
aggregation with, and generate DataFrame as output −
# import the pandas library
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
print grouped['Points'].agg([np.sum, np.mean, np.std])
Transformations
Transformation on a group or a column returns an object that is
indexed the same size of that is being grouped. Thus, the transform
should return a result that is the same size as that of a group chunk.
# import the pandas library
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print grouped.transform(score)
Python Pandas - Merging/Joining
Pandas has full-featured, high performance in-memory join operations
idiomatically very similar to relational databases like SQL.
Pandas provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects −
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
Here, we have used the following parameters −
left − A DataFrame object.
right − Another DataFrame object.
on − Columns (names) to join on. Must be found in both the left
and right DataFrame objects.
left_on − Columns from the left DataFrame to use as keys. Can
either be column names or arrays with length equal to the length of
the DataFrame.
right_on − Columns from the right DataFrame to use as keys. Can
either be column names or arrays with length equal to the length of
the DataFrame.
left_index − If True, use the index (row labels) from the left
DataFrame as its join key(s). In case of a DataFrame with a
MultiIndex (hierarchical), the number of levels must match the
number of join keys from the right DataFrame.
right_index − Same usage as left_index for the right DataFrame.
how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each
method has been described below.
sort − Sort the result DataFrame by the join keys in lexicographical
order. Defaults to True, setting to False will improve the
performance substantially in many cases.
Let us now create two different DataFrames and perform the merging
operations on it.
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right
Merge Two DataFrames on a Key
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on='id')
Merge Two DataFrames on Multiple Keys
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on=['id','subject_id'])
Merge Using 'how' Argument
The how argument to merge specifies how to determine which keys are
to be included in the resulting table. If a key combination does not
appear in either the left or the right tables, the values in the joined table
will be NA.
Here is a summary of the how options and their SQL equivalent names
−
Merge Method SQL Equivalent Description
left LEFT OUTER JOIN Use keys from left object
right RIGHT OUTER JOIN Use keys from right object
outer FULL OUTER JOIN Use union of keys
inner INNER JOIN Use intersection of keys
Left Join
Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='left')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Alex 1 sub1 NaN NaN
1 Amy 2 sub2 Billy 1.0
2 Allen 3 sub4 Brian 2.0
3 Alice 4 sub6 Bryce 4.0
4 Ayoung 5 sub5 Betty 5.0
Right Join
Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='right')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Amy 2.0 sub2 Billy 1
1 Allen 3.0 sub4 Brian 2
2 Alice 4.0 sub6 Bryce 4
3 Ayoung 5.0 sub5 Betty 5
4 NaN NaN sub3 Bran 3
Outer Join
Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, how='outer', on='subject_id')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Alex 1.0 sub1 NaN NaN
1 Amy 2.0 sub2 Billy 1.0
2 Allen 3.0 sub4 Brian 2.0
3 Alice 4.0 sub6 Bryce 4.0
4 Ayoung 5.0 sub5 Betty 5.0
5 NaN NaN sub3 Bran 3.0
Inner Join
Joining will be performed on index. Join operation honors the object on
which it is called. So, a.join(b) is not equal to b.join(a).
Live Demo
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='inner')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Amy 2 sub2 Billy 1
1 Allen 3 sub4 Brian 2
2 Alice 4 sub6 Bryce 4
3 Ayoung 5 sub5 Betty 5
Demo
1 - 项目导学页
1.1 - 项目介绍
Chipotle 全名是 ChipotleMexicanGrill,意思是 Chipotle 墨西哥风味烤肉。
Chipotle 的原意本是一种墨西哥干辣椒。这家快餐是一个叫 SteveElls 的年轻人于
1993 年在美国的丹佛创立的。主要是产品为墨西哥式的食品,食材多是新鲜的牛肉,
鸡肉,猪肉,各种蔬菜,各种豆类,米饭等。
Chipotle 是美国休闲快餐美食行业的领头羊,在《财富》杂志 2011 年度增长最快
的 100 家公司排行榜上列第 54 位。Chipotle 一直保持着惊人的增长速度。在截至
2011 年 6 月 30 日的财年里,Chipotle 公司的营业额超过了 20 亿美元,比上年增长
了 23.5%。自 2006 年以来,公司的营业额几乎增加了两倍。与此同时,Chipotle 连
锁公司的餐厅数量翻了一番。2011 年上半年,Chipotle 开业至少一年的餐厅的销售
额增长了 11%。Chipotle 餐厅的利润率一直在 25%~26%之间,这在快餐行业内属于
佼佼者。
随着公司的发展,越来越需要数据化运营手段辅助公司的业务运营。为此,公司希
望数据分析师,分析整体的经营数据,为销售决策提供依据。
在这个项目中我们将运用 Python 的数据分析模块—pandas 对 Chipotle 的数据进
行探索性分析。
1.2 - 教学目标
本项目,主要是教会大家熟练使用 Pandas 数据分析全家桶的以下内容:
Pandas 数据分析全家桶 介绍
Pandas 环境安装 Pandas 数据分析工具安装
Pandas 数据分析数据分析模
块
Pandas 数据分析函数等
数据分析基本流程 数据分析的基本过程与方法
1.3 - 前置知识点
如果要更好地实现本项目,最好具备以下技能:
掌握 Python 编程基础
掌握统计学基本常识
掌握 Python 环境安装
掌握数据分析的基本知识
掌握 Pandas 的基本操作
1.4 - 学习周期
总时⻓ 任务 1 任务 2 任务 3 任务 4 任务 45
3h 1 0.5 0.5 0.5 0.5
1.5 - 配套资料
Pandas 官⽅⽂档:https://pandas.pydata.org
Python 基础:https://www.runoob.com/python3/python3-tutorial.html
实验数据下载:https://pan.baidu.com/s/1BItW8kYNU4xp2cw8rF5ojA 密
码:fvtn
Python 与 Pandas 开发环境安装方法参考:
https://blog.csdn.net/weixin_42526141/article/details/84141157
2 - 项目剖析页
2.1 - 项目解读
数据分析是指用适当的统计分析方法对收集来的大量数据进行分析,提取有用信息
和形成结论而对数据加以详细研究和概括总结的过程。这一过程也是质量管理体系的
支持过程。在实用中,数据分析可帮助人们作出判断,以便采取适当行动。
Chipotle 快餐数据是数据分析中一个入门级的数据集,包含 5 个特征,4622 个样
本。数据简单易理解的同时可以充分的支撑练习 Pandas 的关键函数及各种描述性分
析方法的需求。
我们为了了解 Chipotle 快餐的经营情况,可以运用 Pandas 中的函数从多角度进
行数据分析。
2.2 - 技能要求
掌握 Pandas 的使用
掌握 Python 编程基础
掌握统计学基本常识
2.3 - 项⽬拆解
本项⽬主要分为 5 个任务:
任务 1:开发环境安装与准备
通过完成 Python 数据分析的环境搭建与相关实验数据的准备,为后续的分析任务
的展开打下基础。
任务 2:导入数据并完成数据字段的理解
通过使用 Python Pandas 的数据加载 API,加载餐饮数据分析。数据分析的第一
步,往往都是数据加载,数据源一般会有很多种不同的格式,例如 csv,txt。所以学
习如何加载数据对我们来讲非常重要。
任务 3: 下单次数最多的商品
通过使用 Python Pandas 的数据分析 API,groupby 等,完成下单次数最多的商品
数据分析。
任务 4: 不同种商品销售情况分析
通过使用 Python Pandas 的数据分析 API,groupby 等,完成不同种商品销售情况
分析。
任务 5: 总收入与总销量分析
通过使用 Python Pandas 的数据分析 API,value_counts,groupby 等,完成总
收入与总销量分析。
3 - 任务详情页
3.1 - 开发环境安装与准备-基于 Anaconda
Pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。
Pandas 纳入了大量库和一些标准的 数据模型 ,提供了高效地操作大型数据集
所需的工具。pandas 提供了大量能使我们快速便捷地处理数据的函数和方法。你很快
就会发现,它是使 Python 成为强大而高效的数据分析环境的重要因素之一。
在本任务中,我们使用 Python 的 Anaconda 搭建数据分析环境。
任务要求:
1. 基于 Anaconda 最新版本搭建数据分析环境
2. 构建 Jupyter,spyder 开发环境
3. 下载练习数据
任务提示:
Anaconda 指的是一个开源的 Python 发行版本,其包含了 conda、Python 等
180 多个科学包及其依赖项。 因为包含了大量的科学包,Anaconda 的下载文件比较
大(约 531 MB),如果只需要某些包,或者需要节省带宽或存储空间,也可以使用
Miniconda 这个较小的发行版(仅包含 conda 和 Python。
安装文档可以参考:
https://baijiahao.baidu.com/s?id=1616120886763657106&wfr=spider&for
=pc
Anaconda 下载链接: https://www.anaconda.com/distribution,需要下载 3.7
版本同时与自己电脑的操作系统要对应。
搭建成功后的如下:
开发环境搭建:
我们使用 Jupyter,spyder 都可以,以下我们用 Jupyter 来举例子
Jupyter 是一款界面如下
Jupyter 是一款非常优秀的 Python 数据分析开发工具,具体使用文档大家可
以参考
http://baijiahao.baidu.com/s?id=1601883438842526311&wfr=spi
der&for=pc
编程见面风格,清爽自如
测试数据下载
接下来,需要去下载测试的数据,具体的链接如下
实验数据下载:
https://pan.baidu.com/s/1BItW8kYNU4xp2cw8rF5ojA 密码:fvtn
数据使用百度网盘保存,小伙伴们直接下载即可,本项目中使用的字段说明
如下:
字段名称 含义
item_name 产品名称
quantity 销售数量
choice_description 选择的描述
order_id 订单 ID
item_price 产品价格
3.2 - 导入数据并完成数据字段的理解
在完成了 3.1 环境安装与数据下载的练习后,我们开始进行正式的数据分析模块开
发。数据分析也是运营当中的一部分,而且起到非常关键的作用,我们可以通过数据
分析可以做出正确的判断。在实用中,数据分析可帮助人们作出判断,以便采取适当
的行动。数据分析是组织有目的地收集数据、分析数据,使之成为信息的过程。
对于所有的数据分析过程中,都应该把最多的注意⼒放在数据上, 读取数据虽然
简单,却是最重要的⼀步。在第一个任务中我们主要工作是把数据集读取进来,并且
对数据集中的信息有整体的了解。
本任务中,我们将实现数据集的加载,数据集的基本探索。
任务要求:
1. 读取 Chipotle 数据集,并将数据集命名为 chipo
2. 查看数据前 100 行
3. 查看数据集信息
4. 查看数据集中的列
任务提示:
首先第一步我们要导入 Pandas 包,Pandas 是基于 Numpy 的一个数据分析包,
主要目的是为了数据分析。它提供了大量的高级数据结构和对数据处理的方法。
import pandas as pd
接下来就是将数据集以 DataFrame 形式导入,并且命名为 chipo。DataFrame
是 Python 中 Pandas 库中的一种数据结构,它类似 excel,是一种二维表。或许
说它可能有点像 matlab 的矩阵,但是 matlab 的矩阵只能放数值型值(当然
matlab 也可以用 cell 存放多类型数据),DataFrame 的单元格可以存放数值、
字符串等,这和 excel 表很像。同时 DataFrame 可以设置列名 columns 与行名
index,可以通过像 matlab 一样通过位置获取数据也可以通过列名和行名定位。
我们使用 Pandas 中的 read_csv()来读取数据。代码如下所示,其中 path 为数据
集存储路径,sep 参数为指定分隔符。如果不指定参数,则会尝试使用逗号分隔。
因为本文的数据文件为 TSV 文件,而 TSV 文件和 CSV 的文件的区别是:前者使
用\t 作为分隔符,后者使用,作为分隔符。所以参数 sep 为 '\t'。
path = "/Users/用户数据.tsv"
#读取 CSV 数据,以\t 作为分隔符,关键函数为 read_csv
chipo = pd.read_csv(path1, sep = '\t')
通过上一步骤,我们已经将数据集完整的导入,并且命名为 chipo。在做数据分
析时我们一般再导入数据后要查看一下导入的数据是否存在什么问题,并且对数
据大致情况进行了解。由于一般情况下整体数据量可能很大,我们只需要查看部
分数据就可以,因此调用 head()来查看前 500 条数据,代码如下:
#关键函数为 head
print(chipo.head(500))
接下来为了对数据集有更深刻的认识我们调用 info()函数来查看数据集信息,代码
如下:
#关键函数为 info
chipo.info()
我们得到的结果如下图:
其中包含索引范围,数据列信息(列名,每列数据不为空的条数,每列数据类型),
数据类型,占用内存。
当我们只想查看数据中包含哪些列时,通过调用 columns 实现。
#关键函数为 columns
print(chipo.columns)
3.3 - 下单次数最多的商品分析
经过任务 3.2 的练习,我们已经将数据导入,并且对数据的内容有了一定的认识。
接下来,从本任务开始我们将进行分析工作。
首先以商品为维度分析下单次数最多的商品,一般来说下单次数最多的商品为主打
商品,这部分商品会在收益中占有很高份额,因此对下单次数最多的商品进行分析有
重要意义。
本任务中,我们将实现下单次数最多的商品分析。
任务要求:
1. 将商品名字和下单数量两列取出
2. 按照商品名字分组,求分组内下单数量和,分组内下单数量均值
3. 将分组后的数据按照下单数量和值排序
任务提示:
第一步我们要做的就是把我们需要用到的两列:商品名字,下单数量 从
DataFrame——chipo 中取出,并且创建为新的 DataFrame——c。DataFrame
按列进行取数据时,如果取一列方法为:a=df['列名'],取多列时方法为:b=df[['
列名 1',‘列名 2’,‘列名 3’]],所以代码如下:
c = chipo[['item_name','quantity']]
接下来按照商品名字分组,求分组内下单数量和,分组内下单数量均值。首先调
用 groupby()函数将 c 按照商品名字做分组,其中参数 as_index:bool,默认为
True。对于聚合输出,返回以组标签作为索引的对象。仅与 DataFrame 输入相
关。as_index = False 实际上是“SQL 风格”的分组输出。pandas 引入了 agg
函数,它提供基于列的聚合操作。而 groupby 可以看做是基于行,或者说 index
的聚合操作。通过这里介绍我们可以交接 groupby 函数是基于行操作的 而 agg
是基于列操作的。这个说可能太抽象,什么是行操作 什么是列操作呢。最简单的
理解就是 基于行操作 我可以进行分类(比如一个班名单 所有 180 以上的是一组
160-180 是一组 低于 160 是一组)如果实现这个过程 我们是每一行每一行就行
查找,查看符合什么条件 然后分组。这就是 groupby 函数最简单的理解 而我们
分好组以后 想得到每一组的平均值咋办 一般我们是着用操作的 选择一个组之后
把他们所有身高都加起来 然后除以该组人数。那么问题来了 不管是身高和还是平
均值 我们都是进行列操作的 即我们是从上至下加起来的 而不是从左到右。为了
计算简便 我们引入了 agg 函数。接下来调用 agg()函数对分组后的数据进行聚合,
默认情况下对分组后其他列进行聚合。代码如下:
#关键函数为,group by, agg
c1=c.groupby(['item_name'],as_index=False).agg({'quantity':
[sum,'mean']})
最后我们将上一步做好的分组后数据按照下单数量和值排序。此处调用
sort_values()函数。Pandas 中的 sort_values()函数原理类似于 SQL 中的 order
by,可以将数据集依照某个字段中的数据进行排序,该函数即可根据指定列数据
也可根据指定行的数据排序。具体代码如下所示,其中参数 ascending :是否按
指定列的数组升序排列,默认为 True,即升序排列。inplace :是否用排序后的
数据集替换原来的数据,默认为 False,即不替换。
#关键函数为 sort_values
c1.sort_values([('quantity', 'sum')],ascending=False,inplace=True)
3.4 - 不同种商品销售情况分析
在完成了 3.3 的练习后,我们可以通过观察有多少商品被下单可以看出是否存在未
售出商品,便于调整上架商品。接下来可以分析不同种商品销售情况从而找出影响销
量的因素。
本任务中,我们将实现不同种商品销售情况。
任务要求:
1. 分析多少种商品被下单
2. 分析不同种商品销售情况分析
3. 分析一共有多少商品被下单
任务提示:
第一步是查看多少种商品被下单,也就是输出商品名称不同的计数。我们调用
nunique()函数进行实现。nunique()函数用于获取唯一值得统计次数。
#关键函数为 nunique
print(chipo['item_name'].nunique())
接下来我们对不同种商品销售情况进行分析,也就是统计不同
‘choice_description’字段的频数,我们通过调用 value_counts()函数进行实
现,value_counts()是一种查看表格某列中有多少个不同值的快捷方法,并计算
每个不同值又在该列中有多少重复值。
#关键函数为 value_counts
chipo['choice_description'].value_counts().head(10)
下面我们将对一共有多少商品被下单进行分析,也就是统计一共有多少
‘quantity’字段被下单,代码如下:
#关键函数为 sum
total_items_orders = chipo.quantity.sum()
3.5 - 总收入与总订单数分析
最后我们可以对收入情况及订单数进行汇总,进而对店铺的整体运营情况进行分析。
并计算每一订单的均值得出店铺的消费水平。
本任务中,我们将实现总收入与总订单数分析。
任务要求:
1. 分析在该数据集对应的时期内,总收入是多少?
2. 分析在该数据集对应的时期内,一共有多少订单?
3. 分析每一单(order)对应的平均总价是多少?
任务提示:
首先对总收入情况进行分析,也就是对 item_price 求和即为收入。在对
item_price 进行求和之前,我们由任务 3.1 可知 item_price 数据类型为 object,
因此需要将其转化为数值类型方可进行求和。而 chipo['item_price']中每个对象
的格式是 string(‘$’+数字),所以如果直接用.astype(float)则会出错,因为
str 无法转换成为 float。所以我们可以利用可调用函数处理每个对象,并将函数
传入 apply。代码如下:
dollarizer = lambda x: float(x[1:-1])#去除'$'符号并将数字传唤成为 float
chipo['item_price']=chipo['item_price'].apply(dollarizer)
经过上一步的处理就可以对 item_price 字段进行 sum()求和计算,代码如下:
#关键函数为 sum
chipo.item_price.sum()
接下来对订单总数进行分析。我们先通过调用 value_counts()查看订单 ID 中有多
少个不同值和每个值对应的重复次数,然后对 value_counts()的结果调用 count()
函数进行计数统计。代码如下:
#关键函数为 value_counts().count()
chipo.order_id.value_counts().count()
最后我们计算每一单(order)对应的平均总价进行分析。首先调用 groupby()函数,
将数据按照订单 ID 进行分组,并通过 sum()函数对分分组后数据进行求和计算。
接下来就是对刚刚做好分组的数据求 item_price 字段的均值
#关键函数为 groupby,sum,mean
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped.mean()['item_price']
top related