Part 4 - Productivity with Pandas

In the previous notebook, we introduced Pandas, which provides high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive.

In this notebook, we will build on our knowledge of Pandas to be more productive. Pandas provides sophisticated, multi-level indexing functionality, along with the ability to perform data aggregation operations, such as grouping, merging, and joining data. It also provides capabilities for working with Time Series data that involves navigating and manipulating various date ranges and time indices. Let's dive into the details.

# Import libraries
import pandas as pd
import numpy as np

Working with Categorical Data

In many practical Data Science activities, you may come across data that contain categorical variables. These variables are typically stored as text values in columns. For such data, you may want to find the unique elements, frequency of each category present, or transform the categorical data into suitable numeric values.

Pandas provides various approaches to handle categorical data. To get started, let's create a small dataset and look at some examples. Seaborn library comes preloaded with some sample datasets. We will load the tips data from seaborn for our analysis.

import seaborn as sns
tips_data = sns.load_dataset('tips')
print(tips_data.head())
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Unique Values and Value Counts

Pandas provides methods such as unique(), nunique(), and value_counts() to extract information about the values in a column.

unique() and nunique()

unique() can be used to identify the unique elements of a column.

tips_data['day'].unique()
[Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]

The result is an array which can be easily converted to a list by chaining the tolist() function.

tips_data['day'].unique().tolist()
['Sun', 'Sat', 'Thur', 'Fri']

Similarly, unique() can be applied on the index.

indexed_tip = tips_data.set_index('day')
indexed_tip.head()
total_billtipsexsmokertimesize
day
Sun16.991.01FemaleNoDinner2
Sun10.341.66MaleNoDinner3
Sun21.013.50MaleNoDinner3
Sun23.683.31MaleNoDinner2
Sun24.593.61FemaleNoDinner4
indexed_tip.index.unique()
CategoricalIndex(['Sun', 'Sat', 'Thur', 'Fri'], categories=['Sun', 'Sat', 'Thur', 'Fri'], ordered=False, name='day', dtype='category')

nunique() can be used to count the number of unique values in a column.

# Count of unique values in employee
tips_data['day'].nunique()
4
# Count of unique values in skills
tips_data['time'].nunique()
2

value_counts()

value_counts() are used to determine the frequency of different values present in the column.

tips_data['day'].value_counts()
Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

reset_index() can be chained to the value_counts() operation to easily get the results as a DataFrame.

days_df = tips_data['day'].value_counts().reset_index()
days_df
indexday
0Sat87
1Sun76
2Thur62
3Fri19

One Hot Encoding

Many machine learning algorithms do not support the presence of categorical values in data. Pandas provides various approaches to transform the categorical data into suitable numeric values to create dummy variables, and one such approach is called One Hot Encoding. The basic strategy is to convert each category value into a new column and assign a 0 or 1 (True/False) value to the column. Dummy variables can be created using get_dummies.

pd.get_dummies(tips_data, columns=['day'])
total_billtipsexsmokertimesizeday_Thurday_Friday_Satday_Sun
016.991.01FemaleNoDinner20001
110.341.66MaleNoDinner30001
221.013.50MaleNoDinner30001
323.683.31MaleNoDinner20001
424.593.61FemaleNoDinner40001
.................................
23929.035.92MaleNoDinner30010
24027.182.00FemaleYesDinner20010
24122.672.00MaleYesDinner20010
24217.821.75MaleNoDinner20010
24318.783.00FemaleNoDinner21000

244 rows × 10 columns

The resulting dataset contains four new columns (one for each day) day_Thur, day_Fri, day_Sat, day_Sun. You can pass as many category columns as you would like and choose how to label the columns using prefix parameter.

pd.get_dummies(tips_data, columns=['day', 'smoker', 'sex'], prefix=['weekday', 'smokes', 'gender'])
total_billtiptimesizeweekday_Thurweekday_Friweekday_Satweekday_Sunsmokes_Yessmokes_Nogender_Malegender_Female
016.991.01Dinner200010101
110.341.66Dinner300010110
221.013.50Dinner300010110
323.683.31Dinner200010110
424.593.61Dinner400010101
.......................................
23929.035.92Dinner300100110
24027.182.00Dinner200101001
24122.672.00Dinner200101010
24217.821.75Dinner200100110
24318.783.00Dinner210000101

244 rows × 12 columns

Binning Continous Variables

You may come across scenarios where you need to bin continuous data into discrete chunks to be used as a categorical variable. We can use pd.cut() function to cut our data into discrete buckets.

# Bin data into 5 equal sized buckets
pd.cut(tips_data['total_bill'], bins=5)
0      (12.618, 22.166]
1       (3.022, 12.618]
2      (12.618, 22.166]
3      (22.166, 31.714]
4      (22.166, 31.714]
             ...       
239    (22.166, 31.714]
240    (22.166, 31.714]
241    (22.166, 31.714]
242    (12.618, 22.166]
243    (12.618, 22.166]
Name: total_bill, Length: 244, dtype: category
Categories (5, interval[float64]): [(3.022, 12.618] < (12.618, 22.166] < (22.166, 31.714] < (31.714, 41.262] < (41.262, 50.81]]

The function results in five equal-width bins. We can also specify bin edges to create specific non-uniform bins.

# Bin data by specifying bin edges
bill_cat = pd.cut(tips_data['total_bill'], bins=[0, 31.714, 50.81])
bill_cat
0      (0.0, 31.714]
1      (0.0, 31.714]
2      (0.0, 31.714]
3      (0.0, 31.714]
4      (0.0, 31.714]
           ...      
239    (0.0, 31.714]
240    (0.0, 31.714]
241    (0.0, 31.714]
242    (0.0, 31.714]
243    (0.0, 31.714]
Name: total_bill, Length: 244, dtype: category
Categories (2, interval[float64]): [(0.0, 31.714] < (31.714, 50.81]]
# Value count for each category
bill_cat.value_counts()
(0.0, 31.714]      218
(31.714, 50.81]     26
Name: total_bill, dtype: int64

The operation creates two non-uniform categories for total_bill.

Data Aggregation

Summarizing data by applying various aggregation functions such as sum(), mean(), median() etc. to each group or category within the data is a critical component of a data analysis workflow. Simple aggregations can give you a high level overview but are often not enough to get a deeper understanding of the data.

Pandas provides a flexible groupby() operation which allows for quick and efficient aggregation on subsets of data.

GroupBy

The name "group by" comes from a command in the SQL language. Hadley Wickham, author of popular packages in R programming language, described grouping operations by coining the term split-apply-combine.

  • The split step breaks up and groups a DataFrame based on the value of the specified key. Splitting is performed on a particular axis of a DataFrame i.e. rows (axis=0) or columns (axis=1).
  • The apply step computes some aggregation function within the individual groups.
  • The combine step merges the results of these operations into an output array.

The image below shows a mockup of a simple group aggregation.

groupby() method can be used to apply the basic split-apply-combine operation on a DataFrame by specifying the desired column name.

# Apply Group By
grp1 = tips_data.groupby('sex')
grp1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff1186b91c0>

The method returns a DataFrameGroupBy object. No actual computation has been performed by the groupby() method yet. The idea is that this object has all the information needed to then apply some operation to each of the groups in the data. This "lazy evaluation" approach means that common aggregation functions can be implemented very efficiently using groupby(). For example, to compute the mean, we can call mean() method on the GroupBy object.

# Compute mean
grp1.mean()
total_billtipsize
sex
Male20.7440763.0896182.630573
Female18.0568972.8334482.459770

The data has been aggregated according to the group key and is now indexed by the unique values in the sex column. By default, all of the numeric columns are aggregated.

Using Multiple Keys

Multiple column names can be passed as group keys to group the data appropriately. Let's group the data by smoker and day columns.

# Aggregation using multiple keys
tips_data.groupby(['smoker', 'day']).mean()
total_billtipsize
smokerday
YesThur19.1905883.0300002.352941
Fri16.8133332.7140002.066667
Sat21.2766672.8754762.476190
Sun24.1200003.5168422.578947
NoThur17.1131112.6737782.488889
Fri18.4200002.8125002.250000
Sat19.6617783.1028892.555556
Sun20.5066673.1678952.929825

The data is now indexed by the unique values in the smoker and day columns. Similarly, other aggregation operations such as sum(), median(), std() etc. can be applied to the groups within data.

# Sum operation
tips_data.groupby(['smoker', 'day']).sum()
total_billtipsize
smokerday
YesThur326.2451.5140
Fri252.2040.7131
Sat893.62120.77104
Sun458.2866.8249
NoThur770.09120.32112
Fri73.6811.259
Sat884.78139.63115
Sun1168.88180.57167
# Median operation
tips_data.groupby(['smoker', 'day']).median()
total_billtipsize
smokerday
YesThur16.4702.5602
Fri13.4202.5002
Sat20.3902.6902
Sun23.1003.5002
NoThur15.9502.1802
Fri19.2353.1252
Sat17.8202.7502
Sun18.4303.0203
# Standard Deviation operation
tips_data.groupby(['smoker', 'day']).std()
total_billtipsize
smokerday
YesThur8.3551491.1134910.701888
Fri9.0863881.0776680.593617
Sat10.0691381.6305800.862161
Sun10.4425111.2611510.901591
NoThur7.7217281.2829641.179796
Fri5.0592820.8984940.500000
Sat8.9391811.6420880.784960
Sun8.1301891.2247851.032674

Using aggregate()

aggregate() method allows for even greater flexibility by taking a string, a function, or a list and computing all the aggregates at once. The example below shows minimum aggregation operation being used as a string, median being called as a function, and all aggregation operations being passed as a list.

tips_data.groupby(['smoker', 'day']).aggregate(['min', np.median, max])
total_billtipsize
minmedianmaxminmedianmaxminmedianmax
smokerday
YesThur10.3416.47043.112.002.5605.00224
Fri5.7513.42040.171.002.5004.73124
Sat3.0720.39050.811.002.69010.00125
Sun7.2523.10045.351.503.5006.50225
NoThur7.5115.95041.191.252.1806.70126
Fri12.4619.23522.751.503.1253.50223
Sat7.2517.82048.331.002.7509.00124
Sun8.7718.43048.171.013.0206.00236

Aggregation functions can also be passed as a dictionary, mapping column names to operations that are to be applied on that column. The example below shows min operation applied to total_bill column and max operation applied to tip column.

tips_data.groupby(['smoker', 'day']).aggregate({'total_bill':'min', 'tip':'max'})
total_billtip
smokerday
YesThur10.345.00
Fri5.754.73
Sat3.0710.00
Sun7.256.50
NoThur7.516.70
Fri12.463.50
Sat7.259.00
Sun8.776.00

A more complex operation could involve passing a list of operations to be applied to a specific column.

tips_data.groupby(['smoker', 'day']).aggregate({'total_bill':['min','max','count'], 'tip':'max'})
total_billtip
minmaxcountmax
smokerday
YesThur10.3443.11175.00
Fri5.7540.17154.73
Sat3.0750.814210.00
Sun7.2545.35196.50
NoThur7.5141.19456.70
Fri12.4622.7543.50
Sat7.2548.33459.00
Sun8.7748.17576.00

Selecting a Subset of Columns

For large datasets, it may be desirable to aggregate a specific column or only a subset of columns. As an example, we can group the data by smoker and compute mean for tip column as follows:

tips_data.groupby(['smoker'])['tip'].mean()
smoker
Yes    3.008710
No     2.991854
Name: tip, dtype: float64

Similarly, we can group the data by smoker and day columns, compute median for tip column.

tips_data.groupby(['smoker','day'])['tip'].median()
smoker  day 
Yes     Thur    2.560
        Fri     2.500
        Sat     2.690
        Sun     3.500
No      Thur    2.180
        Fri     3.125
        Sat     2.750
        Sun     3.020
Name: tip, dtype: float64

Pivot Tables

Pivot Table is a popular operation that is commonly used on tabular data in spreadsheets. The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. Pivot Tables are essentially a multidimensional version of GroupBy. Pandas includes a pandas.pivot_table function and DataFrame also has a pivot_table method.

Seaborn library comes preloaded with some sample datasets. We will load the titanic dataset from seaborn for our analysis and look at some examples.

# Get Data
titanic_df = sns.load_dataset('titanic')
titanic_df.head()
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue

Let's say we want to look at the average number of people that survived by both sex and class. We can get the results using both GroupBy and pivot_table.

  • Using GroupBy
titanic_df.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447
  • Using pivot_table
titanic_df.pivot_table(values='survived', index='sex', columns='class')
classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447

We can see that the pivot_table approach is much more readable than the GroupBy and produces the same result. The default aggregation operation is mean.

Multi-level Pivot Table

Just as in GroupBy, the data can be grouped at multiple levels using pivot_table. Suppose we want to group survival by sex and age. Since age is a continuous variable, we can create bins for age using pd.cut function and then group the data.

# Create bins for Age
age = pd.cut(titanic_df['age'], bins=[0,18,80])
age
0      (18.0, 80.0]
1      (18.0, 80.0]
2      (18.0, 80.0]
3      (18.0, 80.0]
4      (18.0, 80.0]
           ...     
886    (18.0, 80.0]
887    (18.0, 80.0]
888             NaN
889    (18.0, 80.0]
890    (18.0, 80.0]
Name: age, Length: 891, dtype: category
Categories (2, interval[int64]): [(0, 18] < (18, 80]]
# Group data
titanic_df.pivot_table(values='survived', index=['sex', age], columns='class')
classFirstSecondThird
sexage
female(0, 18]0.9090911.0000000.511628
(18, 80]0.9729730.9000000.423729
male(0, 18]0.8000000.6000000.215686
(18, 80]0.3750000.0714290.133663

The operation can be applied to columns in a similar fashion. Suppose we want to group survival by sex and age and look at the data by class and fare.

We can discretize the fare variable into equal-sized buckets based on sample quantiles using pd.qcut and then group the data.

# Create bins for fare
fare = pd.qcut(titanic_df['fare'], q=2)
# Group data
titanic_df.pivot_table(values='survived', index=['sex', age], columns=[fare, 'class'])
fare(-0.001, 14.454](14.454, 512.329]
classFirstSecondThirdFirstSecondThird
sexage
female(0, 18]NaN1.0000000.7142860.9090911.0000000.318182
(18, 80]NaN0.8800000.4444440.9729730.9142860.391304
male(0, 18]NaN0.0000000.2608700.8000000.8181820.178571
(18, 80]0.00.0980390.1250000.3913040.0303030.192308

Using aggfunc

aggfunc keyword can be used to specify the aggregate functions that can be applied to different columns.

titanic_df.pivot_table(index='sex', columns='class',
                      aggfunc={'survived':'sum', 'fare': 'mean'})
faresurvived
classFirstSecondThirdFirstSecondThird
sex
female106.12579821.97012116.118810917072
male67.22612719.74178212.661633451747

To compute totals along each grouping, margins keyword can be used.

titanic_df.pivot_table(values='survived', index='sex', columns='class',
                      aggfunc='sum', margins=True)
classFirstSecondThirdAll
sex
female917072233
male451747109
All13687119342

describe() method

Convenience methods, such as describe(), can be used to compute several common aggregates for each column. It also comes in handy when you are trying to understand the overall properties of a dataset.

tips_data.describe()
total_billtipsize
count244.000000244.000000244.000000
mean19.7859432.9982792.569672
std8.9024121.3836380.951100
min3.0700001.0000001.000000
25%13.3475002.0000002.000000
50%17.7950002.9000002.000000
75%24.1275003.5625003.000000
max50.81000010.0000006.000000

describe() on GroupBy

describe() can be used on a groupby() object to get common aggregates for a subset of data.

tips_data.groupby(['smoker','day'])['total_bill'].describe()
countmeanstdmin25%50%75%max
smokerday
YesThur17.019.1905888.35514910.3413.51016.47019.810043.11
Fri15.016.8133339.0863885.7511.69013.42018.665040.17
Sat42.021.27666710.0691383.0713.40520.39026.792550.81
Sun19.024.12000010.4425117.2517.16523.10032.375045.35
NoThur45.017.1131117.7217287.5111.69015.95020.270041.19
Fri4.018.4200005.05928212.4615.10019.23522.555022.75
Sat45.019.6617788.9391817.2514.73017.82020.650048.33
Sun57.020.5066678.1301898.7714.78018.43025.000048.17

Combining Data

Data science workflows often involve combining data from different sources to enhance the analysis. There are multiple ways in which data can be combined ranging from the straightforward concatenation of two different datasets, to more complicated database-style joins.

  • pandas.concat - stacks together objects along an axis.
  • DataFrame.append - works similar to pandas.concat but does not modify the original object. It creates a new object with the combined data.
  • pandas.merge - joins rows in DataFrame based on one or more keys. It works as the entry point for all standard database join operations between DataFrame or Series objects.
  • DataFrame.join - join instance of a DataFrame can be used for merging by index. It can be used to combine many DataFrame objects with same or similar indexes but non-overlapping columns.

concat()

concat() can be used to stack data frames together along an axis.

# Create Data
df2 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                   columns=['one', 'two'])
print('df2 is: \n', df2)

df3 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                   columns=['three', 'four'])
print('\ndf3 is: \n', df3)

df4 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                   columns=['one', 'two'])
print('\ndf4 is: \n', df4)
df2 is: 
    one  two
a    0    1
b    2    3
c    4    5

df3 is: 
    three  four
a      5     6
c      7     8

df4 is: 
    one  two
a    5    6
c    7    8

By default, concat() works row-wise within the DataFrame (along axis=0).

pd.concat([df2,df4])
onetwo
a01
b23
c45
a56
c78

ignore_index flag can be used to ignore the index when it is not necessary.

pd.concat([df2,df4], ignore_index=True)
onetwo
001
123
245
356
478

You an specify an axis along which the concatenation should take place. If axis parameter is not specified, the concatenation works row-wise generating NaN values for unmatched columns.

pd.concat([df2,df3])
onetwothreefour
a0.01.0NaNNaN
b2.03.0NaNNaN
c4.05.0NaNNaN
aNaNNaN5.06.0
cNaNNaN7.08.0

When axis='columns' is specified, the concatenation works along columns.

pd.concat([df2,df3], axis='columns')
onetwothreefour
a015.06.0
b23NaNNaN
c457.08.0

append()

append() works similar to concat() but does not modify the original object. It creates a new object with the combined data. The method works row-wise within the DataFrame (along axis=0). This method is not very efficient, as it involves the creation of a new index and data buffer.

df2.append(df4)
onetwo
a01
b23
c45
a56
c78
df2.append(df3)
onetwothreefour
a0.01.0NaNNaN
b2.03.0NaNNaN
c4.05.0NaNNaN
aNaNNaN5.06.0
cNaNNaN7.08.0

merge()

merge() joins rows in DataFrame based on one or more keys. It works as the entry point for all standard database join operations. Let's create sample data and look at some examples.

# Create data
dept_df = pd.DataFrame({'employee': ['John', 'Jake', 'Jane', 'Suzi', 'Chad'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'Management', 'Marketing']})
print('dept_df is: \n', dept_df)

skills_df = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'Management', 'Management', 'Operations'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization', 'SAP']})
print('\nskills_df is: \n', skills_df)
dept_df is: 
   employee        group
0     John   Accounting
1     Jake  Engineering
2     Jane  Engineering
3     Suzi   Management
4     Chad    Marketing

skills_df is: 
          group        skills
0   Accounting          math
1   Accounting  spreadsheets
2  Engineering        coding
3  Engineering         linux
4   Management  spreadsheets
5   Management  organization
6   Operations           SAP
# Apply merge
pd.merge(dept_df,skills_df)
employeegroupskills
0JohnAccountingmath
1JohnAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4JaneEngineeringcoding
5JaneEngineeringlinux
6SuziManagementspreadsheets
7SuziManagementorganization

Notice that a column to join the data was not specified. merge() uses the overlapping column names as keys for joining data. It is a good practice to explicitly specify the column to join the data using on keyword. 'Marketing' and 'Operations' values and associated data are missing from the result as the operation returns only common set.

pd.merge(dept_df,skills_df, on='group')
employeegroupskills
0JohnAccountingmath
1JohnAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4JaneEngineeringcoding
5JaneEngineeringlinux
6SuziManagementspreadsheets
7SuziManagementorganization

If the column names are different in the DataFrame, then left_on and right_on keywords can be used.

# Create data
emp_df = pd.DataFrame({'name': ['John', 'Jake', 'Jane', 'Suzi', 'Chad'],
                    'salary': [70000, 80000, 120000, 65000, 90000]})
emp_df
namesalary
0John70000
1Jake80000
2Jane120000
3Suzi65000
4Chad90000
# Merge
pd.merge(dept_df, emp_df, left_on='employee', right_on='name')
employeegroupnamesalary
0JohnAccountingJohn70000
1JakeEngineeringJake80000
2JaneEngineeringJane120000
3SuziManagementSuzi65000
4ChadMarketingChad90000

The redundant column can be dropped as needed using the drop() method.

pd.merge(dept_df, emp_df, left_on='employee', right_on='name').drop('name', axis='columns')
employeegroupsalary
0JohnAccounting70000
1JakeEngineering80000
2JaneEngineering120000
3SuziManagement65000
4ChadMarketing90000

Inner Join

By default merge() performs an inner join. The result is an intersection, or the common set found in both DataFrame. The merge operations we just saw were all inner joins. Different join types such as left, right, outer can be specified using the how= parameter.

pd.merge(dept_df,skills_df, on='group', how='inner')
employeegroupskills
0JohnAccountingmath
1JohnAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4JaneEngineeringcoding
5JaneEngineeringlinux
6SuziManagementspreadsheets
7SuziManagementorganization

'Marketing' and 'Operations' values and associated data are missing from the result, as the operation returns only common set.

Left Join

This join returns all records from the left DataFrame and the matched records from the right DataFrame.

pd.merge(dept_df,skills_df, on='group', how='left')
employeegroupskills
0JohnAccountingmath
1JohnAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4JaneEngineeringcoding
5JaneEngineeringlinux
6SuziManagementspreadsheets
7SuziManagementorganization
8ChadMarketingNaN

The result is a Cartesian product of the rows. Since there were two 'Engineering' rows in the left DataFrame and two in the right, there are four 'Engineering' rows in the result. Similarly, there was one 'Accounting' row in left DataFrame and two in the right, resulting in two 'Accounting' rows. The row for 'Operations' is missing as this join only keeps matched rows from right DataFrame.

Right Join

This join returns all records from the right DataFrame and the matched records from the left DataFrame.

pd.merge(dept_df,skills_df, on='group', how='right')
employeegroupskills
0JohnAccountingmath
1JohnAccountingspreadsheets
2JakeEngineeringcoding
3JaneEngineeringcoding
4JakeEngineeringlinux
5JaneEngineeringlinux
6SuziManagementspreadsheets
7SuziManagementorganization
8NaNOperationsSAP

The result is a Cartesian product of the rows. Since there were two 'Engineering' rows in the right DataFrame and two in the left, there are four 'Engineering' rows in the result. Similarly, there was one 'Accounting' row in left DataFrame and two in the right, resulting in two 'Accounting' rows. The row for 'Marketing' is missing as this join only keeps matched rows from right DataFrame.

Outer Join

The outer join takes the union of the keys, combining the effect of applying both left and right joins.

out_df = pd.merge(dept_df,skills_df, on='group', how='outer')
out_df
employeegroupskills
0JohnAccountingmath
1JohnAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4JaneEngineeringcoding
5JaneEngineeringlinux
6SuziManagementspreadsheets
7SuziManagementorganization
8ChadMarketingNaN
9NaNOperationsSAP

The result is a Cartesian product of the rows using all key combinations filling in all missing values with NAs.

Using index to merge

Index can also be used as the key for merging by specifying the left_index and/or right_index flags.

# Create data
dept_dfa = dept_df.set_index('group')
print('dept_dfa is \n', dept_dfa)
skills_dfa = skills_df.set_index('group')
print('\nskills_dfa is \n', skills_dfa)
dept_dfa is 
             employee
group               
Accounting      John
Engineering     Jake
Engineering     Jane
Management      Suzi
Marketing       Chad

skills_dfa is 
                    skills
group                    
Accounting           math
Accounting   spreadsheets
Engineering        coding
Engineering         linux
Management   spreadsheets
Management   organization
Operations            SAP
# Merge
pd.merge(dept_dfa, skills_dfa, left_index=True, right_index=True)
employeeskills
group
AccountingJohnmath
AccountingJohnspreadsheets
EngineeringJakecoding
EngineeringJakelinux
EngineeringJanecoding
EngineeringJanelinux
ManagementSuzispreadsheets
ManagementSuziorganization

join()

join instance of a DataFrame can also be used for merging by index. The how keword can be specified for the type of join.

dept_dfa.join(skills_dfa, how='inner')
employeeskills
group
AccountingJohnmath
AccountingJohnspreadsheets
EngineeringJakecoding
EngineeringJakelinux
EngineeringJanecoding
EngineeringJanelinux
ManagementSuzispreadsheets
ManagementSuziorganization

'Marketing' and 'Operations' values and associated data are missing from the result, as the operation returns only common set.

dept_dfa.join(skills_dfa, how='outer')
employeeskills
group
AccountingJohnmath
AccountingJohnspreadsheets
EngineeringJakecoding
EngineeringJakelinux
EngineeringJanecoding
EngineeringJanelinux
ManagementSuzispreadsheets
ManagementSuziorganization
MarketingChadNaN
OperationsNaNSAP

The result shows all data, filling in the missing values with NAs.

Hierarchical Indexing

Hierarchical indexing (also known as multi-indexing) allows you to have multiple (two or more) index levels within a single index on an axis. It provides a way for representing higher dimensional data in a lower dimensional form. Let's start with a simple example, creating a series with multi-index.

Multi-indexed Series

# Create Data
index = [('CA', 2005), ('CA', 2015),
         ('NY', 2005), ('NY', 2015),
         ('TX', 2005), ('TX', 2015)]
population = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop_series = pd.Series(population, index=index)
pop_series
(CA, 2005)    33871648
(CA, 2015)    37253956
(NY, 2005)    18976457
(NY, 2015)    19378102
(TX, 2005)    20851820
(TX, 2015)    25145561
dtype: int64

With this indexing, you can easily index or slice the series using the index. However, what if you wanted to select all the values for 2015? The tuple-based index is essentially a multi-index and Pandas MultiIndex type allows us to create multi-level indexes. This provides us with the flexibility to perform operations on indexes easily and efficiently.

We will create a multi-indexed index for pop_series using the MultiIndex type.

new_index = pd.MultiIndex.from_tuples(index)
new_index.levels
FrozenList([['CA', 'NY', 'TX'], [2005, 2015]])

Notice that the new_index object contains multiple levels of indexing, the state names and the years. We can now reindex the pop_series to see hierarchical representation of the data.

pop_series = pop_series.reindex(new_index)
pop_series
CA  2005    33871648
    2015    37253956
NY  2005    18976457
    2015    19378102
TX  2005    20851820
    2015    25145561
dtype: int64

Subset Selection

Subset selection of multi-indexed data is similar to what we have seen in the previous part of this guide series. Let's take a quick look.

  • Select data for California
pop_series['CA']
2005    33871648
2015    37253956
dtype: int64
  • Select data from New York to Texas
pop_series['NY':'TX']
NY  2005    18976457
    2015    19378102
TX  2005    20851820
    2015    25145561
dtype: int64
  • Select data for California and Texas
pop_series[['CA','TX']]
CA  2005    33871648
    2015    37253956
TX  2005    20851820
    2015    25145561
dtype: int64

We can now easily access the data for second index and answer our question about selecting all the values for 2015.

pop_series[:,2015]
CA    37253956
NY    19378102
TX    25145561
dtype: int64

unstack() and stack()

unstack() method will convert a multi-indexed Series into a DataFrame, and naturally stack() would do the opposite.

# Unstack
pop_df = pop_series.unstack()
pop_df
20052015
CA3387164837253956
NY1897645719378102
TX2085182025145561

The result is a DataFrame where second level index (years) is converted to columns, and first level index (states) remains as the index of the DataFrame.

# Stack
pop_df.stack()
CA  2005    33871648
    2015    37253956
NY  2005    18976457
    2015    19378102
TX  2005    20851820
    2015    25145561
dtype: int64

level can be specified to unstack() by a specific index level. Specifying level=0 will unstack based on the outermost index level i.e. by 'state'.

pop_series.unstack(level=0)
CANYTX
2005338716481897645720851820
2015372539561937810225145561

Specifying level=1 unstacks by the inner index, in this case 'year'.

pop_series.unstack(level=1)
20052015
CA3387164837253956
NY1897645719378102
TX2085182025145561

Multi-indexed DataFrame

In a DataFrame, both rows and columns can have multiple levels of indices. Let's create some sample data and take a look.

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2019, 2018], [2, 1]],
                                   names=['year', 'quarter'])
columns = pd.MultiIndex.from_product([['John', 'Jane', 'Ben'], ['Product A', 'Product B']],
                                     names=['sales person', 'product'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 55

# create the DataFrame
sales_data = pd.DataFrame(data, index=index, columns=columns)
sales_data
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2019249.056.354.055.448.054.5
151.055.758.053.567.052.9
2018243.054.952.055.946.055.0
139.055.662.056.339.055.2

sales_data is essentially four dimensional data with 'sales person', 'product', 'year' and 'quarter' as its dimensions.

Subset Selection

Since columns in a DataFrame are individual Series, the syntax used for multi-indexed Series applies to the columns.

  • Select data for 'Product A' sold by John
sales_data['John','Product A']
year  quarter
2019  2          49.0
      1          51.0
2018  2          43.0
      1          39.0
Name: (John, Product A), dtype: float64
  • Select data for John and Jane
sales_data[['John','Jane']]
sales personJohnJane
productProduct AProduct BProduct AProduct B
yearquarter
2019249.056.354.055.4
151.055.758.053.5
2018243.054.952.055.9
139.055.662.056.3

.loc and .iloc index operators can also be used.

  • Sales data for first two rows and first four columns
sales_data.iloc[:2, :4]
sales personJohnJane
productProduct AProduct BProduct AProduct B
yearquarter
2019249.056.354.055.4
151.055.758.053.5
  • Jane's sales in 2019
sales_data.loc[2019,'Jane']
productProduct AProduct B
quarter
254.055.4
158.053.5
  • Jane's sales in 2019 for Product B
sales_data.loc[2019,('Jane','Product B')]
quarter
2    55.4
1    53.5
Name: (Jane, Product B), dtype: float64
  • Jane's sales in second quarter of 2019 for Product B
sales_data.loc[(2019,2), ('Jane','Product B')]
55.4

Sorting

By Index and Level

sort_index() can be used to sort the index or levels within your data. By default, the indexing operartion is performed on the outermost index (level=0) and in ascending order.

sales_data.sort_index()
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2018139.055.662.056.339.055.2
243.054.952.055.946.055.0
2019151.055.758.053.567.052.9
249.056.354.055.448.054.5

Specifying level=1 sorts by the inner index.

sales_data.sort_index(level=1)
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2018139.055.662.056.339.055.2
2019151.055.758.053.567.052.9
2018243.054.952.055.946.055.0
2019249.056.354.055.448.054.5

Sorting can be applied on columns by specifying axis='columns'.

sales_data.sort_index(axis='columns')
sales personBenJaneJohn
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2019248.054.554.055.449.056.3
167.052.958.053.551.055.7
2018246.055.052.055.943.054.9
139.055.262.056.339.055.6
sales_data.sort_index(axis='columns', level=1)
sales personBenJaneJohnBenJaneJohn
productProduct AProduct AProduct AProduct BProduct BProduct B
yearquarter
2019248.054.049.054.555.456.3
167.058.051.052.953.555.7
2018246.052.043.055.055.954.9
139.062.039.055.256.355.6
By Value

sort_values() can be used to sort the values in a DataFrame by one or more columns.

  • Sort by values of quarter and then year
sales_data.sort_values(by=['quarter','year'])
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2018139.055.662.056.339.055.2
2019151.055.758.053.567.052.9
2018243.054.952.055.946.055.0
2019249.056.354.055.448.054.5

For multi-indexed data, column label must be unique. So, the values passed to by= parameter must be a tuple with elements corresponding to each level.

  • Sort by values of Product A for Ben
sales_data.sort_values(by=('Ben','Product A'))
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2018139.055.662.056.339.055.2
243.054.952.055.946.055.0
2019249.056.354.055.448.054.5
151.055.758.053.567.052.9

Multiple columns, or a combination or column and index, can be specified by passing them as a list of tuples.

sales_data.sort_values(by=[('Jane','Product A'), ('quarter')])
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2018243.054.952.055.946.055.0
2019249.056.354.055.448.054.5
151.055.758.053.567.052.9
2018139.055.662.056.339.055.2

Data Aggregations

We have seen data aggregations in a previous section in this notebook. Various aggregation functions such as sum(), mean(), median() can be applied to multi-indexed data.

# Get data
sales_data
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
yearquarter
2019249.056.354.055.448.054.5
151.055.758.053.567.052.9
2018243.054.952.055.946.055.0
139.055.662.056.339.055.2
  • Total sales by sales person and product
sales_data.sum()
sales person  product  
John          Product A    182.0
              Product B    222.5
Jane          Product A    226.0
              Product B    221.1
Ben           Product A    200.0
              Product B    217.6
dtype: float64
  • Total sales by year and quarter
sales_data.sum(axis='columns')
year  quarter
2019  2          317.2
      1          338.1
2018  2          306.8
      1          307.1
dtype: float64

level parameter controls the subset of data to which aggregation is applied. Let's look at some examples.

  • Total sales for each quarter by product and sales person
sales_data.sum(level='quarter')
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
quarter
292.0111.2106.0111.394.0109.5
190.0111.3120.0109.8106.0108.1
  • Average sales for each year by product and sales person
yearly_avg = sales_data.mean(level='year')
yearly_avg
sales personJohnJaneBen
productProduct AProduct BProduct AProduct BProduct AProduct B
year
201950.056.0056.054.4557.553.7
201841.055.2557.056.1042.555.1
  • Average sales per year by Product
yearly_avg.mean(axis=1, level='product')
productProduct AProduct B
year
201954.50000054.716667
201846.83333355.483333

Conclusion

In this part of the guide series we learned about how to be more productive with Pandas. We started with Data Aggregation using groupby and pivot_table. Next, we discussed how data can be combined using concat(), append(), merge(), and join() methods. You have seen how data can be indexed at multiple levels in the Hierarchical indexing section. Here, we discussed multi-indexed Series and DataFrame, including selection, sorting, and aggregation methods. We also looked at how to look at unique values and value counts for categorical data.

In the next part of this guide series, we will explore the capabilities for working with Time Series data.

References

[1] Wes McKinney. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd. ed.). O'Reilly Media, Inc.

[2] Jake VanderPlas. 2016. Python Data Science Handbook: Essential Tools for Working with Data (1st. ed.). O'Reilly Media, Inc.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.