Pandas cheat sheet

07-05-2023

Introduction

In this article I will go through the common tasks that you will do using pandas. I will use a premier league season result as an example data set to work with.

Background

While working with pandas I found myself using a set of common commands to work on a given dataset. I thought adding all these commands in one place ina form of cheat sheet could be beneficial.

The dataset

I using this dataset for premier league season 2018-2019. It is a dataset that most people will find it easy to read and follow. The shema for this dataset can be found here

Load the dataset

The dataset is stored in a csv file. I am using the read_csv command to load it. I am also using %matplotlib inline to allow graphs to be displayed in jupyter

%matplotlib inline
import pandas as pd

df = pd.read_csv("data/season-1819.csv")

Explore data

These commands will show you some metadata about the dataset. They are used as a good introduction to understand few things about the dataset like how big it is. I have added sample out put for df.shape and df.columns. df.describe() will display mean, count, min and max among other calculations. This method display some fantastic statistics really quickly. For example the maximum home team shots were 36!

df.shape # output: (380, 62)
df.columns 
"""
output:
Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
       'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',
       'BWA', 'IWH', 'IWD', 'IWA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA',
       'VCH', 'VCD', 'VCA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD', 'BbAvD',
       'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5',
       'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA',
       'PSCH', 'PSCD', 'PSCA'],
      dtype='object')
"""
df.describe()

Working with Columns

While viewing a dataset sometimes you may want to remove columns or rename their label for easier reading. The following listings show some columns modifications techniques.

# Modify columns labels
df.rename(columns={'FTHG': 'FullTimeHomeGoals'})
# Remove columns by label
df.drop(columns=['Div'])
df.drop('Div', axis=1)
# Remove columns by column index
df.drop(df.columns[0], axis=1)
# remove columns by range
df.drop(df.iloc[:, 23:65], axis=1) # specify start and end
df.drop(df.iloc[:, 23:], axis=1) # specify start only (23) and it will automatically go to the end

When using iloc notice the first element in the array is :, first element is for rows but since we are targeting columns we use the second element. The second element is a normal python array range. For example 23:65 returns the list of columns from 23 to 65 whereas 23: will return all columns starting from 23.

Working with rows

You can use the techniques shown in below listings to remove or select certain rows.

# remove rows
# remove first row by numerical auto index
df.drop(0)
# remove multiple rows by index
df.drop(df.index[[0,1,2,3]])
# remove rows using range
df.drop(df.index[2:])
# remove rows using condition
df[df.HomeTeam != "Man United"]
df.loc[df.HomeTeam != "Man United"]

Rename auto index

When loading a data frame, pandas used auto index that starts from 0 to identify each row. You can rename the title for this auto generated index column. Please see listings below on how to do that:

# Rename auto index
df.index.names = ["GameId"]

Summary

In this blog I went through a set of common commands you will need when working with pandas. These commands will help you to understand more about the data you are working with as well as tailor it to your needs.