Introduction
In this article I will go through the common tasks that you will do using pandas. I will use a premier league season result as an example data set to work with.
Background
While working with pandas I found myself using a set of common commands to work on a given dataset. I thought adding all these commands in one place ina form of cheat sheet could be beneficial.
The dataset
I using this dataset for premier league season 2018-2019. It is a dataset that most people will find it easy to read and follow. The shema for this dataset can be found here
Load the dataset
The dataset is stored in a csv
file. I am using the read_csv
command to load it. I am also using %matplotlib inline
to allow graphs to be displayed in jupyter
%matplotlib inline
import pandas as pd
df = pd.read_csv("data/season-1819.csv")
Explore data
These commands will show you some metadata about the dataset. They are used as a good introduction to understand few things about the dataset like how big it is. I have added sample out put for df.shape
and df.columns
. df.describe()
will display mean
, count
, min
and max
among other calculations. This method display some fantastic statistics really quickly. For example the maximum home team shots were 36!
df.shape # output: (380, 62)
df.columns
"""
output:
Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',
'BWA', 'IWH', 'IWD', 'IWA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA',
'VCH', 'VCD', 'VCA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD', 'BbAvD',
'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5',
'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA',
'PSCH', 'PSCD', 'PSCA'],
dtype='object')
"""
df.describe()
Working with Columns
While viewing a dataset sometimes you may want to remove columns or rename their label for easier reading. The following listings show some columns modifications techniques.
# Modify columns labels
df.rename(columns={'FTHG': 'FullTimeHomeGoals'})
# Remove columns by label
df.drop(columns=['Div'])
df.drop('Div', axis=1)
# Remove columns by column index
df.drop(df.columns[0], axis=1)
# remove columns by range
df.drop(df.iloc[:, 23:65], axis=1) # specify start and end
df.drop(df.iloc[:, 23:], axis=1) # specify start only (23) and it will automatically go to the end
When using iloc
notice the first element in the array is :
, first element is for rows but since we are targeting columns we use the second element. The second element is a normal python
array range. For example 23:65
returns the list of columns from 23 to 65 whereas 23:
will return all columns starting from 23
.
Working with rows
You can use the techniques shown in below listings to remove or select certain rows.
# remove rows
# remove first row by numerical auto index
df.drop(0)
# remove multiple rows by index
df.drop(df.index[[0,1,2,3]])
# remove rows using range
df.drop(df.index[2:])
# remove rows using condition
df[df.HomeTeam != "Man United"]
df.loc[df.HomeTeam != "Man United"]
Rename auto index
When loading a data frame, pandas
used auto index that starts from 0
to identify each row. You can rename the title for this auto generated index column. Please see listings below on how to do that:
# Rename auto index
df.index.names = ["GameId"]
Summary
In this blog I went through a set of common commands you will need when working with pandas
. These commands will help you to understand more about the data you are working with as well as tailor it to your needs.