Pandas#

Pandas allows for the manipulation of tabular data (i.e. organized as rows and columns) in a data structure known as a dataframe. While you may at times need to create such a structure yourself, very often you will read in the data for the dataframe from a file such as a CSV file. (Many other file types are supported, such as Parquet, )

import pandas as pd
import matplotlib.pyplot as plt

Read in a dataframe and display it#

df = pd.read_csv('assets/pandas-example.csv')
df
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
0 Alice New York 1990 Blue USA Software Engineer Hiking Dog 165 1 0
1 Bob London 1985 Green UK Doctor Reading Cat 180 2 1
2 Charlie Paris 1992 Red France Artist Painting Fish 170 0 0
3 David Tokyo 1988 Indigo Japan Chef Cooking Bird 175 1 2
4 Eve New York 1991 Violet USA Teacher Gardening NaN 160 3 0
5 Frank Berlin 1983 Orange Germany Musician Playing Guitar Dog 185 1 1
6 Grace London 1987 Yellow UK Journalist Writing Rabbit 168 0 2
7 Heidi Paris 1994 Blue France Student Photography NaN 163 2 0
8 Ivan New York 1980 Green USA Architect Cycling Cat 178 1 3
9 Judy Tokyo 1989 Red Japan Designer Gaming Hamster 166 0 1
10 Kevin Berlin 1993 Orange Germany Mechanic Fixing Cars NaN 182 2 0
11 Liam London 1986 Yellow UK Nurse Running Dog 172 1 2
12 Mia New York 1995 Green USA Dancer Dancing Cat 169 0 0
13 Noah Paris 1982 Blue France Writer Traveling Fish 176 1 1
14 Olivia Tokyo 1990 Indigo Japan Economist Investing Bird 164 2 0
15 Peter Berlin 1984 Violet Germany Electrician Woodworking NaN 187 0 2
16 Quinn London 1992 Red UK Veterinarian Animal Care Rabbit 171 1 0
17 Rahul Mumbai 1981 Orange India Engineer Meditation Dog 179 3 1
18 Mei Shanghai 1990 Yellow China Programmer Calligraphy Cat 162 1 0
19 Sven Stockholm 1987 Green Sweden Fisherman Sailing NaN 183 0 2
20 Priya Mumbai 1989 Blue India Accountant Yoga NaN 167 2 0
21 Arjun Mumbai 1993 Indigo India Marketing Manager Cricket Dog 174 1 0
22 Giuseppe Rome 1986 Violet Italy Waiter Cooking NaN 177 0 1

Utility functions to get an overview of the data in the dataframe#

df.head(3)  # default is first 5 lines if no value is provided
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
0 Alice New York 1990 Blue USA Software Engineer Hiking Dog 165 1 0
1 Bob London 1985 Green UK Doctor Reading Cat 180 2 1
2 Charlie Paris 1992 Red France Artist Painting Fish 170 0 0
df.tail(8) # default is last 5 lines if no argument provided
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
15 Peter Berlin 1984 Violet Germany Electrician Woodworking NaN 187 0 2
16 Quinn London 1992 Red UK Veterinarian Animal Care Rabbit 171 1 0
17 Rahul Mumbai 1981 Orange India Engineer Meditation Dog 179 3 1
18 Mei Shanghai 1990 Yellow China Programmer Calligraphy Cat 162 1 0
19 Sven Stockholm 1987 Green Sweden Fisherman Sailing NaN 183 0 2
20 Priya Mumbai 1989 Blue India Accountant Yoga NaN 167 2 0
21 Arjun Mumbai 1993 Indigo India Marketing Manager Cricket Dog 174 1 0
22 Giuseppe Rome 1986 Violet Italy Waiter Cooking NaN 177 0 1
df.columns   # an (iterable) object with a list of all of the column names
Index(['Name', 'City', 'Birth Year', 'Favorite Color', 'Country of Birth',
       'Profession', 'Hobbies', 'Pet', 'Height (cm)', 'Number of Siblings',
       'Number of Children'],
      dtype='object')
for c in df.columns:
    print(c)
Name
City
Birth Year
Favorite Color
Country of Birth
Profession
Hobbies
Pet
Height (cm)
Number of Siblings
Number of Children
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Name                23 non-null     object
 1   City                23 non-null     object
 2   Birth Year          23 non-null     int64 
 3   Favorite Color      23 non-null     object
 4   Country of Birth    23 non-null     object
 5   Profession          23 non-null     object
 6   Hobbies             23 non-null     object
 7   Pet                 16 non-null     object
 8   Height (cm)         23 non-null     int64 
 9   Number of Siblings  23 non-null     int64 
 10  Number of Children  23 non-null     int64 
dtypes: int64(4), object(7)
memory usage: 2.1+ KB
df.shape   # Shows number of rows and columns
(23, 11)
len(df)   # Shows number of rows
23

Broadly speaking there are two types of data you may have in a Pandas dataframe: numerical data and “categorical data”. When understanding a dataset it is usually important to understand which columns fall into which categories and a quick overview of the range of both types in the data itself.

Numerical data contains numbers, of course, both floating point and integer values. Quick statistics can be found using the describe() method, either on the whole dataframe or a given column:

df.describe()  # Provides statistical data for numeric columns
Birth Year Height (cm) Number of Siblings Number of Children
count 23.000000 23.000000 23.000000 23.000000
mean 1988.130435 172.739130 1.086957 0.826087
std 4.256592 7.788204 0.949308 0.936734
min 1980.000000 160.000000 0.000000 0.000000
25% 1985.500000 166.500000 0.000000 0.000000
50% 1989.000000 172.000000 1.000000 1.000000
75% 1991.500000 178.500000 2.000000 1.500000
max 1995.000000 187.000000 3.000000 3.000000
df['Height (cm)'].describe()  # Provides statistical data for numeric columns
count     23.000000
mean     172.739130
std        7.788204
min      160.000000
25%      166.500000
50%      172.000000
75%      178.500000
max      187.000000
Name: Height (cm), dtype: float64

Numerical data is often explored using visualations/plots. We discuss some methods for that with Pandas below, but a more complete explanation can be found in the Matplotlib section.

Categorical data usually consists of a set of unique values. These unique values be a known finite set (e.g. the two-letter abbreviations for the 50 states in the US) or could be in principle unbounded (e.g. names), although within a given dataset even things that are unbounded in principle will of course be a finite set of unique values.

With the unique() function you get the set of values present in the dataset for a categorical variable:

df['Favorite Color'].unique()
array(['Blue', 'Green', 'Red', 'Indigo', 'Violet', 'Orange', 'Yellow'],
      dtype=object)

That set can naturally be stored in a list and used as part of the analysis:

list = df['Favorite Color'].unique()
print(list)
['Blue' 'Green' 'Red' 'Indigo' 'Violet' 'Orange' 'Yellow']

You can also get information on how many each element in the categorical set are present in the data:

df['Favorite Color'].value_counts()
Favorite Color
Blue      4
Green     4
Red       3
Indigo    3
Violet    3
Orange    3
Yellow    3
Name: count, dtype: int64
d = df['Favorite Color'].value_counts()
dict(d)
{'Blue': np.int64(4),
 'Green': np.int64(4),
 'Red': np.int64(3),
 'Indigo': np.int64(3),
 'Violet': np.int64(3),
 'Orange': np.int64(3),
 'Yellow': np.int64(3)}

One can also identify situations where no value (NaN) is provided. This is normal in many cases (e.g. some people don’t have pets), but in others may just represent missing data.

df['Pet'].unique()
array(['Dog', 'Cat', 'Fish', 'Bird', nan, 'Rabbit', 'Hamster'],
      dtype=object)
df['Pet'].value_counts()  # counts for non-NaN values
Pet
Dog        5
Cat        4
Fish       2
Bird       2
Rabbit     2
Hamster    1
Name: count, dtype: int64
df['Pet'].isna().sum()  # Get number of NaN values in the column
np.int64(7)
print(f'There are {df['Pet'].value_counts().sum()} people with pets out of {len(df)} total people, thus {df['Pet'].isna().sum()} have no pet.')
There are 16 people with pets out of 23 total people, thus 7 have no pet.

And you can dig in further to see which rows have no pet:

df['Pet'].isna() # Returns True or False for each row if the value is NaN
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9     False
10     True
11    False
12    False
13    False
14    False
15     True
16    False
17    False
18    False
19     True
20     True
21    False
22     True
Name: Pet, dtype: bool
df[df['Pet'].isna()]
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
4 Eve New York 1991 Violet USA Teacher Gardening NaN 160 3 0
7 Heidi Paris 1994 Blue France Student Photography NaN 163 2 0
10 Kevin Berlin 1993 Orange Germany Mechanic Fixing Cars NaN 182 2 0
15 Peter Berlin 1984 Violet Germany Electrician Woodworking NaN 187 0 2
19 Sven Stockholm 1987 Green Sweden Fisherman Sailing NaN 183 0 2
20 Priya Mumbai 1989 Blue India Accountant Yoga NaN 167 2 0
22 Giuseppe Rome 1986 Violet Italy Waiter Cooking NaN 177 0 1

Subsets of a Dataframe#

# Select columns by name
selected_columns = ['Name','Birth Year','Country of Birth']
df[selected_columns].head()
Name Birth Year Country of Birth
0 Alice 1990 USA
1 Bob 1985 UK
2 Charlie 1992 France
3 David 1988 Japan
4 Eve 1991 USA
# Select column names numerically with slicing syntax
df.columns[0:5]
Index(['Name', 'City', 'Birth Year', 'Favorite Color', 'Country of Birth'], dtype='object')
# You can then use that to select dataframe columns 
df[df.columns[0:5]].head()
Name City Birth Year Favorite Color Country of Birth
0 Alice New York 1990 Blue USA
1 Bob London 1985 Green UK
2 Charlie Paris 1992 Red France
3 David Tokyo 1988 Indigo Japan
4 Eve New York 1991 Violet USA
df[0:3]  # selects rows with slicing syntax
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
0 Alice New York 1990 Blue USA Software Engineer Hiking Dog 165 1 0
1 Bob London 1985 Green UK Doctor Reading Cat 180 2 1
2 Charlie Paris 1992 Red France Artist Painting Fish 170 0 0
# Conditions return a series of Boolean values
df['Birth Year']>1990
0     False
1     False
2      True
3     False
4      True
5     False
6     False
7      True
8     False
9     False
10     True
11    False
12     True
13    False
14    False
15    False
16     True
17    False
18    False
19    False
20    False
21     True
22    False
Name: Birth Year, dtype: bool
# Select based on a condition (the series from the conditional)
df[df['Birth Year']>1990]
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
2 Charlie Paris 1992 Red France Artist Painting Fish 170 0 0
4 Eve New York 1991 Violet USA Teacher Gardening NaN 160 3 0
7 Heidi Paris 1994 Blue France Student Photography NaN 163 2 0
10 Kevin Berlin 1993 Orange Germany Mechanic Fixing Cars NaN 182 2 0
12 Mia New York 1995 Green USA Dancer Dancing Cat 169 0 0
16 Quinn London 1992 Red UK Veterinarian Animal Care Rabbit 171 1 0
21 Arjun Mumbai 1993 Indigo India Marketing Manager Cricket Dog 174 1 0

One can also select based on multiple conditions. Be careful to use & (element-wise “and”) and not and

# Select based on condition
df[(df['Birth Year']>1990) & (df['Favorite Color']=='Red')]
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
2 Charlie Paris 1992 Red France Artist Painting Fish 170 0 0
16 Quinn London 1992 Red UK Veterinarian Animal Care Rabbit 171 1 0

Accessing specific dataframe parts - iloc and loc#

Sometimes it is necessary to access specific rows, columns and entries. This can be done using the iloc and loc functions.

df.iloc[12:15]  # Gets rows from 12 up to (but not including) 15
Name City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children
12 Mia New York 1995 Green USA Dancer Dancing Cat 169 0 0
13 Noah Paris 1982 Blue France Writer Traveling Fish 176 1 1
14 Olivia Tokyo 1990 Indigo Japan Economist Investing Bird 164 2 0
df.iloc[12:15,0:4]  # Gets rows from 12 up to (but not including) 15 and columns 0 up to (but not including) 4 (index is column 0)
Name City Birth Year Favorite Color
12 Mia New York 1995 Green
13 Noah Paris 1982 Blue
14 Olivia Tokyo 1990 Indigo

In this example dataframe, the “index” column (column 0) is numeric. If you have some separate column with unique values, for example names, you can also set that to be the index column. This then facilitates the use of the loc function to access things by names rather than numbers. For example:

df = df.set_index('Name') # Change the index column to be the "Name Column"
df.loc['Olivia','Favorite Color']
'Indigo'

Plotting data from a Pandas dataframe#

#plt.scatter(df['Birth Year'],df['Number of Children'])
#plt.hist(df['Birth Year'])
###plt.hist(df['Favorite Color'],bins=['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet'])
# Add a new column
df['Height (m)'] = df['Height (cm)'] / 100
df
City Birth Year Favorite Color Country of Birth Profession Hobbies Pet Height (cm) Number of Siblings Number of Children Height (m)
Name
Alice New York 1990 Blue USA Software Engineer Hiking Dog 165 1 0 1.65
Bob London 1985 Green UK Doctor Reading Cat 180 2 1 1.80
Charlie Paris 1992 Red France Artist Painting Fish 170 0 0 1.70
David Tokyo 1988 Indigo Japan Chef Cooking Bird 175 1 2 1.75
Eve New York 1991 Violet USA Teacher Gardening NaN 160 3 0 1.60
Frank Berlin 1983 Orange Germany Musician Playing Guitar Dog 185 1 1 1.85
Grace London 1987 Yellow UK Journalist Writing Rabbit 168 0 2 1.68
Heidi Paris 1994 Blue France Student Photography NaN 163 2 0 1.63
Ivan New York 1980 Green USA Architect Cycling Cat 178 1 3 1.78
Judy Tokyo 1989 Red Japan Designer Gaming Hamster 166 0 1 1.66
Kevin Berlin 1993 Orange Germany Mechanic Fixing Cars NaN 182 2 0 1.82
Liam London 1986 Yellow UK Nurse Running Dog 172 1 2 1.72
Mia New York 1995 Green USA Dancer Dancing Cat 169 0 0 1.69
Noah Paris 1982 Blue France Writer Traveling Fish 176 1 1 1.76
Olivia Tokyo 1990 Indigo Japan Economist Investing Bird 164 2 0 1.64
Peter Berlin 1984 Violet Germany Electrician Woodworking NaN 187 0 2 1.87
Quinn London 1992 Red UK Veterinarian Animal Care Rabbit 171 1 0 1.71
Rahul Mumbai 1981 Orange India Engineer Meditation Dog 179 3 1 1.79
Mei Shanghai 1990 Yellow China Programmer Calligraphy Cat 162 1 0 1.62
Sven Stockholm 1987 Green Sweden Fisherman Sailing NaN 183 0 2 1.83
Priya Mumbai 1989 Blue India Accountant Yoga NaN 167 2 0 1.67
Arjun Mumbai 1993 Indigo India Marketing Manager Cricket Dog 174 1 0 1.74
Giuseppe Rome 1986 Violet Italy Waiter Cooking NaN 177 0 1 1.77
# Make a copy of the dataframe with only selected columns -and- rename some columns
columns = ['Name','City','Height (m)', 'Birth Year']
df2 = df[columns].copy().rename(columns={'Birth Year':'Year','Height (m)':'Height'})
df2
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[35], line 3
      1 # Make a copy of the dataframe with only selected columns -and- rename some columns
      2 columns = ['Name','City','Height (m)', 'Birth Year']
----> 3 df2 = df[columns].copy().rename(columns={'Birth Year':'Year','Height (m)':'Height'})
      4 df2

File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
   4111     if is_iterator(key):
   4112         key = list(key)
-> 4113     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4115 # take() does not accept boolean indexers
   4116 if getattr(indexer, "dtype", None) == bool:

File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/indexes/base.py:6212, in Index._get_indexer_strict(self, key, axis_name)
   6209 else:
   6210     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6212 self._raise_if_missing(keyarr, indexer, axis_name)
   6214 keyarr = self.take(indexer)
   6215 if isinstance(key, Index):
   6216     # GH 42790 - Preserve name from an Index

File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/indexes/base.py:6264, in Index._raise_if_missing(self, key, indexer, axis_name)
   6261     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6263 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6264 raise KeyError(f"{not_found} not in index")

KeyError: "['Name'] not in index"

Creating and Modifying Dataframes#

For the most part you will usually load data from files in various formats, but you can create a dataframe from a dictionary of lists:

data = {'Name':['Jon', 'Janie', 'Jack', 'Jasmine'], 
        'Age': [32, 64, 96, 128], 
        'Job': ['Tinker', 'Tailor', 'Solider', 'Spy'], 
        'Shoes': [2,5,10,15], 
        'City': ['Mumbai', 'Dallas', 'Kuala Lumpur', 'Buenos Aires']}
df = pd.DataFrame(data)
df
Name Age Job Shoes City
0 Jon 32 Tinker 2 Mumbai
1 Janie 64 Tailor 5 Dallas
2 Jack 96 Solider 10 Kuala Lumpur
3 Jasmine 128 Spy 15 Buenos Aires

And you can add columns. These can be calculated from existing columns or in principle based on lookup for other data sources (e.g. if you had some other lookup table or dataframe with related data):

df['Shoes Per Year'] = df['Shoes']/df['Age']
df
Name Age Job Shoes City Shoes Per Year
0 Jon 32 Tinker 2 Mumbai 0.062500
1 Janie 64 Tailor 5 Dallas 0.078125
2 Jack 96 Solider 10 Kuala Lumpur 0.104167
3 Jasmine 128 Spy 15 Buenos Aires 0.117188

Here to add a new column we use an “external” datasource and the map() function which does an elementwise lookup in the provided dict:

# For example, maybe some other data source provides population (in millions) for cities:
city_populations = {
    "Berlin": 3.6,
    "Dallas": 1.3,
    "Kuala Lumpur": 1.8,
    "Tokyo": 37.4,
    "Mumbai": 20.7,
    "Nairobi": 4.4,
    "São Paulo": 22.4,
    "Buenos Aires": 15.5
}
df['Population'] = df['City'].map(city_populations)
df
Name Age Job Shoes City Shoes Per Year Population
0 Jon 32 Tinker 2 Mumbai 0.062500 20.7
1 Janie 64 Tailor 5 Dallas 0.078125 1.3
2 Jack 96 Solider 10 Kuala Lumpur 0.104167 1.8
3 Jasmine 128 Spy 15 Buenos Aires 0.117188 15.5

How to understand a dataset you load in Pandas#

If you load a new (to you) dataset in Pandas, typically the useful things to do are:

  • Use df.head() to get a quick sense what is there

  • Understand what the rows are

  • Understand what columns are available (by column names or by separate documentation)

    • Functions like df.columns and df.dtypes

    • Understand which columns are numerical data and which are categorical data

  • Drop any columns you do not need

  • Understand the range of data in the different columns

    • Use df.describe() and plotting for numerical data

    • Use .value_counts() and unique() for categorical data

  • Understand if the data is complete or if some elements are missing

    • Use functions like df['foo'].value_counts(), df.isna().sum()

    • Consider dropping or recategorizing data with special or missing data (though be careful as sometimes those are the interesting cases!)

  • Make plots of numerical data, both individual columns as well as correlations, to understand better the basic distributions, ranges, etc.