Pandas

Pandas#

Pandas allows for the manipulation of tabular data (i.e. organized as rows and columns) in a data structure known as a dataframe. While you may at times need to create such a structure yourself, very often you will read in the data for the dataframe from a file such as a CSV file. (Many other file types are supported, such as Parquet, )

import pandas as pd
import matplotlib.pyplot as plt

Read in a dataframe and display it#

df = pd.read_csv('assets/pandas-example.csv')
df

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children
0	Alice	New York	1990	Blue	USA	Software Engineer	Hiking	Dog	165	1	0
1	Bob	London	1985	Green	UK	Doctor	Reading	Cat	180	2	1
2	Charlie	Paris	1992	Red	France	Artist	Painting	Fish	170	0	0
3	David	Tokyo	1988	Indigo	Japan	Chef	Cooking	Bird	175	1	2
4	Eve	New York	1991	Violet	USA	Teacher	Gardening	NaN	160	3	0
5	Frank	Berlin	1983	Orange	Germany	Musician	Playing Guitar	Dog	185	1	1
6	Grace	London	1987	Yellow	UK	Journalist	Writing	Rabbit	168	0	2
7	Heidi	Paris	1994	Blue	France	Student	Photography	NaN	163	2	0
8	Ivan	New York	1980	Green	USA	Architect	Cycling	Cat	178	1	3
9	Judy	Tokyo	1989	Red	Japan	Designer	Gaming	Hamster	166	0	1
10	Kevin	Berlin	1993	Orange	Germany	Mechanic	Fixing Cars	NaN	182	2	0
11	Liam	London	1986	Yellow	UK	Nurse	Running	Dog	172	1	2
12	Mia	New York	1995	Green	USA	Dancer	Dancing	Cat	169	0	0
13	Noah	Paris	1982	Blue	France	Writer	Traveling	Fish	176	1	1
14	Olivia	Tokyo	1990	Indigo	Japan	Economist	Investing	Bird	164	2	0
15	Peter	Berlin	1984	Violet	Germany	Electrician	Woodworking	NaN	187	0	2
16	Quinn	London	1992	Red	UK	Veterinarian	Animal Care	Rabbit	171	1	0
17	Rahul	Mumbai	1981	Orange	India	Engineer	Meditation	Dog	179	3	1
18	Mei	Shanghai	1990	Yellow	China	Programmer	Calligraphy	Cat	162	1	0
19	Sven	Stockholm	1987	Green	Sweden	Fisherman	Sailing	NaN	183	0	2
20	Priya	Mumbai	1989	Blue	India	Accountant	Yoga	NaN	167	2	0
21	Arjun	Mumbai	1993	Indigo	India	Marketing Manager	Cricket	Dog	174	1	0
22	Giuseppe	Rome	1986	Violet	Italy	Waiter	Cooking	NaN	177	0	1

Utility functions to get an overview of the data in the dataframe#

df.head(3)  # default is first 5 lines if no value is provided

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children
0	Alice	New York	1990	Blue	USA	Software Engineer	Hiking	Dog	165	1	0
1	Bob	London	1985	Green	UK	Doctor	Reading	Cat	180	2	1
2	Charlie	Paris	1992	Red	France	Artist	Painting	Fish	170	0	0

df.tail(8) # default is last 5 lines if no argument provided

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children
15	Peter	Berlin	1984	Violet	Germany	Electrician	Woodworking	NaN	187	0	2
16	Quinn	London	1992	Red	UK	Veterinarian	Animal Care	Rabbit	171	1	0
17	Rahul	Mumbai	1981	Orange	India	Engineer	Meditation	Dog	179	3	1
18	Mei	Shanghai	1990	Yellow	China	Programmer	Calligraphy	Cat	162	1	0
19	Sven	Stockholm	1987	Green	Sweden	Fisherman	Sailing	NaN	183	0	2
20	Priya	Mumbai	1989	Blue	India	Accountant	Yoga	NaN	167	2	0
21	Arjun	Mumbai	1993	Indigo	India	Marketing Manager	Cricket	Dog	174	1	0
22	Giuseppe	Rome	1986	Violet	Italy	Waiter	Cooking	NaN	177	0	1

df.columns   # an (iterable) object with a list of all of the column names

Index(['Name', 'City', 'Birth Year', 'Favorite Color', 'Country of Birth',
       'Profession', 'Hobbies', 'Pet', 'Height (cm)', 'Number of Siblings',
       'Number of Children'],
      dtype='object')

for c in df.columns:
    print(c)

Name
City
Birth Year
Favorite Color
Country of Birth
Profession
Hobbies
Pet
Height (cm)
Number of Siblings
Number of Children

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Name                23 non-null     object
 1   City                23 non-null     object
 2   Birth Year          23 non-null     int64 
 3   Favorite Color      23 non-null     object
 4   Country of Birth    23 non-null     object
 5   Profession          23 non-null     object
 6   Hobbies             23 non-null     object
 7   Pet                 16 non-null     object
 8   Height (cm)         23 non-null     int64 
 9   Number of Siblings  23 non-null     int64 
 10  Number of Children  23 non-null     int64 
dtypes: int64(4), object(7)
memory usage: 2.1+ KB

df.shape   # Shows number of rows and columns

(23, 11)

len(df)   # Shows number of rows

Broadly speaking there are two types of data you may have in a Pandas dataframe: numerical data and “categorical data”. When understanding a dataset it is usually important to understand which columns fall into which categories and a quick overview of the range of both types in the data itself.

Numerical data contains numbers, of course, both floating point and integer values. Quick statistics can be found using the describe() method, either on the whole dataframe or a given column:

df.describe()  # Provides statistical data for numeric columns

	Birth Year	Height (cm)	Number of Siblings	Number of Children
count	23.000000	23.000000	23.000000	23.000000
mean	1988.130435	172.739130	1.086957	0.826087
std	4.256592	7.788204	0.949308	0.936734
min	1980.000000	160.000000	0.000000	0.000000
25%	1985.500000	166.500000	0.000000	0.000000
50%	1989.000000	172.000000	1.000000	1.000000
75%	1991.500000	178.500000	2.000000	1.500000
max	1995.000000	187.000000	3.000000	3.000000

df['Height (cm)'].describe()  # Provides statistical data for numeric columns

count     23.000000
mean     172.739130
std        7.788204
min      160.000000
25%      166.500000
50%      172.000000
75%      178.500000
max      187.000000
Name: Height (cm), dtype: float64

Numerical data is often explored using visualations/plots. We discuss some methods for that with Pandas below, but a more complete explanation can be found in the Matplotlib section.

Categorical data usually consists of a set of unique values. These unique values be a known finite set (e.g. the two-letter abbreviations for the 50 states in the US) or could be in principle unbounded (e.g. names), although within a given dataset even things that are unbounded in principle will of course be a finite set of unique values.

With the unique() function you get the set of values present in the dataset for a categorical variable:

df['Favorite Color'].unique()

array(['Blue', 'Green', 'Red', 'Indigo', 'Violet', 'Orange', 'Yellow'],
      dtype=object)

That set can naturally be stored in a list and used as part of the analysis:

list = df['Favorite Color'].unique()
print(list)

['Blue' 'Green' 'Red' 'Indigo' 'Violet' 'Orange' 'Yellow']

You can also get information on how many each element in the categorical set are present in the data:

df['Favorite Color'].value_counts()

Favorite Color
Blue      4
Green     4
Red       3
Indigo    3
Violet    3
Orange    3
Yellow    3
Name: count, dtype: int64

d = df['Favorite Color'].value_counts()
dict(d)

{'Blue': np.int64(4),
 'Green': np.int64(4),
 'Red': np.int64(3),
 'Indigo': np.int64(3),
 'Violet': np.int64(3),
 'Orange': np.int64(3),
 'Yellow': np.int64(3)}

One can also identify situations where no value (NaN) is provided. This is normal in many cases (e.g. some people don’t have pets), but in others may just represent missing data.

df['Pet'].unique()

array(['Dog', 'Cat', 'Fish', 'Bird', nan, 'Rabbit', 'Hamster'],
      dtype=object)

df['Pet'].value_counts()  # counts for non-NaN values

Pet
Dog        5
Cat        4
Fish       2
Bird       2
Rabbit     2
Hamster    1
Name: count, dtype: int64

df['Pet'].isna().sum()  # Get number of NaN values in the column

np.int64(7)

print(f'There are {df['Pet'].value_counts().sum()} people with pets out of {len(df)} total people, thus {df['Pet'].isna().sum()} have no pet.')

There are 16 people with pets out of 23 total people, thus 7 have no pet.

And you can dig in further to see which rows have no pet:

df['Pet'].isna() # Returns True or False for each row if the value is NaN

   False
   False
   False
   False
    True
   False
   False
    True
   False
   False
   True
  False
  False
  False
  False
   True
  False
  False
  False
   True
   True
  False
   True
Name: Pet, dtype: bool

df[df['Pet'].isna()]

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children
4	Eve	New York	1991	Violet	USA	Teacher	Gardening	NaN	160	3	0
7	Heidi	Paris	1994	Blue	France	Student	Photography	NaN	163	2	0
10	Kevin	Berlin	1993	Orange	Germany	Mechanic	Fixing Cars	NaN	182	2	0
15	Peter	Berlin	1984	Violet	Germany	Electrician	Woodworking	NaN	187	0	2
19	Sven	Stockholm	1987	Green	Sweden	Fisherman	Sailing	NaN	183	0	2
20	Priya	Mumbai	1989	Blue	India	Accountant	Yoga	NaN	167	2	0
22	Giuseppe	Rome	1986	Violet	Italy	Waiter	Cooking	NaN	177	0	1

Subsets of a Dataframe#

# Select columns by name
selected_columns = ['Name','Birth Year','Country of Birth']
df[selected_columns].head()

	Name	Birth Year	Country of Birth
0	Alice	1990	USA
1	Bob	1985	UK
2	Charlie	1992	France
3	David	1988	Japan
4	Eve	1991	USA

# Select column names numerically with slicing syntax
df.columns[0:5]

Index(['Name', 'City', 'Birth Year', 'Favorite Color', 'Country of Birth'], dtype='object')

# You can then use that to select dataframe columns 
df[df.columns[0:5]].head()

	Name	City	Birth Year	Favorite Color	Country of Birth
0	Alice	New York	1990	Blue	USA
1	Bob	London	1985	Green	UK
2	Charlie	Paris	1992	Red	France
3	David	Tokyo	1988	Indigo	Japan
4	Eve	New York	1991	Violet	USA

df[0:3]  # selects rows with slicing syntax

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children
0	Alice	New York	1990	Blue	USA	Software Engineer	Hiking	Dog	165	1	0
1	Bob	London	1985	Green	UK	Doctor	Reading	Cat	180	2	1
2	Charlie	Paris	1992	Red	France	Artist	Painting	Fish	170	0	0

# Conditions return a series of Boolean values
df['Birth Year']>1990

   False
   False
    True
   False
    True
   False
   False
    True
   False
   False
   True
  False
   True
  False
  False
  False
   True
  False
  False
  False
  False
   True
  False
Name: Birth Year, dtype: bool

# Select based on a condition (the series from the conditional)
df[df['Birth Year']>1990]

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings
2	Charlie	Paris	1992	Red	France	Artist	Painting	Fish	170	0
4	Eve	New York	1991	Violet	USA	Teacher	Gardening	NaN	160	3
7	Heidi	Paris	1994	Blue	France	Student	Photography	NaN	163	2
10	Kevin	Berlin	1993	Orange	Germany	Mechanic	Fixing Cars	NaN	182	2
12	Mia	New York	1995	Green	USA	Dancer	Dancing	Cat	169	0
16	Quinn	London	1992	Red	UK	Veterinarian	Animal Care	Rabbit	171	1
21	Arjun	Mumbai	1993	Indigo	India	Marketing Manager	Cricket	Dog	174	1

One can also select based on multiple conditions. Be careful to use & (element-wise “and”) and not and…

# Select based on condition
df[(df['Birth Year']>1990) & (df['Favorite Color']=='Red')]

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children
2	Charlie	Paris	1992	Red	France	Artist	Painting	Fish	170	0	0
16	Quinn	London	1992	Red	UK	Veterinarian	Animal Care	Rabbit	171	1	0

Accessing specific dataframe parts - iloc and loc#

Sometimes it is necessary to access specific rows, columns and entries. This can be done using the iloc and loc functions.

df.iloc[12:15]  # Gets rows from 12 up to (but not including) 15

	Name	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children
12	Mia	New York	1995	Green	USA	Dancer	Dancing	Cat	169	0	0
13	Noah	Paris	1982	Blue	France	Writer	Traveling	Fish	176	1	1
14	Olivia	Tokyo	1990	Indigo	Japan	Economist	Investing	Bird	164	2	0

df.iloc[12:15,0:4]  # Gets rows from 12 up to (but not including) 15 and columns 0 up to (but not including) 4 (index is column 0)

	Name	City	Birth Year	Favorite Color
12	Mia	New York	1995	Green
13	Noah	Paris	1982	Blue
14	Olivia	Tokyo	1990	Indigo

In this example dataframe, the “index” column (column 0) is numeric. If you have some separate column with unique values, for example names, you can also set that to be the index column. This then facilitates the use of the loc function to access things by names rather than numbers. For example:

df = df.set_index('Name') # Change the index column to be the "Name Column"

df.loc['Olivia','Favorite Color']

'Indigo'

Plotting data from a Pandas dataframe#

#plt.scatter(df['Birth Year'],df['Number of Children'])
#plt.hist(df['Birth Year'])
###plt.hist(df['Favorite Color'],bins=['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet'])

# Add a new column
df['Height (m)'] = df['Height (cm)'] / 100
df

	City	Birth Year	Favorite Color	Country of Birth	Profession	Hobbies	Pet	Height (cm)	Number of Siblings	Number of Children	Height (m)
Name
Alice	New York	1990	Blue	USA	Software Engineer	Hiking	Dog	165	1	0	1.65
Bob	London	1985	Green	UK	Doctor	Reading	Cat	180	2	1	1.80
Charlie	Paris	1992	Red	France	Artist	Painting	Fish	170	0	0	1.70
David	Tokyo	1988	Indigo	Japan	Chef	Cooking	Bird	175	1	2	1.75
Eve	New York	1991	Violet	USA	Teacher	Gardening	NaN	160	3	0	1.60
Frank	Berlin	1983	Orange	Germany	Musician	Playing Guitar	Dog	185	1	1	1.85
Grace	London	1987	Yellow	UK	Journalist	Writing	Rabbit	168	0	2	1.68
Heidi	Paris	1994	Blue	France	Student	Photography	NaN	163	2	0	1.63
Ivan	New York	1980	Green	USA	Architect	Cycling	Cat	178	1	3	1.78
Judy	Tokyo	1989	Red	Japan	Designer	Gaming	Hamster	166	0	1	1.66
Kevin	Berlin	1993	Orange	Germany	Mechanic	Fixing Cars	NaN	182	2	0	1.82
Liam	London	1986	Yellow	UK	Nurse	Running	Dog	172	1	2	1.72
Mia	New York	1995	Green	USA	Dancer	Dancing	Cat	169	0	0	1.69
Noah	Paris	1982	Blue	France	Writer	Traveling	Fish	176	1	1	1.76
Olivia	Tokyo	1990	Indigo	Japan	Economist	Investing	Bird	164	2	0	1.64
Peter	Berlin	1984	Violet	Germany	Electrician	Woodworking	NaN	187	0	2	1.87
Quinn	London	1992	Red	UK	Veterinarian	Animal Care	Rabbit	171	1	0	1.71
Rahul	Mumbai	1981	Orange	India	Engineer	Meditation	Dog	179	3	1	1.79
Mei	Shanghai	1990	Yellow	China	Programmer	Calligraphy	Cat	162	1	0	1.62
Sven	Stockholm	1987	Green	Sweden	Fisherman	Sailing	NaN	183	0	2	1.83
Priya	Mumbai	1989	Blue	India	Accountant	Yoga	NaN	167	2	0	1.67
Arjun	Mumbai	1993	Indigo	India	Marketing Manager	Cricket	Dog	174	1	0	1.74
Giuseppe	Rome	1986	Violet	Italy	Waiter	Cooking	NaN	177	0	1	1.77

# Make a copy of the dataframe with only selected columns -and- rename some columns
columns = ['Name','City','Height (m)', 'Birth Year']
df2 = df[columns].copy().rename(columns={'Birth Year':'Year','Height (m)':'Height'})
df2

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[35], line 3
      1 # Make a copy of the dataframe with only selected columns -and- rename some columns
      2 columns = ['Name','City','Height (m)', 'Birth Year']
----> 3 df2 = df[columns].copy().rename(columns={'Birth Year':'Year','Height (m)':'Height'})
      4 df2

File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
   4111     if is_iterator(key):
   4112         key = list(key)
-> 4113     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4115 # take() does not accept boolean indexers
   4116 if getattr(indexer, "dtype", None) == bool:

File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/indexes/base.py:6212, in Index._get_indexer_strict(self, key, axis_name)
   6209 else:
   6210     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6212 self._raise_if_missing(keyarr, indexer, axis_name)
   6214 keyarr = self.take(indexer)
   6215 if isinstance(key, Index):
   6216     # GH 42790 - Preserve name from an Index

File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/indexes/base.py:6264, in Index._raise_if_missing(self, key, indexer, axis_name)
   6261     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6263 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6264 raise KeyError(f"{not_found} not in index")

KeyError: "['Name'] not in index"

Creating and Modifying Dataframes#

For the most part you will usually load data from files in various formats, but you can create a dataframe from a dictionary of lists:

data = {'Name':['Jon', 'Janie', 'Jack', 'Jasmine'], 
        'Age': [32, 64, 96, 128], 
        'Job': ['Tinker', 'Tailor', 'Solider', 'Spy'], 
        'Shoes': [2,5,10,15], 
        'City': ['Mumbai', 'Dallas', 'Kuala Lumpur', 'Buenos Aires']}
df = pd.DataFrame(data)
df

	Name	Age	Job	Shoes	City
0	Jon	32	Tinker	2	Mumbai
1	Janie	64	Tailor	5	Dallas
2	Jack	96	Solider	10	Kuala Lumpur
3	Jasmine	128	Spy	15	Buenos Aires

And you can add columns. These can be calculated from existing columns or in principle based on lookup for other data sources (e.g. if you had some other lookup table or dataframe with related data):

df['Shoes Per Year'] = df['Shoes']/df['Age']
df

	Name	Age	Job	Shoes	City	Shoes Per Year
0	Jon	32	Tinker	2	Mumbai	0.062500
1	Janie	64	Tailor	5	Dallas	0.078125
2	Jack	96	Solider	10	Kuala Lumpur	0.104167
3	Jasmine	128	Spy	15	Buenos Aires	0.117188

Here to add a new column we use an “external” datasource and the map() function which does an elementwise lookup in the provided dict:

# For example, maybe some other data source provides population (in millions) for cities:
city_populations = {
    "Berlin": 3.6,
    "Dallas": 1.3,
    "Kuala Lumpur": 1.8,
    "Tokyo": 37.4,
    "Mumbai": 20.7,
    "Nairobi": 4.4,
    "São Paulo": 22.4,
    "Buenos Aires": 15.5
}
df['Population'] = df['City'].map(city_populations)
df

	Name	Age	Job	Shoes	City	Shoes Per Year	Population
0	Jon	32	Tinker	2	Mumbai	0.062500	20.7
1	Janie	64	Tailor	5	Dallas	0.078125	1.3
2	Jack	96	Solider	10	Kuala Lumpur	0.104167	1.8
3	Jasmine	128	Spy	15	Buenos Aires	0.117188	15.5

How to understand a dataset you load in Pandas#

If you load a new (to you) dataset in Pandas, typically the useful things to do are:

Use df.head() to get a quick sense what is there
Understand what the rows are
Understand what columns are available (by column names or by separate documentation)
- Functions like df.columns and df.dtypes
- Understand which columns are numerical data and which are categorical data
Drop any columns you do not need
Understand the range of data in the different columns
- Use df.describe() and plotting for numerical data
- Use .value_counts() and unique() for categorical data
Understand if the data is complete or if some elements are missing
- Use functions like df['foo'].value_counts(), df.isna().sum()
- Consider dropping or recategorizing data with special or missing data (though be careful as sometimes those are the interesting cases!)
Make plots of numerical data, both individual columns as well as correlations, to understand better the basic distributions, ranges, etc.