Pandas#
Pandas allows for the manipulation of tabular data (i.e. organized as rows and columns) in a data structure known as
a dataframe
. While you may at times need to create such a structure yourself, very often you will read in the
data for the dataframe from a file such as a CSV file. (Many other file types are supported, such as Parquet, )
import pandas as pd
import matplotlib.pyplot as plt
Read in a dataframe and display it#
df = pd.read_csv('assets/pandas-example.csv')
df
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alice | New York | 1990 | Blue | USA | Software Engineer | Hiking | Dog | 165 | 1 | 0 |
1 | Bob | London | 1985 | Green | UK | Doctor | Reading | Cat | 180 | 2 | 1 |
2 | Charlie | Paris | 1992 | Red | France | Artist | Painting | Fish | 170 | 0 | 0 |
3 | David | Tokyo | 1988 | Indigo | Japan | Chef | Cooking | Bird | 175 | 1 | 2 |
4 | Eve | New York | 1991 | Violet | USA | Teacher | Gardening | NaN | 160 | 3 | 0 |
5 | Frank | Berlin | 1983 | Orange | Germany | Musician | Playing Guitar | Dog | 185 | 1 | 1 |
6 | Grace | London | 1987 | Yellow | UK | Journalist | Writing | Rabbit | 168 | 0 | 2 |
7 | Heidi | Paris | 1994 | Blue | France | Student | Photography | NaN | 163 | 2 | 0 |
8 | Ivan | New York | 1980 | Green | USA | Architect | Cycling | Cat | 178 | 1 | 3 |
9 | Judy | Tokyo | 1989 | Red | Japan | Designer | Gaming | Hamster | 166 | 0 | 1 |
10 | Kevin | Berlin | 1993 | Orange | Germany | Mechanic | Fixing Cars | NaN | 182 | 2 | 0 |
11 | Liam | London | 1986 | Yellow | UK | Nurse | Running | Dog | 172 | 1 | 2 |
12 | Mia | New York | 1995 | Green | USA | Dancer | Dancing | Cat | 169 | 0 | 0 |
13 | Noah | Paris | 1982 | Blue | France | Writer | Traveling | Fish | 176 | 1 | 1 |
14 | Olivia | Tokyo | 1990 | Indigo | Japan | Economist | Investing | Bird | 164 | 2 | 0 |
15 | Peter | Berlin | 1984 | Violet | Germany | Electrician | Woodworking | NaN | 187 | 0 | 2 |
16 | Quinn | London | 1992 | Red | UK | Veterinarian | Animal Care | Rabbit | 171 | 1 | 0 |
17 | Rahul | Mumbai | 1981 | Orange | India | Engineer | Meditation | Dog | 179 | 3 | 1 |
18 | Mei | Shanghai | 1990 | Yellow | China | Programmer | Calligraphy | Cat | 162 | 1 | 0 |
19 | Sven | Stockholm | 1987 | Green | Sweden | Fisherman | Sailing | NaN | 183 | 0 | 2 |
20 | Priya | Mumbai | 1989 | Blue | India | Accountant | Yoga | NaN | 167 | 2 | 0 |
21 | Arjun | Mumbai | 1993 | Indigo | India | Marketing Manager | Cricket | Dog | 174 | 1 | 0 |
22 | Giuseppe | Rome | 1986 | Violet | Italy | Waiter | Cooking | NaN | 177 | 0 | 1 |
Utility functions to get an overview of the data in the dataframe#
df.head(3) # default is first 5 lines if no value is provided
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alice | New York | 1990 | Blue | USA | Software Engineer | Hiking | Dog | 165 | 1 | 0 |
1 | Bob | London | 1985 | Green | UK | Doctor | Reading | Cat | 180 | 2 | 1 |
2 | Charlie | Paris | 1992 | Red | France | Artist | Painting | Fish | 170 | 0 | 0 |
df.tail(8) # default is last 5 lines if no argument provided
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
15 | Peter | Berlin | 1984 | Violet | Germany | Electrician | Woodworking | NaN | 187 | 0 | 2 |
16 | Quinn | London | 1992 | Red | UK | Veterinarian | Animal Care | Rabbit | 171 | 1 | 0 |
17 | Rahul | Mumbai | 1981 | Orange | India | Engineer | Meditation | Dog | 179 | 3 | 1 |
18 | Mei | Shanghai | 1990 | Yellow | China | Programmer | Calligraphy | Cat | 162 | 1 | 0 |
19 | Sven | Stockholm | 1987 | Green | Sweden | Fisherman | Sailing | NaN | 183 | 0 | 2 |
20 | Priya | Mumbai | 1989 | Blue | India | Accountant | Yoga | NaN | 167 | 2 | 0 |
21 | Arjun | Mumbai | 1993 | Indigo | India | Marketing Manager | Cricket | Dog | 174 | 1 | 0 |
22 | Giuseppe | Rome | 1986 | Violet | Italy | Waiter | Cooking | NaN | 177 | 0 | 1 |
df.columns # an (iterable) object with a list of all of the column names
Index(['Name', 'City', 'Birth Year', 'Favorite Color', 'Country of Birth',
'Profession', 'Hobbies', 'Pet', 'Height (cm)', 'Number of Siblings',
'Number of Children'],
dtype='object')
for c in df.columns:
print(c)
Name
City
Birth Year
Favorite Color
Country of Birth
Profession
Hobbies
Pet
Height (cm)
Number of Siblings
Number of Children
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 23 non-null object
1 City 23 non-null object
2 Birth Year 23 non-null int64
3 Favorite Color 23 non-null object
4 Country of Birth 23 non-null object
5 Profession 23 non-null object
6 Hobbies 23 non-null object
7 Pet 16 non-null object
8 Height (cm) 23 non-null int64
9 Number of Siblings 23 non-null int64
10 Number of Children 23 non-null int64
dtypes: int64(4), object(7)
memory usage: 2.1+ KB
df.shape # Shows number of rows and columns
(23, 11)
len(df) # Shows number of rows
23
Broadly speaking there are two types of data you may have in a Pandas dataframe: numerical data and “categorical data”. When understanding a dataset it is usually important to understand which columns fall into which categories and a quick overview of the range of both types in the data itself.
Numerical data contains numbers, of course, both floating point and integer values. Quick statistics can be found using the describe()
method, either on the whole dataframe or a given column:
df.describe() # Provides statistical data for numeric columns
Birth Year | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|
count | 23.000000 | 23.000000 | 23.000000 | 23.000000 |
mean | 1988.130435 | 172.739130 | 1.086957 | 0.826087 |
std | 4.256592 | 7.788204 | 0.949308 | 0.936734 |
min | 1980.000000 | 160.000000 | 0.000000 | 0.000000 |
25% | 1985.500000 | 166.500000 | 0.000000 | 0.000000 |
50% | 1989.000000 | 172.000000 | 1.000000 | 1.000000 |
75% | 1991.500000 | 178.500000 | 2.000000 | 1.500000 |
max | 1995.000000 | 187.000000 | 3.000000 | 3.000000 |
df['Height (cm)'].describe() # Provides statistical data for numeric columns
count 23.000000
mean 172.739130
std 7.788204
min 160.000000
25% 166.500000
50% 172.000000
75% 178.500000
max 187.000000
Name: Height (cm), dtype: float64
Numerical data is often explored using visualations/plots. We discuss some methods for that with Pandas below, but a more complete explanation can be found in the Matplotlib section.
Categorical data usually consists of a set of unique values. These unique values be a known finite set (e.g. the two-letter abbreviations for the 50 states in the US) or could be in principle unbounded (e.g. names), although within a given dataset even things that are unbounded in principle will of course be a finite set of unique values.
With the unique()
function you get the set of values present in the dataset for a categorical variable:
df['Favorite Color'].unique()
array(['Blue', 'Green', 'Red', 'Indigo', 'Violet', 'Orange', 'Yellow'],
dtype=object)
That set can naturally be stored in a list and used as part of the analysis:
list = df['Favorite Color'].unique()
print(list)
['Blue' 'Green' 'Red' 'Indigo' 'Violet' 'Orange' 'Yellow']
You can also get information on how many each element in the categorical set are present in the data:
df['Favorite Color'].value_counts()
Favorite Color
Blue 4
Green 4
Red 3
Indigo 3
Violet 3
Orange 3
Yellow 3
Name: count, dtype: int64
d = df['Favorite Color'].value_counts()
dict(d)
{'Blue': np.int64(4),
'Green': np.int64(4),
'Red': np.int64(3),
'Indigo': np.int64(3),
'Violet': np.int64(3),
'Orange': np.int64(3),
'Yellow': np.int64(3)}
One can also identify situations where no value (NaN) is provided. This is normal in many cases (e.g. some people don’t have pets), but in others may just represent missing data.
df['Pet'].unique()
array(['Dog', 'Cat', 'Fish', 'Bird', nan, 'Rabbit', 'Hamster'],
dtype=object)
df['Pet'].value_counts() # counts for non-NaN values
Pet
Dog 5
Cat 4
Fish 2
Bird 2
Rabbit 2
Hamster 1
Name: count, dtype: int64
df['Pet'].isna().sum() # Get number of NaN values in the column
np.int64(7)
print(f'There are {df['Pet'].value_counts().sum()} people with pets out of {len(df)} total people, thus {df['Pet'].isna().sum()} have no pet.')
There are 16 people with pets out of 23 total people, thus 7 have no pet.
And you can dig in further to see which rows have no pet:
df['Pet'].isna() # Returns True or False for each row if the value is NaN
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
9 False
10 True
11 False
12 False
13 False
14 False
15 True
16 False
17 False
18 False
19 True
20 True
21 False
22 True
Name: Pet, dtype: bool
df[df['Pet'].isna()]
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
4 | Eve | New York | 1991 | Violet | USA | Teacher | Gardening | NaN | 160 | 3 | 0 |
7 | Heidi | Paris | 1994 | Blue | France | Student | Photography | NaN | 163 | 2 | 0 |
10 | Kevin | Berlin | 1993 | Orange | Germany | Mechanic | Fixing Cars | NaN | 182 | 2 | 0 |
15 | Peter | Berlin | 1984 | Violet | Germany | Electrician | Woodworking | NaN | 187 | 0 | 2 |
19 | Sven | Stockholm | 1987 | Green | Sweden | Fisherman | Sailing | NaN | 183 | 0 | 2 |
20 | Priya | Mumbai | 1989 | Blue | India | Accountant | Yoga | NaN | 167 | 2 | 0 |
22 | Giuseppe | Rome | 1986 | Violet | Italy | Waiter | Cooking | NaN | 177 | 0 | 1 |
Subsets of a Dataframe#
# Select columns by name
selected_columns = ['Name','Birth Year','Country of Birth']
df[selected_columns].head()
Name | Birth Year | Country of Birth | |
---|---|---|---|
0 | Alice | 1990 | USA |
1 | Bob | 1985 | UK |
2 | Charlie | 1992 | France |
3 | David | 1988 | Japan |
4 | Eve | 1991 | USA |
# Select column names numerically with slicing syntax
df.columns[0:5]
Index(['Name', 'City', 'Birth Year', 'Favorite Color', 'Country of Birth'], dtype='object')
# You can then use that to select dataframe columns
df[df.columns[0:5]].head()
Name | City | Birth Year | Favorite Color | Country of Birth | |
---|---|---|---|---|---|
0 | Alice | New York | 1990 | Blue | USA |
1 | Bob | London | 1985 | Green | UK |
2 | Charlie | Paris | 1992 | Red | France |
3 | David | Tokyo | 1988 | Indigo | Japan |
4 | Eve | New York | 1991 | Violet | USA |
df[0:3] # selects rows with slicing syntax
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alice | New York | 1990 | Blue | USA | Software Engineer | Hiking | Dog | 165 | 1 | 0 |
1 | Bob | London | 1985 | Green | UK | Doctor | Reading | Cat | 180 | 2 | 1 |
2 | Charlie | Paris | 1992 | Red | France | Artist | Painting | Fish | 170 | 0 | 0 |
# Conditions return a series of Boolean values
df['Birth Year']>1990
0 False
1 False
2 True
3 False
4 True
5 False
6 False
7 True
8 False
9 False
10 True
11 False
12 True
13 False
14 False
15 False
16 True
17 False
18 False
19 False
20 False
21 True
22 False
Name: Birth Year, dtype: bool
# Select based on a condition (the series from the conditional)
df[df['Birth Year']>1990]
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | Charlie | Paris | 1992 | Red | France | Artist | Painting | Fish | 170 | 0 | 0 |
4 | Eve | New York | 1991 | Violet | USA | Teacher | Gardening | NaN | 160 | 3 | 0 |
7 | Heidi | Paris | 1994 | Blue | France | Student | Photography | NaN | 163 | 2 | 0 |
10 | Kevin | Berlin | 1993 | Orange | Germany | Mechanic | Fixing Cars | NaN | 182 | 2 | 0 |
12 | Mia | New York | 1995 | Green | USA | Dancer | Dancing | Cat | 169 | 0 | 0 |
16 | Quinn | London | 1992 | Red | UK | Veterinarian | Animal Care | Rabbit | 171 | 1 | 0 |
21 | Arjun | Mumbai | 1993 | Indigo | India | Marketing Manager | Cricket | Dog | 174 | 1 | 0 |
One can also select based on multiple conditions. Be careful to use &
(element-wise “and”) and not and
…
# Select based on condition
df[(df['Birth Year']>1990) & (df['Favorite Color']=='Red')]
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | Charlie | Paris | 1992 | Red | France | Artist | Painting | Fish | 170 | 0 | 0 |
16 | Quinn | London | 1992 | Red | UK | Veterinarian | Animal Care | Rabbit | 171 | 1 | 0 |
Accessing specific dataframe parts - iloc and loc#
Sometimes it is necessary to access specific rows, columns and entries. This can be done using the iloc
and loc
functions.
df.iloc[12:15] # Gets rows from 12 up to (but not including) 15
Name | City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
12 | Mia | New York | 1995 | Green | USA | Dancer | Dancing | Cat | 169 | 0 | 0 |
13 | Noah | Paris | 1982 | Blue | France | Writer | Traveling | Fish | 176 | 1 | 1 |
14 | Olivia | Tokyo | 1990 | Indigo | Japan | Economist | Investing | Bird | 164 | 2 | 0 |
df.iloc[12:15,0:4] # Gets rows from 12 up to (but not including) 15 and columns 0 up to (but not including) 4 (index is column 0)
Name | City | Birth Year | Favorite Color | |
---|---|---|---|---|
12 | Mia | New York | 1995 | Green |
13 | Noah | Paris | 1982 | Blue |
14 | Olivia | Tokyo | 1990 | Indigo |
In this example dataframe, the “index” column (column 0) is numeric. If you have some separate column with unique values, for example names, you can also set that to be the index column. This then facilitates the use of the loc
function to access things by names rather than numbers. For example:
df = df.set_index('Name') # Change the index column to be the "Name Column"
df.loc['Olivia','Favorite Color']
'Indigo'
Plotting data from a Pandas dataframe#
#plt.scatter(df['Birth Year'],df['Number of Children'])
#plt.hist(df['Birth Year'])
###plt.hist(df['Favorite Color'],bins=['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet'])
# Add a new column
df['Height (m)'] = df['Height (cm)'] / 100
df
City | Birth Year | Favorite Color | Country of Birth | Profession | Hobbies | Pet | Height (cm) | Number of Siblings | Number of Children | Height (m) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Alice | New York | 1990 | Blue | USA | Software Engineer | Hiking | Dog | 165 | 1 | 0 | 1.65 |
Bob | London | 1985 | Green | UK | Doctor | Reading | Cat | 180 | 2 | 1 | 1.80 |
Charlie | Paris | 1992 | Red | France | Artist | Painting | Fish | 170 | 0 | 0 | 1.70 |
David | Tokyo | 1988 | Indigo | Japan | Chef | Cooking | Bird | 175 | 1 | 2 | 1.75 |
Eve | New York | 1991 | Violet | USA | Teacher | Gardening | NaN | 160 | 3 | 0 | 1.60 |
Frank | Berlin | 1983 | Orange | Germany | Musician | Playing Guitar | Dog | 185 | 1 | 1 | 1.85 |
Grace | London | 1987 | Yellow | UK | Journalist | Writing | Rabbit | 168 | 0 | 2 | 1.68 |
Heidi | Paris | 1994 | Blue | France | Student | Photography | NaN | 163 | 2 | 0 | 1.63 |
Ivan | New York | 1980 | Green | USA | Architect | Cycling | Cat | 178 | 1 | 3 | 1.78 |
Judy | Tokyo | 1989 | Red | Japan | Designer | Gaming | Hamster | 166 | 0 | 1 | 1.66 |
Kevin | Berlin | 1993 | Orange | Germany | Mechanic | Fixing Cars | NaN | 182 | 2 | 0 | 1.82 |
Liam | London | 1986 | Yellow | UK | Nurse | Running | Dog | 172 | 1 | 2 | 1.72 |
Mia | New York | 1995 | Green | USA | Dancer | Dancing | Cat | 169 | 0 | 0 | 1.69 |
Noah | Paris | 1982 | Blue | France | Writer | Traveling | Fish | 176 | 1 | 1 | 1.76 |
Olivia | Tokyo | 1990 | Indigo | Japan | Economist | Investing | Bird | 164 | 2 | 0 | 1.64 |
Peter | Berlin | 1984 | Violet | Germany | Electrician | Woodworking | NaN | 187 | 0 | 2 | 1.87 |
Quinn | London | 1992 | Red | UK | Veterinarian | Animal Care | Rabbit | 171 | 1 | 0 | 1.71 |
Rahul | Mumbai | 1981 | Orange | India | Engineer | Meditation | Dog | 179 | 3 | 1 | 1.79 |
Mei | Shanghai | 1990 | Yellow | China | Programmer | Calligraphy | Cat | 162 | 1 | 0 | 1.62 |
Sven | Stockholm | 1987 | Green | Sweden | Fisherman | Sailing | NaN | 183 | 0 | 2 | 1.83 |
Priya | Mumbai | 1989 | Blue | India | Accountant | Yoga | NaN | 167 | 2 | 0 | 1.67 |
Arjun | Mumbai | 1993 | Indigo | India | Marketing Manager | Cricket | Dog | 174 | 1 | 0 | 1.74 |
Giuseppe | Rome | 1986 | Violet | Italy | Waiter | Cooking | NaN | 177 | 0 | 1 | 1.77 |
# Make a copy of the dataframe with only selected columns -and- rename some columns
columns = ['Name','City','Height (m)', 'Birth Year']
df2 = df[columns].copy().rename(columns={'Birth Year':'Year','Height (m)':'Height'})
df2
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[35], line 3
1 # Make a copy of the dataframe with only selected columns -and- rename some columns
2 columns = ['Name','City','Height (m)', 'Birth Year']
----> 3 df2 = df[columns].copy().rename(columns={'Birth Year':'Year','Height (m)':'Height'})
4 df2
File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
4111 if is_iterator(key):
4112 key = list(key)
-> 4113 indexer = self.columns._get_indexer_strict(key, "columns")[1]
4115 # take() does not accept boolean indexers
4116 if getattr(indexer, "dtype", None) == bool:
File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/indexes/base.py:6212, in Index._get_indexer_strict(self, key, axis_name)
6209 else:
6210 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6212 self._raise_if_missing(keyarr, indexer, axis_name)
6214 keyarr = self.take(indexer)
6215 if isinstance(key, Index):
6216 # GH 42790 - Preserve name from an Index
File ~/opt/anaconda3/envs/jupyter-book/lib/python3.13/site-packages/pandas/core/indexes/base.py:6264, in Index._raise_if_missing(self, key, indexer, axis_name)
6261 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6263 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6264 raise KeyError(f"{not_found} not in index")
KeyError: "['Name'] not in index"
Creating and Modifying Dataframes#
For the most part you will usually load data from files in various formats, but you can create a dataframe from a dictionary of lists:
data = {'Name':['Jon', 'Janie', 'Jack', 'Jasmine'],
'Age': [32, 64, 96, 128],
'Job': ['Tinker', 'Tailor', 'Solider', 'Spy'],
'Shoes': [2,5,10,15],
'City': ['Mumbai', 'Dallas', 'Kuala Lumpur', 'Buenos Aires']}
df = pd.DataFrame(data)
df
Name | Age | Job | Shoes | City | |
---|---|---|---|---|---|
0 | Jon | 32 | Tinker | 2 | Mumbai |
1 | Janie | 64 | Tailor | 5 | Dallas |
2 | Jack | 96 | Solider | 10 | Kuala Lumpur |
3 | Jasmine | 128 | Spy | 15 | Buenos Aires |
And you can add columns. These can be calculated from existing columns or in principle based on lookup for other data sources (e.g. if you had some other lookup table or dataframe with related data):
df['Shoes Per Year'] = df['Shoes']/df['Age']
df
Name | Age | Job | Shoes | City | Shoes Per Year | |
---|---|---|---|---|---|---|
0 | Jon | 32 | Tinker | 2 | Mumbai | 0.062500 |
1 | Janie | 64 | Tailor | 5 | Dallas | 0.078125 |
2 | Jack | 96 | Solider | 10 | Kuala Lumpur | 0.104167 |
3 | Jasmine | 128 | Spy | 15 | Buenos Aires | 0.117188 |
Here to add a new column we use an “external” datasource and the map()
function which does an elementwise lookup in the provided dict:
# For example, maybe some other data source provides population (in millions) for cities:
city_populations = {
"Berlin": 3.6,
"Dallas": 1.3,
"Kuala Lumpur": 1.8,
"Tokyo": 37.4,
"Mumbai": 20.7,
"Nairobi": 4.4,
"São Paulo": 22.4,
"Buenos Aires": 15.5
}
df['Population'] = df['City'].map(city_populations)
df
Name | Age | Job | Shoes | City | Shoes Per Year | Population | |
---|---|---|---|---|---|---|---|
0 | Jon | 32 | Tinker | 2 | Mumbai | 0.062500 | 20.7 |
1 | Janie | 64 | Tailor | 5 | Dallas | 0.078125 | 1.3 |
2 | Jack | 96 | Solider | 10 | Kuala Lumpur | 0.104167 | 1.8 |
3 | Jasmine | 128 | Spy | 15 | Buenos Aires | 0.117188 | 15.5 |
How to understand a dataset you load in Pandas#
If you load a new (to you) dataset in Pandas, typically the useful things to do are:
Use
df.head()
to get a quick sense what is thereUnderstand what the rows are
Understand what columns are available (by column names or by separate documentation)
Functions like df.columns and df.dtypes
Understand which columns are numerical data and which are categorical data
Drop any columns you do not need
Understand the range of data in the different columns
Use
df.describe()
and plotting for numerical dataUse
.value_counts()
andunique()
for categorical data
Understand if the data is complete or if some elements are missing
Use functions like
df['foo'].value_counts()
,df.isna().sum()
Consider dropping or recategorizing data with special or missing data (though be careful as sometimes those are the interesting cases!)
Make plots of numerical data, both individual columns as well as correlations, to understand better the basic distributions, ranges, etc.