- Introduction to Pandas 🐼
- What are data frames and series?
- Read and write a CSV
- Selecting columns
- Selecting rows
- Working with different datatypes
- Most common functions
- Plotting with Pandas
- 🐕 Exercises
If you need to perform any kind of calculation with tabular data, use Pandas. Pandas is a library that helps programmers query, perform calculations on, and clean CSVs. If you get nothing else from these seminars, learning Pandas is probably the most useful investment of your time. Regardless of what you end up doing, you’re inevitably going to work with data in a table, and Pandas is a lot faster than Excel or Google Sheets.
Pandas has been to space. It’s used on Wall Street and the Federal Reserve. It’s used by digital humanitists and data journalists. It’s powerful, and thankfully, if you know how to code, it’s pretty straightforward to learn!
A “data frame” is a word you’ll hear a lot. It’s what Pandas (as well as other data manipulation software) calls the table object. It’s also why the most common variable used to store a CSV is
df. In Pandas vernacular, “series” just means “column”. A data frame contains multiple series objects.
For all our examples we’ll be using our CSV of
@dog_feelings tweets. In order to load a CSV you just need to provide a path to a CSV file, whether that’s on your computer or on the internet.
url = 'https://raw.githubusercontent.com/kmcelwee/fsi-web-scraping-seminar/main/data/dog_feelings-tweets.csv' df = pd.read_csv(url)
After changing your CSV, you may want to save it as a file, for that, just use the
.to_csv() method, where you just provide a filename where you want to save your data frame. (Just as a warning though, this will overwrite anything with the same name!)
In general, just add
index=False. It’s not worth going into the details here, but put succinctly, you’ll get an extra column if you don’t include that flag.
To view all columns:
To work with one column:
# Select the 'retweet_count' column df['retweet_count']
To work with a subset of multiple columns:
# Select the 'retweet_count' and 'favorite_count' columns df[['retweet_count', 'favorite_count']]
To create a new column, all you need to do is define it. For example, if we wanted to add favorites and retweets together for some reason, just add them together and set them equal to a new column:
df['retweets_plus_favorites'] = df['retweet_count'] + df['favorite_count']
There are many ways to apply functions to columns to manipulate them, but with this baseline knowledge, I trust you can Google appropriately. If you want to learn more, check out this YouTube lecture:
In order select a subset of rows we often want to filter on a certain condition. For example, what if we wanted to only get tweets that had more than 1300 favorites, we could make what’s called a “mask”. First we need to create an array of True and False (boolean) values:
df['favorite_count'] > 1300
The output of which would be:
0 True 1 True 2 True 3 True 4 True ... 1124 False 1125 False 1126 False 1127 False 1128 True
You can then apply this as mask to the data frame by using square brackets like so:
df[df['favorite_count'] > 1300]
This will select only the rows where
df['favorite_count'] > 1300 is True.
To see the different datatypes in our
df.dtypes. You’ll get the following result:
timestamp object id int64 text object favorite_count int64 retweet_count int64 hashtags object
We’re oversimplifying here, but
int64 means that the column contains an integer and the
object columns here are strings.
When we want to filter by date though (which we often want to do), we need to parse the timestamp column as a “datetime” object. You do this with the
pd.to_datetime function. You can you can overwrite the
timestamp column like so:
df['timestamp'] = pd.to_datetime(df['timestamp'])
If you look at
df.dtypes again, you’ll see that the data type of the
timestamp column is
datetime64[ns, UTC]. Now we can create new columns like “month” or “day-of-week” using the
df['month'] = df['timestamp'].dt.month df['day-of-week'] = df['timestamp'].dt.dayofweek
Often you’ll want to get the number of rows or columns in a data frame. You can access this info using the
# return a list that gives the shape of the data frame df.shape # >>> (1129, 6) # number of rows in data frame df.shape # number of columns in a data frame df.shape
# Show the first 5 rows of a df df.head() # Show the first 10 rows of a df df.head(10) # Show the last 5 rows of a df df.tail() # Show the last 10 rows of a df df.tail(10)
Sort a dataframe by the given column.
# Sort df by the column 'favorite_count' df.sort_values('favorite_count') # Sort df by the column 'favorite_count' in ascending order df.sort_values('favorite_count', ascending=False)
If called on a data frame, the count function gives the number of non-null values in all columns. If called on a series, the count function will return the number of non-null values in that column.
# How many non-null values are in each column? df.count() # How many non-null values are in the column 'hashtags' df['hashtags'].count()
When provided numerical data, you will inevitably want to find various statistics on that distribution. Here are some of the most popular:
# Average number of retweets per tweet df['retweet_count'].mean() # Median number of retweets per tweet df['retweet_count'].median() # Minimum number of retweets per tweet df['retweet_count'].min() # Max number of retweets per tweet df['retweet_count'].max() # 25th quantile of retweets per tweet # (meaning 25 percent of tweets have fewer retweets than this number) df['retweet_count'].quantile(.25)
Question: Why might you prefer median instead of mean as a way to describe your data?
Medians are less sensitive to outliers, meaning that one large or really small numbers won't dramatically change your calculation.
This is most common when talking about incomes. Especially in the US, economists will discuss "median household income." This is because the "mean household income" is significantly higher because economic inequality. Mean household income would be a poor reflection of the actual state of the economy.
For example, if you and three friends made $30k, $45k, $45k, and $50k/year, the mean would be $45k/year. If Jeff Bezos was added to the mix, the mean just rose to a few billion, not because you are any richer. The median, however, stayed the same at $45k.
If you’ve ever used Excel or Google Sheets, you’ve probably wanted to take one column and apply some kind of function to it. In Pandas, the
.apply() function does that for you. If you have a series, and want to use each of those rows to create a new column (and you aren’t doing simple arithmetic), then just feed a function into this method. This is best explained by example:
def sentence_count(tweet_text): return len([sentence for sentence in tweet_text.split('.') if sentence != '']) df['sentence_count'] = df['text'].apply(sentence_count)
Here we apply the new function
sentence_count to each value in the
df['text'] column and assign it to a new column
sentence_count function isn’t entirely accurate. Can you think of why that might be?
Sometimes the tweet separates sentences by newline instead of a period. Sometimes the tweet has enumerated points (e.g "1. ...", 2. ...", etc). When making these functions, you'll often have to continually test to make sure you are considering these "corner cases".
This method returns true if the strings in your column contain a given substring. This is useful when filtering rows in our dataset. For example, if we wanted to know what tweets contain the word “gooooob”:
.groupby() isn’t the most straightforward function, but it is very useful in many contexts. Many times you’ll want to group categorical data together, and run a calculation for each of those groups. For example, if you wanted to know what the average number of favorites is per day of the week, here’s what that would look like:
# Ensure that the "timestamp" column is appropriately parsed as a datetime df['timestamp'] = pd.to_datetime(df['timestamp']) # Create a "day-of-week" column df['day-of-week'] = df['timestamp'].dt.dayofweek # Groupby "day-of-week", select the "favorite_count" column and get the mean df.groupby('day-of-week')['favorite_count'].mean()
The output should look like this:
day-of-week 0 65683.261538 1 77871.142857 2 70649.446927 3 74367.388235 4 48420.922619 5 60879.256198 6 70115.652893 Name: favorite_count, dtype: float64
groupby() is a function that can be difficult to get a handle on, so I won’t expand too much further. I recommend the following video if you need to learn more in order to answer your research question:
Thankfully, plotting with Pandas (as long as you’re keeping things simple) is pretty straightforward. Just add
.plot() to whatever series you’re working with. The default is a line chart, but if you want a bar chart, just set
kind='bar'. To set a title, just set
title='My awesome graph'. For everything else, just google “How to change X in a graph with Pandas”. Pandas is built on Matplotlib so familiarizing yourself with that library may be helpful, but in general, you shouldn’t need to do too much outside of just changing that plot function.
If you don’t know how to proceed, try googling the solution. Prepare an answer before clicking the “View Solution” button.
1. How many tweets are were sent in 2019?
2. What was the average number of retweets that a tweet would get in 2017? 2018? 2019? …etc
3. How many tweets contain the word ‘gooooob’?
4. How many tweets are in all caps?
5. What’s the average ratio of favorites to retweets?