If you’ve ever worked with a spreadsheet, you know what a CSV is. A CSV is just a standardized way of storing data in a table. CSV stands for “Comma Separated Value”, meaning literally each value in row is separated by a comma. Another popular format is TSV or “Tab Separated Value”. It’s the same as a CSV, but just separates values by tabs. This is especially useful when there are a lot of commas in your data. Here’s what a CSV looks like (Fortune 100 companies and their Twitter handles):
Corporation,URL,Rank,Handle,Sector Walmart,https://twitter.com/Walmart,1,Walmart,Retailing Amazon.com,https://twitter.com/amazon,2,amazon,Retailing Exxon Mobil,https://twitter.com/exxonmobil,3,exxonmobil,Energy
That data when loaded into Excel or Google Sheets would look like this:
CSVs are important because they are non-proprietary and plain text. Therefore they can easily be parsed and created by the software programs we write. We’ll learn more about working with CSVs when we dig into the Pandas library next week.
One drawback of CSVs is that they are limited to one value for each row and each column. If we were trying to add a
Hashtags column, we might get ten or fifteen values in each row. If, for example, Walmart used the hashtags “#retail”, “#IndependenceDay”, and “#savings” all in one tweet, we could combine with some kind of unique character like a semicolon:
retail;IndependenceDay;savings. Or we could use JSON…
As you can see, there’s a lot of information for just a single tweet, and representing that information in a CSV would be nearly impossible. JSON makes it easy to communicate complex datasets in a plain text format.
Again, JSON is just a combination of dictionaries and lists. To open a JSON file and get certain values, you’d treat it as you would any list or dictionary.
You can load a JSON file using Python’s
import json json_file_path = 'tweet.json' with open(json_file_path) as f: tweet = json.load(json_file_path)
Feel free to copy the following code into a Colab notebook and mess around with the
tweet variable. Here are some examples:
import json import requests r = requests.get('https://raw.githubusercontent.com/kmcelwee/fsi-web-scraping-seminar/main/data/tweet.json') tweet = json.loads(r.text) # To get the text of this tweet tweet['full_text'] # To get the users mentioned in this tweet tweet['entities']['user_mentions']['screen_name']
It can sometimes be difficult to figure out the path to a variable in JSON, so some trial and error should be expected.
💡 Exercise: Can you make a list of the hashtags with the example tweet provided above?