Initial Data Engineering
When I found the initial dump of tweets that I wanted, they were stored in a huge text file as a single JSON array of objects, where each element represents a single tweet, with a ton of information I don't need. I did some initial exploring on them, and started writing out a Python script to get them out to a more useable format for my purposes
Exploration
I am a huge fan of command line tools that make the job easy, and in this case, I'm using jq, an awesome CLI for looking at JSON objects. Since I use Ubuntu, it was as easy as sudo apt-get install jq
, and it was there.
I have a data folder I've been using in my project folder, so I cd'd into that, and wanted to look at the first object in the files I had.
cat 2017.json | jq '.[0]'
jq takes a bunch of different filter types, and for me, I just wanted to see the first element of data that I had. Because this tool is awesome, it even highlighted the output and pretty-printed it for me, which it does by default.
Looking at the data I got back, I saw that it had a bunch of extra information that I don't really need, so I'll need to script out an extraction script that pulls just the keys I need. Each tweet has a huge block of User information that I don't need in my dataset, so I wanted to see just what it contained:
cat 2017.json | jq '.[0]' | jq 'del(.["user"])'
This showed me that the base object, without the user
object, doesn't contain who tweeted it at all, so I'll need to dig into that too. However, looking at that set of keys, I'm not realizing that I'll need a few extra things from my first list. I think I'll initially pull out:
- is_quote_status
- in_reply_to_status_id
- geo - this is just for fun, as I'm not sure I'll use it yet, but someday I might
- id
- favorite_count
- full_text
- entities.user_mentions
- entities.hashtags
- entities.urls
- retweeted
- in_reply_to_status_id_str
- in_reply_to_screen_name
- in_reply_to_user_id
- retweet_count
And then, from the User schema, I'll need:
- user.id
- user.screenname
- user.followers_count (also just for fun)
I'm thinking that, initially, I'm going to store them as a pipe-separated CSV, which means I'll need to do a little cleaning to the full text field to remove those characters. For the fields that are json arrays, I'll be storing them as comma-separated.
Engineering the data
JSON files are ok, but I am pretty sure I'm going to want a CSV. So, I wrote a script to parse out the 3 files I had (2016, 2017, and 2018), get the fields I want, clean them up as necessary, and spit them out to an output file. It'll also be on my GitHub as I get further along, but basically, you pass it a folder of JSON files, and it loops through each of those, parsing out the necessary fields from the JSON objects, cleaning up text fields as necessary, and writing them out to a CSV.
When I downloaded my files, I noticed that, suspiciously, the 2017 and 2018 file were exactly the same size - I thought something was amiss. Just to make sure, I ran my output file through a sort/uniq chain, like such:
sort test_output.csv | uniq > test_output_uniq.csv
And my suspicions were right - I removed a ton of duplicates, and only have data through the end of 2017. Based solely on volume and glancing at the data, I feel like I can trust it up to that date - 6900 tweets across two years for this user seems about right. So now I just need to figure out what to do about this year...
Another data source
Luckily, the same data source that I found my original, full tweets at also had archives that were 'condensed', and only contained:
- source
- id_str
- text
- created_at
- retweet_count
- in_reply_to_user_id_str
- favorite_count
- is_retweet
This, though, is totally enough to get the rest of it from the Twitter API. I get rate limited to 900 requests every 15 minutes, but that shouldn't be so bad, since I won't have but a few months of tweets that I need to get with this method. So I wrote some quick additions to a fork my original engineering script, which uses the python-twitter library, which made my life a ton easier.
Getting Replies
Now that I had figured out how to get all of the 'base' tweets, I needed to get all of the replies I possibly could. Luckily, I found another source that had a HUGE library of Tweet IDs that were filterable, so I filtered out the dataset (using their website) to just tweets that were in reply to my base user, and downloaded approximately 4.2 million tweet ids. I need to rehydrate them (Twitters term for downloading the full data from the API), so I'm going to use the same scripts from above, with slight modifications to be able to read the IDs from a file, and then make the appropriate API calls. The Twitter API rate limits, but the python-twitter library has built-in methods for handling that.
Until next time (which will probably be sooner than later).