Exploring Twitter API and Data Using Tweepy, Pandas and Matplotlib. — Part 1

10 min readSep 7, 2019

The last time I wrote an article about building a quote tweeting bot, I mentioned that if I did something more with the twitter API and data, I would be writing about it which is exactly what I am doing in this article.

So I recently finished an introductory course in scientific computing and python for data science. I wanted to try out some of the things I had learnt on data I curated myself so I decided to do exactly just that using the twitter API and Tweepy to stream the data. After collecting a collection of raw tweet data, I did some basic statistical analysis and visualization using Pandas and Matplotlib respectively. Here’s how I did it.

Getting The Data (Tweets)

As usual, if you do not have a developer account on twitter you’ll have to apply for one and then create/register a new app. This will allow you provide the credentials needed to access the twitter API. If you have done all these, the next step would be create a new folder for this project and store those credentials in a credentials.py file which we would import in our file containing the tweet streamer code. The file should have the following details found in your keys and tokens tab of your developer account dashboard:

API_KEY=YOUR-API-KEY
API_SECRET_KEY=YOUR-API-SECRET-KEY
ACCESS_TOKEN=YOUR-ACCESS-TOKEN
ACCESS_TOKEN_SECRET=YOUR-ACCESS-TOKEN-SECRET

What data did I decide to collect from Twitter? There’s a popular reality TV show going on on DSTV — cable TV — right now known as Big Brother Naija. Every week, housemates are nominated for eviction i.e. if they do not get enough votes they are evicted from the house. The criteria for nomination is usually determined by an invisible character on the show called Big Brother a.k.a Biggie. People outside the house i.e. viewers get to canvass for and vote for their favorite housemates so that they get to remain in the house.

My interest was in getting information about twitter fans of each of the housemates — six (6)in number — up for eviction during the week starting 26th of August through 29th of August. How did I determine who is a fan or not? When viewers canvass for votes for their faves, they used words or phrases like ‘Vote housemate_name’. So I decided my streamer was going to used that phrase as its search phrase for each housemate. It is probably not exhaustive but this project was purely for exploratory purposes and not necessarily to make inferences or deductions.

So now that I had the criteria for my search, I had to install the libraries I planned to use for this project. In order not to have all those libraries and packages on my system itself, Icreated a virtual environment within my project folder. There’s a simple guide here from Real Python on how to create one. To activate the virtual environment — in windows because that’s what I use — you should run the following in your command line within the project folder where venv is the name of the folder containing your virtual environment. To confirm it has been activated, you’ll see a (venv) in your command line.

source ./venv/Scripts/activate

I first installed tweepy — a python wrapper for interacting with the twitter API: pip install tweepy

Then I created my streamer file a streamer.py file. At the top of the file, I imported my credentials.py file, the tweepy package and the inbuilt json library. I also set my credential variables thus:

import credentials
import tweepy
import jsonconsumer_key = credentials.API_KEY
consumer_secret_key = credentials.API_SECRET_KEY
access_token = credentials.ACCESS_TOKEN
access_token_secret = credentials.ACCESS_TOKEN_SECRET

Next, I authenticated my credentials and created an instance of the tweepy api class.

auth = tweepy.OAuthHandler(consumer_key, consumer_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Next, I created a list of the search phrases I wanted to stream. Then I wrote the streaming or tweets-getting function as shown below.

search_terms = ['Vote Cindy', 'Vote Esther', 'Vote Frodd', 'Vote Sir Dee', 'Vote Tacha', 'Vote Venita']def stream_tweets(search_term):
    data = [] # empty list to which tweet_details obj will be added
    counter = 0 # counter to keep track of each iteration
    for tweet in tweepy.Cursor(api.search, q='\"{}\" -filter:retweets'.format(search_term), count=100, lang='en', tweet_mode='extended').items():
        tweet_details = {}
        tweet_details['name'] = tweet.user.screen_name
        tweet_details['tweet'] = tweet.full_text
        tweet_details['retweets'] = tweet.retweet_count
        tweet_details['location'] = tweet.user.location
        tweet_details['created'] = tweet.created_at.strftime("%d-%b-%Y")
        tweet_details['followers'] = tweet.user.followers_count
        tweet_details['is_user_verified'] = tweet.user.verified        data.append(tweet_details)
        
        counter += 1
        if counter == 1000:
            break
        else:
            pass
    with open('data/{}.json'.format(search_term), 'w') as f:
        json.dump(data, f)
    print('done!')

In my case I didn’t just want to listen for new tweets as they are added. I wanted to get past tweets as well which is why I used the api.search method. However, the api.search method does not retrieve more than 100 tweets at a time which is what we ideally want. This is where the tweepy.Cursor object with .items in addition to the counter variable comes into play. The Cursor paginates the results retrieved from the api so we receive a single page. As Omar Aref from Big Data Zone puts it:

…the general idea is that it will allow us to read 100 tweets, store them in a page inherently, then read the next 100 tweets.

We set the counter to stop at 1000 because since we are now searching using the Cursor object, the loop will continue to run forever and not just stop at 100 tweets. Setting the counter to stop at 1000 ensures that the loop will stop after 1000 iterations.

Some other important things to note are:

api.search returns a list of dictionary like objects containing a lots of attributes for a tweet such as the tweet itself, when it was created, the name of the user who sent the tweet, how many followers the user has etc. A list of some of the attributes and a brief description can be found in this tutorial.
The search parameter q=’\”{}\” -filter:retweets’.format(search_term): This ensures that the tweets returned contain the exact search phrase and that retweets are not returned in the results as it can result in duplicate data.
The tweet_mode=extended parameter ensures that we retrieve full text of each tweet not the preview along with a url. Leaving this out gets you a short version of the tweet followed by ellipses (…) and a url to the tweet.
tweet_details[‘created’] = tweet.created_at.strftime(“%d-%b-%Y”): The object returned when tweet.created_at is retrieved is a datetime object. However since I was first storing my tweets in a json file, I had to convert it to a text format. That is why the .strftime("%d-%b-Y") function is called.

Finally, in my file I call, if __name__ == “__main__”: and within that block, I loop through the search_terms list and call stream_tweets for each term. Then we run the tweets_streamer.py file in our command line.

if __name__ == "__main__":
    print('Starting to stream...')
    for search_term in search_terms:
        stream_tweets(search_term)
    print('finished!')

You should see a new data folder in your project directory containing json files each with a name corresponding to the each of the search term in search_terms. You have successfully gotten 1000 tweets for each housemate. Yay!

Analyzing the Tweets

The next thing on the agenda was doing statistical analysis on the data. For this I installed the pandas package: pip install pandas Then at the top of my newly created tweets_analyzer.py file, I imported the pandas package. I also imported the accompanying numpy package — pandas is built on numpy arrays.

import pandas as pd
import numpy as np

Next, I added helper functions to help me with some repetitive tasks. I would like to remove odd characters from my tweets such as urls, emojis and symbols. So I used this clean tweet function I got from this tutorial — I still struggle with regex 😓 so this was very helpful to use. I added it to a newly created helper_functions.py file.

def clean_tweet(tweet):
    return ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', ' ', tweet).split())

I imported the function thus: from helper_functions import clean_tweet just below the first two imports. After which I subsequently began to create individual dataframes for each file. I tried looping through the list of names again and creating the dataframes on the fly but I kept running into an out of memory kind of error. One of the functions for creating the dataframe for one search term looked like this:

def create_cindy_df():
    cindy_df = pd.read_json('data/Vote Cindy.json', orient='records')
    cindy_df['clean_tweet'] = cindy_df['tweet'].apply(lambda x: clean_tweet(x))
    cindy_df['housemate_name'] = 'Cindy'
    cindy_df.drop_duplicates(subset=['name'], keep='first', inplace=True)
    return cindy_df

All subsequent functions looked like that except with a change in the housemate name. The functions also removed duplicate rows with duplicate user handles so that we would have unique handles. After creating for all 6 search terms/ housemates, I then joined them like this to form a single data frame:

def join_dfs():
    cindy_df = create_cindy_df()
    esther_df = create_esther_df()
    frodd_df = create_frodd_df()
    sirdee_df = create_sirdee_df()
    tacha_df = create_tacha_df()
    venita_df = create_venita_df()    frames = [cindy_df, esther_df, frodd_df, venita_df, sirdee_df, tacha_df]    housemates_df = pd.concat(frames, ignore_index=True)
    return housemates_df

The analysis could then begin properly. I wanted to find out how many fans each housemate had so that was my first point of duty. In my analyze function, after accessing the joined housemate_df dataframe, I then did a dataframe.groupby to get that information. What .groupby() does is that it splits the data into groups based on some certain criteria. In my case, I want the data split into groups based on the housemate name and then retrieve the total number of individual unique twitter handles in each group. This was what I came up with:

def analyze():
    housemates_df = join_dfs()    # Number of Fans by Housemate
    fans_by_housemate = housemates_df.groupby('housemate_name')['name'].count().reset_index()
    fans_by_housemate.columns = ['housemate', 'no_of_fans']

Now I have a dataframe containing a the housemates and the total number of each individual unique twitter handle or ‘fan’ for each housemate.

I also wanted to know the locations of these ‘fans’ — twitter handles. In the housemates_df dataframe there is a ‘location’ column. I wanted to know how many unique locations there were and it was running into thousands. When I checked a list of the locations using housemates_df['location'].unique(), I could see that certain locations were common to certain countries. I did a thorough scan and created a list of possible locations based on the data. In my helper_functions.py file, I created a function called extract_locations(location which would take in a location argument and return a likely country for it. I added the possible locations list to it. I then created slices of the list for some countries thus:

def extract_locations(location):
    # To extract locations
    possible_locations = [
        'Nigeria', 'Lagos', 'Osun', 'Ogun', 'Oyo', 'Kano', 'Abia', 'Uyo', 'Abuja', 'Port Harcourt', 'Portharcourt', 'PH', 'Enugu', 
        'Owerri', 'Cross River', 'Anambra', 'Warri', 'Ikeja', 'Unilag', 'University of Lagos', 'V.G.C', 'Naija', 'Ikorodu', 'Makurdi', 
        'Benin City', 'Benin', 'Osogbo', 'Ondo', 'Kalabari Ethnic Nationality', 'NG', 'Delta', 'Onitsha', 'Zaria', 'Kaduna', 'lasgidi',
        'South Africa', 'Pretoria', 'Cape Town', 'Jhb', 'Johannesburg', 'SA', 'S.A', 'Ghana', 'Accra', 'GH', 'United Kingdom', 'England', 
        'USA', 'U.S.A', 'United States', 'NYC', 'NY','Northern Virginia', 'CO', 'Las Vegas', 'Colorado','NM', 'CA', 'Canada', 'Toronto', 
        'Ontario', 'Italy', 'Finland', 'India', 'Swaziland', 'Cameroon', 'Namibia', 'Liberia', 'Mauritius', 'Malawi', 'Tanzania', 'Zambia', 
        'Uganda', 'Kenya', 'Sierra Leone', 'Zimbabwe', 'Africa'
    ]
    
    nigeria = possible_locations[:35]
    south_africa = possible_locations[35:42]
    ghana = possible_locations[42:45]
    uk = possible_locations[45:47]
    usa = possible_locations[47:58]
    canada = possible_locations[58:61]
    europe = possible_locations[61:63]
    other_parts_of_africa = possible_locations[-13:]
    india = possible_locations[63]

Still under the function, I checked for whether the location parameter received has or contains any of the strings in each of the country lists. If it does, it would return that country/general location. Here’s what I mean:

# Returns 'N/A' i.e not available if the location does not contain any of the strings found in the possible_locationsif any(location_substring in location for location_substring in possible_locations) == False or location == '':
        return 'N/A'if any(ng_location_substring in location for ng_location_substring in nigeria):
        return 'Nigeria'

Back in my analyze function in tweets_analyzer.py , I add the following code:

# Locations of Fans by Housemate
    housemates_df['location'] = housemates_df['location'].apply(lambda location: extract_locations(location))
    fans_by_location = pd.DataFrame(housemates_df.groupby('housemate_name')['location'].value_counts().rename('count')).reset_index()

What I am basically doing here is applying a lambda function to the housemates_df[‘location’] column then grouping it by housemate_name, getting the count of each unique location and converting it into a dataframe.

The last analysis I have been able to do so far is get the average number of followers of the fans of each housemate. This is similar to getting the number of fans but this time around I’m concerned with the followers column and I’m getting the mean instead of the count.

# Average no of followers of fans by housemate
    followers_of_fans_by_hm = housemates_df.groupby('housemate_name')['followers'].mean().reset_index()
    followers_of_fans_by_hm.columns = ['housemate', 'average_no_of_followers_of_fans']

Finally I want to return all these new dataframes created so that I can use it in a subsequent plotting function. So I return a tuple containing these three variables:

return (fans_by_housemate, fans_by_location, followers_of_fans_by_hm)

Plotting Charts

I wanted to see a visualization of the my results so I decided to plot the dataframes. For this I used the matplotlib package. I installed it: pip install matplotlib. Still in my tweets_analyzer.py file, imported it — from matplotlib import pyplot as plt and I added one last function: plot_graphs(). In this function I call the analyze() function which returns a tuple of my result dataframes. I did tuple unpacking to get the first two dataframes and plotted the graphs thus:

def plot_graphs():
    analysis_details = analyze()
    fans_by_housemate, fans_by_location, _ = analysis_details    #Bar Chat for numbers of fans of housmates
    fig1, ax1 = plt.subplots()
    ax1.bar(fans_by_housemate['housemate'], fans_by_housemate['no_of_fans'], label='fans by housemate')
    ax1.set_xlabel('Housemate')
    ax1.set_ylabel('Number of twitter Fans')
    ax1.set_title('Number of Twitter Fans by Housemate')    # Bar Chart for locations of fans of housemates
    ax2 = fans_by_location.pivot(index='housemate_name', columns='location', values='count').T.plot(kind='bar', label='fans by location')
    ax2.set_xlabel('Locations')
    ax2.set_ylabel('Number of Twitter Fans')
    ax2.set_yticks(np.arange(0, 300, 15))
    ax2.set_title('Location of Twitter Fans of Housemates')
    ax2.figure.set_size_inches(10, 17)

Because I want to display the results on a web page as we’ll see in the subsequent section(s), I’ll add these last lines to the plot function. This would return a list of the plotted figures which I will display on a web page in the next section.

   list_of_figures = [plt.figure(i) for i in plt.get_fignums()]
   return list_of_figures

In my next article, I will explain how I used Flask to display my dataframes and plotted graphs on a web page and add the link to the github repo for the code. If you liked this one, don’t forget to share with your friends.

References

Some useful resources I made use of during this exercise.

Part 2 of this discussion is linked below.

Exploring Twitter API and Data Using Tweepy, Pandas and Matplotlib. — Part 2

Visualization of dataframes constructed from Twitter Data on a web page using Flask and Matplotlib in Python.

medium.com