Pollytics

As a part of a course in our Data Science Degree, we worked on analyzing sentiment around Biden and Trump for the year 2020 using around 60,000 Reddit posts.

We analyzed and compared it with tweets of Biden (2k tweets) and Trump (10k tweets) and poll data (obtained from FiveThirtyEight).

We created a website visualizing all this data using Python's Dash framework which is fetching data from Firebase, deployed on Heroku. Also used Tableau for some visualizations to add some pretty cool visuals. Topic modeling has also been done using LDA (latent Dirichlet allocation) for creating a generative statistical model for each month that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

Stack Used
• Python - Data Scraping, Collection, Cleaning, Processing
• PySpark - Data Processing for Poll Data
• Firebase - for realtime Data Storage and access
• Plotly Express - for interactive Data Visualizing and Graphing
• Tableau - Data Visualizing
• Python Dash (with Flask) - for creating a React Web App with bootstrap components
• Heroku - deploying Web App

Data Sources
• Twitter API - for scraping Donald Trump and Joe Biden Tweets
• Trump Archive - Since Twitter limits tweets to 3k and Trump had over 10k tweets for 2020, we used Trump Archive to extract tweets that exceeded the limit
• PushShift Reddit API - for scraping Reddit posts/tweets, aggregated stats for 2020
• FiveThirtyEight - for Polling Data