Breast Cancer Forum Analysis

  • Category: Data Analysis
  • Company: Private Client (UpWork)
  • Code Repo: Private
  • Language: Python

Worked on a long-term, multi-stage project on UpWork to perform exploratory analysis including a multi-stage project involving data scraping, exploratory analysis (using sentiment extraction) and unsupervised clustering.

Data Scraping
Initially worked on a simple bot that scraped HTML pages from the Inspire Medical Forum. However, the forum restricts user information to only logged-in users. Thus, in the second phase, developed a crawler bot using Python's excellent Selenium library that could log-in and scrape user and thread information using a Chrome instance.

With the help of this bot, I scraped over 50k user posts on threads and around 6k user profiles.

Sentiment Analysis
Performed sentiment analysis to extract the sentiment of user posts using two approaches,
• Positive and Negative words approach which weighs the use of positive words against negative words to score a certain sentence.
• An open-source lexicon and rule-based sentiment analysis tool called vaderSentiment that is specifically attuned to sentiments expressed in social media.

K-Means Clustering
Used an the unsupervised clustering method, called K-Means clustering, to cluster posts based on different metrics including,
• Medical Drugs discussed in the post as categorical variables. The drugs were extracted based on lexicons provided by the client. Used exact match heuristics to extract their similarity.
• Sentiment score of the post.
• Similarity of the user's interests and communities they were member of who created that post.