
Breast Cancer Forum Analysis
- Category: Data Analysis
- Company: Private Client (UpWork)
- Code Repo: Private
- Language: Python
Worked on a long-term, multi-stage project on UpWork to perform exploratory analysis including a multi-stage project involving data scraping, exploratory analysis (using sentiment extraction) and unsupervised clustering.
Data Scraping
Initially worked on a simple bot that scraped HTML pages from the Inspire Medical Forum. However, the forum restricts user information to only logged-in users. Thus, in the second phase, developed a crawler bot using Python's excellent Selenium library that could log-in and scrape user and thread information using a Chrome instance.With the help of this bot, I scraped over 50k user posts on threads and around 6k user profiles.
Sentiment Analysis
Performed sentiment analysis to extract the sentiment of user posts using two approaches,• Positive and Negative words approach which weighs the use of positive words against negative words to score a certain sentence.
• An open-source lexicon and rule-based sentiment analysis tool called vaderSentiment that is specifically attuned to sentiments expressed in social media.
K-Means Clustering
Used an the unsupervised clustering method, called K-Means clustering, to cluster posts based on different metrics including,• Medical Drugs discussed in the post as categorical variables. The drugs were extracted based on lexicons provided by the client. Used exact match heuristics to extract their similarity.
• Sentiment score of the post.
• Similarity of the user's interests and communities they were member of who created that post.