Social Media Sentiment Analysis
Analyze customer sentiment from social media posts and product reviews using NLP techniques, VADER sentiment scoring, and time-series tracking to monitor brand perception in real time.
Project Overview
You are part of a brand monitoring team at a consumer electronics company. Customer feedback streams in daily from Twitter, Instagram, Reddit, and product review sites. Leadership wants to understand overall brand sentiment, detect negative sentiment spikes quickly, and identify which product features or topics drive positive or negative reactions. This project focuses on SQL for data extraction and aggregation, and Python for NLP processing, sentiment scoring, and visualization. Excel plays a supporting role for summary reporting.
Dataset Description
The dataset simulates a social listening data warehouse capturing posts and reviews from multiple platforms along with pre-computed and raw text data for sentiment analysis.
| Table | Key Columns | Rows |
|---|---|---|
| posts | post_id, user_id, platform, post_text, post_date, likes, shares, comments_count | 25,000 |
| reviews | review_id, user_id, product_id, review_text, rating, review_date, verified_purchase | 12,000 |
| users | user_id, username, platform, follower_count, account_created, location | 18,500 |
| sentiment_scores | score_id, post_id, review_id, compound_score, pos_score, neg_score, neu_score, label | 37,000 |
SQL Analysis
1. Sentiment Distribution by Platform
Aggregate sentiment labels across platforms to compare brand perception on each channel.
SELECT
p.platform,
ss.label,
COUNT(*) AS post_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY p.platform), 1)
AS pct_of_platform
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id
GROUP BY p.platform, ss.label
ORDER BY p.platform, post_count DESC;
2. Trending Topics by Sentiment
Identify keywords and topics that appear most frequently in negative posts to flag emerging issues.
-- Extract posts with negative sentiment from the last 30 days
SELECT
p.post_id,
p.platform,
p.post_text,
ss.compound_score,
p.post_date
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id
WHERE ss.label = 'negative'
AND p.post_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY ss.compound_score ASC
LIMIT 100;
-- Aggregate negative sentiment by week to spot spikes
SELECT
DATE_TRUNC('week', p.post_date) AS week_start,
COUNT(*) FILTER (WHERE ss.label = 'negative') AS negative_count,
COUNT(*) FILTER (WHERE ss.label = 'positive') AS positive_count,
COUNT(*) AS total_posts,
ROUND(
COUNT(*) FILTER (WHERE ss.label = 'negative') * 100.0
/ NULLIF(COUNT(*), 0), 1
) AS negative_pct
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id
GROUP BY DATE_TRUNC('week', p.post_date)
ORDER BY week_start DESC;
3. Platform Comparison: Reviews vs. Social Posts
Compare average sentiment scores between product reviews and social media posts.
SELECT
'Social Posts' AS source,
ROUND(AVG(ss.compound_score), 3) AS avg_compound,
ROUND(AVG(ss.pos_score), 3) AS avg_positive,
ROUND(AVG(ss.neg_score), 3) AS avg_negative,
COUNT(*) AS total_items
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id
UNION ALL
SELECT
'Product Reviews' AS source,
ROUND(AVG(ss.compound_score), 3),
ROUND(AVG(ss.pos_score), 3),
ROUND(AVG(ss.neg_score), 3),
COUNT(*)
FROM reviews r
JOIN sentiment_scores ss ON ss.review_id = r.review_id;
4. High-Influence Negative Posts
Find negative posts from users with large followings, as these pose the highest reputational risk.
SELECT
u.username,
u.follower_count,
p.platform,
p.post_text,
ss.compound_score,
p.likes + p.shares AS engagement
FROM posts p
JOIN users u ON u.user_id = p.user_id
JOIN sentiment_scores ss ON ss.post_id = p.post_id
WHERE ss.label = 'negative'
AND u.follower_count > 10000
ORDER BY engagement DESC
LIMIT 20;
Python Analysis
1. NLP Preprocessing
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
posts_df = pd.read_csv('posts.csv')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
"""Clean and normalize social media text."""
text = text.lower()
text = re.sub(r'http\S+|www\S+', '', text) # Remove URLs
text = re.sub(r'@\w+', '', text) # Remove mentions
text = re.sub(r'#(\w+)', r'\1', text) # Keep hashtag text
text = re.sub(r'[^a-z\s]', '', text) # Remove non-alpha
tokens = word_tokenize(text)
tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and len(t) > 2]
return ' '.join(tokens)
posts_df['clean_text'] = posts_df['post_text'].astype(str).apply(preprocess_text)
print(posts_df[['post_text', 'clean_text']].head())
2. VADER Sentiment Scoring
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
def get_sentiment(text):
"""Return VADER sentiment scores."""
scores = sia.polarity_scores(text)
return pd.Series({
'compound': scores['compound'],
'positive': scores['pos'],
'negative': scores['neg'],
'neutral': scores['neu']
})
# Apply VADER to each post
sentiment_df = posts_df['post_text'].astype(str).apply(get_sentiment)
posts_df = pd.concat([posts_df, sentiment_df], axis=1)
# Classify sentiment
posts_df['label'] = posts_df['compound'].apply(
lambda x: 'positive' if x >= 0.05 else ('negative' if x <= -0.05 else 'neutral')
)
print(posts_df['label'].value_counts())
print(f"\nAverage compound score: {posts_df['compound'].mean():.3f}")
3. Word Cloud Visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Word cloud for positive posts
positive_text = ' '.join(posts_df[posts_df['label'] == 'positive']['clean_text'])
negative_text = ' '.join(posts_df[posts_df['label'] == 'negative']['clean_text'])
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
wc_pos = WordCloud(width=800, height=400, background_color='white',
colormap='Greens', max_words=100).generate(positive_text)
axes[0].imshow(wc_pos, interpolation='bilinear')
axes[0].set_title('Positive Sentiment Keywords', fontsize=14)
axes[0].axis('off')
wc_neg = WordCloud(width=800, height=400, background_color='white',
colormap='Reds', max_words=100).generate(negative_text)
axes[1].imshow(wc_neg, interpolation='bilinear')
axes[1].set_title('Negative Sentiment Keywords', fontsize=14)
axes[1].axis('off')
plt.tight_layout()
plt.savefig('sentiment_wordclouds.png', dpi=150)
plt.show()
4. Sentiment Time Series
import matplotlib.dates as mdates
posts_df['post_date'] = pd.to_datetime(posts_df['post_date'])
# Daily average sentiment with 7-day rolling average
daily = posts_df.groupby(posts_df['post_date'].dt.date).agg(
avg_compound=('compound', 'mean'),
post_count=('post_id', 'count'),
negative_pct=('label', lambda x: (x == 'negative').mean() * 100)
).reset_index()
daily['post_date'] = pd.to_datetime(daily['post_date'])
daily['rolling_avg'] = daily['avg_compound'].rolling(7, min_periods=1).mean()
fig, ax1 = plt.subplots(figsize=(14, 5))
ax1.plot(daily['post_date'], daily['avg_compound'], alpha=0.3, color='#3498db', label='Daily')
ax1.plot(daily['post_date'], daily['rolling_avg'], color='#2c3e50', linewidth=2, label='7-day MA')
ax1.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
ax1.set_ylabel('Compound Sentiment Score')
ax1.set_title('Brand Sentiment Over Time')
ax1.legend(loc='upper left')
ax2 = ax1.twinx()
ax2.bar(daily['post_date'], daily['post_count'], alpha=0.15, color='#95a5a6', label='Post Volume')
ax2.set_ylabel('Daily Post Volume')
ax2.legend(loc='upper right')
ax1.xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))
plt.tight_layout()
plt.savefig('sentiment_timeseries.png', dpi=150)
plt.show()
Key Insights
Sentiment Spikes
Sudden drops in compound sentiment often correlate with product launches, outages, or PR incidents. Monitoring the 7-day moving average helps separate noise from genuine trend shifts.
Platform Differences
Twitter/X posts tend to be more polarized (strong positive or negative) while Reddit discussions are more nuanced. Review sites skew toward extreme ratings, creating a U-shaped distribution.
Influencer Risk
A single negative post from a high-follower account can generate more engagement and reputational damage than hundreds of low-reach complaints. Prioritize response based on reach and engagement.
VADER Limitations
VADER works well for short, informal text but struggles with sarcasm, domain-specific jargon, and context-dependent sentiment. For production systems, consider fine-tuned transformer models.
Knowledge Check
What does the VADER compound score represent?
Why do we remove stopwords during text preprocessing?
What is the purpose of lemmatization in the preprocessing pipeline?
In the SQL query for high-influence negative posts, why is follower_count > 10000 used as a filter?
What does a 7-day rolling average of sentiment scores help you identify?