Social Media Sentiment Analysis - Data Analytics Projects

Project Overview

You are part of a brand monitoring team at a consumer electronics company. Customer feedback streams in daily from Twitter, Instagram, Reddit, and product review sites. Leadership wants to understand overall brand sentiment, detect negative sentiment spikes quickly, and identify which product features or topics drive positive or negative reactions. This project focuses on SQL for data extraction and aggregation, and Python for NLP processing, sentiment scoring, and visualization. Excel plays a supporting role for summary reporting.

Dataset Description

The dataset simulates a social listening data warehouse capturing posts and reviews from multiple platforms along with pre-computed and raw text data for sentiment analysis.

Table	Key Columns	Rows
posts	post_id, user_id, platform, post_text, post_date, likes, shares, comments_count	25,000
reviews	review_id, user_id, product_id, review_text, rating, review_date, verified_purchase	12,000
users	user_id, username, platform, follower_count, account_created, location	18,500
sentiment_scores	score_id, post_id, review_id, compound_score, pos_score, neg_score, neu_score, label	37,000

SQL Analysis

1. Sentiment Distribution by Platform

Aggregate sentiment labels across platforms to compare brand perception on each channel.

SELECT
    p.platform,
    ss.label,
    COUNT(*)                                      AS post_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY p.platform), 1)
                                                  AS pct_of_platform
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id
GROUP BY p.platform, ss.label
ORDER BY p.platform, post_count DESC;

2. Trending Topics by Sentiment

Identify keywords and topics that appear most frequently in negative posts to flag emerging issues.

-- Extract posts with negative sentiment from the last 30 days
SELECT
    p.post_id,
    p.platform,
    p.post_text,
    ss.compound_score,
    p.post_date
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id
WHERE ss.label = 'negative'
  AND p.post_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY ss.compound_score ASC
LIMIT 100;

-- Aggregate negative sentiment by week to spot spikes
SELECT
    DATE_TRUNC('week', p.post_date)              AS week_start,
    COUNT(*) FILTER (WHERE ss.label = 'negative') AS negative_count,
    COUNT(*) FILTER (WHERE ss.label = 'positive') AS positive_count,
    COUNT(*)                                      AS total_posts,
    ROUND(
        COUNT(*) FILTER (WHERE ss.label = 'negative') * 100.0
        / NULLIF(COUNT(*), 0), 1
    )                                             AS negative_pct
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id
GROUP BY DATE_TRUNC('week', p.post_date)
ORDER BY week_start DESC;

3. Platform Comparison: Reviews vs. Social Posts

Compare average sentiment scores between product reviews and social media posts.

SELECT
    'Social Posts' AS source,
    ROUND(AVG(ss.compound_score), 3) AS avg_compound,
    ROUND(AVG(ss.pos_score), 3)      AS avg_positive,
    ROUND(AVG(ss.neg_score), 3)      AS avg_negative,
    COUNT(*)                         AS total_items
FROM posts p
JOIN sentiment_scores ss ON ss.post_id = p.post_id

UNION ALL

SELECT
    'Product Reviews' AS source,
    ROUND(AVG(ss.compound_score), 3),
    ROUND(AVG(ss.pos_score), 3),
    ROUND(AVG(ss.neg_score), 3),
    COUNT(*)
FROM reviews r
JOIN sentiment_scores ss ON ss.review_id = r.review_id;

4. High-Influence Negative Posts

Find negative posts from users with large followings, as these pose the highest reputational risk.

SELECT
    u.username,
    u.follower_count,
    p.platform,
    p.post_text,
    ss.compound_score,
    p.likes + p.shares AS engagement
FROM posts p
JOIN users u           ON u.user_id = p.user_id
JOIN sentiment_scores ss ON ss.post_id = p.post_id
WHERE ss.label = 'negative'
  AND u.follower_count > 10000
ORDER BY engagement DESC
LIMIT 20;

Python Analysis

1. NLP Preprocessing

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

posts_df = pd.read_csv('posts.csv')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """Clean and normalize social media text."""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)       # Remove URLs
    text = re.sub(r'@\w+', '', text)                  # Remove mentions
    text = re.sub(r'#(\w+)', r'\1', text)             # Keep hashtag text
    text = re.sub(r'[^a-z\s]', '', text)              # Remove non-alpha
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and len(t) > 2]
    return ' '.join(tokens)

posts_df['clean_text'] = posts_df['post_text'].astype(str).apply(preprocess_text)
print(posts_df[['post_text', 'clean_text']].head())

2. VADER Sentiment Scoring

from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

def get_sentiment(text):
    """Return VADER sentiment scores."""
    scores = sia.polarity_scores(text)
    return pd.Series({
        'compound': scores['compound'],
        'positive': scores['pos'],
        'negative': scores['neg'],
        'neutral': scores['neu']
    })

# Apply VADER to each post
sentiment_df = posts_df['post_text'].astype(str).apply(get_sentiment)
posts_df = pd.concat([posts_df, sentiment_df], axis=1)

# Classify sentiment
posts_df['label'] = posts_df['compound'].apply(
    lambda x: 'positive' if x >= 0.05 else ('negative' if x <= -0.05 else 'neutral')
)

print(posts_df['label'].value_counts())
print(f"\nAverage compound score: {posts_df['compound'].mean():.3f}")

3. Word Cloud Visualization

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Word cloud for positive posts
positive_text = ' '.join(posts_df[posts_df['label'] == 'positive']['clean_text'])
negative_text = ' '.join(posts_df[posts_df['label'] == 'negative']['clean_text'])

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

wc_pos = WordCloud(width=800, height=400, background_color='white',
                   colormap='Greens', max_words=100).generate(positive_text)
axes[0].imshow(wc_pos, interpolation='bilinear')
axes[0].set_title('Positive Sentiment Keywords', fontsize=14)
axes[0].axis('off')

wc_neg = WordCloud(width=800, height=400, background_color='white',
                   colormap='Reds', max_words=100).generate(negative_text)
axes[1].imshow(wc_neg, interpolation='bilinear')
axes[1].set_title('Negative Sentiment Keywords', fontsize=14)
axes[1].axis('off')

plt.tight_layout()
plt.savefig('sentiment_wordclouds.png', dpi=150)
plt.show()

4. Sentiment Time Series

import matplotlib.dates as mdates

posts_df['post_date'] = pd.to_datetime(posts_df['post_date'])

# Daily average sentiment with 7-day rolling average
daily = posts_df.groupby(posts_df['post_date'].dt.date).agg(
    avg_compound=('compound', 'mean'),
    post_count=('post_id', 'count'),
    negative_pct=('label', lambda x: (x == 'negative').mean() * 100)
).reset_index()
daily['post_date'] = pd.to_datetime(daily['post_date'])
daily['rolling_avg'] = daily['avg_compound'].rolling(7, min_periods=1).mean()

fig, ax1 = plt.subplots(figsize=(14, 5))

ax1.plot(daily['post_date'], daily['avg_compound'], alpha=0.3, color='#3498db', label='Daily')
ax1.plot(daily['post_date'], daily['rolling_avg'], color='#2c3e50', linewidth=2, label='7-day MA')
ax1.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
ax1.set_ylabel('Compound Sentiment Score')
ax1.set_title('Brand Sentiment Over Time')
ax1.legend(loc='upper left')

ax2 = ax1.twinx()
ax2.bar(daily['post_date'], daily['post_count'], alpha=0.15, color='#95a5a6', label='Post Volume')
ax2.set_ylabel('Daily Post Volume')
ax2.legend(loc='upper right')

ax1.xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))
plt.tight_layout()
plt.savefig('sentiment_timeseries.png', dpi=150)
plt.show()

Key Insights

Sentiment Spikes

Sudden drops in compound sentiment often correlate with product launches, outages, or PR incidents. Monitoring the 7-day moving average helps separate noise from genuine trend shifts.

Platform Differences

Twitter/X posts tend to be more polarized (strong positive or negative) while Reddit discussions are more nuanced. Review sites skew toward extreme ratings, creating a U-shaped distribution.

Influencer Risk

A single negative post from a high-follower account can generate more engagement and reputational damage than hundreds of low-reach complaints. Prioritize response based on reach and engagement.

VADER Limitations

VADER works well for short, informal text but struggles with sarcasm, domain-specific jargon, and context-dependent sentiment. For production systems, consider fine-tuned transformer models.

Knowledge Check

Question 1

What does the VADER compound score represent?

The percentage of positive words in the text The total number of sentiment-bearing words A normalized aggregate score from -1 (most negative) to +1 (most positive) The ratio of positive to negative words

Question 2

Why do we remove stopwords during text preprocessing?

They are common words (the, is, and) that add noise without meaningful sentiment signal They contain negative sentiment that skews results They are too long to process efficiently They are only used in formal writing, not social media

Question 3

What is the purpose of lemmatization in the preprocessing pipeline?

To convert text to uppercase for uniformity To reduce words to their base form so "running" and "ran" map to "run" To split sentences into individual words To remove duplicate words from the text

Question 4

In the SQL query for high-influence negative posts, why is follower_count > 10000 used as a filter?

Users with fewer followers never post negative content It reduces the query execution time significantly Smaller accounts are always bots and should be excluded Negative posts from high-follower users have greater reach and reputational impact

Question 5

What does a 7-day rolling average of sentiment scores help you identify?

The exact timestamp of each negative post Individual user sentiment patterns Underlying sentiment trends by smoothing out daily volatility The total number of posts per week