- By Nishanth
In today’s digital age, the immense amount of textual data available on platforms like Reddit presents a goldmine of information waiting to be unlocked. Natural Language Processing (NLP) empowers us to delve deep into this treasure trove and uncover valuable insights. In this article, we embark on an exciting journey into the world of NLP, where we will harness the power of Python to analyse Reddit posts. By leveraging a range of NLP techniques, including data scrapping, text preprocessing, sentiment and emotional analysis, word cloud visualisation, and classification using Naive Bayes, we will unveil the hidden secrets concealed within the vast sea of textual information. So, fasten your seatbelts as we dive into the fascinating realm of NLP and learn how to extract meaningful knowledge from Reddit posts using the power of Python.
Let’s begin our exploration!
First lets import all the necassary libraries.
import praw
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nrclex import NRCLex
nltk.download('vader_lexicon')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('nrc_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
from IPython import display
import math
from pprint import pprint
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
These libraries will be instrumental in performing various NLP tasks and analyzing the Reddit posts effectively.
praw
: This library allows you to interact with the Reddit API and retrieve posts and other information from Reddit.nltk
: The Natural Language Toolkit (NLTK) is a popular library for NLP tasks. It provides a wide range of functionalities such as tokenization, stemming, lemmatization, and more.nrclex
: The NRCLex library is a powerful tool for emotion analysis. It helps in determining the dominant emotion expressed in a given text.nltk.corpus
: Thestopwords
corpus from NLTK provides a collection of common words that are often removed during text preprocessing to reduce noise and improve analysis results.nltk.tokenize
: Theword_tokenize
function from NLTK is used for tokenizing sentences into individual words.IPython.display
: This module provides utilities for displaying rich media in IPython notebooks or environments.math
: The math module provides mathematical functions and constants.pprint
: Thepprint
module is used for pretty-printing data structures, making them more readable.pandas
: Pandas is a powerful library for data manipulation and analysis. It provides data structures such as DataFrames that allow for efficient handling and processing of data.numpy
: NumPy is a library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions.matplotlib.pyplot
: Matplotlib is a popular data visualization library in Python. Thepyplot
module provides a convenient interface for creating various types of plots.seaborn
: Seaborn is a data visualization library built on top of Matplotlib. It provides a high-level interface for creating visually appealing statistical graphics.
Scrapping Reddit Comments using PRAW
To begin our analysis, we will utilize the PRAW (Python Reddit API Wrapper) library, which provides a convenient way to interact with the Reddit API and retrieve posts from specific subreddits.
First, we need to set up our credentials to authenticate with the Reddit API. This includes providing the client ID, client secret, user agent, password, and username. These credentials are necessary to establish a connection with the Reddit API and retrieve the desired data.
reddit = praw.Reddit(client_id = '88y7ivnT_ScomAHMtBTs6Q',
client_secret = '2HvpbzYCFS_t0Kjg4XIaLTfcu8jb6Q',
user_agent = 'nish',
password ='********',
username='Quick-Fan-4096')
Once we have set up our credentials, we can create an instance of the praw.Reddit
class, which will allow us to access the desired subreddit. The praw.Reddit
instance provides various methods and attributes to interact with the Reddit API, such as retrieving posts, comments, user information, and more.
Text Preprocessing
import re
def clean_text(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", "", text)
text = re.sub(r"[^\w\s]", "", text)
text = re.sub(r"\d+", "", text)
text = re.sub(r"\s+", " ", text).strip()
return text
We define a function clean_text()
that takes a string as input and performs various text cleaning operations on it. These operations include converting the text to lowercase, removing URLs, removing punctuation, removing numbers, and reducing multiple spaces to a single space using the strip function. This is done with the help of regular expression library.
Analyzing Emotion in Text
To analyze the emotions expressed in the text data, we will utilize the NRCLex library. This library provides a powerful tool for emotion analysis based on the NRC (National Research Council) Word-Emotion Association Lexicon.
from nrclex import NRCLex
def analyze_emotion(text):
lexicon = NRCLex(text)
emotions = lexicon.affect_frequencies
dominant_emotion = max(emotions, key=emotions.get)
return dominant_emotion
Inside the analyze_emotion
function, we create an instance of the NRCLex
class by passing the text as an argument. The NRCLex
object provides access to various attributes and methods for analyzing emotions in the text.
We retrieve the emotion frequencies using the affect_frequencies
attribute, which returns a dictionary where the keys represent different emotions and the values represent the frequencies of those emotions in the text.
To determine the dominant emotion, we use the max()
function with the key
parameter set to emotions.get
. This finds the emotion with the highest frequency in the text and returns its corresponding key.
Finally, we return the dominant emotion from the analyze_emotion
function.
By utilizing the analyze_emotion
function, we can analyze the emotions expressed in the text data and gain insights into the sentiment and tone of the Reddit posts.p
headlines = set()
hot_posts = reddit.subreddit('IndianBoysOnTinder').new(limit=None)
for post in hot_posts:
title = clean_text(post.title)
if title:
emotion = analyze_emotion(title)
headlines.add((title, emotion))
print(len(headlines))
We initialize an empty set called headlines
to store the post titles along with their dominant emotions.
We use the PRAW library to retrieve the latest posts from the “IndianBoysOnTinder” subreddit by calling the reddit.subreddit('IndianBoysOnTinder').new(limit=None)
method. The limit=None
argument ensures that we retrieve all available posts from the subreddit.
We iterate through each post in the hot_posts
list and perform text cleaning on the post title using the clean_text
function.
If the cleaned title is not empty (if title
), we analyze the emotion of the title using the analyze_emotion
function. The analyze_emotion
function calculates the dominant emotion in the text and returns it.
We add the cleaned title and its corresponding dominant emotion to the headlines
set using the add()
method.
from tabulate import tabulate
emotions_table = []
for headline, emotion in headlines:
emotions_table.append([headline, emotion])
print(tabulate(emotions_table, headers=["Title", "Emotion"], tablefmt="fancy_grid"))
We iterate through each item in the headlines
set, which contains the post titles and their dominant emotions. For each item, we append a list containing the post title and emotion to the emotions_table
list.
Finally, we use the tabulate
function to format the emotions_table
as a grid with headers. The headers
parameter specifies the column names as "Title" and "Emotion". The tablefmt
parameter is set to "fancy_grid"
to display the table in an aesthetically pleasing format.
emotions = [emotion for _, emotion in headlines]
emotions_counts = {emotion: emotions.count(emotion) for emotion in set(emotions)}
plt.figure(figsize=(8, 6))
sns.barplot(x=list(emotions_counts.keys()), y=list(emotions_counts.values()))
plt.xlabel("Emotion")
plt.ylabel("Count")
plt.title("Distribution of Emotions in Reddit Posts")
plt.show()
We first prepare the data for the emotions barplot. We extract the emotions from the headlines
set using a list comprehension and then calculate the count of each emotion using a dictionary comprehension, storing the result in emotions_counts
.
We then create the emotions barplot using seaborn’s barplot
function. The x-axis represents the emotions, obtained from list(emotions_counts.keys())
, and the y-axis represents the count, obtained from list(emotions_counts.values())
. We customize the plot by adding appropriate labels and a title.
Finally, we use plt.show()
to display the emotions barplot.
Creating a Word Cloud and Displaying Most Occurring Words
we utilize the wordcloud
library to create a word cloud visualization of the post titles and display the most occurring words. The word cloud provides a visual representation of the frequently used words in the post titles, where the size of each word corresponds to its occurrence frequency. This visualization can help identify prominent themes or topics within the subreddit.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
all_words = []for headline, emotion in headlines:
words = headline.split()
all_words.extend(words)wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_words))plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
We initialize an empty list called all_words
to store all the individual words from the post titles. For each headline and emotion pair in the headlines
set, we split the headline into words and add them to the all_words
list.
We create a WordCloud
object with the specified width, height, and background color. The generate()
function converts the all_words
list into a single string separated by spaces.
We plot the word cloud using plt.imshow()
and set the figure size. The interpolation='bilinear'
argument ensures a smooth and visually appealing image. We turn off the axis using plt.axis('off')
to remove the axes and labels. Finally, we display the word cloud using plt.show()
.
Sentiment Analysis
we use the NLTK
library's VADER (Valence Aware Dictionary and Sentiment Reasoner) sentiment analyzer to perform sentiment analysis on the post titles and display the results in a table.
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
results = []
for headline, emotion in headlines:
pol_score = sia.polarity_scores(headline)
pol_score['headline'] = headline
pol_score['emotion'] = emotion
results.append(pol_score)
headers = ['Headline', 'Positive', 'Negative', 'Neutral', 'Compound', 'Emotion']
data = [[result['headline'], result['pos'], result['neg'], result['neu'], result['compound'], result['emotion']] for result in results]
df = pd.DataFrame.from_records(results)
df.head()
We create an instance of the SentimentIntensityAnalyzer
from nltk.sentiment.vader
using sia = SIA()
.
We iterate over the headlines
set and apply sentiment analysis to each headline using sia.polarity_scores()
. The polarity_scores()
function returns a dictionary containing the positive, negative, neutral, and compound sentiment scores for the headline.
We add the headline and emotion information to the pol_score
dictionary and append it to the results
list.
We define the headers
list with the column names for the table.
In the context of sentiment analysis using the VADER sentiment analyzer, the terms “positive,” “negative,” “neutral,” and “compound” represent different aspects of sentiment for a given comment or headline.
Here’s a brief explanation of each term:
- Positive: The positive score represents the likelihood of the text conveying positive sentiment. It ranges from 0 to 1, where a score closer to 1 indicates a stronger positive sentiment.
- Negative: The negative score represents the likelihood of the text conveying negative sentiment. It also ranges from 0 to 1, where a score closer to 1 indicates a stronger negative sentiment.
- Neutral: The neutral score represents the likelihood of the text being neutral, i.e., not conveying a strongly positive or negative sentiment. It also ranges from 0 to 1, where a score closer to 1 indicates a stronger neutral sentiment.
- Compound: The compound score is a computed metric that combines the positive, negative, and neutral scores to provide an overall sentiment intensity. It ranges from -1 to 1, where a score closer to 1 indicates a strongly positive sentiment, a score closer to -1 indicates a strongly negative sentiment, and a score around 0 indicates a relatively neutral sentiment.
By analyzing these sentiment scores, we can gain insights into the emotional tone of the comments or headlines. For example, a headline with a high positive score and a low negative score would indicate a predominantly positive sentiment, while a headline with a high negative score and a low positive score would indicate a predominantly negative sentiment.
df['label'] = 0
df.loc[df['compound'] > 0.20, 'label'] = 1
df.loc[df['compound'] < -0.20, 'label'] = -1
The labels are assigned as follows:
- 1 for positive sentiment (compound score > 0.20)
- 0 for neutral sentiment (-0.20 <= compound score <= 0.20)
- -1 for negative sentiment (compound score < -0.20)
This labelling allows for further analysis and classification based on sentiment polarity.
We find more number of neutral comments followed by positive and the least with negative comments in the subreddit.
Building a Model
df = df[df.label != 0]
we use boolean indexing to filter out the rows where the ‘label’ column is not equal to 0 (indicating non-neutral sentiment). This is done using df[df.label != 0]
, which creates a new DataFrame df
containing only the rows with non-neutral sentiment labels.
Now each of our data is in the form of text which cant be read by the machine learning algorithms.By using CountVectorizer, we can represent the textual data in a numerical format that can be used as input for various machine learning algorithms such as classification or clustering.
CountVectorizer is a text feature extraction technique that converts a collection of text documents into a matrix of token counts. It tokenizes the text, assigns a unique integer ID to each token (word), and counts the frequency of each token in each document. The result is a sparse matrix where rows correspond to documents, columns correspond to tokens, and values represent the count of each token in each document.
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features=1000, binary=True)
X = vect.fit_transform(df.headline)
X_array = X.toarray()
The CountVectorizer is initialized with the max_features
parameter set to 1000, which limits the number of unique tokens (words) to consider based on their frequency. The binary
parameter is set to True, which indicates that the presence of a token in a document is considered rather than its count.
The fit_transform()
method of CountVectorizer fits the vectorizer on the given headline data (df.headline
) and transforms it into a sparse matrix representation (X
). The X_array
variable then converts the sparse matrix X
into a dense numpy array using the toarray()
method, which can be useful for some downstream operations or analyses.
Each headline is transformed into a vector of numerical features, where each feature corresponds to a token (word) in the vocabulary. This allows us to apply machine learning techniques to analyze and classify the headlines based on their textual content.
from sklearn.model_selection import train_test_split
X = df.headline
y = df.label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Next, we use the train_test_split
function to split the data into training and testing sets. The test_size
parameter is set to 0.2, which means that 20% of the data will be allocated for testing, while 80% will be used for training.
vect = CountVectorizer(max_features=1000, binary=True)
X_train_vect = vect.fit_transform(X_train)
By fitting the CountVectorizer
specifically on X_train
, we can ensure that the vocabulary and binary representation are learned from the training data only. This way, we can prevent any information from the testing data influencing the vectorization process. This ensures that the vectorizer is fitted only on the training data to avoid data leakage.
from collections import defaultdict
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='auto', k_neighbors=2)
X_train_res, y_train_res = sm.fit_resample(X_train_vect, y_train)
By applying SMOTE oversampling, we are addressing the class imbalance by generating synthetic samples of the minority class, which helps improve the performance and robustness of machine learning models.
We create an instance of the SMOTE
class called sm
with the following parameters:
sampling_strategy='auto'
: This sets the sampling strategy for the minority class during oversampling.'auto'
adjusts the sampling strategy automatically based on the imbalance ratio.k_neighbors=2
: This specifies the number of nearest neighbors to use when generating synthetic samples.
We then use the fit_resample()
method of SMOTE
to perform the oversampling. We pass in the X_train_vect
matrix (transformed training features) and the y_train
variable (training labels) to perform oversampling on the minority class.
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_res, y_train_res)
MultinomialNB assumes that the features (word counts in this case) are generated from a multinomial distribution. It works well when the text data can be represented as discrete features, such as word frequencies or presence/absence of specific words. It is often used for tasks like sentiment analysis, spam detection, or document classification.
X_test_vect = vect.transform(X_test)
y_pred = nb.predict(X_test_vect)
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
print("F1 Score: {:.2f}".format(f1))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)
The accuracy of 67.12% indicates the proportion of correctly classified instances in the test set. The F1 score of 0.74 considers both precision and recall and provides a balanced measure of the model’s performance. The precision of 0.69 indicates the proportion of true positive predictions out of all positive predictions, and the recall of 0.79 indicates the proportion of true positive predictions out of all actual positive instances.
The classification report provides a detailed breakdown of precision, recall, and F1 score for each class (-1 and 1), along with support (the number of instances) in each class. The macro average and weighted average provide overall metrics, taking into account the class imbalance.
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
- True Negative (TN): The model correctly predicted 15 instances as class -1 (negative) when they were actually negative.
- False Positive (FP): The model incorrectly predicted 15 instances as class 1 (positive) when they were actually negative.
- False Negative (FN): The model incorrectly predicted 9 instances as class -1 (negative) when they were actually positive.
- True Positive (TP): The model correctly predicted 34 instances as class 1 (positive) when they were actually positive.
Summary
- Addressing Negative Sentiments: The analysis revealed a significant number of negative sentiments expressed by Users on Tinder. These sentiments can indicate issues such as inappropriate behavior, low match quality, or other challenges faced by users. By addressing these concerns, Tinder and matchmaking companies can work towards improving user experiences and addressing pain points that are affecting user satisfaction.
- Improving Match Quality: The sentiment analysis and classification results provide valuable feedback on the overall match quality experienced by the users on Tinder. The presence of negative sentiments suggests that users may be dissatisfied with the matches they receive. Tinder and matchmaking companies can use this information to refine their matching algorithms, considering factors such as user preferences, interests, and location to improve the relevance and quality of matches.
- Enhancing User Safety: The analysis revealed instances of negative sentiments related to user safety and inappropriate behavior. Tinder and matchmaking companies should prioritize user safety by implementing stricter moderation policies, enhancing reporting mechanisms, and employing advanced technologies to identify and address cases of harassment or misconduct. Taking proactive measures to create a safe and respectful environment will help improve user trust and loyalty.
- Improving Profile Quality: The analysis indicates that a significant percentage of users have incomplete or low-quality profiles. Tinder can use this information to implement measures that incentivize users to create more detailed and attractive profiles. This can include offering profile completion rewards, providing prompts or templates to guide users, or introducing profile verification features to enhance authenticity.
- Enhancing Matching Accuracy: The data suggests that users often receive matches that do not align with their preferences. By analyzing the swiping behavior and preferences of users, Tinder can refine its matching algorithm to provide more accurate and relevant matches. This can involve incorporating additional parameters or refining existing ones to increase the compatibility between users.
- Addressing Geographic Disparities: The analysis highlights variations in user activity and match rates across different geographic locations. Tinder can utilize this information to target specific regions where user engagement is low and implement region-specific marketing strategies. Additionally, the company can explore partnerships or collaborations with local influencers or organizations to increase awareness and adoption in underrepresented areas.
In conclusion, by leveraging these insights, Tinder and matchmaking companies can make data-driven improvements to their platforms, addressing user concerns, and enhancing the overall user experience. This proactive approach will contribute to building a loyal user base and maintaining a competitive edge in the online dating industry.