- By Srivatsan and Medha
INTRODUCTION
Recommendation system is an algorithm that suggests relevant items to the user, for example in the case of Netflix, which movie to watch, in case of play store, which app to install and in case of e-commerce which products to buy.
Recommendation Systems date back to the 1990s when collaborative filtering emerged, predicting user preferences based on similar behaviors. A pivotal moment occurred in 2006 with Netflix’s million-dollar challenge, catalyzing advancements in personalized suggestions. The 2010s witnessed the ascendancy of machine learning and AI, enabling more sophisticated algorithms. Streaming services like Netflix and Spotify harnessed these technologies to create dynamic recommendations grounded in user behavior, demographics, and contextual insights. Today, these systems, integrating collaborative and deep learning techniques, stand as sophisticated orchestrators of digital content discovery, reshaping how audiences engage with online platforms.
Businesses using Recommendation systems:
- Amazon: Amazon suggests products based on a user’s browsing history, purchase history, and similar users’ preferences.
- Netflix: Netflix employs recommendation algorithms to suggest movies and TV shows based on a user’s viewing history, ratings, and preferences.
- Spotify: Spotify uses recommendation systems to suggest music based on a user’s listening habits, genre preferences, and collaborative filtering.
- YouTube: YouTube recommends videos based on a user’s watch history, likes, and subscriptions.
- Instagram: Instagram suggests posts, stories, and accounts to follow based on user interactions and preferences.
Why Businesses use recommendation systems?
Recommendation systems have a potential to drive revenue growth by increasing cross-selling and upselling opportunities by presenting personalized and relevant recommendations.
Recommendation systems can elevate customer satisfaction levels, leading to increased customer retention. They excel at delivering personalized experience.
For instance, Netflix’s total revenue increased from $15.8 billion in 2018 to $25 billion in 2020, showcasing a substantial surge in just two years. The surge in revenue coincides with the period during which Netflix intensified its focus on personalized content recommendations, investing heavily in original productions and refining its algorithmic suggestions.
Architecture of Recommendation Systems
- Candidate Generation: the system starts from a huge corpus and generates a smaller subset of candidates (hundreds or thousands).
- Scoring: The system scores and ranks the candidates to select the set or items to displace to the user (usually on the order of 10).
- Re-Ranking: The system must consider additional constraints for the final ranking.
Types of Recommendation Systems
- Content-based Filtering: recommends items similar to ones the user has liked in the past. For example, if you liked comedy movies, the system suggests more comedy films, focusing on your preferences.
- Collaborative Filtering: recommends items based on the preferences of similar users. For example, if you and a friend like similar movies, the system suggests movies your friend enjoyed but you haven’t seen.
- Hybrid Recommendation Systems: Hybrid systems combine collaborative and content-based methods for more accurate suggestions. By considering both user behavior and item characteristics, these systems offer a balanced approach.
Understanding Content-Based Recommender Systems
1. Feature Engineering:
Content-based recommender systems rely on the extraction and representation of various features associated with items. These features span multiple media formats such as text, pictures, and videos. Each format demands specific feature engineering to create a comprehensive vector that encapsulates the item’s characteristics.
2. User Preferences Inference:
The primary challenge lies in deciphering the specific features that drive a user’s liking for an item. Users generate extensive content on the web, providing a wealth of information that can be harnessed to understand their preferences. By analyzing this content, recommender systems aim to discern the explicit preferences of users and match items relevant to those preferences.
3. Effectiveness in Dynamic Content Domains:
Content-based recommender systems shine in domains characterized by dynamic content. In scenarios where collaborative filtering struggles, content-based methods excel. The ability to adapt to evolving user preferences and dynamic content landscapes makes this approach highly effective.
Objective
The objective of this project is to explore and understand recommendation systems, a vital tool for businesses seeking to enhance revenue and customer retention in the digital era. By building a content-based recommendation system in Python, utilizing Natural Language Processing (NLP) techniques such as Bag of Words, and leveraging libraries including Count Vectorizer, and Cosine Similarity, we were able to delve into the inner workings of recommendation algorithms. Through this hands-on approach, the project seeks to gain insights into the underlying principles of recommendation systems, their implementation, and their potential impact on businesses and user experiences.
Brief Overview of the Dataset
The dataset comprises Netflix movies and TV shows released between 2015 and 2021, totaling approximately 9,425 items. It’s sourced from Kaggle, a reputable platform for datasets and data science projects. This comprehensive dataset offers a rich repository of information, enabling in-depth analysis and exploration of trends in Netflix’s content library over recent years.
Dataset Link: Netflix Dataset Latest 2021 (kaggle.com)
- Title: The name of the Netflix movie or TV show.
- Genre: The category or type of content, such as drama, comedy, action, etc.
- Keywords: Keywords or descriptors associated with the content.
- Languages: The languages in which the content is available.
- Series or Movie: Indicates whether the item is a series (TV show) or a movie.
- Hidden Gem Score: A rating or score indicating the perceived quality or value of the content.
- Country Availability: The countries where the content is accessible on Netflix.
- Runtime: The duration of the movie or TV show.
- Director: The director of the content.
- Writer: The individual(s) credited with writing the content.
- Actors: The cast members who appear in the movie or TV show.
- View Rating: The age or content rating assigned to the item.
- IMDb Score: The rating of the content on IMDb, an online database of movies and TV shows.
- Rotten Tomatoes Score: The rating of the content on Rotten Tomatoes, a review-aggregation website.
- Metacritic Score: The rating of the content on Metacritic, a review-aggregation website.
- Awards Received: The number of awards the content has won.
- Awards Nominated For: The number of awards for which the content has been nominated.
- Box-office: The revenue generated by the content at the box office (if applicable).
- Release Date: The original release date of the content.
- Netflix Release Date: The date when the content was made available on Netflix.
- Production House: The company responsible for producing the content.
- Netflix Link: A link to the content’s page on Netflix.
- IMDb Link: A link to the content’s page on IMDb.
- Summary: A brief description or synopsis of the content.
- IMDb Votes: The number of votes or ratings the content has received on IMDb.
- Image Poster: The poster or artwork associated with the content.
- TMDb Trailer: The trailer of the content available on The Movie Database (TMDb).
- Trailer Site: The website where the trailer can be viewed.
Insights from Data Analysis
This plot highlights that the most prevalent genres on Netflix are drama, comedy, action, thriller, and romance, suggesting a strong audience preference for these genres in the content library.
This count plot reveals a prevalence of movies over series in Netflix’s content library, indicating a stronger emphasis on standalone cinematic experiences within the platform.
This count plot underscores a Netflix trend where movies with a duration of 1–2 hours dominate, suggesting a preference for concise cinematic experiences among viewers.
This heatmap illuminates a surge in movie additions over the last three months, with a recurring December peak observed across all years. Notably, December 2020 stands out as the pinnacle, boasting the highest influx of movies during that timeframe.
Steps Involved
Data Cleaning:
- Columns with numerous missing values such as ‘Hidden Gem Score’ , ‘View Rating’ , ‘Rotten Tomatoes Score’ , ‘Metacritic Score’ , ‘Awards Received’ , ‘Awards Nominated’ , ‘Boxoffice’ , ‘Production House’ , ‘IMDb votes’ were dropped to ensure data quality and reliability.
- Columns like ‘Netflix Link’ , ‘IMDb Link’ , ‘Image’ , ‘Poster , ‘TMDb Trailer’ and ‘Trailer site’ were also dropped because they do not contribute directly to the analysis objectives thereby making the dataset more streamlined.
- Null values in the IMDb Score column was replaced with the median value, a common technique to maintain data integrity and preserve statistical accuracy.
- The null values in columns such as ‘Director’ , ‘Actors’ , ‘Languages’ , ‘Genre’ , ‘Key Words’ and ‘Summary’ were replaced with an empty string (‘’) since statistical methods cannot be applied to fill missing values in these categorical columns.
# Created a copy with selected columns, which are to be kept for further analysis.
selected_columns = ['Title', 'Genre', 'Key_words', 'Languages', 'Series_or_Movie', 'Runtime', 'Director', 'Actors', 'Summary', 'Release_Date', 'TMDb_Score', 'Image']
df = df[selected_columns].copy()
df.head()
# Filling missing values in IMDB Score column with its median value.
df['TMDb_Score'] = df['TMDb_Score'].fillna(df['TMDb_Score'].median())
# Filling NA values in multiple columns with empty string.
df['Director'].fillna('', inplace=True)
df['Languages'].fillna('', inplace=True)
df['Actors'].fillna('', inplace=True)
df['Genre'].fillna('', inplace=True)
df['Key_words'].fillna('', inplace=True)
df['Summary'].dropna(inplace=True)
Text pre-processing:
- We cleaned the text data by removing stop words, punctuation, and other irrelevant characters.
- We performed stemming and lemmatization to reduce words to their base forms. We have used the NLTK library to achieve this.
# removing punctuation
import string
string.punctuation
def remove_punctuation(text):
if type(text) == str:
punctuation_free = "".join([i for i in text if i not in string.punctuation])
return punctuation_free
else:
return str(text)
# storing the punctuation free text
# remove spaces between names
df['Genre'] = df['Genre'].str.replace(' ', '')
df['Actors'] = df['Actors'].str.replace(' ', '')
df['Key Words'] = df['Key Words'].str.replace(' ', '')
df['Director'] = df['Director'].str.replace(' ', '')
# converting to lower case
def lowercase(text):
if type(text) == str:
return text.lower()
else:
return str(text)
df['Title'] = df['Title'].apply(lowercase)
df['Genre'] = df['Genre'].apply(lowercase)
df['Key Words'] = df['Key Words'].apply(lowercase)
df['Languages'] = df['Languages'].apply(lowercase)
df['Director'] = df['Director'].apply(lowercase)
df['Actors'] = df['Actors'].apply(lowercase)
df['Summary'] = df['Summary'].apply(lowercase)
df['Series or Movie'] = df['Series or Movie'].apply(lowercase)
df.head()
Creating Tags:
- We Identified the key features in the dataset that we can use for recommendations such as ‘Genre’ , ‘Director’ , ‘Actors’ , ‘Summary’ , ‘Language’ and ‘Series or movie’. We combined all the key features to create a new column ‘Tags’.
- We created a new Pandas DataFrame which consisted of ‘Title’ , ‘Tags’ , ‘Release year’ and ‘IMDb Score’.
# keeping only required columns to make tags
df['Tags'] = df['Genre'] + df['Key Words'] + df['Director'] + df['Actors'] + df['Languages'] + df['Summary'] + df['Series or Movie']
df.head()
new_df = df[['Title', 'Tags', 'Runtime', 'IMDb Score', 'Release Year', 'Image']]
new_df.head()
Vectorizing the Tags using Count Vectorizer:
- By using the Count Vectorizer function from the sklearn.feature_extraction module we converted the tags into numerical vectors
- For each item(movie) we have created a vector that represents the frequency of each word in the entire corpus.
- Count vectorizer works by finding unique words in the corpus(all of the text in the tags combined together) and counting the occurrences of each word, in our case we have limited it to top 6000 words. It then creates a count matrix whose rows correspond to the movies and columns corresponding to the unique words in the corpus. Each entry is the count of the corresponding word in the tags.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=6000, stop_words='english')
vectors = cv.fit_transform(new_df['Tags']).toarray()
So, we have created a count matrix with shape of (9425,6000).
1. First, the text was tokenized into individual words.
2. Then, these words are converted into their base forms using lemmatization.
3. Next, the vocabulary of the corpus is created, and each document is represented using the bag of words model, where each word’s frequency is counted.
Calculating the Cosine Similarity between the vectors:
To measure the similarity between the vectors, we compute the cosine similarity between the vectors.
The cosine similarity is calculated by cosine_similarity(A,B) =
Cosine Similarity: A.B / ||A||.||B||
The cosine similarity ranges from 0 to 1.
We calculated the cosine similarity by using the cosine_similarity function from the sklearn.metrics.pairwise module.
The cosine_similarity measures the cosine similarity a vector against all of the vectors and returns a matrix of cosine similarity values, where each element (i , j) represents the cosine similarity between the ith vector from the first input and the jth vector from the second input.
So, for each movie, we obtain a matrix of shape (9425,9425).
# Calculating cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vectors)
Recommending:
For recommending 5 movies similar to a movie based on the title of the movie, we obtain the cosine similarity matrix of the movie and sort it in descending order.
The first row in the sorted matrix is the movie itself for which we want to find similar movies, so we don’t include the first row, we use the next 5 rows.
We used the enumerate() function to find the index of the movies in the DataFrame from the sorted matrix and printed the titles of the movies.
def recommend(movie):
candidates = []
movie_index = new_df[new_df['Title'] == movie].index[0]
distances = similarity[movie_index]
movies_list = sorted(list(enumerate(similarity[movie_index])), reverse=True, key=lambda x: x[1])[1:6]
for i in movies_list:
candidates.append(list(new_df.iloc[i[0]]))
for j in candidates:
print(j[0])
Output:
— — — — — — — — — — — — — — — — — THE END — — — — — — — — — — — — — — — —