A Neural Collaborative Filtering Movie Recommender System

10 min readOct 30, 2020

Online shopping is prevailing because it is convenient, saves time, and provides more options. The free-return policy motivates more people to shop online. Especially during pandemic, online shopping dominates the commercial industry. Recommender system thus becomes more popular and critical for the success of online commercial businesses.

The recommendation system allows customers to be exposed to the items that they most likely would like and purchase. It predicts the future preference of a list of items for a user, and recommends the top items from this list. Recommendation system personalizes marketing and makes direct impact on profitability and customer satisfaction.

The recommender system at the beginning is actually not personalized. It recommended customers the most popular items watched, read, or purchased in overall. For example, it would recommend the most popular movies to you regardless of your preference. It actually works to some degree, but not always.

With the development of machine learning, recommender system evolves from unpersonalized to personalized. There are mainly three types of personalized recommender system: Content Based Filtering, Collaborative Filtering, and the hybrid recommender system that combines content based filtering and collaborative filtering.

The main idea of content based filtering is that if you purchase this item, you will like other similar items. For example, I bought a sticker book to my 1 year old daughter. The content recommender system will expose me to other sticker books I would like and buy.

Collaborative filtering holds the main idea that similar users like similar items. For example, I purchased sticker books, puzzles from Amazon, and another customer bought the same items as what I bought, but she also bought a staff animals. Since the other customer and I purchased the same two items and both give high rates on these two items, we are calculated as similar by the algorithm. Then the system will recommend me the staff animal that other customer similar to me bought.

From the example above, we can see that collaborative filtering system use user rating of items to make recommendations. The issue of collaborative filtering model is that, what are you going to do at the beginning when you don’t have any user data to begin with. Well, we can begin with recommending the most popular items as the traditional way! It is definitely an option. In a short word, collaborative filtering model works better when you have sufficient data.

Collaborative filtering uses the users’ behavior which is indicated as the user-item ratings for predictions. Collaborative Filtering (CF) is currently the most widely used approach to build recommendation systems. Matrix factorization is the core under the collaborative filtering hood. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices, the users and the products.

In this post, I will provide an example of building a collaborative filtering movie recommender system using Amazon movies and TV products reviews.

The project goal:

Building a high performance movies/TV recommender system to engage Amazon customers, so as to drive the movies/TV products sales.

Business Understanding

Entertainment products such as movies/TV market is increasing. A recommender system with high performance plays a key role to engage customers and to drive sales.
A recommender system is used to predict users’ interests based on historical data, and recommend product items that are most likely interesting for users.
Neural network used in a recommendation system can efficiently learn the underlying explanatory factors and useful representations, so as to produce high performance.

Data Understanding

Data used in this project are Amazon Movies/TV reviews from UCSD
The downloaded data are Json.gz files with 19 years data (8,765,568 reviews)
The data used in this project is a subsample that only includes 2018 ratings/reviews, so as to reduce computational cost, and to better focus on the the model performance.
In the meta_Movies_and_TV file, there is a feature named “details” which is in HTML format and includes the links for each product. Used BeautifulSoup to scraped the links.

Process

Download the data
Extract the data from json.gz files, subsample the data, scrape the links, clean and preprocess the data for exploration and building the model.
Explore and visualize the dataset
Build the model
Cross Validation
Make Predictions/Recommendations

Step1: Download Data files

## use the following two lines of codes to download data files. The files will be saved as Movies_andTV_.json.gz and meta_Movies_and_TV.json.gz in the data folder. 
## I already downloaded and saved the data files in the data folder, so I skip this step!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Movies_and_TV.json.gz
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Movies_and_TV.json.gz# the following code allows get_data() to find the files in the right place
!mkdir ../data/
!mv meta_Movies_and_TV.json.gz ../data/
!mv Movies_and_TV.json.gz ../data/

Step2: Get the data.

Because the dataset is too large, I extract a subsample for model training, and saved that data as a new csv file.

def get_data():data_meta = []
    with gzip.open('../data/meta_Movies_and_TV.json.gz') as f:
        for l in f:
            data_meta.append(json.loads(l.strip()))
    
    df_meta = pd.DataFrame.from_dict(data_meta)# drop useless columns
    df_meta = df_meta.dropna(subset=['details'])
    df_meta = df_meta.drop(['image', 'feature', 'date', 'tech1'], axis=1)
    
    # unzip movie data file and turn into dataframe
    data_movie = []
    with gzip.open('../data/Movies_and_TV.json.gz') as f:
        for l in f:
            data_movie.append(json.loads(l.strip()))
    
    df_movie = pd.DataFrame.from_dict(data_movie)
    # subsample to only include 2018 for computational cost consideration
    df_movie_2018 = df_movie[df_movie['reviewTime'].str.contains('2018')]df_movie_2018 = df_movie_2018.drop('image', axis=1)
    df_movie_2018 = df_movie_2018[df_movie_2018['verified']==True]
    data_2018 = df_movie_2018.merge(df_meta, on='asin', how='inner')
    
    # extract links from column of 'details'
    links = []
    for i in data_2018['details']:
        soup = BeautifulSoup(i)
        found_links = soup.select('a.a-text-normal')
        if found_links:
            link = found_links[0]['href']
            links.append(link)
        else:
            links.append("")
    # add links back to dataframe
    data_2018['links'] = links# clean the data and proprecessing for exploration and model building
    data_2018 = data_2018.drop(['verified', 'rank', 'also_buy', 'also_view', 'details'], axis=1)
    data_2018 = data_2018.rename(columns={'overall':'rating', 'asin':'movieID'})
    reviewer_count = data_2018.groupby('reviewerID')['rating'].count()
    product_count = data_2018.groupby('movieID')['rating'].count()
    average_rating = data_2018.groupby('movieID')['rating'].mean()
    # remove reviewers that has only one review.
    data_2018_1 = data_2018.merge(reviewer_count, on='reviewerID')
    data_2018_1 = data_2018_1.rename(columns={'rating_y':'reviewer_count', 'rating_x':'rating'})
    data_2018_1 = data_2018_1.merge(product_count, on='movieID')
    data_2018_1 = data_2018_1.rename(columns={'rating_y':'movie_count', 'rating_x':'rating'})
    data_2018_1 = data_2018_1.merge(average_rating, on='movieID')
    data_2018_1 = data_2018_1.rename(columns={'rating_y':'average_rating', 'rating_x':'rating'})
    data_2018_1 = data_2018_1[data_2018_1['reviewer_count']>1]
    data_2018_1 = data_2018_1[data_2018_1['movie_count']>1]# create transformed features for building models
    reviewer_enc = LabelEncoder()
    data_2018_1['reviewer'] = reviewer_enc.fit_transform(data_2018_1['reviewerID'].astype(str).values)
    movie_enc = LabelEncoder()
    data_2018_1['movie'] = movie_enc.fit_transform(data_2018_1['movieID'].astype(str).values)
    data_2018_1['rating'] = data_2018_1['rating'].values.astype(np.float32)
    
    return data_2018_1data_2018_1.to_csv('data_2018_mr.csv')

Look at the data:

# import the data from csv file saved from last step
data_2018 = pd.read_csv('../data/data_2018_mr.csv')
# take a look at the cleaned and preprocessed data
data_2018.head(3)

Explore the data:

top_reviewers = reviewer_count.sort_values(ascending=False)[:20]
count_reviewers = len(reviewer_count)
top_products = product_count.sort_values(ascending=False)[:20]
print(f"Count of Reviewers: {len(reviewer_count)}")
print(f"Count of Products: {len(product_count)}")
print("")
print(f"Ratings descriptive statistics: ")
print(ratings_2018['rating'].describe()) 
print('')
print(f"Reviewers by count descriptive statistics: ")
print(reviewer_count.describe())
print("")
print(f"Products by count descriptive statistics: ")
print(product_count.describe())
print("")
print(f"Top reviwers by count of reviews: ")
print(top_reviewers)
print("")
print(f"Top movies by count of reviews: ")
print(top_products)

Visualize the rate distribution, the top 25 movies that received most reviews and the top 25 users that reviewed most movies.

from matplotlib.pyplot import figure
figure(num=None, figsize=(10, 6), dpi=80, facecolor='w', edgecolor='k')
plt.hist(ratings_2018['rating'])
plt.title("Distribution of Ratings")
plt.xlabel('Ratings')
plt.ylabel('Counts')

from matplotlib.pyplot import figure
figure(num=None, figsize=(10, 6), dpi=80, facecolor='w', edgecolor='k')top_reviewers.sort_values().plot(kind='barh',color='#86bf91', zorder=2, width=0.85)
plt.title('Top Reviewers')# plt.bar(top_reviewers['reviewerID'], top_reviewers.values())

from matplotlib.pyplot import figure
figure(num=None, figsize=(10, 6), dpi=80, facecolor='w', edgecolor='k')top_products.sort_values().plot(kind='barh', zorder=2, color='purple', width=0.85)
plt.title('Top Movies')

Findings:

Most ratings are five starts.
The average review for movie/TV is 5.4. The movie has the most reviews is Deadpool, which has 959 reviews.
The average review for reviewer is 2.4 movies.The top reviewer who reviewed most movie has 110 reviews.
Majority of the top movies by count are fantasy movies.

Build Model!!!

Step 3: Build the Deep Neural Networks Model

The cf_models.py file includes the base model and the final molde. The following model is the final model.

Neural Networks Model

The movie recommender was a Collaborative Filtering model with deep learning embedding technique. Collaborative filtering model is to use similarities of users to predict the movies/TV one user has not watched/purchased, but are highly interesting to this user. It is a model based recommender.

The model applies deep learning keras embedding technique. Embedding is split one matrix into two smaller matrix, or transform high dimension to low dimensions. Embedding is one notable successful use of deep learning to represent discrete variables as continuous vectors. Embedding create a low dimensional space in which the movies that have been watched by a given user are nearby in the ‘movie preference’ space, and the users embeddings are closer to the movies that they have watched. These individual dimensions in these vectors typically have no inherent meaning. Instead, it’s the overall patterns of location and distance between vectors that machine learning takes advantage of. So that the model is able to recommend other movies based on those movies’ proximity to a user embedding, because nearby users and movies share preferences.

The model:

1. Create the reviewer embeddings and movie embeddings as input layer.When create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer), and are gradually adjusted via back-propagation during training.2. Use concatenate to merge embedding layers: It takes as input a list of tensors, all of the same shape except for the concatenation axis, and returns a single tensor, the concatenation of all inputs (https://keras.io/layers/merge/).3. Add hidden layers. Hidden layers that better learn the underlying factors and representations to adjust the weights via back-propagation.4. Add dropout to help with preventing overfitting on training dataset.

Loss function:

Mean Squared Error(MSE). When improving the model performance, Mean Squared Error loss function and Mean Absolute Error loss function were both tried, and the MSE had a better results than MAE.

Metrics:

Mean Absolute Error (MAE).

Mean Absolute Error (MAE) measures the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. Due to the predicted value is the ratings ranging from 1 to 5, and being as 5 could be consider as good as 5 times of being as 1, and because there is no need to penalize the outliers, MAE is more appropriated, and easier to interpret.

Final model performs very well with the loss function of Squared Mean Error as 0.78, and the metrics, Mean Absolute Error as 0.43.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pickle
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model
from sklearn.metrics import confusion_matrix
import itertools
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Concatenate, Dense, Dropout
from tensorflow.keras.layers import Add, Activation, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Reshape, Dot
from tensorflow.keras.layers import Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal',
                      embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return xdef RecommenderNet(n_reviewers, n_movies, n_factors, min_rating, max_rating):
    reviewer = Input(shape=(1,))
    r = EmbeddingLayer(n_reviewers, n_factors)(reviewer)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    
    x = Concatenate()([r, m])
    x = Dropout(0.05)(x)
    
    x = Dense(10, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.5)(x)
    
    x = Dense(10, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.5)(x)
    
    x = Dense(1, kernel_initializer='he_normal')(x)
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
    model = Model(inputs=[reviewer, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mae'])
    return modeldef train_test(data_2018):
    X = data_2018[['reviewer','movie']].values
    y = data_2018['rating'].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
    
    X_train_array = [X_train[:, 0], X_train[:, 1]]
    X_test_array = [X_test[:, 0], X_test[:, 1]]
    
    return X_train, X_test, X_train_array, X_test_array, y_train, y_testdef create_parameters_model(data_2018):
    n_reviewers = data_2018['reviewer'].nunique()
    n_movies = data_2018['movie'].nunique()
    n_factors = 50
    min_rating = min(data_2018['rating'])
    max_rating = max(data_2018['rating'])return n_reviewers, n_movies, n_factors, min_rating, max_ratingdef import_data():
    
    file =  '../data/data_2018_mr.csv'
    data_2018 = pd.read_csv(file)
    
    return data_2018# import cleaned data for model , it is the save data_2018_mr.csv
data_2018 = import_data()# get parameters for the model
n_reviewers, n_movies, n_factors, min_rating, max_rating = create_parameters_model(data_2018)
# split train and test data for the model
X_train, X_test, X_train_array, X_test_array, y_train, y_test = train_test(data_2018)
# initiate the model
model = RecommenderNet(n_reviewers, n_movies, n_factors, min_rating, max_rating)# fit the data
history = model.fit(x=X_train_array, 
                    y=y_train, 
                    batch_size=64, 
                    epochs=5,
                    verbose=1, 
                    validation_data=(X_test_array, y_test))

# visualize the final model loss
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5
import matplotlib.pyplot as plt
plt.plot(history.history['loss'] , 'g')
plt.plot(history.history['val_loss'] , 'b')
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.grid(True)
plt.show()

Cross Validation

from tensorflow.keras.models import load_model
from sklearn.model_selection import StratifiedKFold
import numpy
import pandas as pddef cross_val_v1(X_train, y_train):
    model_cv = load_model('mr_model.h5')
    kfold = StratifiedKFold(n_splits=5, shuffle=True,
random_state=42)
    cvscores = []
    for train_index, test_index in kfold.split(X_train, y_train):
        X_train_s, X_test_s = X_train[train_index], X_train[test_index]
        y_train_s, y_test_s = y_train[train_index], y_train[test_index]
        X_train_array = [X_train_s[:, 0], X_train_s[:, 1]]
        X_test_array = [X_test_s[:, 0], X_test_s[:, 1]]
#       # create model
        model_cv.fit(x=X_train_array, 
                    y=y_train_s, 
                    epochs=5, 
                    verbose=1,
                    validation_data=(X_test_array, y_test_s))
    #evaluate the model
        scores = model_cv.evaluate(x=X_test_array, y=y_test_s, verbose=1)
        print(model_cv.metrics_names[1], scores[1])
        cvscores.append(scores[1])
    cvscore_mean= numpy.mean(cvscores)
    cvscore_std = numpy.std(cvscores)return cvscores, cvscore_mean, cvscore_stdcvscores, cvscore_mean, cvscore_std = cross_val_v1(X_train, y_train)