Online Shopper Intention Prediction: Imbalance Classification Using Stack Model & Neural Network

Daihong Chen
7 min readOct 23, 2020

Online shopping is gaining popularity these days, especially during pandemic. This post presents a project that applies Random Forest, Logistic Regression, Gradient Boost, XGBoost, and Neural Network.

The data is downloaded from Kaggle.

First, lets understand the outcome variable and the features in the dataset.

The ‘Revenue’ attribute is used as the class label.

“Administrative”, “Administrative Duration”, “Informational”, “Informational Duration”, “Product Related” and “Product Related Duration” represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories.
The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another.

The “Bounce Rate”, “Exit Rate” and “Page Value” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site.

The value of “Bounce Rate” feature for a web page refers to the percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.

The value of “Exit Rate” feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.

The “Page Value” feature represents the average value for a web page that a user visited before completing an e-commerce transaction.

The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.

The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

Lets load the data and take a look:

# load the data
df = pd.read_csv('online_shoppers_intention.csv')
# take a look
df.head()
# descriptive statistics
df.describe(include='all').transpose()

Data Exploratory Analysis and Preprocessing

DEA step1:

check missing values and impute missing values.

df.isnull().sum()

Not many missing values, will impute missing values with mean:

df.fillna(df.mean(), inplace=True)

DAE step2: Encode binary variables using LabelEncoder.

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['Weekend']=labelencoder.fit_transform(df['Weekend'])
df['Revenue']=labelencoder.transform(df['Revenue'])

DAE step3: Explore numerical variables.

num_columns =['Administrative', 'Administrative_Duration', 'Informational',
'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay']
print(range(len(num_columns)))fig, ax = plt.subplots(nrows=5,ncols=2, figsize=(20,16))s=0
for i in range(5):
for j in range(2):
ax[i,j].hist(df[df['Revenue']==0][num_columns[s]], color='orange', label='False')
ax[i,j].hist(df[df['Revenue']==1][num_columns[s]], color='green', label='True')
ax[i,j].set_xlabel(num_columns[s])
ax[i,j].set_ylabel('Count of Shoppers')
fig.suptitle('Revenue True VS. False')
ax[i,j].legend()
s =s + 1
plt.show()

DAE: Step4, check correlations among continuous variables.

num_df = df[num_columns]
num_df.corr().round(3)
corr = num_df.corr()
fig, ax =plt.subplots(figsize=(12, 12))
ax = sns.heatmap(
corr,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(20, 220, n=200),
square=True
)
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=45,
horizontalalignment='right'
);

From the heatmap, we can see that some types of web session visit duration have a moderate correlation with the number of types of the session visited.

DAE: Step5, explore and visualize categorical variables.

cat_columns = ['Month','OperatingSystems', 
'Browser', 'Region', 'TrafficType',
'VisitorType','Weekend']
fig, ax = plt.subplots(nrows=3,ncols=2, figsize=(20,16))s=0
for i in range(3):
for j in range(2):
ax[i,j].hist(df[df['Revenue']==0][cat_columns[s]], color='orange', label='False')
ax[i,j].hist(df[df['Revenue']==1][cat_columns[s]], color='green', label='True')
ax[i,j].set_xlabel(cat_columns[s])
ax[i,j].set_ylabel('Count of Shoppers')
fig.suptitle('Revenue True VS. False')
ax[i,j].legend()
s =s + 1
plt.show()
fig, ax = plt.subplots(figsize=(10, 5))
plt.hist(df[df['Revenue']==0][cat_columns[6]], color='orange', label='False')
plt.hist(df[df['Revenue']==1][cat_columns[6]], color='green', label='True')
plt.xlabel(cat_columns[s])
plt.ylabel('Count of Shoppers')
plt.legend()
plt.show()

Prepare Data For Modeling

There are not that many issues in the dataset to clean. So we move to prepare the data for modeling. One important task is to encode the categorical variables. I used target encoding in this project.

# If you dont have category_encoders, copy the code below to install it first.
# conda install -c conda-forge category_encoders
import category_encoders as ce
target_encoder = ce.TargetEncoder(cols=cat_columns[:-1])
df_te = df.copy()
target_encoder.fit(df_te[cat_columns[:-1]], df_te['Revenue'])
df_=target_encoder.transform(df_te[cat_columns[:-1]], df_te['Revenue'])
df_te = df_te.drop(cat_columns[:-1], axis=1)
df_te = pd.concat([df_te, df_], axis=1)

Split data to train and test, and identify the metrics. Because this dataset is very imbalance, the number of False = 10422, and the number of True = 1908, and because we want to predict shoppers who have intention to purchase, which is positive, ROC_AUC score is selected as the metrics. F1 score is also used as reference since it shows the ratio of precision and recall.

# import libraries
from sklearn.model_selection import train_test_split, cross_validate, cross_val_predict
from sklearn.preprocessing import StandardScaler, RobustScaler, power_transform
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import plot_roc_curve
X = df_te.drop(['Revenue'], axis=1)
y = df_te['Revenue']
# Define metricsdef metrics(y_true, y_pred):
print('Confusion Matrix: ', '\n', confusion_matrix(y_true, y_pred), '\n')
print('Classification Report:', '\n', classification_report(y_true, y_pred), '\n')
print('F1 Score: ', f1_score(y_true, y_pred).round(3), '\n')
print('ROC_AUC Score: ', roc_auc_score(y_true, y_pred).round(3))
def roc_curve_disp(model, X_test, y_test):
ax=plt.gca()
model_disp = plot_roc_curve(model, X_test, y_test, ax=ax, alpha=0.8)
plt.show()

Build the base model, which is a basic logistic regression model.

# base model:
lr = LogisticRegression(solver='liblinear')
base_model = lr.fit(X_train, y_train)
y_pred = base_model.predict(X_test)print(metrics(y_test, y_pred))roc_curve_disp(base_model, X_test, y_test)

I built the first model as Stack Model. Stack model is an ensemble model that stacks several model and voting for the majority, which normally performs better than each individual model.

estimators = [  
('rf', RandomForestClassifier(n_estimators = 100)),
('grad', GradientBoostingClassifier()),
('xgb', XGBClassifier())]
stack = StackingClassifier(estimators = estimators, final_estimator = LogisticRegression(), cv = 5)
stack.fit(X_train, y_train);
print(metrics(y_test, stack.predict(X_test)))
roc_curve_disp(stack, X_test, y_test)

Our ROC_AUC score increased from 0.674 to 0.759. Which is great! Lets tune the hyper parameters, and specifically to adjust the class weight to improve the model performance.

# Transform the data:ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
Tune the model:
estimators = [('rf', RandomForestClassifier(n_estimators = 200, class_weight='balanced')),
('grad', GradientBoostingClassifier(n_estimators=200)),
('xgb', XGBClassifier(scale_pos_weight=scale_pos_weight))]
stack = StackingClassifier(estimators = estimators, final_estimator = LogisticRegression(), cv = 5)
stack.fit(X_train, y_train);metrics(y_test, stack.predict(X_test))Confusion Matrix:
[[2449 143]
[ 193 298]]

Classification Report:
precision recall f1-score support

0 0.93 0.94 0.94 2592
1 0.68 0.61 0.64 491

accuracy 0.89 3083
macro avg 0.80 0.78 0.79 3083
weighted avg 0.89 0.89 0.89 3083


F1 Score: 0.639

ROC_AUC Score: 0.776

Well, the F1 score and ROC_AUC score are both increased after I scaled the features, and tuned the model, but not that significantly.

Now, lets move to Neural Network.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
from tensorflow.keras.metrics import AUC

Now, build the Multilayer Neural Network model.

model = Sequential()
model.add(Dense(4, input_dim=X_train.shape[1], activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(0.5))
model.add(Dense(4, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))auc = AUC()
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=[auc])
history = model.fit(X_train, y_train,
epochs=200,
verbose=1,
validation_data=(X_test,y_test)
)

Results as below shows that the ROC_AUC score for the test set is up to 0.85.

Lets add the class weight into the MLP model.

weights = {0:1, 1:6}
history1 = model.fit(X_train, y_train,
class_weight=weights,
epochs=200,
verbose=1,
validation_data=(X_test,y_test)
)
score1 = roc_auc_score(y_test, model.predict(X_test))
score2 = roc_auc_score(y_test, model.predict_classes(X_test))
print('ROC AUC: %.3f' % score1)
print('ROC AUC: %.3f' % score2)
print('F1 Score: %.3f' % f1_score(y_test, model.predict_classes(X_test)))

From here, I calcuated two AUC scores from the probability the model predicts, and the classes the model predicts. The F1 score is lower than the stack model.

ROC AUC: 0.901
ROC AUC: 0.819
F1 Score: 0.536

Though the AUC score of the NN model is higher than the stack model, the results turned to be interesting.

y_pred_nn = model.predict_classes(X_test)
y_pred_stack = stack.predict(X_test)
fig, ax=plt.subplots(figsize=(16,12))
plt.hist(y_pred_nn, color='yellow', label='Neural Network', stacked=True)
plt.hist(y_test, color = 'orange', label = 'Ground Truth', stacked=True)
plt.hist(y_pred_stack, color='green', label='Stack predicted', stacked=True)
plt.title('Predicted vs Ground Truth')
plt.legend()
plt.show()

The graph above shows that the NN model generated more than one time of the misclassification on the positive class. Whereas, stack model only had a small portion of misclassification that predicted positive class as negative. The MLP model tends to have high false positive, and the stack model tends to have some false negative.

Honestly, I am not sure the reason and how to interpret it, but I will definitely do research and come back for an answer. If you read this article, and have an answer, please let me know.

Thank you for reading.

--

--

Daihong Chen

Data Science, Machine Learning, Data Visualization, and Climbing.