A Text Analysis of Women Clothing Reviews

5 min readNov 7, 2020

In this study, I present a practice of natural language processing using a Women Clothing Reviews dataset downloaded from Kaggle.

Step1, load the data and take a look.

df = pd.read_csv('women_clothing_review.csv')
df.head()

Step2, preprocess and visualize the data.

Check missing values:

df.isnull().sum()Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

Since Title has 3810 missing values, I decided to combine Title and Review Text together as a new variable named text.

df = df.fillna('')df['text'] = df['Title'].str.cat(df['Review Text'], sep=' ')
# Check missing value for the new feature: 
df['text'].isnull().shape
(23486,)## keep the original columns of Title and Reviews in case need them.df.columnsIndex(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name', 'Class Name', 'text'],dtype='object')

Check the distribution of the target variable Recommended IND. 82% of the values are 1, positive.

df['Recommended IND'].value_counts(normalize=True).round(3)1    0.822
0    0.178
Name: Recommended IND, dtype: float64

Visualize the data by features.

# check distribution of Age by Recommended INDfig, ax = plt.subplots(figsize=(20, 8))plt.hist(df[df['Recommended IND']==1]['Age'], label='Recmmended IND= 1')
plt.hist(df[df['Recommended IND']==0]['Age'], label='Recmmended IND= 1')
plt.title('Age Distribution by Recommended IND')
plt.legend()
plt.show()

# check distribution of Positive Feedback Count by Recommended INDfig, ax = plt.subplots(figsize=(12, 8))plt.hist(df[df['Recommended IND']==1]['Positive Feedback Count'], label='Recmmended IND= 1')
plt.hist(df[df['Recommended IND']==0]['Positive Feedback Count'], label='Recmmended IND= 1')
plt.title('Distribution of Positive Feedback Count by Recommended IND')
plt.legend()
plt.show()

# check distribution of Rating by Recommended INDfig, ax = plt.subplots(figsize=(10, 15))plt.hist(df[df['Recommended IND']==1]['Rating'], label='Recmmended IND= 1', color = 'orange')
plt.hist(df[df['Recommended IND']==0]['Rating'], label='Recmmended IND= 1', histtype='step', color = 'green')
plt.title('Distribution of Rating by Recommended IND')
plt.legend()
plt.show()

# check distribution of Division by Recommended IND
fig, ax = plt.subplots(figsize=(20, 8))plt.hist(df[df['Recommended IND']==1]['Division Name'], label='Recmmended IND= 1')
plt.hist(df[df['Recommended IND']==0]['Division Name'], label='Recmmended IND= 1')
plt.legend()
plt.show()

# check distribution of count by Department and Recommended IND
fig, ax = plt.subplots(figsize=(20, 8))plt.hist(df[df['Recommended IND']==1]['Department Name'], label='Recmmended IND= 1')
plt.hist(df[df['Recommended IND']==0]['Department Name'], label='Recmmended IND= 1')
plt.legend()
plt.show()

# check distribution of count by Class and Recommended IND
fig, ax = plt.subplots(figsize=(20, 8))plt.hist(df[df['Recommended IND']==1]['Class Name'], label='Recmmended IND= 1')
plt.hist(df[df['Recommended IND']==0]['Class Name'], label='Recmmended IND= 1')
plt.legend()
plt.show()

Step3, preprocess text data.

In the NLP, the first step is to clean the data so that the data is usable for vectorization. The data clean for text data includes:

1. make all lower case
2. Remove punctuation, numbers, symbols, etc
3. Remove stop words, perhaps develop custom stop words list
4. Stemming/Lemmatization

Clean data by lowering text, removing punctuation, non sensical text, and space.

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return textround1 = lambda x: clean_text_round1(x)data_clean = pd.DataFrame(data_clean.text.apply(round2))def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return textround2 = lambda x: clean_text_round2(x)data_clean = pd.DataFrame(data_clean.text.apply(round2))

Since sklearn.feature_extraction.text.TfidfVectorizer and CountVectorizer include stop words and tokenizer_porter (used for stemming, I will use these directly)

Step4 Transforming Words into Feature Vectors

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizercv = CountVectorizer(stop_words='english')
# tokenize and build vocab
cv.fit(data_clean.text)
# encode document
data_cv = cv.transform(data_clean.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
print(data_cv.shape)
print(type(data_cv))(23486, 19371)
<class 'scipy.sparse.csr.csr_matrix'>

Write a function to visualize the top 20 unigram or 3-grams in the text.

# define a function to find the top20 n-grams:
def top_n_ngram(corpus,n = None,ngram = 1):
    vec = CountVectorizer(stop_words = 'english',ngram_range=(ngram,ngram)).fit(corpus)
    words_bag = vec.transform(corpus) #Have the count of  all the words for each review
    sum_words = words_bag.sum(axis =0) #Calculates the count of all the word in the whole review
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key = lambda x:x[1],reverse = True)
    return words_freq[:n]

Visualize the top20 unigrams:

# visualize the top 20 unigrams:
pop_words = top_n_ngram(data_clean['text'], 20,1)
df = pd.DataFrame(pop_words, columns = ['ReviewText' , 'count'])
plt.figure(figsize =(20,6))
df.groupby('ReviewText').sum()['count'].sort_values(ascending=False).plot(
kind='bar', title='Top 20 unigrams in review after removing stop words')

Visualize the top 20 2- grams:

Visualize the top 20 3-grams:

Visualize the top 20 5-grams:

As we can see that, the size fit is the most frequently used words in the reviews.

Now let’s build a logistic regression model that uses the vectorization of the text to predict the Recommended IND.

tfidf = TfidfVectorizer(use_idf=True)
token = tfidf.fit_transform(df['text'])
first_vector_tfidfvectorizer=token[0]
tfidf_df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf.get_feature_names(), columns=["tfidf"]) 
tfidf_df = tfidf_df.sort_values(by=["tfidf"],ascending=False)[:10]from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_scoreX_train = token[:20000]
X_test = token[20000:]
y_train = df['Recommended IND'][:20000].values
y_test = df['Recommended IND'][20000:].valueslr = LogisticRegression(random_state=1)lr.fit(X_train, y_train)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)y_pred = lr.predict(X_test)
print(pd.DataFrame(confusion_matrix(y_test, y_pred)))
print()
print(f'ROC_AUC_SCORE is: {roc_auc_score(y_test, y_pred).round(3)}')    0     1
0  338   273
1   90  2785ROC_AUC_SCORE is: 0.761

The ROC_AUC_Score is not very high. We can improve the model by adding more variables in and tune the hyperparameters.

Thanks for reading.

A Text Analysis of Women Clothing Reviews

Written by Daihong Chen