In this article, I apply a series of natural language processing techniques on a dataset containing reviews about businesses. After that, I train a model using Logistic Regression to forecast if a review is “positive” or “negative”.
The natural language processing field contains a series of tools that are very useful to extract, label, and forecast information starting from raw text data. This collection of techniques are mainly used in the field of emotions recognition, text tagging (for example to automatize the process of sorting complaints from a client), chatbots, and vocal assistants.
A condensed version of the yelp dataset will be used. This version contains a collection of 1000 observations, originally in JSON format, then converted into
The review dataset being used:
Made out of 9 features (‘business_id’, ‘cool’, ‘date’, ‘funny’, ‘review_id’, ‘stars’, ‘text’, ‘useful’, ‘user_id’) this dataset contains a collection of reviews made by users from yelp, for each review a user gave a score from 1 to 5 stars. In order to create an efficient model to forecast if a review is “positive” or “negative”, we start from a model that takes the
text variable as a predictor and the
stars variable as the target.
Data preprocessing and explorative analysis
Once the dataset is reduced to 2 columns it is possible to conduct a small explorative analysis. It is important to know which distribution the target variable (stars received) follows, in this way it is possible to understand if there is a bias in the dataset — imbalance between positive or negative reviews. This influences the results of the model, giving the propensity to predict outcomes that are more present in the training set.
As we can see from the plot, there is a major component of positive reviews (5 stars), which creates an imbalance or bias.
In order to be able to obtain useful results, it is necessary to reduce the complexity of the problem, an efficient way to do so is to divide the reviews into positive and negative, using this division as the dependent variable.
Before proceeding with any other visualization, it is mandatory to apply some preprocessing procedures very common in NLP:
- Remove any non-useful characters (slashes, punctuation, HTML tags, question marks, etc.)
- Convert the whole text to lowercase characters
def functions will be very useful while preprocessing the text as described before. From here it is possible to determine which single words and a combination of words (bigrams) are more common:
After a small indexing adjustment we can create a bubble chart displaying the most common words in positive and negative reviews:
And for the positive reviews:
After this short but interesting insight, we can proceed into the next phase: model creation.
A very simple, fast to train, and very efficient algorithm is Logistic Regression. The
scikit-learn library provides a tool that helps to build this model, but before doing this and before doing the classical splitting between train and test set, it is mandatory to perform few steps like stemming, vectorization, and removal of stopwords:
- Stemming allows us to reduce every word to its root. This procedure avoids ‘dispersion’ in the text. For example, conjugation of the verb ‘to be’ like: ‘am’, ‘are’, ‘is’ are converted into its root form ‘be’.
- The removal of the stopwords consists of removing every word like ‘the’, ‘that’, ‘of’ that would cause a decrease in the model accuracy.
- Vectorization transforms every observation (review) in the dataset into a numerical representation. This phase is mandatory, and for every machine learning algorithm we would like to train, it is necessary to input numerical data. Vectorization provides the ability to translate text into a numeric representation of itself.