Naïve Bayes (NB) in Python with Labeled Text Data
To view the HTML version of the complete NB python code click here: Labeled Text Data NB
This page focuses on text data naive bayes and will look into Naive Bayes Classification, attribute selection measures, and how to build and optimize Naive Bayes Classifier using python.
Naive Bayes is a straightforward method for building classifiers, which are models that give class labels to problem cases represented as vectors of feature values, with the class labels selected from a limited set. For training such classifiers, there is no one algorithm, but rather a variety of algorithms based on the same principle: all naive Bayes classifiers assume that the value of one feature is independent of the value of any other feature, given the class variable. For example, if a fruit is red, round, and around 10 cm in diameter, it is termed an apple. Regardless of any possible relationships between the color, roundness, and diameter features, a naive Bayes classifier considers each of these features to contribute independently to the likelihood that this fruit is an apple.
1 Data Science Questions Answered
- What do people think about some of the best companies in the industry: Amazon, Microsoft and Google?
- How does working in one of the MAANG companies feel like?
- Should employees prefer to work for such firms?
1.1 Setting the objective
Checking if the past statements related to the firms mentioned above can predict some unlabeled statements and also predict whether an employee should work for an MNC.
2 Dataset Used
Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (Twitter setup page will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose.
You will need to create an app to get access to the API. Once you have access to the API use the step-by-step guide to create an app and project. Remember to copy the keys in a txt file on your local machine.
Screenshot of extracted tweets:
3 Data Cleaning
Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox (check Text.ipynb) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it.
Screenshot of the Cleaned tweets:
4 Histogram of labels
It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a naive bayes model can be implemented.
5 Types of Naive Bayes Classifier
5.1 Multinomial Naive Bayes
This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.
5.2 Bernoulli Naive Bayes
This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.
5.3 Gaussian Naive Bayes
When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.
Naive Bayes has been implemented while specifying the alpha value as 1. This means that laplace parameter has been set to 1.
Python Code for NB : Python NB code
Python HTML : Labeled Text Data NB
6 Naive Bayes Key Features
It is important to have a look at the features that were considered for the Naive Bayes classifier. Following is the plot of the features i.e. words for all the labels.
6.1 Amazon
6.2 Microsoft
6.3 Google
The wordcloud for every firm describes various aspects such as work-location, salary, designation of employees, type of company, etc. For example, if we take a look at the wordcloud of MICROSOFT we will witness that employees can work here remotely or at some of the mentioned office locations such as London, UK. Also, certain technologies such as Cloud have also been displayed.
7 Confusion Matrix for NB
8 Classification report for NB
9 Density Plots
Below you can find the probability density plots of all the features. Note that for labels: * 0 refers to Amazon * 1 refers to Microsoft * 2 refers to Google
9.1 Alpha = 0
9.2 Alpha = 1
9.3 Alpha = 5
10 Conclusion/Inferences
It can be seen in the confusion matrix that 3 Amazon results was correctly predicted, 1 Microsoft results were correctly predicted and 1 Google results were correctly predicted. The classification report tells us the precision, recall and f1 value of the classifier. The accuracy of the model as per the report is 88%.