Naïve Bayes (NB) in Python with Labeled Text Data

To view the HTML version of the complete NB python code click here: Labeled Text Data NB

This page focuses on text data naive bayes and will look into Naive Bayes Classification, attribute selection measures, and how to build and optimize Naive Bayes Classifier using python.

Naive Bayes is a straightforward method for building classifiers, which are models that give class labels to problem cases represented as vectors of feature values, with the class labels selected from a limited set. For training such classifiers, there is no one algorithm, but rather a variety of algorithms based on the same principle: all naive Bayes classifiers assume that the value of one feature is independent of the value of any other feature, given the class variable. For example, if a fruit is red, round, and around 10 cm in diameter, it is termed an apple. Regardless of any possible relationships between the color, roundness, and diameter features, a naive Bayes classifier considers each of these features to contribute independently to the likelihood that this fruit is an apple.

1 Data Science Questions Answered

What do people think about some of the best companies in the industry: Amazon, Microsoft and Google?
How does working in one of the MAANG companies feel like?
Should employees prefer to work for such firms?

1.1 Setting the objective

Checking if the past statements related to the firms mentioned above can predict some unlabeled statements and also predict whether an employee should work for an MNC.

2 Dataset Used

Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (Twitter setup page will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose.

You will need to create an app to get access to the API. Once you have access to the API use the step-by-step guide to create an app and project. Remember to copy the keys in a txt file on your local machine.

Screenshot of extracted tweets:

3 Data Cleaning

Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox (check Text.ipynb) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it.

Screenshot of the Cleaned tweets:

4 Histogram of labels

It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a naive bayes model can be implemented.

5 Types of Naive Bayes Classifier

5.1 Multinomial Naive Bayes

This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.

5.2 Bernoulli Naive Bayes

This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

5.3 Gaussian Naive Bayes

When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

Naive Bayes has been implemented while specifying the alpha value as 1. This means that laplace parameter has been set to 1.

Python Code for NB : Python NB code

Python HTML : Labeled Text Data NB

6 Naive Bayes Key Features

It is important to have a look at the features that were considered for the Naive Bayes classifier. Following is the plot of the features i.e. words for all the labels.

6.1 Amazon

6.2 Microsoft

6.3 Google

The wordcloud for every firm describes various aspects such as work-location, salary, designation of employees, type of company, etc. For example, if we take a look at the wordcloud of MICROSOFT we will witness that employees can work here remotely or at some of the mentioned office locations such as London, UK. Also, certain technologies such as Cloud have also been displayed.

7 Confusion Matrix for NB

8 Classification report for NB

9 Density Plots

Below you can find the probability density plots of all the features. Note that for labels: * 0 refers to Amazon * 1 refers to Microsoft * 2 refers to Google

9.1 Alpha = 0

9.2 Alpha = 1

9.3 Alpha = 5

10 Conclusion/Inferences

It can be seen in the confusion matrix that 3 Amazon results was correctly predicted, 1 Microsoft results were correctly predicted and 1 Google results were correctly predicted. The classification report tells us the precision, recall and f1 value of the classifier. The accuracy of the model as per the report is 88%.

--- title: Naïve Bayes (NB) in Python with Labeled Text Data --- To view the HTML version of the complete NB python code click here: <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Naive%20Bayes/Python/Using_Twitter_API_for_gathering_tweets.ipynb" target="_blank">Labeled Text Data NB </a> This page focuses on text data naive bayes and will look into Naive Bayes Classification, attribute selection measures, and how to build and optimize Naive Bayes Classifier using python. Naive Bayes is a straightforward method for building classifiers, which are models that give class labels to problem cases represented as vectors of feature values, with the class labels selected from a limited set. For training such classifiers, there is no one algorithm, but rather a variety of algorithms based on the same principle: all naive Bayes classifiers assume that the value of one feature is independent of the value of any other feature, given the class variable. For example, if a fruit is red, round, and around 10 cm in diameter, it is termed an apple. Regardless of any possible relationships between the color, roundness, and diameter features, a naive Bayes classifier considers each of these features to contribute independently to the likelihood that this fruit is an apple. # Data Science Questions Answered 1. What do people think about some of the best companies in the industry: Amazon, Microsoft and Google? 2. How does working in one of the MAANG companies feel like? 3. Should employees prefer to work for such firms? ## Setting the objective Checking if the past statements related to the firms mentioned above can predict some unlabeled statements and also predict whether an employee should work for an MNC. # Dataset Used Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (<a href="https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api" target="_blank">Twitter setup page</a> will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose. You will need to create an app to get access to the API. Once you have access to the API use the <a href="https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2" target="_blank"> step-by-step guide</a> to create an app and project. Remember to copy the keys in a txt file on your local machine. Screenshot of extracted tweets: <img src="./images/Python/Tweets Initial.png"> # Data Cleaning Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox (check <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Text.ipynb" target="_blank"> Text.ipynb</a>) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it. Screenshot of the Cleaned tweets: <img src="./images/Python/Tweets cleaned.png" style="width:1000px;" align="center"> # Histogram of labels <img src="./images/Python/histogram_of_python_labels.png" style="width:1000px;" align="center"> It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a naive bayes model can be implemented. # Types of Naive Bayes Classifier ## Multinomial Naive Bayes This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document. ## Bernoulli Naive Bayes This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not. ## Gaussian Naive Bayes When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution. Naive Bayes has been implemented while specifying the alpha value as 1. This means that laplace parameter has been set to 1. Python Code for NB : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Naive%20Bayes/Python/Using_Twitter_API_for_gathering_tweets.ipynb" target="_blank">Python NB code</a> Python HTML : <a href="./Using_Twitter_API_for_gathering_tweets.html" target="_blank">Labeled Text Data NB</a> # Naive Bayes Key Features It is important to have a look at the features that were considered for the Naive Bayes classifier. Following is the plot of the features i.e. words for all the labels. ## Amazon <img src="./images/Python/word_cloud_amazon.png" style="width:1000px;" align="center"> ## Microsoft <img src="./images/Python/word_cloud_microsoft.png" style="width:1000px;" align="center"> ## Google <img src="./images/Python/word_cloud_google.png" style="width:1000px;" align="center"> The wordcloud for every firm describes various aspects such as work-location, salary, designation of employees, type of company, etc. For example, if we take a look at the wordcloud of MICROSOFT we will witness that employees can work here remotely or at some of the mentioned office locations such as London, UK. Also, certain technologies such as Cloud have also been displayed. # Confusion Matrix for NB <img src="./images/Python/Confusion Matrix(alpha=1).png"> # Classification report for NB <img src="./images/Python/Classification_report.png"> # Density Plots Below you can find the probability density plots of all the features. Note that for labels: * 0 refers to Amazon * 1 refers to Microsoft * 2 refers to Google ## Alpha = 0 <img src="./images/Python/predict_prob(alpha=0).png"> ## Alpha = 1 <img src="./images/Python/predict_prob(alpha=1).png"> ## Alpha = 5 <img src="./images/Python/predict_prob(alpha=5).png"> # Conclusion/Inferences It can be seen in the confusion matrix that 3 Amazon results was correctly predicted, 1 Microsoft results were correctly predicted and 1 Google results were correctly predicted. The classification report tells us the precision, recall and f1 value of the classifier. The accuracy of the model as per the report is 88%.