Naïve Bayes (NB) in R with Labeled Record Data

To view the HTML version of the complete NB rmd code click here: Record Data NB

This page focuses on record data naive bayes and will look into Naive Bayes Classification, attribute selection measures, and how to build and optimize Naive Bayes Classifier using R.

Naive Bayes is a straightforward method for building classifiers, which are models that give class labels to problem cases represented as vectors of feature values, with the class labels selected from a limited set. For training such classifiers, there is no one algorithm, but rather a variety of algorithms based on the same principle: all naive Bayes classifiers assume that the value of one feature is independent of the value of any other feature, given the class variable. For example, if a fruit is red, round, and around 10 cm in diameter, it is termed an apple. Regardless of any possible relationships between the color, roundness, and diameter features, a naive Bayes classifier considers each of these features to contribute independently to the likelihood that this fruit is an apple.

1 Data Science Questions Answered

What factors are involved in predicting that the given job belongs to a private or a public sector?
How much does working in the private or public sector effects the salary?
Should employees work in the private or the public sector?

1.1 Setting the objective

Checking if the sector of a listed job can be predicted based on attributes like Job Title, Company Rating, State, salary range of the firm and the company size (number of employees).

2 Dataset Used

A dataset was collected using different websites during the Data Gathering Phase which can be found here.

Screenshot of the Initial Dataset:

3 Data Cleaning

Next, the dataset has been cleaned and prepped to be set in a specific way to run our model.

In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed. ‘Min’ and ‘Max’ salaries and size of a company had been extracted from the salary and size range respectively. ‘Sector’ column was created from ‘Type of Ownership’ column and similarly ‘State’ column was created from location.

We have chosen 3 labels for our analysis : Data Scientist, Data Engineer and Data Analyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the model well in learning their respective features. Next, 2 different dataframes have been made to have the private and public sector representation and after some required feature generations on both of these data frames, they have been merged back. Other required data cleaning and prepping steps have also been applied, and can be seen in detail in the html version of the rmd file.

R Code for data cleaning : Record Data Cleaning

Raw csv : Raw data.csv

Clean csv : Clean data.csv

Screenshot of clean data:

4 Histogram of labels

Splitting the data into train and test data, we will implement the naive bayes model.

5 Types of Naive Bayes Classifier

5.1 Multinomial Naive Bayes

This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.

5.2 Bernoulli Naive Bayes

This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

5.3 Gaussian Naive Bayes

When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

Multinomial Naive Bayes has been used in the dataset shown above. Following are the types used for some of the key parameters:

Naive Bayes has been implemented while specifying the alpha value as 1. This means that laplace parameter has been set to 1.

R Code for NB : NB code

R HTML : Record Data NB

6 Confusion Matrix for NB

7 Classification report for NB

8 Density Plots

8.1 Naive Bayes PDF(Title)

8.2 Naive Bayes PDF(Rating)

8.3 Naive Bayes PDF(State)

8.4 Naive Bayes PDF(Min Salary)

8.5 Naive Bayes PDF(Max Salary)

8.6 Naive Bayes PDF(Min Size)

8.7 Naive Bayes PDF(Max Size)

9 Conclusion/Inferences

After looking at the PDF of Min and Max sizes, it can be infered that the plot is completely insightful in terms of the real notions of the size of a firm. In the lower side of the curve where min size is less, the probability is high that it might be a startup and thus the label predicted would be of private. As the min size value keeps on increasing we can see that the prediction is likely to be public. Comparing to real-life scenarios, it’s a known fact that the firm that goes public is generally has a lot of employees whereas a firm that is private has comparitively less number of employees. One interesting thing that can be noted is that for min and max salaries, we get a similar PDF which means that statistically private firms tend to offer more CTC in order to attract the employee.

It can be seen in the confusion matrix that 14 public firm results were correctly predicted and 84 private firm results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of the classifier. The accuracy of the model as per the report is 89.09%.

--- title: Naïve Bayes (NB) in R with Labeled Record Data --- To view the HTML version of the complete NB rmd code click here: <a href="./R_Naive_Bayes.html" target="_blank">Record Data NB</a> This page focuses on record data naive bayes and will look into Naive Bayes Classification, attribute selection measures, and how to build and optimize Naive Bayes Classifier using R. Naive Bayes is a straightforward method for building classifiers, which are models that give class labels to problem cases represented as vectors of feature values, with the class labels selected from a limited set. For training such classifiers, there is no one algorithm, but rather a variety of algorithms based on the same principle: all naive Bayes classifiers assume that the value of one feature is independent of the value of any other feature, given the class variable. For example, if a fruit is red, round, and around 10 cm in diameter, it is termed an apple. Regardless of any possible relationships between the color, roundness, and diameter features, a naive Bayes classifier considers each of these features to contribute independently to the likelihood that this fruit is an apple. # Data Science Questions Answered 1. What factors are involved in predicting that the given job belongs to a private or a public sector? 2. How much does working in the private or public sector effects the salary? 3. Should employees work in the private or the public sector? ## Setting the objective Checking if the sector of a listed job can be predicted based on attributes like Job Title, Company Rating, State, salary range of the firm and the company size (number of employees). # Dataset Used A dataset was collected using different websites during the Data Gathering Phase which can be found <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/DataScientist.csv" target="_blank">here</a>. Screenshot of the Initial Dataset: <img src="./images/R/Raw Data.png" style="width:1000px;" align="center"> # Data Cleaning Next, the dataset has been cleaned and prepped to be set in a specific way to run our model. In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed. 'Min' and 'Max' salaries and size of a company had been extracted from the salary and size range respectively. 'Sector' column was created from 'Type of Ownership' column and similarly 'State' column was created from location. We have chosen 3 labels for our analysis : Data Scientist, Data Engineer and Data Analyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the model well in learning their respective features. Next, 2 different dataframes have been made to have the private and public sector representation and after some required feature generations on both of these data frames, they have been merged back. Other required data cleaning and prepping steps have also been applied, and can be seen in detail in the html version of the rmd file. R Code for data cleaning : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Cleaning/R/Record-Data-Cleaning-in-R.Rmd" target="_blank">Record Data Cleaning</a> Raw csv : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/DataScientist.csv" target="_blank">Raw data.csv</a> Clean csv : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/clean_all_data.csv" target="_blank">Clean data.csv</a> Screenshot of clean data: <img src="./images/R/Clean Data.png" style="width:1000px;" align="center"> # Histogram of labels <img src="./images/R/Sector.png" style="width:1000px;" align="center"> Splitting the data into train and test data, we will implement the naive bayes model. # Types of Naive Bayes Classifier ## Multinomial Naive Bayes This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document. ## Bernoulli Naive Bayes This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not. ## Gaussian Naive Bayes When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution. Multinomial Naive Bayes has been used in the dataset shown above. Following are the types used for some of the key parameters: <img src="./images/R/NB.png" style="width:1000px;" align="center"> Naive Bayes has been implemented while specifying the alpha value as 1. This means that laplace parameter has been set to 1. R Code for NB : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Naive%20Bayes/R/R_Naive_Bayes.Rmd" target="_blank">NB code</a> R HTML : <a href="./codes/Techniques/Naive Bayes/R/R_Naive_Bayes.html" target="_blank">Record Data NB</a> # Confusion Matrix for NB <img src="./images/R/Confusion Matrix.png" style="width:1000px;" align="center"> # Classification report for NB <img src="./images/R/Classification report.png" style="width:750px;height:800px;" align="center"> # Density Plots ## Naive Bayes PDF(Title) <img src="./images/R/NB PDF (Title).png" style="width:1000px;" align="center"> ## Naive Bayes PDF(Rating) <img src="./images/R/NB PDF (Rating).png" style="width:1000px;" align="center"> ## Naive Bayes PDF(State) <img src="./images/R/NB PDF (State).png" style="width:1000px;" align="center"> ## Naive Bayes PDF(Min Salary) <img src="./images/R/NB PDF (Min Sal).png" style="width:1000px;" align="center"> ## Naive Bayes PDF(Max Salary) <img src="./images/R/NB PDF (Max Sal).png" style="width:1000px;" align="center"> ## Naive Bayes PDF(Min Size) <img src="./images/R/NB PDF (Min Size).png" style="width:1000px;" align="center"> ## Naive Bayes PDF(Max Size) <img src="./images/R/NB PDF (Max Size).png" style="width:1000px;" align="center"> # Conclusion/Inferences After looking at the PDF of Min and Max sizes, it can be infered that the plot is completely insightful in terms of the real notions of the size of a firm. In the lower side of the curve where min size is less, the probability is high that it might be a startup and thus the label predicted would be of private. As the min size value keeps on increasing we can see that the prediction is likely to be public. Comparing to real-life scenarios, it's a known fact that the firm that goes public is generally has a lot of employees whereas a firm that is private has comparitively less number of employees. One interesting thing that can be noted is that for min and max salaries, we get a similar PDF which means that statistically private firms tend to offer more CTC in order to attract the employee. It can be seen in the confusion matrix that 14 public firm results were correctly predicted and 84 private firm results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of the classifier. The accuracy of the model as per the report is 89.09%.