SVM in Python with Labeled Text Data

To view the HTML version of the complete SVM python code click here: Labeled Text Data SVM

This page focuses on text data SVM and will look into SVM Classification, attribute selection measures, and how to build and optimize SVM Classifier using python.

Support vector machine is another simple algorithm that every machine learning expert should have in their arsenal. Support vector machine is highly preferred by many as it produces significant accuracy with less computation power. Support Vector Machine, abbreviated as SVM can be used for both regression and classification tasks.

1 What is Support Vector Machine?

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

2 Data Science Questions

What do people think about some of the best companies in the industry: Amazon, Microsoft and Google?
How does working in one of the MAANG companies feel like?
Should employees prefer to work for such firms?

2.1 Setting the objective

Checking if the past statements related to the firms mentioned above can predict some unlabeled statements and also predict whether an employee should work for an MNC.

3 Dataset Used

Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (Twitter setup page will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose. You will need to create an app to get access to the API. Once you have access to the API use the step-by-step guide to create an app and project. Remember to copy the keys in a txt file on your local machine.

Screenshot of extracted tweets:

4 Data Cleaning

Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox ( Text.ipynb) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it.

Screenshot of the Cleaned tweets:

5 Histogram of Labels

It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a SVM model can be implemented.

It should also be noted that the labels were label encoded and furthermore CountVectorizer was used. ‘stop_words’ (‘english’) was used as an argument for the CountVectorizer.

6 SVM: Types of Kernels

4 types of kernels have been used for implementing Support Vector Machine classification algorithm for the labeled text dataset. These are as follows:

Linear
Polynomial
RBF
Sigmoid

6.1 Linear Kernel

Linear kernel is used when the data is linearly separable i.e it can be separated using a single line. It is one of the most common kernels to be used. It is mostly used when there is a large number of features in a particular Data Set.

Tuning has been performed to get the best cost for the kernel.

Tuned Cost : 1

6.1.1 Classification Report

6.1.2 Confusion Matrix

6.1.3 Predicted Probabilities

6.1.4 Inference and comparison with other kernels

It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 4 google results were correctly predicted and 5 microsoft results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 1.

6.2 Polynomial Kernel

The polynomial kernel is a kernel function in machine learning that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. It is commonly used with support vector machines (SVMs) and other kernelized models.

Tuning has been performed to get the best cost for the kernel.

Tuned Cost : 5

6.2.1 Classification Report

6.2.2 Confusion Matrix

6.2.3 Predicted Probabilities

6.2.4 Inference and comparison with other kernels

It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 5 microsoft results were correctly predicted and 4 google results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 5. The accuracy of this model matches with the linear kernel. This can be due to the fact that the dataset that has been used is small.

6.3 RBF (Radial Basis Function) Kernel

RBF kernels are the most generalized form of kernelization and is one of the most widely used kernels due to its similarity to the Gaussian distribution. The RBF kernel function for two points X₁ and X₂ computes the similarity or how close they are to each other.

Tuning has been performed to get the best cost for the kernel.

Tuned Cost : 5

6.3.1 Classification Report

6.3.2 Confusion Matrix

6.3.3 Predicted Probabilities

6.3.4 Inference and comparison with other kernels

It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 5 microsoft results were correctly predicted and 4 google results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 5. The accuracy of this model matches with that of the linear as well as the polynomial kernel. This can be due to the fact that the dataset that has been used is small.

6.4 Sigmoid Kernel

This kernel function is similar to a two-layer perceptron model of the neural network, which works as an activation function for neurons. It can be shown as,

Sigmoid Kernel Function: F(x, xj) = tanh(αxay + c)

Tuning has been performed to get the best cost for the kernel.

Tuned Cost : 5

6.4.1 Classification Report

6.4.2 Confusion Matrix

6.4.3 Predicted Probabilities

6.4.4 Inference and comparison with other kernels

It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 5 microsoft results were correctly predicted and 4 google results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 5. The accuracy of this model matches with that of the linear, polynomial and the radial kernel. This can be due to the fact that the dataset that has been used is small.

7 Conclusion/Inferences

It is pretty interesting to see that SVM is performing better than Naive Bayes (done before) on this dataset. One reason could be because of smaller size of the dataset. Similar analysis will be done when access to larger amount of data set pertaining to the tweets is available.

All kernels work equally good: This can be seen in the confusion matrix. All have the same accuracy even with different costs, so any of one of them can be chosen to be the most suitable kernel for the dataset. Cost used for Linear kernel is 1 and as for Polynomial, Radial and Sigmoid kernels, cost used is 5.

--- title: SVM in Python with Labeled Text Data --- To view the HTML version of the complete SVM python code click here: <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/SVM/Python/SVM_Python.ipynb" target="_blank">Labeled Text Data SVM </a> This page focuses on text data SVM and will look into SVM Classification, attribute selection measures, and how to build and optimize SVM Classifier using python. Support vector machine is another simple algorithm that every machine learning expert should have in their arsenal. Support vector machine is highly preferred by many as it produces significant accuracy with less computation power. Support Vector Machine, abbreviated as SVM can be used for both regression and classification tasks. # What is Support Vector Machine? The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. <img src="./images/svm_intro.png" style="width:1000px;" align="center"> To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. # Data Science Questions 1. What do people think about some of the best companies in the industry: Amazon, Microsoft and Google? 2. How does working in one of the MAANG companies feel like? 3. Should employees prefer to work for such firms? ## Setting the objective Checking if the past statements related to the firms mentioned above can predict some unlabeled statements and also predict whether an employee should work for an MNC. # Dataset Used Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (<a href="https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api" target="_blank">Twitter setup page</a> will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose. You will need to create an app to get access to the API. Once you have access to the API use the <a href="https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2" target="_blank"> step-by-step guide</a> to create an app and project. Remember to copy the keys in a txt file on your local machine. Screenshot of extracted tweets: <img src="./images/tweets_initial.png"> # Data Cleaning Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox ( <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Text.ipynb" target="_blank">Text.ipynb</a>) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it. Screenshot of the Cleaned tweets: <img src="./images/tweets_cleaned.png" style="width:1000px;" align="center"> # Histogram of Labels <img src="./images/histogram_of_labels.png" style="width:1000px;" align="center"> It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a SVM model can be implemented. It should also be noted that the labels were label encoded and furthermore CountVectorizer was used. 'stop_words' ('english') was used as an argument for the CountVectorizer. # SVM: Types of Kernels 4 types of kernels have been used for implementing Support Vector Machine classification algorithm for the labeled text dataset. These are as follows: 1. Linear 2. Polynomial 3. RBF 4. Sigmoid ## Linear Kernel Linear kernel is used when the data is linearly separable i.e it can be separated using a single line. It is one of the most common kernels to be used. It is mostly used when there is a large number of features in a particular Data Set. Tuning has been performed to get the best cost for the kernel. Tuned Cost : 1 ### Classification Report <img src="./images/svm_linear_classification_report.png"> ### Confusion Matrix <img src="./images/svm_linear_confusion_matrix.png"> ### Predicted Probabilities <img src="./images/svm_linear_predicted_prob.png"> ### Inference and comparison with other kernels It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 4 google results were correctly predicted and 5 microsoft results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 1. ## Polynomial Kernel The polynomial kernel is a kernel function in machine learning that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. It is commonly used with support vector machines (SVMs) and other kernelized models. Tuning has been performed to get the best cost for the kernel. Tuned Cost : 5 ### Classification Report <img src="./images/svm_poly_classification_report.png"> ### Confusion Matrix <img src="./images/svm_poly_confusion_matrix.png"> ### Predicted Probabilities <img src="./images/svm_poly_predicted_prob.png"> ### Inference and comparison with other kernels It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 5 microsoft results were correctly predicted and 4 google results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 5. The accuracy of this model matches with the linear kernel. This can be due to the fact that the dataset that has been used is small. ## RBF (Radial Basis Function) Kernel RBF kernels are the most generalized form of kernelization and is one of the most widely used kernels due to its similarity to the Gaussian distribution. The RBF kernel function for two points X₁ and X₂ computes the similarity or how close they are to each other. Tuning has been performed to get the best cost for the kernel. Tuned Cost : 5 ### Classification Report <img src="./images/svm_rbf_classification_report.png"> ### Confusion Matrix <img src="./images/svm_rbf_confusion_matrix.png"> ### Predicted Probabilities <img src="./images/svm_rbf_predicted_prob.png"> ### Inference and comparison with other kernels It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 5 microsoft results were correctly predicted and 4 google results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 5. The accuracy of this model matches with that of the linear as well as the polynomial kernel. This can be due to the fact that the dataset that has been used is small. ## Sigmoid Kernel This kernel function is similar to a two-layer perceptron model of the neural network, which works as an activation function for neurons. It can be shown as, Sigmoid Kernel Function: F(x, xj) = tanh(αxay + c) Tuning has been performed to get the best cost for the kernel. Tuned Cost : 5 ### Classification Report <img src="./images/svm_sigmoid_classification_report.png"> ### Confusion Matrix <img src="./images/svm_sigmoid_confusion_matrix.png"> ### Predicted Probabilities <img src="./images/svm_sigmoid_predicted_prob.png"> ### Inference and comparison with other kernels It can be seen in the confusion matrix that 1 amazon results were correctly predicted, 5 microsoft results were correctly predicted and 4 google results were correctly predicted as well. The classification report tells us the precision, recall and f1 value of SVM. The accuracy of the model as per the report is 91%. Cost used is 5. The accuracy of this model matches with that of the linear, polynomial and the radial kernel. This can be due to the fact that the dataset that has been used is small. # Conclusion/Inferences It is pretty interesting to see that SVM is performing better than Naive Bayes (done before) on this dataset. One reason could be because of smaller size of the dataset. Similar analysis will be done when access to larger amount of data set pertaining to the tweets is available. All kernels work equally good: This can be seen in the confusion matrix. All have the same accuracy even with different costs, so any of one of them can be chosen to be the most suitable kernel for the dataset. Cost used for Linear kernel is 1 and as for Polynomial, Radial and Sigmoid kernels, cost used is 5.