Decision Trees in Python with Labeled Record Data

This page focuses on record data decision trees and will look into Decision Tree Classification, attribute selection measures, and how to build and optimize Decision Tree Classifier using Python Scikit-learn package.

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It’s visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.

1 Data Science Questions

What factors are involved in predicting that the given job belongs to a private or a public sector?
How much does working in the private or public sector effects the salary?
Should employees work in the private or the public sector?

1.1 Setting the objective

Checking if the sector of a listed job can be predicted based on attributes like Job Title, Company Rating, State, salary range of the firm and the company size (number of employees).

2 Dataset Used

A dataset was collected using different websites during the Data Gathering Phase which can be found here.

3 Data Cleaning

Next, the dataset has been cleaned and prepped to be set in a specific way to run our model. In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed. ‘Min’ and ‘Max’ salaries and size of a company had been extracted from the salary and size range respectively. ‘Sector’ column was created from ‘Type of Ownership’ column and similarly ‘State’ column was created from location.

We have chosen 3 labels for the job title for our analysis : Data Scientist, Data Engineer and Data Analyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the model well in learning their respective features. Next, 2 different dataframes have been made to have the private and public sector representation and after some required feature generations on both of these data frames, they have been merged back. Other required data cleaning and prepping steps have also been applied, and can be seen in detail in the html version of the ipynb file.

Code for data cleaning : Record Data Cleaning

Raw csv : Raw data.csv

Clean csv : Clean data.csv

4 Code

Although, this comprehensive report will walk you through the code but you can also find the code on the github repository for which the link has been provided below.

Python Code for DT: DT code

4.1 Importing the libraries

Code

import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn import tree
from IPython.display import Image
import numpy as np
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix


import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score
from sklearn.tree import plot_tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.tree import DecisionTreeClassifier

4.2 Load the clean csv file into a dataframe using Pandas

Code

# Load the cleaned csv file
df = pd.read_csv('./clean_all_data.csv')
df.head(10)

	Index	Title	Rating	Founded	Sector	State	min.salary	max.salary	min.size	max.size
0	1	Senior Data Scientist	3.5	2007	Private	NY	111000.0	181000.0	501.0	1000.0
1	2	Data Scientist, Product Analytics	4.5	2008	Private	NY	111000.0	181000.0	1001.0	5000.0
2	3	Data Analyst	3.4	2019	Private	NJ	111000.0	181000.0	201.0	500.0
3	4	Director, Data Science	3.4	2007	Private	NY	111000.0	181000.0	51.0	200.0
4	5	Data Scientist	2.9	1985	Private	NY	111000.0	181000.0	201.0	500.0
5	6	Quantitative Researcher	4.4	1993	Private	NY	111000.0	181000.0	51.0	200.0
6	7	AI Scientist	5.0	2018	Private	NY	111000.0	181000.0	1.0	50.0
7	8	Quantitative Researcher	4.8	2000	Private	NY	111000.0	181000.0	501.0	1000.0
8	9	Data Scientist	3.9	2014	Private	NY	111000.0	181000.0	201.0	500.0
9	10	Data Scientist/Machine Learning	4.4	2011	Private	NY	111000.0	181000.0	51.0	200.0

Code

# remove unwanted columns
df.drop(columns=['Index', 'Founded'], inplace=True)
# rearrange columns
df = df[['Title','Rating','State','min.salary','max.salary','min.size', 'max.size', 'Sector']]

# Subset the data frame for Binary Classification with Decision Trees
df = df.loc[df['Title'].isin(['Data Scientist', 'Data Engineer', 'Data Analyst'])]
df = df.loc[df['Sector'].isin(['Private', 'Public'])]

# use label encoding for categorical data
from sklearn.preprocessing import LabelEncoder
le_title = LabelEncoder()
le_state = LabelEncoder()
le_sector = LabelEncoder()
df['Title'] = le_title.fit_transform(df['Title'])
df['State'] = le_state.fit_transform(df['State'])

# Private - 0
# Public - 1
df['Sector'] = le_sector.fit_transform(df['Sector'])

df.head(10)

	Title	Rating	State	min.salary	max.salary	min.size	max.size	Sector
2	0	3.4	5	111000.0	181000.0	201.0	500.0	0
4	2	2.9	6	111000.0	181000.0	201.0	500.0	0
8	2	3.9	6	111000.0	181000.0	201.0	500.0	0
11	2	3.9	6	111000.0	181000.0	1001.0	5000.0	0
13	2	3.0	6	111000.0	181000.0	51.0	200.0	0
16	2	3.6	6	111000.0	181000.0	501.0	1000.0	1
22	2	4.4	6	111000.0	181000.0	51.0	200.0	0
24	2	4.1	6	111000.0	181000.0	1001.0	5000.0	1
28	2	4.3	6	120000.0	140000.0	51.0	200.0	0
30	2	4.3	6	120000.0	140000.0	201.0	500.0	0

4.3 Heatmap of the correlation matrix

Code

# HEAT-MAP FOR THE CORRELATION MATRIX
corr = df.corr();
print(corr.shape)
sns.set_theme(style="white")

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 20))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmin=-1, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show();

(8, 8)

4.4 Histogram of Labels

Code

ax = df['Sector'].value_counts().plot(kind='bar',
                                    figsize=(14,8),
                                    title="Number for labels")
labels = ['Private', 'Public']
ax.xaxis.set_ticklabels(labels)
ax.set_xlabel("Labels")
ax.set_ylabel("Frequency")

Text(0, 0.5, 'Frequency')

It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a naive bayes model can be implemented.

4.5 Splitting the data into train and test set

Code

# MAKE DATA-FRAMES (or numpy arrays) (X,y) WHERE y="Sector" COLUMN and X="everything else"
X = df.drop(columns=['Sector'])
y = df['Sector']

# PARTITION THE DATASET INTO TRAINING AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2190)

4.6 Decision Trees: Entropy method

Shannon invented the concept of entropy, which measures the impurity of the input set. In physics and mathematics, entropy referred as the randomness or the impurity in the system. In information theory, it refers to the impurity in a group of examples. Information gain is the decrease in entropy. Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain. The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N(). First decison tree based on entropy method is as follows:

Code

dt1_2 = DecisionTreeClassifier(criterion = "entropy", splitter = "best")
dt1_2.fit(X_train, y_train)
y_pred1_2 = dt1_2.predict(X_test)
print(classification_report(y_test, y_pred1_2))
print(confusion_matrix(y_test, y_pred1_2))

              precision    recall  f1-score   support

           0       0.84      0.89      0.87        65
           1       0.63      0.52      0.57        23

    accuracy                           0.80        88
   macro avg       0.74      0.71      0.72        88
weighted avg       0.79      0.80      0.79        88

[[58  7]
 [11 12]]

4.6.1 Visualise Confusion Matrix

Code

labels = ['Private', 'Public']
ax1=plt.subplot()
sns.heatmap(confusion_matrix(y_test, y_pred1_2), annot=True, fmt='g', ax=ax1);

# labels, title and ticks
ax1.set_xlabel('Predicted labels');ax1.set_ylabel('True labels'); 
ax1.set_title('Confusion Matrix'); 
ax1.xaxis.set_ticklabels(labels); ax1.yaxis.set_ticklabels(labels);
plt.show()
plt.close()

4.6.2 Plot tree

Code

plt.figure(figsize = (20,20))
dec_tree = plot_tree(decision_tree=dt1_2,class_names=["Private","Public"],filled=True, rounded=True, fontsize=10, max_depth=6)

4.6.3 Inference and comparison with other trees

After looking at the Decision Tree plot, it can be inferred that the root node is splitting based of the 5th index value and belongs to “Private” class. It can be seen in the confusion matrix that 59 private firms were correctly predicted and 11 public firms were correctly predicted as well. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 78%.

4.7 Decision Trees: Using GINI Index

A fundamental decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to create split points. The Gini Index considers a binary split for each attribute and can compute a weighted sum of the impurity of each partition. In case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with smaller gini index chosen as the splitting point. The attribute with minimum Gini index is chosen as the splitting attribute. Second decison tree based on gini method is as follows:

Code

model = DecisionTreeClassifier(criterion= "gini", splitter = "best")
model.fit(X_train, y_train)
# USE THE MODEL TO MAKE PREDICTIONS FOR THE TRAINING AND TEST SET 
yp_train = model.predict(X_train)
yp_test = model.predict(X_test)

4.7.1 Function to generate a confusion matrix and print necessary information

Code

def confusion_plot(y_data, y_pred):
    
    cm = confusion_matrix(y_data, y_pred)
    print('ACCURACY: {:.2f}'.format(accuracy_score(y_data, y_pred)))
    print('NEGATIVE RECALL (Y=0): {:.2f}'.format(recall_score(y_data, y_pred, pos_label=0)))
    print('NEGATIVE PRECISION (Y=0): {:.2f}'.format(precision_score(y_data, y_pred, pos_label=0)))
    print('POSITIVE RECALL (Y=1): {:.2f}'.format(recall_score(y_data, y_pred, pos_label=0)))
    print('POSITIVE PRECISION (Y=1): {:.2f}'.format(precision_score(y_data, y_pred, pos_label=1)))
    print(cm)

    labels = ['Private', 'Public']
    ax1=plt.subplot()
    sns.heatmap(cm, annot=True, fmt='g', ax=ax1);

    # labels, title and ticks
    ax1.set_xlabel('Predicted labels');ax1.set_ylabel('True labels'); 
    ax1.set_title('Confusion Matrix'); 
    ax1.xaxis.set_ticklabels(labels); ax1.yaxis.set_ticklabels(labels);
    plt.show()
    plt.close()

4.7.2 Confusion Matrix

Code

print("------TRAINING------")
confusion_plot(y_train,yp_train)
print("------TEST------")
confusion_plot(y_test,yp_test)

------TRAINING------
ACCURACY: 1.00
NEGATIVE RECALL (Y=0): 1.00
NEGATIVE PRECISION (Y=0): 1.00
POSITIVE RECALL (Y=1): 1.00
POSITIVE PRECISION (Y=1): 1.00
[[264   0]
 [  0  87]]

------TEST------
ACCURACY: 0.82
NEGATIVE RECALL (Y=0): 0.92
NEGATIVE PRECISION (Y=0): 0.85
POSITIVE RECALL (Y=1): 0.92
POSITIVE PRECISION (Y=1): 0.71
[[60  5]
 [11 12]]

4.7.3 Classification Report

Code

print(classification_report(y_test, yp_test))

              precision    recall  f1-score   support

           0       0.85      0.92      0.88        65
           1       0.71      0.52      0.60        23

    accuracy                           0.82        88
   macro avg       0.78      0.72      0.74        88
weighted avg       0.81      0.82      0.81        88

4.7.4 Function to visualize the decision tree

Code

def plot_tree(model, X, Y):
    plt.figure(figsize=(20, 20))
    tree.plot_tree(model, class_names=["Private","Public"], filled=True, feature_names=X.columns, rounded=True, fontsize=10, max_depth=6)
    plt.show()
    
plot_tree(model, X_train, y_train)

4.7.5 Inference and comparison with other trees

After looking at the Decision Tree plot, it can be inferred that the root node is splitting based on the min size value (if min size <= 7500) and belongs to “Private” class. It can be seen in the confusion matrix that 62 private firms were correctly predicted and 13 public firms were correctly predicted as well which is higher than the entropy method. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 85% which is again more than the entropy method.

4.8 Hyper-parametric tuning

Parameters are the model features that the model learns from the data. Whereas, Hyperparameters are arguments accepted by a model-making function and can be modified to reduce overfitting, leading to a better generalization of the model. This process of calibrating our model by finding the right hyperparameters to generalize our model is called Hyperparameter Tuning. We will use the max_depth hyperparameter for plotting out decision tree.

Code

# LOOP OVER POSSIBLE HYPER-PARAMETERS VALUES
test_results=[]
train_results=[]

for num_layer in range(1,20):
    model = tree.DecisionTreeClassifier(max_depth=num_layer)
    model = model.fit(X_train, y_train)

    yp_train=model.predict(X_train)
    yp_test=model.predict(X_test)

    # print(y_pred.shape)
    test_results.append([num_layer,accuracy_score(y_test, yp_test),recall_score(y_test, yp_test,pos_label=0),recall_score(y_test, yp_test,pos_label=1)])
    train_results.append([num_layer,accuracy_score(y_train, yp_train),recall_score(y_train, yp_train,pos_label=0),recall_score(y_train, yp_train,pos_label=1)])

4.8.1 GENERATE THE PLOTS

Code

plt.plot([x[0] for x in test_results],[x[1] for x in test_results],label='test', color='red', marker='o')
plt.plot([x[0] for x in train_results],[x[1] for x in train_results],label='train', color='blue', marker='o')
plt.xlabel('Number of layers in decision tree (max_depth)')
plt.ylabel('ACCURACY (Y=0): Training (blue) and Test (red)')
plt.show()

plt.plot([x[0] for x in test_results],[x[2] for x in test_results],label='test', color='red', marker='o')
plt.plot([x[0] for x in train_results],[x[2] for x in train_results],label='train', color='blue', marker='o')
plt.xlabel('Number of layers in decision tree (max_depth)')
plt.ylabel('RECALL (Y=0): Training (blue) and Test (red)')
plt.show()

plt.plot([x[0] for x in test_results],[x[3] for x in test_results],label='test', color='red', marker='o')
plt.plot([x[0] for x in train_results],[x[3] for x in train_results],label='train', color='blue', marker='o')
plt.xlabel('Number of layers in decision tree (max_depth)')
plt.ylabel('RECALL (Y=1): Training (blue) and Test (red)')
plt.show()

4.9 Decision Trees: Using Max Depth

How does max_depth parameter helps on the model? How does high/low max_depth help in predicting the test data more accurately?

‘max_depth’ is what the name suggests: The maximum depth that you allow the tree to grow to. The deeper you allow, the more complex your model will become. For training error, it is easy to see what will happen. If you increase max_depth, training error will always go down (or at least not go up). For testing error, it gets less obvious. If you set max_depth too high, then the decision tree might simply overfit the training data without capturing useful patterns as we would like; this will cause testing error to increase. But if you set it too low, that is not good as well; then you might be giving the decision tree too little flexibility to capture the patterns and interactions in the training data. This will also cause the testing error to increase. There is a nice golden spot in between the extremes of too-high and too-low. Usually, the modeller would consider the max_depth as a hyper-parameter, and use some sort of grid/random search with cross-validation to find a good number for max_depth Third decison tree based on max depth method is as follows:

Code

#### TRAIN A SKLEARN DECISION TREE MODEL ON X_train,y_train 
model_maxdepth = DecisionTreeClassifier(max_depth=7, splitter = "best")
model_maxdepth.fit(X_train, y_train)

yp_train=model_maxdepth.predict(X_train)
yp_test=model_maxdepth.predict(X_test)

4.9.1 Confusion Matrix

Code

print("------TRAINING------")
confusion_plot(y_train,yp_train)
print()
print("------TEST------")
confusion_plot(y_test,yp_test)

------TRAINING------
ACCURACY: 0.96
NEGATIVE RECALL (Y=0): 0.99
NEGATIVE PRECISION (Y=0): 0.96
POSITIVE RECALL (Y=1): 0.99
POSITIVE PRECISION (Y=1): 0.97
[[262   2]
 [ 12  75]]


------TEST------
ACCURACY: 0.84
NEGATIVE RECALL (Y=0): 0.92
NEGATIVE PRECISION (Y=0): 0.87
POSITIVE RECALL (Y=1): 0.92
POSITIVE PRECISION (Y=1): 0.74
[[60  5]
 [ 9 14]]

4.9.2 Classification Report

Code

print(classification_report(y_test, yp_test))

              precision    recall  f1-score   support

           0       0.87      0.92      0.90        65
           1       0.74      0.61      0.67        23

    accuracy                           0.84        88
   macro avg       0.80      0.77      0.78        88
weighted avg       0.83      0.84      0.84        88

4.9.3 Visualize the tree

Code

plot_tree(model,X,y)

4.9.4 Inference and comparison with other trees

After looking at the Decision Tree plot, it can be inferred that the root node is splitting based on the min size value (if min size <= 7500) and belongs to “Private” class. It can be seen in the confusion matrix that 62 private firms were correctly predicted and 13 public firms were correctly predicted as well which is higher in comparison to the ones achieved using the entropy and the gini method. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 85% which is same as that of the gini method.

It is very interseting to compare the 3 decison trees which has been done above. One more interesting fact is that all the decision trees have common root nodes.

The following code cell shows the comparison between all the decision trees:

Code

comparison_df = pd.DataFrame([[59,10,78],[62,13,85], [62,13,85]], index=['DT1', 'DT2', 'DT3'], 
columns=['Private-Correct Prediction', 'Public-Correct Prediction', 'Accuracy'])
comparison_df

	Private-Correct Prediction	Public-Correct Prediction	Accuracy
DT1	59	10	78
DT2	62	13	85
DT3	62	13	85