This page focuses on record data decision trees and will look into Decision Tree Classification, attribute selection measures, and how to build and optimize Decision Tree Classifier using Python Scikit-learn package.
A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It’s visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.
1 Data Science Questions
What factors are involved in predicting that the given job belongs to a private or a public sector?
How much does working in the private or public sector effects the salary?
Should employees work in the private or the public sector?
1.1 Setting the objective
Checking if the sector of a listed job can be predicted based on attributes like Job Title, Company Rating, State, salary range of the firm and the company size (number of employees).
2 Dataset Used
A dataset was collected using different websites during the Data Gathering Phase which can be found here.
3 Data Cleaning
Next, the dataset has been cleaned and prepped to be set in a specific way to run our model. In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed. ‘Min’ and ‘Max’ salaries and size of a company had been extracted from the salary and size range respectively. ‘Sector’ column was created from ‘Type of Ownership’ column and similarly ‘State’ column was created from location.
We have chosen 3 labels for the job title for our analysis : Data Scientist, Data Engineer and Data Analyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the model well in learning their respective features. Next, 2 different dataframes have been made to have the private and public sector representation and after some required feature generations on both of these data frames, they have been merged back. Other required data cleaning and prepping steps have also been applied, and can be seen in detail in the html version of the ipynb file.
Although, this comprehensive report will walk you through the code but you can also find the code on the github repository for which the link has been provided below.
4.2 Load the clean csv file into a dataframe using Pandas
Code
# Load the cleaned csv filedf = pd.read_csv('./clean_all_data.csv')df.head(10)
Index
Title
Rating
Founded
Sector
State
min.salary
max.salary
min.size
max.size
0
1
Senior Data Scientist
3.5
2007
Private
NY
111000.0
181000.0
501.0
1000.0
1
2
Data Scientist, Product Analytics
4.5
2008
Private
NY
111000.0
181000.0
1001.0
5000.0
2
3
Data Analyst
3.4
2019
Private
NJ
111000.0
181000.0
201.0
500.0
3
4
Director, Data Science
3.4
2007
Private
NY
111000.0
181000.0
51.0
200.0
4
5
Data Scientist
2.9
1985
Private
NY
111000.0
181000.0
201.0
500.0
5
6
Quantitative Researcher
4.4
1993
Private
NY
111000.0
181000.0
51.0
200.0
6
7
AI Scientist
5.0
2018
Private
NY
111000.0
181000.0
1.0
50.0
7
8
Quantitative Researcher
4.8
2000
Private
NY
111000.0
181000.0
501.0
1000.0
8
9
Data Scientist
3.9
2014
Private
NY
111000.0
181000.0
201.0
500.0
9
10
Data Scientist/Machine Learning
4.4
2011
Private
NY
111000.0
181000.0
51.0
200.0
Code
# remove unwanted columnsdf.drop(columns=['Index', 'Founded'], inplace=True)# rearrange columnsdf = df[['Title','Rating','State','min.salary','max.salary','min.size', 'max.size', 'Sector']]# Subset the data frame for Binary Classification with Decision Treesdf = df.loc[df['Title'].isin(['Data Scientist', 'Data Engineer', 'Data Analyst'])]df = df.loc[df['Sector'].isin(['Private', 'Public'])]# use label encoding for categorical datafrom sklearn.preprocessing import LabelEncoderle_title = LabelEncoder()le_state = LabelEncoder()le_sector = LabelEncoder()df['Title'] = le_title.fit_transform(df['Title'])df['State'] = le_state.fit_transform(df['State'])# Private - 0# Public - 1df['Sector'] = le_sector.fit_transform(df['Sector'])df.head(10)
Title
Rating
State
min.salary
max.salary
min.size
max.size
Sector
2
0
3.4
5
111000.0
181000.0
201.0
500.0
0
4
2
2.9
6
111000.0
181000.0
201.0
500.0
0
8
2
3.9
6
111000.0
181000.0
201.0
500.0
0
11
2
3.9
6
111000.0
181000.0
1001.0
5000.0
0
13
2
3.0
6
111000.0
181000.0
51.0
200.0
0
16
2
3.6
6
111000.0
181000.0
501.0
1000.0
1
22
2
4.4
6
111000.0
181000.0
51.0
200.0
0
24
2
4.1
6
111000.0
181000.0
1001.0
5000.0
1
28
2
4.3
6
120000.0
140000.0
51.0
200.0
0
30
2
4.3
6
120000.0
140000.0
201.0
500.0
0
4.3 Heatmap of the correlation matrix
Code
# HEAT-MAP FOR THE CORRELATION MATRIXcorr = df.corr();print(corr.shape)sns.set_theme(style="white")# Set up the matplotlib figuref, ax = plt.subplots(figsize=(20, 20))# Generate a custom diverging colormapcmap = sns.diverging_palette(230, 20, as_cmap=True)# Draw the heatmap with the mask and correct aspect ratiosns.heatmap(corr, cmap=cmap, vmin=-1, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})plt.show();
(8, 8)
4.4 Histogram of Labels
Code
ax = df['Sector'].value_counts().plot(kind='bar', figsize=(14,8), title="Number for labels")labels = ['Private', 'Public']ax.xaxis.set_ticklabels(labels)ax.set_xlabel("Labels")ax.set_ylabel("Frequency")
Text(0, 0.5, 'Frequency')
It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a naive bayes model can be implemented.
4.5 Splitting the data into train and test set
Code
# MAKE DATA-FRAMES (or numpy arrays) (X,y) WHERE y="Sector" COLUMN and X="everything else"X = df.drop(columns=['Sector'])y = df['Sector']# PARTITION THE DATASET INTO TRAINING AND TEST SETSX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2190)
4.6 Decision Trees: Entropy method
Shannon invented the concept of entropy, which measures the impurity of the input set. In physics and mathematics, entropy referred as the randomness or the impurity in the system. In information theory, it refers to the impurity in a group of examples. Information gain is the decrease in entropy. Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain. The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N(). First decison tree based on entropy method is as follows:
After looking at the Decision Tree plot, it can be inferred that the root node is splitting based of the 5th index value and belongs to “Private” class. It can be seen in the confusion matrix that 59 private firms were correctly predicted and 11 public firms were correctly predicted as well. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 78%.
4.7 Decision Trees: Using GINI Index
A fundamental decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to create split points. The Gini Index considers a binary split for each attribute and can compute a weighted sum of the impurity of each partition. In case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with smaller gini index chosen as the splitting point. The attribute with minimum Gini index is chosen as the splitting attribute. Second decison tree based on gini method is as follows:
Code
model = DecisionTreeClassifier(criterion="gini", splitter ="best")model.fit(X_train, y_train)# USE THE MODEL TO MAKE PREDICTIONS FOR THE TRAINING AND TEST SET yp_train = model.predict(X_train)yp_test = model.predict(X_test)
4.7.1 Function to generate a confusion matrix and print necessary information
After looking at the Decision Tree plot, it can be inferred that the root node is splitting based on the min size value (if min size <= 7500) and belongs to “Private” class. It can be seen in the confusion matrix that 62 private firms were correctly predicted and 13 public firms were correctly predicted as well which is higher than the entropy method. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 85% which is again more than the entropy method.
4.8 Hyper-parametric tuning
Parameters are the model features that the model learns from the data. Whereas, Hyperparameters are arguments accepted by a model-making function and can be modified to reduce overfitting, leading to a better generalization of the model. This process of calibrating our model by finding the right hyperparameters to generalize our model is called Hyperparameter Tuning. We will use the max_depth hyperparameter for plotting out decision tree.
Code
# LOOP OVER POSSIBLE HYPER-PARAMETERS VALUEStest_results=[]train_results=[]for num_layer inrange(1,20): model = tree.DecisionTreeClassifier(max_depth=num_layer) model = model.fit(X_train, y_train) yp_train=model.predict(X_train) yp_test=model.predict(X_test)# print(y_pred.shape) test_results.append([num_layer,accuracy_score(y_test, yp_test),recall_score(y_test, yp_test,pos_label=0),recall_score(y_test, yp_test,pos_label=1)]) train_results.append([num_layer,accuracy_score(y_train, yp_train),recall_score(y_train, yp_train,pos_label=0),recall_score(y_train, yp_train,pos_label=1)])
4.8.1 GENERATE THE PLOTS
Code
plt.plot([x[0] for x in test_results],[x[1] for x in test_results],label='test', color='red', marker='o')plt.plot([x[0] for x in train_results],[x[1] for x in train_results],label='train', color='blue', marker='o')plt.xlabel('Number of layers in decision tree (max_depth)')plt.ylabel('ACCURACY (Y=0): Training (blue) and Test (red)')plt.show()plt.plot([x[0] for x in test_results],[x[2] for x in test_results],label='test', color='red', marker='o')plt.plot([x[0] for x in train_results],[x[2] for x in train_results],label='train', color='blue', marker='o')plt.xlabel('Number of layers in decision tree (max_depth)')plt.ylabel('RECALL (Y=0): Training (blue) and Test (red)')plt.show()plt.plot([x[0] for x in test_results],[x[3] for x in test_results],label='test', color='red', marker='o')plt.plot([x[0] for x in train_results],[x[3] for x in train_results],label='train', color='blue', marker='o')plt.xlabel('Number of layers in decision tree (max_depth)')plt.ylabel('RECALL (Y=1): Training (blue) and Test (red)')plt.show()
4.9 Decision Trees: Using Max Depth
How does max_depth parameter helps on the model? How does high/low max_depth help in predicting the test data more accurately?
‘max_depth’ is what the name suggests: The maximum depth that you allow the tree to grow to. The deeper you allow, the more complex your model will become. For training error, it is easy to see what will happen. If you increase max_depth, training error will always go down (or at least not go up). For testing error, it gets less obvious. If you set max_depth too high, then the decision tree might simply overfit the training data without capturing useful patterns as we would like; this will cause testing error to increase. But if you set it too low, that is not good as well; then you might be giving the decision tree too little flexibility to capture the patterns and interactions in the training data. This will also cause the testing error to increase. There is a nice golden spot in between the extremes of too-high and too-low. Usually, the modeller would consider the max_depth as a hyper-parameter, and use some sort of grid/random search with cross-validation to find a good number for max_depth Third decison tree based on max depth method is as follows:
Code
#### TRAIN A SKLEARN DECISION TREE MODEL ON X_train,y_train model_maxdepth = DecisionTreeClassifier(max_depth=7, splitter ="best")model_maxdepth.fit(X_train, y_train)yp_train=model_maxdepth.predict(X_train)yp_test=model_maxdepth.predict(X_test)
After looking at the Decision Tree plot, it can be inferred that the root node is splitting based on the min size value (if min size <= 7500) and belongs to “Private” class. It can be seen in the confusion matrix that 62 private firms were correctly predicted and 13 public firms were correctly predicted as well which is higher in comparison to the ones achieved using the entropy and the gini method. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 85% which is same as that of the gini method.
It is very interseting to compare the 3 decison trees which has been done above. One more interesting fact is that all the decision trees have common root nodes.
The following code cell shows the comparison between all the decision trees:
---title: Decision Trees in Python with Labeled Record Data---<b>This page focuses on record data decision trees and will look into Decision Tree Classification, attribute selection measures, and how to build and optimize Decision Tree Classifier using Python Scikit-learn package.</b>A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.# Data Science Questions1. What factors are involved in predicting that the given job belongs to a private or a public sector? 2. How much does working in the private or public sector effects the salary? 3. Should employees work in the private or the public sector?## Setting the objectiveChecking if the sector of a listed job can be predicted based on attributes like Job Title, Company Rating, State, salary range of the firm and the company size (number of employees).# Dataset UsedA dataset was collected using different websites during the Data Gathering Phase which can be found <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/501/data/R/DataScientist.csv"target="_blank">here</a>.<br><br># Data CleaningNext, the dataset has been cleaned and prepped to be set in a specific way to run our model.In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed.'Min' and 'Max' salaries and size of a company had been extracted from the salary and size rangerespectively. 'Sector' column was created from 'Type of Ownership' column and similarly 'State' columnwas created from location.We have chosen 3 labels for the job title for our analysis : Data Scientist, Data Engineer and DataAnalyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the modelwell in learning their respective features. Next, 2 different dataframes have been made tohave the private and public sector representation and after some required feature generations on both ofthese data frames, they have been merged back. Other required data cleaning and prepping steps have also beenapplied, and can be seen in detail in the html version of the ipynb file.Code for data cleaning : <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Cleaning/R/Record-Data-Cleaning-in-R.Rmd"target="_blank">Record Data Cleaning</a>Raw csv : <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/DataScientist.csv"target="_blank">Raw data.csv</a>Clean csv : <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/clean_all_data.csv"target="_blank">Clean data.csv</a># CodeAlthough, this comprehensive report will walk you through the code but you can also find the code on the github repository for which the link has been provided below.Python Code for DT: <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Decision%20Trees/Python/Decision_Trees_python.ipynb"target="_blank">DT code</a>## Importing the libraries```{python}import pandas as pdimport seaborn as sns import matplotlib.pyplot as pltfrom sklearn import treefrom IPython.display import Imageimport numpy as npfrom sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import confusion_matriximport numpy as npfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.metrics import classification_report, accuracy_scorefrom sklearn.tree import plot_treefrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import cross_val_predictfrom sklearn.tree import DecisionTreeClassifier```## Load the clean csv file into a dataframe using Pandas```{python}# Load the cleaned csv filedf = pd.read_csv('./clean_all_data.csv')df.head(10)``````{python}# remove unwanted columnsdf.drop(columns=['Index', 'Founded'], inplace=True)# rearrange columnsdf = df[['Title','Rating','State','min.salary','max.salary','min.size', 'max.size', 'Sector']]# Subset the data frame for Binary Classification with Decision Treesdf = df.loc[df['Title'].isin(['Data Scientist', 'Data Engineer', 'Data Analyst'])]df = df.loc[df['Sector'].isin(['Private', 'Public'])]# use label encoding for categorical datafrom sklearn.preprocessing import LabelEncoderle_title = LabelEncoder()le_state = LabelEncoder()le_sector = LabelEncoder()df['Title'] = le_title.fit_transform(df['Title'])df['State'] = le_state.fit_transform(df['State'])# Private - 0# Public - 1df['Sector'] = le_sector.fit_transform(df['Sector'])df.head(10)```## Heatmap of the correlation matrix```{python}# HEAT-MAP FOR THE CORRELATION MATRIXcorr = df.corr();print(corr.shape)sns.set_theme(style="white")# Set up the matplotlib figuref, ax = plt.subplots(figsize=(20, 20))# Generate a custom diverging colormapcmap = sns.diverging_palette(230, 20, as_cmap=True)# Draw the heatmap with the mask and correct aspect ratiosns.heatmap(corr, cmap=cmap, vmin=-1, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})plt.show();```## Histogram of Labels```{python}ax = df['Sector'].value_counts().plot(kind='bar', figsize=(14,8), title="Number for labels")labels = ['Private', 'Public']ax.xaxis.set_ticklabels(labels)ax.set_xlabel("Labels")ax.set_ylabel("Frequency")```It can be seen that all the labels are very fairly balanced. Thus the test train split would work on this data and a naive bayes model can be implemented.## Splitting the data into train and test set```{python}# MAKE DATA-FRAMES (or numpy arrays) (X,y) WHERE y="Sector" COLUMN and X="everything else"X = df.drop(columns=['Sector'])y = df['Sector']# PARTITION THE DATASET INTO TRAINING AND TEST SETSX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2190)```## Decision Trees: Entropy methodShannon invented the concept of entropy, which measures the impurity of the input set. In physics and mathematics, entropy referred as the randomness or the impurity in the system. In information theory, it refers to the impurity in a group of examples. Information gain is the decrease in entropy. Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain. The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N(). First decison tree based on entropy method is as follows:```{python}dt1_2 = DecisionTreeClassifier(criterion ="entropy", splitter ="best")dt1_2.fit(X_train, y_train)y_pred1_2 = dt1_2.predict(X_test)print(classification_report(y_test, y_pred1_2))print(confusion_matrix(y_test, y_pred1_2))```### Visualise Confusion Matrix```{python}labels = ['Private', 'Public']ax1=plt.subplot()sns.heatmap(confusion_matrix(y_test, y_pred1_2), annot=True, fmt='g', ax=ax1);# labels, title and ticksax1.set_xlabel('Predicted labels');ax1.set_ylabel('True labels');ax1.set_title('Confusion Matrix');ax1.xaxis.set_ticklabels(labels); ax1.yaxis.set_ticklabels(labels);plt.show()plt.close()```### Plot tree```{python}plt.figure(figsize = (20,20))dec_tree = plot_tree(decision_tree=dt1_2,class_names=["Private","Public"],filled=True, rounded=True, fontsize=10, max_depth=6)```### Inference and comparison with other treesAfter looking at the Decision Tree plot, it can be inferred that the root node is splitting based of the 5th index value and belongs to "Private" class. It can be seen in the confusion matrix that 59 private firms were correctly predicted and 11 public firms were correctly predicted as well. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 78%.## Decision Trees: Using GINI IndexA fundamental decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to create split points. The Gini Index considers a binary split for each attribute and can compute a weighted sum of the impurity of each partition. In case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with smaller gini index chosen as the splitting point. The attribute with minimum Gini index is chosen as the splitting attribute. Second decison tree based on gini method is as follows:```{python}model = DecisionTreeClassifier(criterion="gini", splitter ="best")model.fit(X_train, y_train)# USE THE MODEL TO MAKE PREDICTIONS FOR THE TRAINING AND TEST SET yp_train = model.predict(X_train)yp_test = model.predict(X_test)```### Function to generate a confusion matrix and print necessary information```{python}def confusion_plot(y_data, y_pred): cm = confusion_matrix(y_data, y_pred)print('ACCURACY: {:.2f}'.format(accuracy_score(y_data, y_pred)))print('NEGATIVE RECALL (Y=0): {:.2f}'.format(recall_score(y_data, y_pred, pos_label=0)))print('NEGATIVE PRECISION (Y=0): {:.2f}'.format(precision_score(y_data, y_pred, pos_label=0)))print('POSITIVE RECALL (Y=1): {:.2f}'.format(recall_score(y_data, y_pred, pos_label=0)))print('POSITIVE PRECISION (Y=1): {:.2f}'.format(precision_score(y_data, y_pred, pos_label=1)))print(cm) labels = ['Private', 'Public'] ax1=plt.subplot() sns.heatmap(cm, annot=True, fmt='g', ax=ax1);# labels, title and ticks ax1.set_xlabel('Predicted labels');ax1.set_ylabel('True labels'); ax1.set_title('Confusion Matrix'); ax1.xaxis.set_ticklabels(labels); ax1.yaxis.set_ticklabels(labels); plt.show() plt.close()```### Confusion Matrix```{python}print("------TRAINING------")confusion_plot(y_train,yp_train)print("------TEST------")confusion_plot(y_test,yp_test)```### Classification Report```{python}print(classification_report(y_test, yp_test))```### Function to visualize the decision tree```{python}def plot_tree(model, X, Y): plt.figure(figsize=(20, 20)) tree.plot_tree(model, class_names=["Private","Public"], filled=True, feature_names=X.columns, rounded=True, fontsize=10, max_depth=6) plt.show()plot_tree(model, X_train, y_train)```### Inference and comparison with other treesAfter looking at the Decision Tree plot, it can be inferred that the root node is splitting based on the min size value (if min size <= 7500) and belongs to "Private" class. It can be seen in the confusion matrix that 62 private firms were correctly predicted and 13 public firms were correctly predicted as well which is higher than the entropy method. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 85% which is again more than the entropy method.## Hyper-parametric tuningParameters are the model features that the model learns from the data. Whereas, Hyperparameters are arguments accepted by a model-making function and can be modified to reduce overfitting, leading to a better generalization of the model. This process of calibrating our model by finding the right hyperparameters to generalize our model is called Hyperparameter Tuning. We will use the max_depth hyperparameter for plotting out decision tree.```{python}# LOOP OVER POSSIBLE HYPER-PARAMETERS VALUEStest_results=[]train_results=[]for num_layer inrange(1,20): model = tree.DecisionTreeClassifier(max_depth=num_layer) model = model.fit(X_train, y_train) yp_train=model.predict(X_train) yp_test=model.predict(X_test)# print(y_pred.shape) test_results.append([num_layer,accuracy_score(y_test, yp_test),recall_score(y_test, yp_test,pos_label=0),recall_score(y_test, yp_test,pos_label=1)]) train_results.append([num_layer,accuracy_score(y_train, yp_train),recall_score(y_train, yp_train,pos_label=0),recall_score(y_train, yp_train,pos_label=1)])```### GENERATE THE PLOTS```{python}plt.plot([x[0] for x in test_results],[x[1] for x in test_results],label='test', color='red', marker='o')plt.plot([x[0] for x in train_results],[x[1] for x in train_results],label='train', color='blue', marker='o')plt.xlabel('Number of layers in decision tree (max_depth)')plt.ylabel('ACCURACY (Y=0): Training (blue) and Test (red)')plt.show()plt.plot([x[0] for x in test_results],[x[2] for x in test_results],label='test', color='red', marker='o')plt.plot([x[0] for x in train_results],[x[2] for x in train_results],label='train', color='blue', marker='o')plt.xlabel('Number of layers in decision tree (max_depth)')plt.ylabel('RECALL (Y=0): Training (blue) and Test (red)')plt.show()plt.plot([x[0] for x in test_results],[x[3] for x in test_results],label='test', color='red', marker='o')plt.plot([x[0] for x in train_results],[x[3] for x in train_results],label='train', color='blue', marker='o')plt.xlabel('Number of layers in decision tree (max_depth)')plt.ylabel('RECALL (Y=1): Training (blue) and Test (red)')plt.show()```## Decision Trees: Using Max DepthHow does max_depth parameter helps on the model? How does high/low max_depth help in predicting the test data more accurately?'max_depth' is what the name suggests: The maximum depth that you allow the tree to grow to. The deeper you allow, the more complex your model will become. For training error, it is easy to see what will happen. If you increase max_depth, training error will always go down (or at least not go up). For testing error, it gets less obvious. If you set max_depth too high, then the decision tree might simply overfit the training data without capturing useful patterns as we would like; this will cause testing error to increase. But if you set it too low, that is not good as well; then you might be giving the decision tree too little flexibility to capture the patterns and interactions in the training data. This will also cause the testing error to increase. There is a nice golden spot in between the extremes of too-high and too-low. Usually, the modeller would consider the max_depth as a hyper-parameter, and use some sort of grid/random search with cross-validation to find a good number for max_depth Third decison tree based on max depth method is as follows:```{python}#### TRAIN A SKLEARN DECISION TREE MODEL ON X_train,y_train model_maxdepth = DecisionTreeClassifier(max_depth=7, splitter ="best")model_maxdepth.fit(X_train, y_train)yp_train=model_maxdepth.predict(X_train)yp_test=model_maxdepth.predict(X_test)```### Confusion Matrix```{python}print("------TRAINING------")confusion_plot(y_train,yp_train)print()print("------TEST------")confusion_plot(y_test,yp_test)```### Classification Report```{python}print(classification_report(y_test, yp_test))```### Visualize the tree```{python}plot_tree(model,X,y)```### Inference and comparison with other treesAfter looking at the Decision Tree plot, it can be inferred that the root node is splitting based on the min size value (if min size <= 7500) and belongs to "Private" class. It can be seen in the confusion matrix that 62 private firms were correctly predicted and 13 public firms were correctly predicted as well which is higher in comparison to the ones achieved using the entropy and the gini method. The classification report tells us the precision, recall and f1 value of the tree. The accuracy of the model as per the report is 85% which is same as that of the gini method.It is very interseting to compare the 3 decison trees which has been done above. One more interesting fact is that all the decision trees have common root nodes.<br>The following code cell shows the comparison between all the decision trees:```{python}comparison_df = pd.DataFrame([[59,10,78],[62,13,85], [62,13,85]], index=['DT1', 'DT2', 'DT3'], columns=['Private-Correct Prediction', 'Public-Correct Prediction', 'Accuracy'])comparison_df```