Clustering groups the records in a table based on similar values in one or more numeric key fields. Similar values are values that are nearby or close to one another in the context of the entire data set. These similar values represent clusters that, once identified, reveal patterns in the data. This page will look at partition clustering (k means) where a deep dive will be done in methods like elbow and Silhouette to determine and visualize the best value of k. Further this page will look into hierarchical clustering (with dendrograms).
1 Data Science Questions
What factors are involved in predicting that the given job belongs to a private or a public sector?
How much does working in the private or public sector effects the salary?
Should employees work in the private or the public sector?
1.1 Setting the objective
Checking if attributes like Job Title, Company Rating, State, salary range and the company size (number of employees) can help identify the sector of firms in the market.
2 Dataset Used
A dataset was collected using different websites during the Data Gathering Phase which can be found here.
3 Data Cleaning
Next, the dataset has been cleaned and prepped to be set in a specific way to run our model. In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed. ‘Min’ and ‘Max’ salaries and size of a company had been extracted from the salary and size range respectively. ‘Sector’ column was created from ‘Type of Ownership’ column and similarly ‘State’ column was created from location.
We have chosen 3 labels for the job title for our analysis: Data Scientist, Data Engineer and Data Analyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the model well in learning their respective features. Other required data cleaning and prepping steps have also been applied, and can be seen in detail in the code for data cleaning.
4 Clustering algorithms used for the current dataset
There are tons of different models that can be used to perform Clustering but for this dataset, we are focusing mainly on the 4 most used methods which are: 1. K-Means 2. DBSCAN 3. Hierarchichal (Agglomerative) 4. MeanShift
4.1 K Means Clustering : Selecting optimal value of K
K-means clustering is an unsupervised learning algorithm that groups data based on each point euclidean distance to a central point called centroid. The centroids are defined by the means of all points that are in the same cluster. The algorithm first chooses random points as centroids and then iterates adjusting them until full convergence.
An important thing to remember when using K-means, is that the number of clusters (k) is a hyperparameter, it will be defined before running the model.
4.2 DBSCAN
DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.
4.3 Hierarchichal Clustering
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
Strategies for hierarchical clustering generally fall into two categories:
Agglomerative: This is a “bottom-up” approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a “top-down” approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
4.4 Mean Shift
Meanshift is falling under the category of a clustering algorithm in contrast of Unsupervised learning that assigns the data points to the clusters iteratively by shifting points towards the mode (mode is the highest density of data points in the region, in the context of the Meanshift). As such, it is also known as the Mode-seeking algorithm. Mean-shift algorithm has applications in the field of image processing and computer vision.
Given a set of data points, the algorithm iteratively assigns each data point towards the closest cluster centroid and direction to the closest cluster centroid is determined by where most of the points nearby are at. So each iteration each data point will move closer to where the most points are at, which is or will lead to the cluster center. When the algorithm stops, each point is assigned to a cluster.
Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the number of clusters in advance. The number of clusters is determined by the algorithm with respect to the data.
4.5 Hyper-parameter Tuning
When creating a machine-learning model, you can define variables known as hyper-parameters. This indicates that while creating the model, the user defines the hyper-parameters. While parameters are taught, hyper-parameters influence the learning process. The value of a model’s hyperparameters has a substantial impact on its performance.
Noting that it is impossible to forecast in advance the best values for hyperparameters, it is advisable to try every conceivable value before settling on the best ones. This could take a lot of time and resources to complete manually. In order to determine the best possible Hyper-parameter(s) based on a measure, we therefore develop functions with loops and parameters. This metric varies not only between algorithms but also between algorithms’ approaches.
For instance, the silhouette approach, which maximizes the silhouette score, is a widely popular way for clustering hyper-parameters. However, there are various methods for doing so, including the Elbow method, Grubbs method, and others. When the dataset’s dimension is modest, the Elbow Method is effective at identifying the best hyper-parameters; nevertheless, the Silhouette Method is recommended for high dimension data.
4.5.1 Elbow Method
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.
We can use the Elbow method to have an indication of clusters for our data. It consists in the interpretation of a line plot with an elbow shape. The number of clusters is were the elbow bends.
4.5.2 Silhouette Method
The term ‘silhouette’ refers to a way of interpreting and validating consistency inside data clusters. The method generates a simple graphical depiction of how effectively each object was classified. When compared to other clusters, the silhouette value is a measure of how similar an object is to its own cluster (cohesion) (separation).
In order to determine the silhouette score, this method computes the silhouette coefficients for each point and averages them across all samples. It is decided to use the collection of hyper-parameters with the highest silhouette score. The silhouette scores fall between [-1,1] where a high value denotes a good match to the object’s own cluster and a poor match to nearby clusters, while a low value denotes a poor fit to nearby clusters. If the vast majority of the items have high values, the clustering configuration is advantageous. Numerous points with low or negative values indicate that there may be too many or too few clusters in the clustering design.
There are 2 main terms to be known to understand silhouette scores: 1. Cohesion (Similarity to its own cluster) 2. Separation (Similarity to other clusters)
5 Code
5.1 Importing the libraries
Code
# import the necessary packagesimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltsns.set_theme(style='whitegrid', palette='Set2')# import relevant libraries for clustering. we will use KMeans, DBSCAN, AgglomerativeClustering and MeanShiftfrom sklearn.cluster import KMeansfrom sklearn.cluster import DBSCANfrom sklearn.metrics import silhouette_scorefrom sklearn.cluster import AgglomerativeClusteringfrom scipy.cluster.hierarchy import dendrogram, linkagefrom sklearn.cluster import MeanShiftfrom sklearn.neighbors import NearestNeighborsfrom scipy.spatial.distance import cdistimport timeimport warningswarnings.filterwarnings('ignore')
5.2 Import the dataset
Code
# import the datasetdf = pd.read_csv('./clean_all_data.csv')df.head(10)
Index
Title
Rating
Founded
Sector
State
min.salary
max.salary
min.size
max.size
0
1
Senior Data Scientist
3.5
2007
Private
NY
111000.0
181000.0
501.0
1000.0
1
2
Data Scientist, Product Analytics
4.5
2008
Private
NY
111000.0
181000.0
1001.0
5000.0
2
3
Data Analyst
3.4
2019
Private
NJ
111000.0
181000.0
201.0
500.0
3
4
Director, Data Science
3.4
2007
Private
NY
111000.0
181000.0
51.0
200.0
4
5
Data Scientist
2.9
1985
Private
NY
111000.0
181000.0
201.0
500.0
5
6
Quantitative Researcher
4.4
1993
Private
NY
111000.0
181000.0
51.0
200.0
6
7
AI Scientist
5.0
2018
Private
NY
111000.0
181000.0
1.0
50.0
7
8
Quantitative Researcher
4.8
2000
Private
NY
111000.0
181000.0
501.0
1000.0
8
9
Data Scientist
3.9
2014
Private
NY
111000.0
181000.0
201.0
500.0
9
10
Data Scientist/Machine Learning
4.4
2011
Private
NY
111000.0
181000.0
51.0
200.0
5.3 Load the clean csv file into a dataframe using Pandas
Code
# Load the cleaned csv filedf = pd.read_csv('./clean_all_data.csv')df.head(10)
Index
Title
Rating
Founded
Sector
State
min.salary
max.salary
min.size
max.size
0
1
Senior Data Scientist
3.5
2007
Private
NY
111000.0
181000.0
501.0
1000.0
1
2
Data Scientist, Product Analytics
4.5
2008
Private
NY
111000.0
181000.0
1001.0
5000.0
2
3
Data Analyst
3.4
2019
Private
NJ
111000.0
181000.0
201.0
500.0
3
4
Director, Data Science
3.4
2007
Private
NY
111000.0
181000.0
51.0
200.0
4
5
Data Scientist
2.9
1985
Private
NY
111000.0
181000.0
201.0
500.0
5
6
Quantitative Researcher
4.4
1993
Private
NY
111000.0
181000.0
51.0
200.0
6
7
AI Scientist
5.0
2018
Private
NY
111000.0
181000.0
1.0
50.0
7
8
Quantitative Researcher
4.8
2000
Private
NY
111000.0
181000.0
501.0
1000.0
8
9
Data Scientist
3.9
2014
Private
NY
111000.0
181000.0
201.0
500.0
9
10
Data Scientist/Machine Learning
4.4
2011
Private
NY
111000.0
181000.0
51.0
200.0
Code
# remove unwanted columnsdf.drop(columns=['Index', 'Founded'], inplace=True)# rearrange columnsdf = df[['Title','Rating','State','min.salary','max.salary','min.size', 'max.size', 'Sector']]# Subset the data frame for Binary Classification with Decision Treesdf = df.loc[df['Title'].isin(['Data Scientist', 'Data Engineer', 'Data Analyst'])]df = df.loc[df['Sector'].isin(['Private', 'Public'])]# use label encoding for categorical datafrom sklearn.preprocessing import LabelEncoderle_title = LabelEncoder()le_state = LabelEncoder()le_sector = LabelEncoder()df['Title'] = le_title.fit_transform(df['Title'])df['State'] = le_state.fit_transform(df['State'])# Private - 0# Public - 1df['Sector'] = le_sector.fit_transform(df['Sector'])df.head(10)
Title
Rating
State
min.salary
max.salary
min.size
max.size
Sector
2
0
3.4
5
111000.0
181000.0
201.0
500.0
0
4
2
2.9
6
111000.0
181000.0
201.0
500.0
0
8
2
3.9
6
111000.0
181000.0
201.0
500.0
0
11
2
3.9
6
111000.0
181000.0
1001.0
5000.0
0
13
2
3.0
6
111000.0
181000.0
51.0
200.0
0
16
2
3.6
6
111000.0
181000.0
501.0
1000.0
1
22
2
4.4
6
111000.0
181000.0
51.0
200.0
0
24
2
4.1
6
111000.0
181000.0
1001.0
5000.0
1
28
2
4.3
6
120000.0
140000.0
51.0
200.0
0
30
2
4.3
6
120000.0
140000.0
201.0
500.0
0
5.4 Data Selection
Splitting the dataset in X and y. Since this is an unsupervised learning algorithm, we will not use the y labels. We will normalize the X data by using the StandardScaler function.
There are numerous hyperparameters that must be predetermined for each method that clusters a dataset. However, these hyper-parameters must be adjusted while taking into account our dataset. Giving random hyperparameters that don’t match our dataset could lead to data loss or incorrect datapoint clustering.
Let’s see a k-means model with random hyperparameters before hyper-parameter tuning as follows: 1. Initializer: kmeans++ 2. Number of clusters: 6 3. Maximum iterations: 300 4. Random state: 2190
5.6 Hyper-Parameter tuning function (maximize_silhouette)
The dataset (X), algorithm to be utilized (algo), the maximum number of loops to execute (nmax), and a boolean function to plot the silhouette scores versus the hyper-parameter (i_plot) are all inputs to the tuning function. It turns our dataset into a contiguous array, uses a for loop to run models with various hyper-parameter values, and calculates silhouette scores for each iteration. The best hyper-parameter, label, and model values are then saved and returned. The function will additionally plot a graph of silhouette score v/s hyper-parameter if the i_plot parameter is set to “True”.
# Using the elbow method to find the optimal number of clusters. We will use the inertia_ attribute to find the sum of squared distances of samples to their closest cluster center.evaluation = pd.DataFrame(columns=['Cluster', 'Distortion', 'Inertia'])for i inrange(1, 12): kmeanModel = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeanModel.fit(X) evaluation = evaluation.append({'Cluster': i, 'Distortion': sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1))/X.shape[0], 'Inertia': kmeanModel.inertia_}, ignore_index=True)evaluation
Cluster
Distortion
Inertia
0
1.0
2.563764
3073.000000
1
2.0
2.274386
2431.225991
2
3.0
2.086601
2047.937528
3
4.0
1.962188
1836.409041
4
5.0
1.850970
1626.352431
5
6.0
1.784530
1506.095051
6
7.0
1.709502
1404.642749
7
8.0
1.650355
1314.039079
8
9.0
1.605711
1228.014252
9
10.0
1.549378
1158.091282
10
11.0
1.500983
1096.804660
Code
# plot distortion and inertia for kmeans that will suggest the optimal number of clusters based on the plot.evaluation.plot.line(x='Cluster', subplots=True)
algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']start = time.time()print('KMeans Algorithm')kmeans_labels, n_clusters, kmeans_model = maximize_silhouette(X, 'kmeans', 7, True)print('The number of clusters for Kmeans algorithm is: ', len(np.unique(kmeans_labels)))df['kmeans_label'] = kmeans_labelsalgo_features.append('kmeans_label')print('Cluster Plots for Kmeans algorithm')sns.pairplot(df[algo_features], hue='kmeans_label')plt.show()end = time.time()print('Time taken for KMEANS algorithm: {} minutes'.format((end - start)/60))
KMeans Algorithm
The number of clusters for Kmeans algorithm is: 5
Cluster Plots for Kmeans algorithm
Time taken for KMEANS algorithm: 0.11039044857025146 minutes
5.7.3 Final Results
We can see that for the larger dataset, Silhoutte method gives a better prediction as compared to the elbow method. The number of clusters for K-Means algorithm as per the analysis above is 4.
5.8 DBSCAN: Comparing number of clusters with silhouette score
Code
neighbors = NearestNeighbors(n_neighbors=66)neighbors_fit = neighbors.fit(X)distances, indices = neighbors_fit.kneighbors(X)distances = np.sort(distances, axis=0)distances = distances[:,1]plt.plot(distances)plt.xlabel('Points')plt.ylabel('Distance (EPS)')plt.title('K-Distance Graph for DBSCAN')
Text(0.5, 1.0, 'K-Distance Graph for DBSCAN')
Code
# perform DBSCAN clustering. use the eps and min_samples parameters to find the optimal number of clusters. plot the number of clusters vs the silhouette score.dbscan_df = pd.DataFrame(columns=['eps', 'min_samples', 'clusters', 'silhouette_score'])for i in np.arange(0.5, 1.0, 0.1):for j inrange(1, 11): dbscan = DBSCAN(eps=i, min_samples=j) dbscan.fit(X)if (len(set(dbscan.labels_) -set([-1])) >1) & (len(set(dbscan.labels_) -set([-1])) <11): dbscan_df = dbscan_df.append({'eps': i, 'min_samples': j, 'clusters': len(set(dbscan.labels_) -set([-1])), 'silhouette_score': silhouette_score(X, dbscan.labels_)}, ignore_index=True)else: dbscan_df = dbscan_df.append({'eps': i, 'min_samples': j, 'clusters': 0, 'silhouette_score': 0}, ignore_index=True)dbscan_df.head()
# Perform Agglomerative Clusteringmodel = AgglomerativeClustering().fit(X)labels = model.labels_# create linkage for agglomerative clustering, and the dendrogram for the linkage so as to depict the optimal number of clusters based on the dendrogram.Z = linkage(X, method='ward')dend = dendrogram(Z)plt.axhline(y=21, color='r', linestyle='--', label='21')
<matplotlib.lines.Line2D at 0x292d9fe50>
Code
algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']start = time.time()print('Agglomerative (Hierarchical) Algorithm')ag_labels, n_clusters, agg_model = maximize_silhouette(X, 'agg', 7, True)print('The number of clusters for Agglomerative (Hierarchical) algorithm is: ', len(np.unique(ag_labels)))df['ag_label'] = ag_labelsalgo_features.append('ag_label')print('Cluster Plots for Agglomerative (Hierarchical) algorithm')sns.pairplot(df[algo_features], hue='ag_label')plt.show()end = time.time()print('Time taken for Agglomerative (Hierarchical) algorithm: {} minutes'.format((end - start)/60))
Agglomerative (Hierarchical) Algorithm
The number of clusters for Agglomerative (Hierarchical) algorithm is: 4
Cluster Plots for Agglomerative (Hierarchical) algorithm
Time taken for Agglomerative (Hierarchical) algorithm: 0.10493305524190268 minutes
Code
X1 = df.drop(['Sector', 'ag_label', 'dbscan_label', 'kmeans_label', 'cluster_label'], axis=1)Z = linkage(X1, method='ward')dend = dendrogram(Z, truncate_mode='lastp', # show only the last p merged clusters p=4, leaf_rotation=90, leaf_font_size=12, show_contracted=True)plt.title('Truncated Hierarchical Clustering Dendrogram')plt.xlabel('Cluster Size')plt.ylabel('Distance')plt.show()
5.9.1 Final Results
The optimal number of clusters based on the dendogram above is 4.
5.10 Mean Shift
Code
# Perform MeanShift Clustering and predict number model = MeanShift(bandwidth=2).fit(X)labels = model.labels_cluster_centers = model.cluster_centers_labels_unique = np.unique(labels)n_clusters =len(labels_unique)print("number of estimated clusters : ", n_clusters)
The optimal number of clusters as per mean shift is 6.
6 Results
When comparing the time needed to execute each Clustering Method, it can be seen that KMEANS takes the longest time followed by DBSCAN, and Agglomerative (Hierarchical) algorithm is the quickest of all. The K-Means (4) and Agglomerative (Hierarchical) (4) clusters are the ones that come closest to the original number of labels (5). We gain a new understanding of the data and the target variable since the clusters created for DBSCAN (2) and MeanShift (6) do not correspond to the number of labels in the original labels. Even if the method is made simpler by taking an important hyper-parameter (min samples), DBSCAN still has a higher degree of complexity than the other 2 algorithms. From the cluster plots of each algorithm, it is seen that clusters are more distinct for Agglomerative followed by K-means, then MeanShift and finally DBSCAN.
7 Conclusions
Because of its quick processing time and low complexity, the Agglomerative (Hierarchical) Clustering Algorithm is our best option for doing clustering based on the aforementioned insights. The Agglomerative Algorithm reveals that, in order to produce better clusters, the label variables may need to be divided into four, or, more precisely, the ‘Government’ and ‘Public’ variables may need to be combined.
The mismatch between the number of clusters and the number of labels across all Clustering techniques can be attributed to the complexity, size, and historical nature of the data as well as the fact that many features are interdependent.
---title: Clustering Labeled Record Data in Python---<b>This page focuses on record data clustering.</b>Clustering groups the records in a table based on similar values in one or more numeric key fields. Similar values are values that are nearby or close to one another in the context of the entire data set. These similar values represent clusters that, once identified, reveal patterns in the data. This page will look at partition clustering (k means) where a deep dive will be done in methods like elbow and Silhouette to determine and visualize the best value of k. Further this page will look into hierarchical clustering (with dendrograms).# Data Science Questions1. What factors are involved in predicting that the given job belongs to a private or a public sector?2. How much does working in the private or public sector effects the salary?3. Should employees work in the private or the public sector?## Setting the objectiveChecking if attributes like Job Title, Company Rating, State, salary range and the company size (number of employees) can help identify the sector of firms in the market.# Dataset UsedA dataset was collected using different websites during the Data Gathering Phase which can be found <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/data/R/DataScientist.csv"target="_blank">here</a>.# Data Cleaning Next, the dataset has been cleaned and prepped to be set in a specific way to run our model. In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed. ‘Min’ and ‘Max’ salaries and size of a company had been extracted from the salary and size range respectively. ‘Sector’ column was created from ‘Type of Ownership’ column and similarly ‘State’ column was created from location.We have chosen 3 labels for the job title for our analysis: Data Scientist, Data Engineer and Data Analyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the model well in learning their respective features. Other required data cleaning and prepping steps have also been applied, and can be seen in detail in the code for data cleaning.Code for data cleaning: <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Cleaning/R/Record-Data-Cleaning-in-R.Rmd"target="_blank">Record Data Cleaning</a>Raw csv: <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/DataScientist.csv"target="_blank">Raw data.csv</a>Clean csv: <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/clean_all_data.csv"target="_blank">Clean data.csv</a># Clustering algorithms used for the current datasetThere are tons of different models that can be used to perform Clustering but for this dataset, we are focusing mainly on the 4 most used methods which are:1. K-Means2. DBSCAN3. Hierarchichal (Agglomerative)4. MeanShift ## K Means Clustering : Selecting optimal value of KK-means clustering is an unsupervised learning algorithm that groups data based on each point euclidean distance to a central point called centroid. The centroids are defined by the means of all points that are in the same cluster. The algorithm first chooses random points as centroids and then iterates adjusting them until full convergence.An important thing to remember when using K-means, is that the number of clusters (k) is a hyperparameter, it will be defined before running the model.## DBSCANDBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.## Hierarchichal ClusteringHierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.Strategies for hierarchical clustering generally fall into two categories:1. Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.2. Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.## Mean ShiftMeanshift is falling under the category of a clustering algorithm in contrast of Unsupervised learning that assigns the data points to the clusters iteratively by shifting points towards the mode (mode is the highest density of data points in the region, in the context of the Meanshift). As such, it is also known as the Mode-seeking algorithm. Mean-shift algorithm has applications in the field of image processing and computer vision.Given a set of data points, the algorithm iteratively assigns each data point towards the closest cluster centroid and direction to the closest cluster centroid is determined by where most of the points nearby are at. So each iteration each data point will move closer to where the most points are at, which is or will lead to the cluster center. When the algorithm stops, each point is assigned to a cluster.Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the number of clusters in advance. The number of clusters is determined by the algorithm with respect to the data.## Hyper-parameter TuningWhen creating a machine-learning model, you can define variables known as hyper-parameters. This indicates that while creating the model, the user defines the hyper-parameters. While parameters are taught, hyper-parameters influence the learning process.The value of a model's hyperparameters has a substantial impact on its performance.Noting that it is impossible to forecast in advance the best values for hyperparameters, it is advisable to try every conceivable value before settling on the best ones. This could take a lot of time and resources to complete manually. In order to determine the best possible Hyper-parameter(s) based on a measure, we therefore develop functions with loops and parameters. This metric varies not only between algorithms but also between algorithms' approaches.For instance, the silhouette approach, which maximizes the silhouette score, is a widely popular way for clustering hyper-parameters. However, there are various methods for doing so, including the Elbow method, Grubbs method, and others.When the dataset's dimension is modest, the Elbow Method is effective at identifying the best hyper-parameters; nevertheless, the Silhouette Method is recommended for high dimension data.### Elbow MethodA fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.We can use the Elbow method to have an indication of clusters for our data. It consists in the interpretation of a line plot with an elbow shape. The number of clusters is were the elbow bends.### Silhouette MethodThe term 'silhouette' refers to a way of interpreting and validating consistency inside data clusters. The method generates a simple graphical depiction of how effectively each object was classified. When compared to other clusters, the silhouette value is a measure of how similar an object is to its own cluster (cohesion) (separation). In order to determine the silhouette score, this method computes the silhouette coefficients for each point and averages them across all samples. It is decided to use the collection of hyper-parameters with the highest silhouette score. The silhouette scores fall between [-1,1] where a high value denotes a good match to the object's own cluster and a poor match to nearby clusters, while a low value denotes a poor fit to nearby clusters. If the vast majority of the items have high values, the clustering configuration is advantageous. Numerous points with low or negative values indicate that there may be too many or too few clusters in the clustering design.There are 2 main terms to be known to understand silhouette scores:1. Cohesion (Similarity to its own cluster) 2. Separation (Similarity to other clusters) # Code## Importing the libraries```{python}# import the necessary packagesimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltsns.set_theme(style='whitegrid', palette='Set2')# import relevant libraries for clustering. we will use KMeans, DBSCAN, AgglomerativeClustering and MeanShiftfrom sklearn.cluster import KMeansfrom sklearn.cluster import DBSCANfrom sklearn.metrics import silhouette_scorefrom sklearn.cluster import AgglomerativeClusteringfrom scipy.cluster.hierarchy import dendrogram, linkagefrom sklearn.cluster import MeanShiftfrom sklearn.neighbors import NearestNeighborsfrom scipy.spatial.distance import cdistimport timeimport warningswarnings.filterwarnings('ignore')```## Import the dataset```{python}# import the datasetdf = pd.read_csv('./clean_all_data.csv')df.head(10)```## Load the clean csv file into a dataframe using Pandas```{python}# Load the cleaned csv filedf = pd.read_csv('./clean_all_data.csv')df.head(10)``````{python}# remove unwanted columnsdf.drop(columns=['Index', 'Founded'], inplace=True)# rearrange columnsdf = df[['Title','Rating','State','min.salary','max.salary','min.size', 'max.size', 'Sector']]# Subset the data frame for Binary Classification with Decision Treesdf = df.loc[df['Title'].isin(['Data Scientist', 'Data Engineer', 'Data Analyst'])]df = df.loc[df['Sector'].isin(['Private', 'Public'])]# use label encoding for categorical datafrom sklearn.preprocessing import LabelEncoderle_title = LabelEncoder()le_state = LabelEncoder()le_sector = LabelEncoder()df['Title'] = le_title.fit_transform(df['Title'])df['State'] = le_state.fit_transform(df['State'])# Private - 0# Public - 1df['Sector'] = le_sector.fit_transform(df['Sector'])df.head(10)```## Data SelectionSplitting the dataset in X and y. Since this is an unsupervised learning algorithm, we will not use the y labels. We will normalize the X data by using the StandardScaler function.```{python}from sklearn.preprocessing import StandardScalerX = df.drop(['Sector'], axis=1)y = df['Sector']scalar = StandardScaler()X = scalar.fit_transform(X)```## Clustering with random Hyper-parametersThere are numerous hyperparameters that must be predetermined for each method that clusters a dataset. However, these hyper-parameters must be adjusted while taking into account our dataset. Giving random hyperparameters that don't match our dataset could lead to data loss or incorrect datapoint clustering.Let’s see a k-means model with random hyperparameters before hyper-parameter tuning as follows:1. Initializer: kmeans++2. Number of clusters: 63. Maximum iterations: 3004. Random state: 2190```{python}kmeans = KMeans( init="k-means++", n_clusters=6, max_iter=300, random_state=2190)label = kmeans.fit_predict(X)df['cluster_label'] = label# plot the clustersfeatures = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size','max.size', 'Sector', 'cluster_label']sns.pairplot(df[features], hue='cluster_label')```## Hyper-Parameter tuning function (maximize_silhouette)The dataset (X), algorithm to be utilized (algo), the maximum number of loops to execute (nmax), and a boolean function to plot the silhouette scores versus the hyper-parameter (i_plot) are all inputs to the tuning function. It turns our dataset into a contiguous array, uses a for loop to run models with various hyper-parameter values, and calculates silhouette scores for each iteration. The best hyper-parameter, label, and model values are then saved and returned. The function will additionally plot a graph of silhouette score v/s hyper-parameter if the i_plot parameter is set to "True".```{python}def maximize_silhouette(X, algo="kmeans", nmax=20, i_plot=False): i_print =False# FORCE CONTIGUOUS X = np.ascontiguousarray(X)# LOOP OVER HYPER-PARAM params = [] sil_scores = [] sil_max =-10if algo in ["kmeans", "agg"]:for param inrange(3, nmax+1):if algo =="agg": model = AgglomerativeClustering(n_clusters=param).fit(X) labels = model.labels_if algo =="kmeans": model = KMeans(n_clusters=param, init='k-means++').fit(X) labels = model.predict(X)try: sil_scores.append(silhouette_score(X, labels)) params.append(param)except:continueif i_print:print(param, sil_scores[-1])if sil_scores[-1] > sil_max: opt_param = param sil_max = sil_scores[-1] opt_labels = labels opt_model = modelelif algo in ["dbscan"]:for eps in np.arange(0.5, nmax+1, 0.2): model = DBSCAN(eps=eps, min_samples=X.shape[1]*2).fit(X) labels = model.labels_try: sil_scores.append(silhouette_score(X, labels)) params.append(eps)except:continueif i_print:print(eps, sil_scores[-1])if sil_scores[-1] > sil_max: opt_param = eps sil_max = sil_scores[-1] opt_labels = labels opt_model = modelif i_plot: fig, ax = plt.subplots() ax.plot(params, sil_scores, "-o") ax.set(xlabel='Hyper-parameter', ylabel='Silhouette') plt.show()return opt_labels, opt_param, opt_model```## K-means (Hyper-Parameter = n_clusters)### Elbow Method```{python}# Using the elbow method to find the optimal number of clusters. We will use the inertia_ attribute to find the sum of squared distances of samples to their closest cluster center.evaluation = pd.DataFrame(columns=['Cluster', 'Distortion', 'Inertia'])for i inrange(1, 12): kmeanModel = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeanModel.fit(X) evaluation = evaluation.append({'Cluster': i, 'Distortion': sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1))/X.shape[0], 'Inertia': kmeanModel.inertia_}, ignore_index=True)evaluation``````{python}# plot distortion and inertia for kmeans that will suggest the optimal number of clusters based on the plot.evaluation.plot.line(x='Cluster', subplots=True)```After looking at the elbow plot, it can be seen that an elbow is forming around these values of K: 2,3 and 9.Lets take a look at the scatterplot of the labels vs their predicted clusters for the columns of min salary vs max salary:```{python}sns.set_theme(style='whitegrid', palette='Set2')bestK = KMeans(n_clusters=11, init='k-means++', random_state=42, max_iter =500)labels4 = bestK.fit_predict(X)df['nlabels'] = labels4fig, ax = plt.subplots(1, 2, figsize = (10,5), dpi =150)sns.scatterplot(x='min.salary', y='max.salary', data=df, hue='Sector', ax=ax[0])sns.scatterplot(x='min.salary', y='max.salary', data=df, hue='nlabels', ax=ax[1])```### Silhouette Method```{python}algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']start = time.time()print('KMeans Algorithm')kmeans_labels, n_clusters, kmeans_model = maximize_silhouette(X, 'kmeans', 7, True)print('The number of clusters for Kmeans algorithm is: ', len(np.unique(kmeans_labels)))df['kmeans_label'] = kmeans_labelsalgo_features.append('kmeans_label')print('Cluster Plots for Kmeans algorithm')sns.pairplot(df[algo_features], hue='kmeans_label')plt.show()end = time.time()print('Time taken for KMEANS algorithm: {} minutes'.format((end - start)/60))```### Final ResultsWe can see that for the larger dataset, Silhoutte method gives a better prediction as compared to the elbow method. The number of clusters for K-Means algorithm as per the analysis above is 4.## DBSCAN: Comparing number of clusters with silhouette score```{python}neighbors = NearestNeighbors(n_neighbors=66)neighbors_fit = neighbors.fit(X)distances, indices = neighbors_fit.kneighbors(X)distances = np.sort(distances, axis=0)distances = distances[:,1]plt.plot(distances)plt.xlabel('Points')plt.ylabel('Distance (EPS)')plt.title('K-Distance Graph for DBSCAN')``````{python}# perform DBSCAN clustering. use the eps and min_samples parameters to find the optimal number of clusters. plot the number of clusters vs the silhouette score.dbscan_df = pd.DataFrame(columns=['eps', 'min_samples', 'clusters', 'silhouette_score'])for i in np.arange(0.5, 1.0, 0.1):for j inrange(1, 11): dbscan = DBSCAN(eps=i, min_samples=j) dbscan.fit(X)if (len(set(dbscan.labels_) -set([-1])) >1) & (len(set(dbscan.labels_) -set([-1])) <11): dbscan_df = dbscan_df.append({'eps': i, 'min_samples': j, 'clusters': len(set(dbscan.labels_) -set([-1])), 'silhouette_score': silhouette_score(X, dbscan.labels_)}, ignore_index=True)else: dbscan_df = dbscan_df.append({'eps': i, 'min_samples': j, 'clusters': 0, 'silhouette_score': 0}, ignore_index=True)dbscan_df.head()``````{python}dbscan_df.plot.line(x='clusters', y='silhouette_score')``````{python}dbscan_df[dbscan_df['silhouette_score'] ==max(dbscan_df['silhouette_score'])]``````{python}optimal_cluster_size = dbscan_df['clusters'][dbscan_df['silhouette_score'] ==max(dbscan_df['silhouette_score'])]algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']start = time.time()print('DBSCAN Algorithm')dbscan_labels, parameter, db_model = maximize_silhouette(X, 'dbscan', 7, True)print('The EPS for DBSCAN algorithm is: {}'.format(parameter))print('The number of clusters for DBSCAN algorithm is: ', len(np.unique(dbscan_labels)))df['dbscan_label'] = dbscan_labelsalgo_features.append('dbscan_label')print('Cluster Plots for DBSCAN algorithm')sns.pairplot(df[algo_features], hue='dbscan_label')plt.title('DBSCAN Algorithm')plt.show()end = time.time()print('Time taken for DBSCAN algorithm: {} minutes'.format((end - start)/60))``````{python}dbscan = DBSCAN(eps=2.49, min_samples=6)dbscan.fit(X)y_pred = dbscan.fit_predict(X)labels_DB = dbscan.labels_print(labels_DB)```### Final ResultsThe optimal cluster size is 2 with epsilon value equal to 2.49.## Agglomerative Clustering (Hierarchical clustering)```{python}# Perform Agglomerative Clusteringmodel = AgglomerativeClustering().fit(X)labels = model.labels_# create linkage for agglomerative clustering, and the dendrogram for the linkage so as to depict the optimal number of clusters based on the dendrogram.Z = linkage(X, method='ward')dend = dendrogram(Z)plt.axhline(y=21, color='r', linestyle='--', label='21')``````{python}algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']start = time.time()print('Agglomerative (Hierarchical) Algorithm')ag_labels, n_clusters, agg_model = maximize_silhouette(X, 'agg', 7, True)print('The number of clusters for Agglomerative (Hierarchical) algorithm is: ', len(np.unique(ag_labels)))df['ag_label'] = ag_labelsalgo_features.append('ag_label')print('Cluster Plots for Agglomerative (Hierarchical) algorithm')sns.pairplot(df[algo_features], hue='ag_label')plt.show()end = time.time()print('Time taken for Agglomerative (Hierarchical) algorithm: {} minutes'.format((end - start)/60))``````{python}X1 = df.drop(['Sector', 'ag_label', 'dbscan_label', 'kmeans_label', 'cluster_label'], axis=1)Z = linkage(X1, method='ward')dend = dendrogram(Z, truncate_mode='lastp', # show only the last p merged clusters p=4, leaf_rotation=90, leaf_font_size=12, show_contracted=True)plt.title('Truncated Hierarchical Clustering Dendrogram')plt.xlabel('Cluster Size')plt.ylabel('Distance')plt.show()```### Final ResultsThe optimal number of clusters based on the dendogram above is 4.## Mean Shift```{python}# Perform MeanShift Clustering and predict number model = MeanShift(bandwidth=2).fit(X)labels = model.labels_cluster_centers = model.cluster_centers_labels_unique = np.unique(labels)n_clusters =len(labels_unique)print("number of estimated clusters : ", n_clusters)``````{python}labels = pd.DataFrame(labels) labeledMeanShift = pd.concat((df[['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size','max.size']],labels), axis=1)labeledMeanShift = labeledMeanShift.rename({0:'labels'},axis=1)sns.pairplot(labeledMeanShift, hue='labels')```### Final ResultsThe optimal number of clusters as per mean shift is 6.# ResultsWhen comparing the time needed to execute each Clustering Method, it can be seen that KMEANS takes the longest time followed by DBSCAN, and Agglomerative (Hierarchical) algorithm is the quickest of all. The K-Means (4) and Agglomerative (Hierarchical) (4) clusters are the ones that come closest to the original number of labels (5). We gain a new understanding of the data and the target variable since the clusters created for DBSCAN (2) and MeanShift (6) do not correspond to the number of labels in the original labels. Even if the method is made simpler by taking an important hyper-parameter (min samples), DBSCAN still has a higher degree of complexity than the other 2 algorithms. From the cluster plots of each algorithm, it is seen that clusters are more distinct for Agglomerative followed by K-means, then MeanShift and finally DBSCAN.# ConclusionsBecause of its quick processing time and low complexity, the Agglomerative (Hierarchical) Clustering Algorithm is our best option for doing clustering based on the aforementioned insights. The Agglomerative Algorithm reveals that, in order to produce better clusters, the label variables may need to be divided into four, or, more precisely, the 'Government' and 'Public' variables may need to be combined.The mismatch between the number of clusters and the number of labels across all Clustering techniques can be attributed to the complexity, size, and historical nature of the data as well as the fact that many features are interdependent.# References1. <ahref="https://www.geeksforgeeks.org/">GeeksForGeeks</a>2. <ahref="https://www.wikipedia.org/">Wikipedia</a>