Clustering Labeled Record Data in Python

This page focuses on record data clustering.

Clustering groups the records in a table based on similar values in one or more numeric key fields. Similar values are values that are nearby or close to one another in the context of the entire data set. These similar values represent clusters that, once identified, reveal patterns in the data. This page will look at partition clustering (k means) where a deep dive will be done in methods like elbow and Silhouette to determine and visualize the best value of k. Further this page will look into hierarchical clustering (with dendrograms).

1 Data Science Questions

What factors are involved in predicting that the given job belongs to a private or a public sector?
How much does working in the private or public sector effects the salary?
Should employees work in the private or the public sector?

1.1 Setting the objective

Checking if attributes like Job Title, Company Rating, State, salary range and the company size (number of employees) can help identify the sector of firms in the market.

2 Dataset Used

A dataset was collected using different websites during the Data Gathering Phase which can be found here.

3 Data Cleaning

Next, the dataset has been cleaned and prepped to be set in a specific way to run our model. In this, first undesired columns like Index, Job Description, Headquarters, etc have been removed. ‘Min’ and ‘Max’ salaries and size of a company had been extracted from the salary and size range respectively. ‘Sector’ column was created from ‘Type of Ownership’ column and similarly ‘State’ column was created from location.

We have chosen 3 labels for the job title for our analysis: Data Scientist, Data Engineer and Data Analyst. These labels are chosen on the basis of their counts in the dataset and therefore can train the model well in learning their respective features. Other required data cleaning and prepping steps have also been applied, and can be seen in detail in the code for data cleaning.

Code for data cleaning: Record Data Cleaning

Raw csv: Raw data.csv

Clean csv: Clean data.csv

4 Clustering algorithms used for the current dataset

There are tons of different models that can be used to perform Clustering but for this dataset, we are focusing mainly on the 4 most used methods which are: 1. K-Means 2. DBSCAN 3. Hierarchichal (Agglomerative) 4. MeanShift

4.1 K Means Clustering : Selecting optimal value of K

K-means clustering is an unsupervised learning algorithm that groups data based on each point euclidean distance to a central point called centroid. The centroids are defined by the means of all points that are in the same cluster. The algorithm first chooses random points as centroids and then iterates adjusting them until full convergence.

An important thing to remember when using K-means, is that the number of clusters (k) is a hyperparameter, it will be defined before running the model.

4.2 DBSCAN

DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.

4.3 Hierarchichal Clustering

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

Strategies for hierarchical clustering generally fall into two categories:

Agglomerative: This is a “bottom-up” approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a “top-down” approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

4.4 Mean Shift

Meanshift is falling under the category of a clustering algorithm in contrast of Unsupervised learning that assigns the data points to the clusters iteratively by shifting points towards the mode (mode is the highest density of data points in the region, in the context of the Meanshift). As such, it is also known as the Mode-seeking algorithm. Mean-shift algorithm has applications in the field of image processing and computer vision.

Given a set of data points, the algorithm iteratively assigns each data point towards the closest cluster centroid and direction to the closest cluster centroid is determined by where most of the points nearby are at. So each iteration each data point will move closer to where the most points are at, which is or will lead to the cluster center. When the algorithm stops, each point is assigned to a cluster.

Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the number of clusters in advance. The number of clusters is determined by the algorithm with respect to the data.

4.5 Hyper-parameter Tuning

When creating a machine-learning model, you can define variables known as hyper-parameters. This indicates that while creating the model, the user defines the hyper-parameters. While parameters are taught, hyper-parameters influence the learning process. The value of a model’s hyperparameters has a substantial impact on its performance.

Noting that it is impossible to forecast in advance the best values for hyperparameters, it is advisable to try every conceivable value before settling on the best ones. This could take a lot of time and resources to complete manually. In order to determine the best possible Hyper-parameter(s) based on a measure, we therefore develop functions with loops and parameters. This metric varies not only between algorithms but also between algorithms’ approaches.

For instance, the silhouette approach, which maximizes the silhouette score, is a widely popular way for clustering hyper-parameters. However, there are various methods for doing so, including the Elbow method, Grubbs method, and others. When the dataset’s dimension is modest, the Elbow Method is effective at identifying the best hyper-parameters; nevertheless, the Silhouette Method is recommended for high dimension data.

4.5.1 Elbow Method

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

We can use the Elbow method to have an indication of clusters for our data. It consists in the interpretation of a line plot with an elbow shape. The number of clusters is were the elbow bends.

4.5.2 Silhouette Method

The term ‘silhouette’ refers to a way of interpreting and validating consistency inside data clusters. The method generates a simple graphical depiction of how effectively each object was classified. When compared to other clusters, the silhouette value is a measure of how similar an object is to its own cluster (cohesion) (separation).

In order to determine the silhouette score, this method computes the silhouette coefficients for each point and averages them across all samples. It is decided to use the collection of hyper-parameters with the highest silhouette score. The silhouette scores fall between [-1,1] where a high value denotes a good match to the object’s own cluster and a poor match to nearby clusters, while a low value denotes a poor fit to nearby clusters. If the vast majority of the items have high values, the clustering configuration is advantageous. Numerous points with low or negative values indicate that there may be too many or too few clusters in the clustering design.

There are 2 main terms to be known to understand silhouette scores: 1. Cohesion (Similarity to its own cluster) 2. Separation (Similarity to other clusters)

5 Code

5.1 Importing the libraries

Code

# import the necessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style='whitegrid', palette='Set2')

# import relevant libraries for clustering. we will use KMeans, DBSCAN, AgglomerativeClustering and MeanShift
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import MeanShift
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import cdist
import time

import warnings
warnings.filterwarnings('ignore')

5.2 Import the dataset

Code

# import the dataset
df = pd.read_csv('./clean_all_data.csv')
df.head(10)

	Index	Title	Rating	Founded	Sector	State	min.salary	max.salary	min.size	max.size
0	1	Senior Data Scientist	3.5	2007	Private	NY	111000.0	181000.0	501.0	1000.0
1	2	Data Scientist, Product Analytics	4.5	2008	Private	NY	111000.0	181000.0	1001.0	5000.0
2	3	Data Analyst	3.4	2019	Private	NJ	111000.0	181000.0	201.0	500.0
3	4	Director, Data Science	3.4	2007	Private	NY	111000.0	181000.0	51.0	200.0
4	5	Data Scientist	2.9	1985	Private	NY	111000.0	181000.0	201.0	500.0
5	6	Quantitative Researcher	4.4	1993	Private	NY	111000.0	181000.0	51.0	200.0
6	7	AI Scientist	5.0	2018	Private	NY	111000.0	181000.0	1.0	50.0
7	8	Quantitative Researcher	4.8	2000	Private	NY	111000.0	181000.0	501.0	1000.0
8	9	Data Scientist	3.9	2014	Private	NY	111000.0	181000.0	201.0	500.0
9	10	Data Scientist/Machine Learning	4.4	2011	Private	NY	111000.0	181000.0	51.0	200.0

5.3 Load the clean csv file into a dataframe using Pandas

Code

# Load the cleaned csv file
df = pd.read_csv('./clean_all_data.csv')
df.head(10)

	Index	Title	Rating	Founded	Sector	State	min.salary	max.salary	min.size	max.size
0	1	Senior Data Scientist	3.5	2007	Private	NY	111000.0	181000.0	501.0	1000.0
1	2	Data Scientist, Product Analytics	4.5	2008	Private	NY	111000.0	181000.0	1001.0	5000.0
2	3	Data Analyst	3.4	2019	Private	NJ	111000.0	181000.0	201.0	500.0
3	4	Director, Data Science	3.4	2007	Private	NY	111000.0	181000.0	51.0	200.0
4	5	Data Scientist	2.9	1985	Private	NY	111000.0	181000.0	201.0	500.0
5	6	Quantitative Researcher	4.4	1993	Private	NY	111000.0	181000.0	51.0	200.0
6	7	AI Scientist	5.0	2018	Private	NY	111000.0	181000.0	1.0	50.0
7	8	Quantitative Researcher	4.8	2000	Private	NY	111000.0	181000.0	501.0	1000.0
8	9	Data Scientist	3.9	2014	Private	NY	111000.0	181000.0	201.0	500.0
9	10	Data Scientist/Machine Learning	4.4	2011	Private	NY	111000.0	181000.0	51.0	200.0

Code

# remove unwanted columns
df.drop(columns=['Index', 'Founded'], inplace=True)
# rearrange columns
df = df[['Title','Rating','State','min.salary','max.salary','min.size', 'max.size', 'Sector']]

# Subset the data frame for Binary Classification with Decision Trees
df = df.loc[df['Title'].isin(['Data Scientist', 'Data Engineer', 'Data Analyst'])]
df = df.loc[df['Sector'].isin(['Private', 'Public'])]

# use label encoding for categorical data
from sklearn.preprocessing import LabelEncoder
le_title = LabelEncoder()
le_state = LabelEncoder()
le_sector = LabelEncoder()
df['Title'] = le_title.fit_transform(df['Title'])
df['State'] = le_state.fit_transform(df['State'])

# Private - 0
# Public - 1
df['Sector'] = le_sector.fit_transform(df['Sector'])

df.head(10)

	Title	Rating	State	min.salary	max.salary	min.size	max.size	Sector
2	0	3.4	5	111000.0	181000.0	201.0	500.0	0
4	2	2.9	6	111000.0	181000.0	201.0	500.0	0
8	2	3.9	6	111000.0	181000.0	201.0	500.0	0
11	2	3.9	6	111000.0	181000.0	1001.0	5000.0	0
13	2	3.0	6	111000.0	181000.0	51.0	200.0	0
16	2	3.6	6	111000.0	181000.0	501.0	1000.0	1
22	2	4.4	6	111000.0	181000.0	51.0	200.0	0
24	2	4.1	6	111000.0	181000.0	1001.0	5000.0	1
28	2	4.3	6	120000.0	140000.0	51.0	200.0	0
30	2	4.3	6	120000.0	140000.0	201.0	500.0	0

5.4 Data Selection

Splitting the dataset in X and y. Since this is an unsupervised learning algorithm, we will not use the y labels. We will normalize the X data by using the StandardScaler function.

Code

from sklearn.preprocessing import StandardScaler

X = df.drop(['Sector'], axis=1)
y = df['Sector']

scalar = StandardScaler()
X = scalar.fit_transform(X)

5.5 Clustering with random Hyper-parameters

There are numerous hyperparameters that must be predetermined for each method that clusters a dataset. However, these hyper-parameters must be adjusted while taking into account our dataset. Giving random hyperparameters that don’t match our dataset could lead to data loss or incorrect datapoint clustering.

Let’s see a k-means model with random hyperparameters before hyper-parameter tuning as follows: 1. Initializer: kmeans++ 2. Number of clusters: 6 3. Maximum iterations: 300 4. Random state: 2190

Code

kmeans = KMeans(
    init="k-means++",
    n_clusters=6,
    max_iter=300,
    random_state=2190)

label = kmeans.fit_predict(X)
df['cluster_label'] = label

# plot the clusters

features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size',
       'max.size', 'Sector', 'cluster_label']

sns.pairplot(df[features], hue='cluster_label')

<seaborn.axisgrid.PairGrid at 0x1400bb580>

5.6 Hyper-Parameter tuning function (maximize_silhouette)

The dataset (X), algorithm to be utilized (algo), the maximum number of loops to execute (nmax), and a boolean function to plot the silhouette scores versus the hyper-parameter (i_plot) are all inputs to the tuning function. It turns our dataset into a contiguous array, uses a for loop to run models with various hyper-parameter values, and calculates silhouette scores for each iteration. The best hyper-parameter, label, and model values are then saved and returned. The function will additionally plot a graph of silhouette score v/s hyper-parameter if the i_plot parameter is set to “True”.

Code

def maximize_silhouette(X, algo="kmeans", nmax=20, i_plot=False):
    i_print = False

    # FORCE CONTIGUOUS
    X = np.ascontiguousarray(X)

    # LOOP OVER HYPER-PARAM
    params = []
    sil_scores = []
    sil_max = -10

    if algo in ["kmeans", "agg"]:
        for param in range(3, nmax+1):
            if algo == "agg":
                model = AgglomerativeClustering(n_clusters=param).fit(X)
                labels = model.labels_
            if algo == "kmeans":
                model = KMeans(n_clusters=param, init='k-means++').fit(X)
                labels = model.predict(X)
            try:
                sil_scores.append(silhouette_score(X, labels))
                params.append(param)
            except:
                continue
            if i_print:
                print(param, sil_scores[-1])
            if sil_scores[-1] > sil_max:
                opt_param = param
                sil_max = sil_scores[-1]
                opt_labels = labels
                opt_model = model
    elif algo in ["dbscan"]:
        for eps in np.arange(0.5, nmax+1, 0.2):
            model = DBSCAN(eps=eps, min_samples=X.shape[1]*2).fit(X)
            labels = model.labels_
            try:
                sil_scores.append(silhouette_score(X, labels))
                params.append(eps)
            except:
                continue
            if i_print:
                print(eps, sil_scores[-1])
            if sil_scores[-1] > sil_max:
                opt_param = eps
                sil_max = sil_scores[-1]
                opt_labels = labels
                opt_model = model

    if i_plot:
        fig, ax = plt.subplots()
        ax.plot(params, sil_scores, "-o")
        ax.set(xlabel='Hyper-parameter', ylabel='Silhouette')
        plt.show()

    return opt_labels, opt_param, opt_model

5.7 K-means (Hyper-Parameter = n_clusters)

5.7.1 Elbow Method

Code

# Using the elbow method to find the optimal number of clusters. We will use the inertia_ attribute to find the sum of squared distances of samples to their closest cluster center.
evaluation = pd.DataFrame(columns=['Cluster', 'Distortion', 'Inertia'])
for i in range(1, 12):
    kmeanModel = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeanModel.fit(X)
    evaluation = evaluation.append({'Cluster': i, 'Distortion': sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), 
    axis=1))/X.shape[0], 'Inertia': kmeanModel.inertia_}, ignore_index=True)

evaluation

	Cluster	Distortion	Inertia
0	1.0	2.563764	3073.000000
1	2.0	2.274386	2431.225991
2	3.0	2.086601	2047.937528
3	4.0	1.962188	1836.409041
4	5.0	1.850970	1626.352431
5	6.0	1.784530	1506.095051
6	7.0	1.709502	1404.642749
7	8.0	1.650355	1314.039079
8	9.0	1.605711	1228.014252
9	10.0	1.549378	1158.091282
10	11.0	1.500983	1096.804660

Code

# plot distortion and inertia for kmeans that will suggest the optimal number of clusters based on the plot.
evaluation.plot.line(x='Cluster', subplots=True)

array([<Axes: xlabel='Cluster'>, <Axes: xlabel='Cluster'>], dtype=object)

After looking at the elbow plot, it can be seen that an elbow is forming around these values of K: 2,3 and 9.

Lets take a look at the scatterplot of the labels vs their predicted clusters for the columns of min salary vs max salary:

Code

sns.set_theme(style='whitegrid', palette='Set2')
bestK = KMeans(n_clusters=11, init='k-means++', random_state=42, max_iter = 500)
labels4 = bestK.fit_predict(X)
df['nlabels'] = labels4
fig, ax = plt.subplots(1, 2, figsize = (10,5), dpi = 150)
sns.scatterplot(x='min.salary', y='max.salary', data=df, hue='Sector', ax=ax[0])
sns.scatterplot(x='min.salary', y='max.salary', data=df, hue='nlabels', ax=ax[1])

<Axes: xlabel='min.salary', ylabel='max.salary'>

5.7.2 Silhouette Method

Code

algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']

start = time.time()
print('KMeans Algorithm')

kmeans_labels, n_clusters, kmeans_model = maximize_silhouette(X, 'kmeans', 7, True)
print('The number of clusters for Kmeans algorithm is: ', len(np.unique(kmeans_labels)))
df['kmeans_label'] = kmeans_labels
algo_features.append('kmeans_label')
print('Cluster Plots for Kmeans algorithm')
sns.pairplot(df[algo_features], hue='kmeans_label')
plt.show()
end = time.time()
print('Time taken for KMEANS algorithm: {} minutes'.format((end - start)/60))

KMeans Algorithm

The number of clusters for Kmeans algorithm is:  5
Cluster Plots for Kmeans algorithm

Time taken for KMEANS algorithm: 0.11039044857025146 minutes

5.7.3 Final Results

We can see that for the larger dataset, Silhoutte method gives a better prediction as compared to the elbow method. The number of clusters for K-Means algorithm as per the analysis above is 4.

5.8 DBSCAN: Comparing number of clusters with silhouette score

Code

neighbors = NearestNeighbors(n_neighbors=66)
neighbors_fit = neighbors.fit(X)
distances, indices = neighbors_fit.kneighbors(X)

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.xlabel('Points')
plt.ylabel('Distance (EPS)')
plt.title('K-Distance Graph for DBSCAN')

Text(0.5, 1.0, 'K-Distance Graph for DBSCAN')

Code

# perform DBSCAN clustering. use the eps and min_samples parameters to find the optimal number of clusters. plot the number of clusters vs the silhouette score.
dbscan_df = pd.DataFrame(columns=['eps', 'min_samples', 'clusters', 'silhouette_score'])

for i in np.arange(0.5, 1.0, 0.1):
    
    for j in range(1, 11):
        
        dbscan = DBSCAN(eps=i, min_samples=j)
        dbscan.fit(X)
        
        if (len(set(dbscan.labels_) - set([-1])) > 1) & (len(set(dbscan.labels_) - set([-1])) < 11):
            dbscan_df = dbscan_df.append({'eps': i, 'min_samples': j, 'clusters': len(set(dbscan.labels_) - set([-1])), 'silhouette_score': silhouette_score(X, dbscan.labels_)}, ignore_index=True)
        
        else:
            dbscan_df = dbscan_df.append({'eps': i, 'min_samples': j, 'clusters': 0, 'silhouette_score': 0}, ignore_index=True)
        
dbscan_df.head()

	eps	min_samples	clusters	silhouette_score
0	0.5	1.0	0.0	0.000000
1	0.5	2.0	0.0	0.000000
2	0.5	3.0	0.0	0.000000
3	0.5	4.0	9.0	-0.304854
4	0.5	5.0	2.0	-0.116243

Code

dbscan_df.plot.line(x='clusters', y='silhouette_score')

<Axes: xlabel='clusters'>

Code

dbscan_df[dbscan_df['silhouette_score'] == max(dbscan_df['silhouette_score'])]

	eps	min_samples
0	0.5	1.0
1	0.5	2.0
2	0.5	3.0
6	0.5	7.0
7	0.5	8.0
8	0.5	9.0
9	0.5	10.0
10	0.6	1.0
11	0.6	2.0
12	0.6	3.0
13	0.6	4.0
16	0.6	7.0
17	0.6	8.0
18	0.6	9.0
19	0.6	10.0
20	0.7	1.0
21	0.7	2.0
22	0.7	3.0
23	0.7	4.0
24	0.7	5.0
27	0.7	8.0
28	0.7	9.0
29	0.7	10.0
30	0.8	1.0
31	0.8	2.0
32	0.8	3.0
33	0.8	4.0
34	0.8	5.0
39	0.8	10.0
40	0.9	1.0
41	0.9	2.0
42	0.9	3.0
43	0.9	4.0
44	0.9	5.0
45	0.9	6.0

Code

optimal_cluster_size = dbscan_df['clusters'][dbscan_df['silhouette_score'] == max(dbscan_df['silhouette_score'])]

algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']

start = time.time()
print('DBSCAN Algorithm')

dbscan_labels, parameter, db_model = maximize_silhouette(X, 'dbscan', 7, True)
print('The EPS for DBSCAN algorithm is: {}'.format(parameter))
print('The number of clusters for DBSCAN algorithm is: ', len(np.unique(dbscan_labels)))
df['dbscan_label'] = dbscan_labels
algo_features.append('dbscan_label')
print('Cluster Plots for DBSCAN algorithm')
sns.pairplot(df[algo_features], hue='dbscan_label')
plt.title('DBSCAN Algorithm')
plt.show()
end = time.time()
print('Time taken for DBSCAN algorithm: {} minutes'.format((end - start)/60))

DBSCAN Algorithm

The EPS for DBSCAN algorithm is: 2.4999999999999996
The number of clusters for DBSCAN algorithm is:  2
Cluster Plots for DBSCAN algorithm

Time taken for DBSCAN algorithm: 0.10353649854660034 minutes

Code

dbscan = DBSCAN(eps=2.49, min_samples=6)
dbscan.fit(X)
y_pred = dbscan.fit_predict(X)
labels_DB = dbscan.labels_
print(labels_DB)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

5.8.1 Final Results

The optimal cluster size is 2 with epsilon value equal to 2.49.

5.9 Agglomerative Clustering (Hierarchical clustering)

Code

# Perform Agglomerative Clustering
model = AgglomerativeClustering().fit(X)
labels = model.labels_

# create linkage for agglomerative clustering, and the dendrogram for the linkage so as to depict the optimal number of clusters based on the dendrogram.
Z = linkage(X, method='ward')
dend = dendrogram(Z)
plt.axhline(y=21, color='r', linestyle='--', label='21')

<matplotlib.lines.Line2D at 0x292d9fe50>

Code

algo_features = ['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size', 'max.size']

start = time.time()
print('Agglomerative (Hierarchical) Algorithm')

ag_labels, n_clusters, agg_model = maximize_silhouette(X, 'agg', 7, True)
print('The number of clusters for Agglomerative (Hierarchical) algorithm is: ', len(np.unique(ag_labels)))
df['ag_label'] = ag_labels
algo_features.append('ag_label')
print('Cluster Plots for Agglomerative (Hierarchical) algorithm')
sns.pairplot(df[algo_features], hue='ag_label')
plt.show()
end = time.time()
print('Time taken for Agglomerative (Hierarchical) algorithm: {} minutes'.format((end - start)/60))

Agglomerative (Hierarchical) Algorithm

The number of clusters for Agglomerative (Hierarchical) algorithm is:  4
Cluster Plots for Agglomerative (Hierarchical) algorithm

Time taken for Agglomerative (Hierarchical) algorithm: 0.10493305524190268 minutes

Code

X1 = df.drop(['Sector', 'ag_label', 'dbscan_label', 'kmeans_label', 'cluster_label'], axis=1)

Z = linkage(X1, method='ward')
dend = dendrogram(Z, truncate_mode='lastp',  # show only the last p merged clusters
    p=4,
    leaf_rotation=90,
    leaf_font_size=12,
    show_contracted=True)
plt.title('Truncated Hierarchical Clustering Dendrogram')
plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.show()

5.9.1 Final Results

The optimal number of clusters based on the dendogram above is 4.

5.10 Mean Shift

Code

# Perform MeanShift Clustering and predict number 
model = MeanShift(bandwidth=2).fit(X)
labels = model.labels_
cluster_centers = model.cluster_centers_
labels_unique = np.unique(labels)
n_clusters = len(labels_unique)
print("number of estimated clusters : ", n_clusters)

number of estimated clusters :  6

Code

labels = pd.DataFrame(labels) 
labeledMeanShift = pd.concat((df[['Title', 'Rating', 'State', 'min.salary', 'max.salary', 'min.size',
       'max.size']],labels), axis=1)
labeledMeanShift = labeledMeanShift.rename({0:'labels'},axis=1)
sns.pairplot(labeledMeanShift, hue='labels')

<seaborn.axisgrid.PairGrid at 0x2a1c560b0>

5.10.1 Final Results

The optimal number of clusters as per mean shift is 6.

6 Results

When comparing the time needed to execute each Clustering Method, it can be seen that KMEANS takes the longest time followed by DBSCAN, and Agglomerative (Hierarchical) algorithm is the quickest of all. The K-Means (4) and Agglomerative (Hierarchical) (4) clusters are the ones that come closest to the original number of labels (5). We gain a new understanding of the data and the target variable since the clusters created for DBSCAN (2) and MeanShift (6) do not correspond to the number of labels in the original labels. Even if the method is made simpler by taking an important hyper-parameter (min samples), DBSCAN still has a higher degree of complexity than the other 2 algorithms. From the cluster plots of each algorithm, it is seen that clusters are more distinct for Agglomerative followed by K-means, then MeanShift and finally DBSCAN.

7 Conclusions

Because of its quick processing time and low complexity, the Agglomerative (Hierarchical) Clustering Algorithm is our best option for doing clustering based on the aforementioned insights. The Agglomerative Algorithm reveals that, in order to produce better clusters, the label variables may need to be divided into four, or, more precisely, the ‘Government’ and ‘Public’ variables may need to be combined.

The mismatch between the number of clusters and the number of labels across all Clustering techniques can be attributed to the complexity, size, and historical nature of the data as well as the fact that many features are interdependent.

1 Data Science Questions

1.1 Setting the objective

2 Dataset Used

3 Data Cleaning

4 Clustering algorithms used for the current dataset

4.1 K Means Clustering : Selecting optimal value of K

4.2 DBSCAN

4.3 Hierarchichal Clustering

4.4 Mean Shift

4.5 Hyper-parameter Tuning

4.5.1 Elbow Method

4.5.2 Silhouette Method

5 Code

5.1 Importing the libraries

5.2 Import the dataset

5.3 Load the clean csv file into a dataframe using Pandas

5.4 Data Selection

5.5 Clustering with random Hyper-parameters

5.6 Hyper-Parameter tuning function (maximize_silhouette)

5.7 K-means (Hyper-Parameter = n_clusters)

5.7.1 Elbow Method

5.7.2 Silhouette Method

5.7.3 Final Results

5.8 DBSCAN: Comparing number of clusters with silhouette score

5.8.1 Final Results

5.9 Agglomerative Clustering (Hierarchical clustering)

5.9.1 Final Results

5.10 Mean Shift

5.10.1 Final Results

6 Results

7 Conclusions

8 References