Clustering
Since our dataset has more than 700 stations, using station ID was avoided as including it in statistical and machine learning models as a categorical variable would be very hectic and would probably require supercomputers. Consequently, we opted to cluster the stations into 2 separate groups that we would eventually be used as a variable in the machine learning part of this project, reducing the number of variables from 700 (total stations) to 2 (clusters).
1 K-means clustering
K-means clustering was used since it is the most popular clustering algorithm and fits the problem well. The results can be visualized in Figure 19.
The algorithm’s performance is near perfect, as there is no visible overlap between the two clusters and all stations are assigned to a cluster in a visually logical manner. The red cluster is representative of bike stations in Georgetown, Chevy Chase (Maryland), and Arlington (Virgina) and the blue cluster is representative of bike stations in both Washington D.C, bordering the states of Maryland and Virginia, and the vicinity of Ronald Reagan Washington National Airport (Virginia). This has now helped us factor in the location of each station through the resulting labels of the clustering models.