Linear Regression
In an effort to predict the amount of time users spend on the bike, a Linear regression model was used to input multiple variables that may contribute to a desired prediction. To initiate this process, it was devised to cluster specific chosen variables first. As shown in Figure 19, the distances in miles, bike type, and membership status were all used to train the model, and they were also used as variable inputs for the clustered group, as shown by the distinct clustered group they were placed into“k_1” and “k_0”, while Figure 20 depicts the chosen target variables that were used for predicting the model, distance in miles traveled.
1 Details about input variables
❖ Bike type(r_classic_bike, r_docked_bike, and r_electric_bike): Provides the type of ride being used. These include ‘electric’, ‘classic’, and ‘docked’. With the hypothesis testing, we came to know that the mean distance traveled on electric bikes was more than that of casual bikes. So it’s pretty interesting to know how long people ride different types of bikes for their CaBi journey.
❖ Member type(M_member and M_casual): Now knowing the method in which members used CaBi as compared to a casual rider. The hypothesis tests also provided an insight on how much a member uses an electric bike in comparison to the latter. Such fascinating aspects contribute to the significance of this particular column in predicting the dependent variable.
❖ Distance of miles traveled (distance_miles): This is a feature-engineered column using the geospatial data that we had in the initial dataset. Start and end lats and longs along with other details such as Station name and number contributed to calculating the distance covered by a rider.
❖ K-means variable (k_1 and k_0): By using k-means, we divided certain areas of the map into different clusters. Based on the geospatial data, k-means clustering was achieved helping not only in Exploratory Data Analysis but also in Dimensionality Reduction.
2 Details about the dependent variable
❖ Duration (minutes): How long, on average, does a user ride CaBi bikes across the DC-Arlington area? The response variable can help in providing better recommendations according to the bike-type. It can also let us know whether casual riders want to become CaBi members.
3 Performance metrics
❖ Mean absolute error: This is the average of absolute errors of all the data points in the given dataset. The models mean absolute error was set at 14.8.
❖ Mean squared error: This is the average of the squares of the errors of all the data points in the given dataset. The model’s mean squared error was set at 25,133.51, which was relatively high.
❖ Median absolute error: This is the median of all the errors in the given dataset. The main advantage of this metric is that it filters any outliers within the data. A single bad data point in the test dataset wouldn’t skew the entire error metric, as opposed to a mean error metric. The model’s median absolute error was 4.97.
❖ Variance score: This score measures how well our model can account for the variation in our dataset. A score of 1.0 indicates that our model is perfect. Our model’s variance score was 0.01, putting to emphasis its insignificant value.
❖ R2-score: This score refers to the coefficient of determination. This tells us how well the unknown samples will be predicted by our model. The best possible score is 1.0, but the score can be negative as well. The model’s R2-score was 0.01, which restated that our model was insignificant.
4 Results
Using this devised structure, the analysis and results shown from the linear regression model unfortunately do not provide sufficient evidence to predict the amount of time users will spend on the bikes. The statistics gathered from the model show a high mean squared error alongside an R2-value that highlighted the model is not fit for our data.
Figure 21 also shows how poorly linear regression performs for this problem, since the predictions with respect to the actual values are all over the place and there does not seem to be any consistency in performance. Also, the figure also showcases the presence of extreme outliers in the test data, with some durations exceeding 10,000 minutes. This information could be used later on when developing other models to improve their performance and the quality of the test data.