CabiGeoStats
  • Home
  • Code
  • Data
    • Data Gathering
    • Data Cleaning
    • EDA
  • Statistical Method and Results
    • Hypothesis Testing
    • Clustering
    • Linear Regression
  • Conclusions
  • Works Cited

On this page

  • 1 Capital Bikeshare usage per year
  • 2 Proportion of ridesper rider status
  • 3 Number of start rides per hour of the day by Rider Status (2016-2022)
  • 4 Ridge plot of log ride duration
  • 5 Ridge plot of number of start rides per hour of the day
  • 6 Bar plot for proportions
  • 7 Boxplots for trip duration
  • 8 Geospatial

Exploratory Data Analysis

This page explores data in the form of plots and visuals to give a better idea about the data that was cleaned.

1 Capital Bikeshare usage per year

Figure 4 above was generated using our sampled data, so the original number of rides for each year is ten times more than that is showcased on the y-axis. We notice that the demand for CaBi stayed relatively similar across 2016 to 2019. The drop in demand in 2020 is a cause of the COVID-19 pandemic and the consequent lockdown imposed. However, the key finding here is that post-covid demand levels have not yet caught up to pre-covid demand!

2 Proportion of ridesper rider status

A positive sign of growth in members over years 2016 to 2020 is seen in Figure 5 above. The fact that CaBi had the highest number of members in 2020 signifies that the number of casual riders, mainly tourists and folks residing in D.C for a few months at maximum, declined significantly due to the imposed lockdown. As a result of the lockdown, we notice a drop in members from 2020 to 2021 by almost 25%! Therefore, the pandemic indubitably affected CaBi’s revenues and has now put them in a period of recovery.

3 Number of start rides per hour of the day by Rider Status (2016-2022)

Figure 6 highlights that there is relatively no activity on CaBi bikes amongst members and casual riders at 3AM and 4AM across the years 2016 to 2022. However, at 8AM we see a spike in rides started for members and not for casual riders, implying that members mainly commute for work using CaBi. Another peak is seen at 5PM, this time for both members and casual riders, which denotes that members mainly commute from their workplaces back home but casual riders set off for a leisurely trip, reinforcing our observation from Figure 6 that casual riders comprise mainly tourists.

4 Ridge plot of log ride duration

From Figure 7, we now know that not only did frequency of rides reduce in 2020, but so did duration of each ride as the peak of 2020 is slightly shifted left relative to other ridges.

Figure 8 accounts for outliers by using a log-transformation on duration in minutes. The main takeaway from this plot is the heavier tails for hours 10AM to 11PM, implying riders use CaBi for greater durations half the day compared to the other half.

5 Ridge plot of number of start rides per hour of the day

Analogous to Figure 8, Figure 9 above helps us visualize the distribution of the number of rides started over a day across all years and members in our data. At 5AM, there is relatively no activity for CaBi, but once the clock strikes 6AM and riders start their day, the activity increases gradually until it reaches a peak at 8AM. Due to work hours, activity is low from 9AM to 4PM, but we see even greater activity at 5PM and 6PM as riders leave their workplaces. Activity after 6PM starts reducing gradually and the cycle begins again at 6AM the next day.

6 Bar plot for proportions

Figure 10 above signifies that CaBi members account for approximately 73% of the whole data and casual riders account for 27%.

Figure 11 above encapsulates another categorical variable, bike type. Classic bikes were present since the inception of CaBi and comprise approximately 90% of the data. Electric bikes were introduced in 2020 and because we could not find conclusive information about the meaning of docked bikes, we decided to drop it entirely for statistical analyses.

7 Boxplots for trip duration

From Figure 12, the numeric variables of duration in minutes and distance in miles still had a right-skewed distribution after removing heavy outliers. Because the median is closer to the left of the box (lower duration or distance) and the whisker is shorter on the left end of the box, Their distributions are right-skewed. The median duration is approximately 12 minutes and the median distance covered is approximately one mile among members and casual riders. Therefore, using log- transformations on these features is better for t-Tests as they would then follow normality strictly.

8 Geospatial

The Folium package in Python helped us generate the geospatial visualizations because our data contained latitude and longitude features for start as well as end dates of each trip. Therefore, we created a function that takes in the station address as a string and outputs a heatmap of stations where rides ended. From Figure 13 above, most trips ended around the Rosslyn Metro Station and Dupont Circle, indicating that the Georgetown community uses CaBi as a substitute for the Georgetown University Transportation Shuttle. Moreover, 8000 trips, including 5500 members and 2500 casual riders, were started from 37th & O St NW across 2016-2022.

Source Code
---
title: Exploratory Data Analysis
---

<b>This page explores data in the form of plots and visuals to give a better idea about the data that was cleaned.</b>

## Capital Bikeshare usage per year

<img src="../../images/EDA/Fig4.png" style="width:1000px;" align="center">


Figure 4 above was generated using our sampled data, so the original number of rides for each year is ten times more than that is showcased on the y-axis. We notice that the demand for CaBi stayed relatively similar across 2016 to 2019. The drop in demand in 2020 is a cause of the COVID-19 pandemic and the consequent lockdown imposed. However, the key finding here is that post-covid demand levels have not yet caught up to pre-covid demand!

## Proportion of ridesper rider status

<img src="../../images/EDA/Fig5.png" style="width:1000px;" align="center">

A positive sign of growth in members over years 2016 to 2020 is seen in Figure 5 above. The fact that CaBi had the highest number of members in 2020 signifies that the number of casual riders, mainly tourists and folks residing in D.C for a few months at maximum, declined significantly due to the imposed lockdown. As a result of the lockdown, we notice a drop in members from 2020 to 2021 by almost 25%! Therefore, the pandemic indubitably affected CaBi’s revenues and has now put them in a period of recovery.

## Number of start rides per hour of the day by Rider Status (2016-2022)

<img src="../../images/EDA/Fig6.png" style="width:1000px;" align="center">

Figure 6 highlights that there is relatively no activity on CaBi bikes amongst members and casual riders at 3AM and 4AM across the years 2016 to 2022. However, at 8AM we see a spike in rides started for members and not for casual riders, implying that members mainly commute for work using CaBi. Another peak is seen at 5PM, this time for both members and casual riders, which denotes that members mainly commute from their workplaces back home but casual riders set off for a leisurely trip, reinforcing our observation from Figure 6 that casual riders comprise mainly tourists.

## Ridge plot of log ride duration

```{=html}
<table cellspacing="20" cellpadding="10">
    <tbody>
        <tr>
            <th>
                <img src="../../images/EDA/Fig7.png" width="450" height="300" align="right">

            </th>


            <th>
                <img src="../../images/EDA/Fig8.png" width="435" height="300" align="center">

            </th>

        </tr>
    </tbody>
</table>
```

From Figure 7, we now know that not only did frequency of rides reduce in 2020, but so did duration of each ride as the peak of 2020 is slightly shifted left relative to other ridges.

Figure 8 accounts for outliers by using a log-transformation on duration in minutes. The main takeaway from this plot is the heavier tails for hours 10AM to 11PM, implying riders use CaBi for greater durations half the day compared to the other half.

## Ridge plot of number of start rides per hour of the day

<img src="../../images/EDA/Fig9.png" style="width:1000px;" align="center">

Analogous to Figure 8, Figure 9 above helps us visualize the distribution of the number of rides started over a day across all years and members in our data. At 5AM, there is relatively no activity for CaBi, but once the clock strikes 6AM and riders start their day, the activity increases gradually until it reaches a peak at 8AM. Due to work hours, activity is low from 9AM to 4PM, but we see even greater activity at 5PM and 6PM as riders leave their workplaces. Activity after 6PM starts reducing gradually and the cycle begins again at 6AM the next day.

## Bar plot for proportions

```{=html}
<table cellspacing="20" cellpadding="0">
    <tbody>
        <tr>
            <th>
                <img src="../../images/EDA/Fig10.png" width="450" height="300" align="right">

            </th>


            <th>
                <img src="../../images/EDA/Fig11.png" width="435" height="300" align="center">

            </th>

        </tr>
    </tbody>
</table>
```

Figure 10 above signifies that CaBi members account for approximately 73% of the whole data and casual riders account for 27%.

Figure 11 above encapsulates another categorical variable, bike type. Classic bikes were present since the inception of CaBi and comprise approximately 90% of the data. Electric bikes were introduced in 2020 and because we could not find conclusive information about the meaning of docked bikes, we decided to drop it entirely for statistical analyses.

## Boxplots for trip duration

<img src="../../images/EDA/Fig12.png" style="width:1000px;" align="center">

From Figure 12, the numeric variables of duration in minutes and distance in miles still had a right-skewed distribution after removing heavy outliers. Because the median is closer to the left of the box (lower duration or distance) and the whisker is shorter on the left end of the box, Their distributions are right-skewed. The median duration is approximately 12 minutes and the median distance covered is approximately one mile among members and casual riders. Therefore, using log- transformations on these features is better for t-Tests as they would then follow normality strictly.

## Geospatial

<img src="../../images/EDA/Fig13.png" style="width:1000px;" align="center">

The Folium package in Python helped us generate the geospatial visualizations because our data contained latitude and longitude features for start as well as end dates of each trip. Therefore, we created a function that takes in the station address as a string and outputs a heatmap of stations where rides ended. From Figure 13 above, most trips ended around the Rosslyn Metro Station and Dupont Circle, indicating that the Georgetown community uses CaBi as a substitute for the Georgetown University Transportation Shuttle. Moreover, 8000 trips, including 5500 members and 2500 casual riders, were started from 37th & O St NW across 2016-2022.