Data Cleaning

This page talks about the data cleaning techniques that have been used to clean raw data that would be further used in developing the ML model.

Considering that the data was gathered from pre existing CSV files, there was no extensive cleaning required, but rather minor pre-processing for specific needs of this project. All data cleaning was completed using Python. The correct sequence to follow in our Github Repository for data cleaning is “511 Geospatial Features Append.ipynb” followed by “Joining_data_teg.ipynb”.

1 Missing data

It is important to mention all NA values were dropped since there were a total of 20,000 NA values out of 20 million rows of data we had gathered from all the CSVs, so dropping them would not be significant and/or change our analysis. In addition the column called “Bike.number” was dropped as it represents every ID for every bike Capital Bikeshare owns. With that being known, it can be said that running an analysis on every bike is useless since all of them are the same and there are a lot of bikes in stock.

2 Geospatial data

Next, not all of the datasets (2016 - 2020) had longitude and latitude columns, which motivated us to conduct geospatial analysis for which left joins were used. Left joins were performed between every yearly dataset provided from the source on the column called Station.number (see visual example below), but distinctions were made between the 2021 and 2022 datasets that involved longitude and latitude columns with ones that did not. The only column that all the datasets share in common as shown in Figure 2 below.

3 Sampling the data

One last task that had to be completed before joining the datasets was sampling, since every dataset has around 3 million rows, and joining them as is would have resulted in 21 million rows, which was not within the scope of the project and would require much more computational power. For that reason, we had to come up with a sampling strategy that is reflective of the population data, which is why we decided to sample on station number, as it guarantees the spread of rides around the DC. Maryland, and Virginia area remains consistent with that of the population data.

As shown below in Figure 3, presented is a small example that explains that sampling strategy:

Assume the initial dataset had 3 million rows and all rides are coming from 3 stations with 20% of bike rides from the first station, 50% from the second, and 30% from the third. The sampling was conducted using Python functions that guarantee that this spread among stations is the same when x (in this case 10) % is being sampled from the initial data. Obviously, The data set had much more than just three stations but this is just a small example to explain the sampling strategy.

Finally all 7 datasets were joined together into one final cleaned dataset that had approximately 2.1 million rows of data, which was representative of the population data.

--- title: Data Cleaning --- <b>This page talks about the data cleaning techniques that have been used to clean raw data that would be further used in developing the ML model.</b> Considering that the data was gathered from pre existing CSV files, there was no extensive cleaning required, but rather minor pre-processing for specific needs of this project. All data cleaning was completed using Python. The correct sequence to follow in our Github Repository for data cleaning is “511 Geospatial Features Append.ipynb” followed by “Joining_data_teg.ipynb”. ## Missing data It is important to mention all NA values were dropped since there were a total of 20,000 NA values out of 20 million rows of data we had gathered from all the CSVs, so dropping them would not be significant and/or change our analysis. In addition the column called “Bike.number” was dropped as it represents every ID for every bike Capital Bikeshare owns. With that being known, it can be said that running an analysis on every bike is useless since all of them are the same and there are a lot of bikes in stock. ## Geospatial data Next, not all of the datasets (2016 - 2020) had longitude and latitude columns, which motivated us to conduct geospatial analysis for which left joins were used. Left joins were performed between every yearly dataset provided from the source on the column called Station.number (see visual example below), but distinctions were made between the 2021 and 2022 datasets that involved longitude and latitude columns with ones that did not. The only column that all the datasets share in common as shown in Figure 2 below. <img src="../../images/data_cleaning/Fig2.png" style="width:1000px;" align="center"> ## Sampling the data One last task that had to be completed before joining the datasets was sampling, since every dataset has around 3 million rows, and joining them as is would have resulted in 21 million rows, which was not within the scope of the project and would require much more computational power. For that reason, we had to come up with a sampling strategy that is reflective of the population data, which is why we decided to sample on station number, as it guarantees the spread of rides around the DC. Maryland, and Virginia area remains consistent with that of the population data. As shown below in Figure 3, presented is a small example that explains that sampling strategy: <img src="../../images/data_cleaning/Fig3.png" style="width:1000px;" align="center"> Assume the initial dataset had 3 million rows and all rides are coming from 3 stations with 20% of bike rides from the first station, 50% from the second, and 30% from the third. The sampling was conducted using Python functions that guarantee that this spread among stations is the same when x (in this case 10) % is being sampled from the initial data. Obviously, The data set had much more than just three stations but this is just a small example to explain the sampling strategy. Finally all 7 datasets were joined together into one final cleaned dataset that had approximately 2.1 million rows of data, which was representative of the population data.