Data Cleaning

This page talks about the data cleaning techniques that have been used to clean raw data that would be further used in developing the ML model. Data has been cleaned via methods that have been listed below:

1 Record Data Cleaning using R

First method used to clean record data is in R. Initially all the the files representing data of every salary have been combined together to make a complete data frame. After using the str function, it was confirmed that the data needed to be cleaned. It consisted alot NA values, some of the values in the features were missing, the datatypes were incorrectly set and it had some other issues as well. A screen shot of the raw data set has been included down here:

The next set of process started off by counting NA values present in every column. This gives an idea on what columns need to be kept and and which of them need to handled accordingly. This strategy sets up the basis for the columns that are necessary for our analysis. The columns that have more than 300 NA or -1 values have been removed straight away. Other columns that have 1 NA value have been explored row wise and handled accordingly. After this process the columns that are left with NA, -1 or missing values are as follows:

These rows have been handled as per the requirement. At this stage, all the missing and NA values have been sucessfully handled. Moving forward, the focus of cleaning shifts is on the other columns such as Revenue as it seems to be containing only one value i.e. ‘Unknown/ Not-Applicable’. But before removing the column, a check has been made on the same by plotting a bar graph of the Revenue column which looks something like this:

Since most values (closer to 1200) in this column are ‘Unknown/ Not-Applicable’, this column seems like of no relevance to the analysis because of the fact that we don’t have the data for these companies and revenue is something that we can’t replace with mean or median or mode. This column thus has been removed from our dataset. After this all the datatypes have been set accordingly. Founded column has been set as a date datatype. Other than this feature generation has been done to generate some columns that could be of further use to our analysis. Min and Max salaries and size of a company had been extracted from the salary and size range respectively. Also, State column was created from location.

After following all the other steps mentioned in the code, we come up with a clean dataset which is ready to use for our analysis.

R Code for data cleaning : Record Data Cleaning

Raw csv : Raw data.csv

Clean csv : Clean data.csv

Screenshot of clean data:

2 Record Data Cleaning using Python

Getting the Data

Retrieving the data is the most confusing part starting out, since the site structure determines how accessible the data is. Levels.fyi delivers its entire salary payload from a single endpoint. Grabbing JSON data with Python using requests library and then combining this with the pandas library, the JSON is stored into a tidy dataframe:

It’s visible that there’s a few useless columns that have been pulled in, such as rowNumber. These will be removed so that we don’t clutter our analysis in the next steps. We’ll also want to replace blank values with the numpy ‘NaN’ so that some of the exploratory analysis in the next steps function properly. Once our DataFrame has been cleaned up, we can start to explore the dataset!

Data Completeness with MissingNo

For this dataset first we need to understand the completeness of each feature and whether we have enough data to conduct meaningful analysis (example: examining wage difference by gender, or variance in stock/bonus by tenure). To do this we will use the missingno library, which can return a variety of visualizations to help better understand the missing data in the dataset.

Beyond seeing how complete the data is, we also learn certain features like ‘tag’, ‘basesalary’, and ‘bonus’ weren’t always required form fills, but became required later. It also shows that outside of ‘otherdetails’ (and ‘gender’ for a stretch), optional data like ‘stockgrantvalue’ was filled out fairly frequently and can be included in the initial analysis.

Cleaned Data:

After performing certain operations for data cleaning, below is how the cleaned dataframe looks like:

Python Code to clean data : View Jupyter Notebook

Raw Data CSV: Raw Data.csv

Clean Data CSV : Clean Data.csv

3 Text Data Cleaning using Python

The text data was also cleaned using python.

The text data was also cleaned using python. Below is an enhanced description of the process:

Punctuation Removal: To ensure consistency in the dataset, punctuation marks were removed from all text values in relevant columns such as company names and job levels. This step helps eliminate variations caused by punctuation and ensures uniformity in the data.
Tokenization: The text data was then tokenized, which involves splitting the text into individual words or tokens. Tokenization allows for easier manipulation and analysis of the text data at a granular level.
Stop Words: Common English stop words were imported to identify and remove commonly occurring words that do not carry significant meaning in the context of the analysis. These words include articles (e.g., “the,” “a”), pronouns (e.g., “he,” “she”), and other frequently used words that do not provide specific insights.
Custom Stop Words: Additional stop words specific to the analysis were added to the existing set of common English stop words. These custom stop words were chosen to capture domain-specific or context-specific terms that may not be relevant to the analysis or carry negligible meaning.
Stemming: Stemming is the process of reducing words to their base or root form by removing suffixes or prefixes. It helps in standardizing words and reducing their variations, enabling better grouping and analysis. A stemming algorithm (such as the Porter stemming algorithm) can be applied to the tokenized text to perform stemming.
Lemmatization: Lemmatization is similar to stemming but goes a step further by considering the context and meaning of words. It reduces words to their base form (lemma) using vocabulary analysis and morphological parsing. Lemmatization helps in obtaining more meaningful and accurate results compared to stemming. Tools like NLTK (Natural Language Toolkit) offer lemmatization capabilities.
Text Cleaning Utility Function: To streamline the text cleaning process, a utility function can be created using regex patterns. This function can handle various cleaning operations such as removing special characters, numbers, or specific patterns based on regular expressions. It helps in achieving consistent and standardized text data suitable for analysis.

Python Code to clean data : Python code to clean text data

Raw Data CSV : Raw Data.csv

Clean Data CSV : Clean Data.csv

--- title: Data Cleaning --- This page talks about the data cleaning techniques that have been used to clean raw data that would be further used in developing the ML model. Data has been cleaned via methods that have been listed below: # Record Data Cleaning using R First method used to clean record data is in R. Initially all the the files representing data of every salary have been combined together to make a complete data frame. After using the str function, it was confirmed that the data needed to be cleaned. It consisted alot NA values, some of the values in the features were missing, the datatypes were incorrectly set and it had some other issues as well. A screen shot of the raw data set has been included down here: <img src="./images/R/Raw Data.png" style="width:1000px;" align="center"> The next set of process started off by counting NA values present in every column. This gives an idea on what columns need to be kept and and which of them need to handled accordingly. This strategy sets up the basis for the columns that are necessary for our analysis. The columns that have more than 300 NA or -1 values have been removed straight away. Other columns that have 1 NA value have been explored row wise and handled accordingly. After this process the columns that are left with NA, -1 or missing values are as follows: <img src="./images/R/Data Containing NA or -1 or missing values.png" style="width:1000px;" align="center"> These rows have been handled as per the requirement. At this stage, all the missing and NA values have been sucessfully handled. Moving forward, the focus of cleaning shifts is on the other columns such as Revenue as it seems to be containing only one value i.e. 'Unknown/ Not-Applicable'. But before removing the column, a check has been made on the same by plotting a bar graph of the Revenue column which looks something like this: <img src="./images/R/Revenue Column.png" style="width:1000px;" align="center"> Since most values (closer to 1200) in this column are 'Unknown/ Not-Applicable', this column seems like of no relevance to the analysis because of the fact that we don't have the data for these companies and revenue is something that we can't replace with mean or median or mode. This column thus has been removed from our dataset. After this all the datatypes have been set accordingly. Founded column has been set as a date datatype. Other than this feature generation has been done to generate some columns that could be of further use to our analysis. Min and Max salaries and size of a company had been extracted from the salary and size range respectively. Also, State column was created from location. After following all the other steps mentioned in the code, we come up with a clean dataset which is ready to use for our analysis. R Code for data cleaning : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Cleaning/R/Record-Data-Cleaning-in-R.Rmd" target="_blank">Record Data Cleaning</a> Raw csv : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/DataScientist.csv" target="_blank">Raw data.csv</a> Clean csv : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/R/clean_all_data.csv" target="_blank">Clean data.csv</a> Screenshot of clean data: <img src="./images/R/Clean Data.png" style="width:1000px;height:600px;" align="center"> # Record Data Cleaning using Python Getting the Data Retrieving the data is the most confusing part starting out, since the site structure determines how accessible the data is. Levels.fyi delivers its entire salary payload from a single endpoint. Grabbing JSON data with Python using requests library and then combining this with the pandas library, the JSON is stored into a tidy dataframe: <img src="./images/Python/levels_fyi_raw_data.png" style="width:1000px;" align="center"> It's visible that there's a few useless columns that have been pulled in, such as rowNumber. These will be removed so that we don't clutter our analysis in the next steps. We'll also want to replace blank values with the numpy 'NaN' so that some of the exploratory analysis in the next steps function properly. Once our DataFrame has been cleaned up, we can start to explore the dataset! Data Completeness with MissingNo For this dataset first we need to understand the completeness of each feature and whether we have enough data to conduct meaningful analysis (example: examining wage difference by gender, or variance in stock/bonus by tenure). To do this we will use the missingno library, which can return a variety of visualizations to help better understand the missing data in the dataset. <img src="./images/Python/missingno_analysis.png" style="width:1000px;" align="center"> Beyond seeing how complete the data is, we also learn certain features like 'tag', 'basesalary', and 'bonus' weren't always required form fills, but became required later. It also shows that outside of 'otherdetails' (and 'gender' for a stretch), optional data like 'stockgrantvalue' was filled out fairly frequently and can be included in the initial analysis. Cleaned Data: After performing certain operations for data cleaning, below is how the cleaned dataframe looks like: <img src="./images/Python/clean_python_df.png" style="width:1000px;" align="center"> Python Code to clean data : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Gathering/Python/Analyzing_Salaries_Scraping_LevelsFyi.ipynb" target="_blank">View Jupyter Notebook</a> Raw Data CSV: <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/Python/levels_fyi_raw_data.csv" target="_blank">Raw Data.csv</a> Clean Data CSV : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/Python/levels_fyi_clean_data.csv" target="_blank">Clean Data.csv</a> # Text Data Cleaning using Python The text data was also cleaned using python. The text data was also cleaned using python. Below is an enhanced description of the process: 1. Punctuation Removal: To ensure consistency in the dataset, punctuation marks were removed from all text values in relevant columns such as company names and job levels. This step helps eliminate variations caused by punctuation and ensures uniformity in the data. 2. Tokenization: The text data was then tokenized, which involves splitting the text into individual words or tokens. Tokenization allows for easier manipulation and analysis of the text data at a granular level. 3. Stop Words: Common English stop words were imported to identify and remove commonly occurring words that do not carry significant meaning in the context of the analysis. These words include articles (e.g., "the," "a"), pronouns (e.g., "he," "she"), and other frequently used words that do not provide specific insights. 4. Custom Stop Words: Additional stop words specific to the analysis were added to the existing set of common English stop words. These custom stop words were chosen to capture domain-specific or context-specific terms that may not be relevant to the analysis or carry negligible meaning. 5. Stemming: Stemming is the process of reducing words to their base or root form by removing suffixes or prefixes. It helps in standardizing words and reducing their variations, enabling better grouping and analysis. A stemming algorithm (such as the Porter stemming algorithm) can be applied to the tokenized text to perform stemming. 6. Lemmatization: Lemmatization is similar to stemming but goes a step further by considering the context and meaning of words. It reduces words to their base form (lemma) using vocabulary analysis and morphological parsing. Lemmatization helps in obtaining more meaningful and accurate results compared to stemming. Tools like NLTK (Natural Language Toolkit) offer lemmatization capabilities. 7. Text Cleaning Utility Function: To streamline the text cleaning process, a utility function can be created using regex patterns. This function can handle various cleaning operations such as removing special characters, numbers, or specific patterns based on regular expressions. It helps in achieving consistent and standardized text data suitable for analysis. Python Code to clean data : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Gathering/Python/Analyzing_Salaries_Scraping_LevelsFyi.ipynb" target="_blank">Python code to clean text data</a> Raw Data CSV : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/Python/levels_fyi_raw_data.csv" target="_blank">Raw Data.csv</a> Clean Data CSV : <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/data/Python/levels_fyi_clean_data.csv" target="_blank">Clean Data.csv</a>