It’s great that you have narrowed down your search to the District of Columbia, Maryland and Virginia (DMV) area.
The region is quickly becoming a hotspot for data science experts looking for rewarding employment opportunities. The area offers a distinctive environment that encourages creativity and cooperation in the field of data science due to its close proximity to prominent academic institutions, governmental organizations, and cutting-edge enterprises. The DMV region has a thriving IT sector, with a variety of businesses in the public and private sectors that produce enormous volumes of data that are ready for investigation and analysis. Additionally, the region has a wealth of talent, since each year, talented graduates in data science and related subjects are produced by a large number of top institutions and colleges.
Additionally, the area’s vibrant startup scene and strong support for entrepreneurship foster a climate that fosters data-driven businesses, making it a desirable location for data science experts looking to have a real effect. Overall, the DMV region is a prime center for data science professionals wishing to grow their careers in a challenging and gratifying environment due to its combination of a thriving industry, research possibilities, and talent pool.
Let me walk you through this comprehensive report which will help you find your next job.
2 Data
Some information about the dataset that was provided by our very own Georgetown University DSAN department.
This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.
There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .
The postings were obtained using the following search query terms:
“data-scientist”,
“data-analyst”,
“neural-networks”,
“big-data-and-cloud-computing”,
“machine-learning”,
“reinforcement-learning”,
“deep-learning”,
“time-series”,
“block-chain”,
“natural-language-processing”
The search for this data is a local search for DC, respectively. The files may contain duplicate job postings.
Furthermore, this search result extends outside of the specified search area. However, they likely contain a higher sampling of jobs in the DC area.
The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search
3 Data preparation
3.1 Importing the libraries
This step needs no explanation. Required packages must always be loaded.
Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport klibimport plotly.graph_objects as goimport foliumfrom folium import pluginsfrom shapely.geometry import Polygon, Pointfrom wordcloud import WordCloudimport jsonimport globimport osimport reimport nltkfrom nltk.stem import PorterStemmerfrom string import punctuationnltk.download('stopwords')nltk.download('wordnet')nltk.download('omw-1.4')import warningswarnings.filterwarnings('ignore')
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/raghavsharma/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/raghavsharma/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data] /Users/raghavsharma/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
3.2 Importing the dataset
I created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.
The function below will return the list of dataframes for respective job searches.
Code
# Function to create the required dataframe for analysis.def create_job_df(path): """ Takes as input directory to construct a list of dataframes from and returns that list :param path: a Path to a directory :return: a list of pandas DataFrames """# Get every file in the folder using glob all_files = glob.glob(os.path.join(path, "*.json"))# lists for appending dataframes for every job-search data_scientist_list = [] data_analyst_list = [] neural_networks_list = [] big_data_and_cloud_computing_list = [] machine_learning_list = [] reinforcement_learning_list = [] deep_learning_list = [] time_series_list = [] block_chain_list = [] natural_language_processing_list = []# Iterate over the files in the folderfor filename in all_files:# Read the json filewithopen(filename, 'r') as fp: data = json.load(fp)if'jobs_results'in data:# create dataframe df = pd.DataFrame(data['jobs_results'])# Data Cleaning# Via df['via'] = df['via'].apply(lambda x: x[4:])# Job highlights qualifications = [] responsibilities = [] benefits = []for i inrange(len(df['job_highlights'])): jd = df['job_highlights'][i] n =len(jd)if n ==3: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(jd[2]['items'])elif n==2: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(np.nan)elif n==1: qualifications.append(jd[0]['items']) responsibilities.append(np.nan) benefits.append(np.nan)else: qualifications.append(np.nan) responsibilities.append(np.nan) benefits.append(np.nan)# Related links resources = []for i inrange(len(df['related_links'])): links = df['related_links'][i] resources.append(links[0]['link'])# Extensions and detected extensions posted = [] salary = [] job_type = []for i inrange(len(df['detected_extensions'])): extn = df['detected_extensions'][i]if'posted_at'in extn.keys(): posted.append(extn['posted_at'])else: posted.append(np.nan)if'salary'in extn.keys(): salary.append(extn['salary']) else: salary.append(np.nan)if'schedule_type'in extn.keys(): job_type.append(extn['schedule_type'])else: job_type.append(np.nan)# Add the created columns df['qualifications'] = qualifications df['responsibilities'] = responsibilities df['benefits'] = benefits df['posted'] = posted df['salary'] = salary df['job_type'] = job_type df['resources'] = resources# Drop the redundant columns df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)# Rearrange the columns df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 'benefits', 'salary', 'via', 'posted', 'resources']] search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing","machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain","natural-language-processing"]if"data-scientist"in filename: data_scientist_list.append(df)elif"data-analyst"in filename: data_analyst_list.append(df)elif"neural-networks"in filename: neural_networks_list.append(df)elif"big-data-and-cloud-computing"in filename: big_data_and_cloud_computing_list.append(df)elif"machine-learning"in filename: machine_learning_list.append(df)elif"reinforcement-learning"in filename: reinforcement_learning_list.append(df)elif"deep-learning"in filename: deep_learning_list.append(df)elif"time-series"in filename: time_series_list.append(df)elif"block-chain"in filename: block_chain_list.append(df)elif"natural-language-processing"in filename: natural_language_processing_list.append(df)# Concat the lists to create the merged dataframe data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True) data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True) neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True) big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True) machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True) reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True) deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True) time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True) block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True) natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)# return the list of dataframes for every jobreturn [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]
Now that you’ve understood the function, lets see what kind of dataframe do we get for the potential analysis.
Code
# Define pathpath ='../data/DC/'# Execute the driver function to get the list of dataframesdf_list = create_job_df(path)# The respective dataframes for each job search which might be later used for potential analyses.data_scientist_df = df_list[0]data_analyst_df = df_list[1]neural_networks_df = df_list[2]big_data_and_cloud_computing_df = df_list[3]machine_learning_df = df_list[4]reinforcement_learning_df = df_list[5]deep_learning_df = df_list[6]time_series_df = df_list[7]block_chain_df = df_list[8]natural_language_processing_df = df_list[9]# Merge all the dataframes to get all job postings around DCdc_jobs = pd.concat(df_list, axis=0, ignore_index=True)dc_jobs.head()
job_id
title
company_name
job_type
location
description
responsibilities
qualifications
benefits
salary
via
posted
resources
0
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uLV...
Data Scientist - On-Site
BU_IS, Raytheon Technologies
Full-time
Herndon, VA
Date Posted:\n2022-11-10-08:00\n...\nCountry:\...
[The Data Scientists will be responsible for i...
[Bachelor’s degree with 5 years of experience ...
NaN
NaN
Raytheon Jobs - Raytheon Technologies
NaN
https://www.google.com/search?hl=en&gl=us&q=BU...
1
eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
Junior Data Scientist
Gryphon Scientific
Full-time
Takoma Park, MD
This position offers options for in-person (in...
[As a junior data scientist at Gryphon, you wi...
[Undergraduate degree in a relevant discipline...
[Fully paid health insurance for staff, and di...
NaN
Jobs By Workable
6 days ago
http://www.gryphonscientific.com/
2
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgQ29uc3...
Data Scientist, Consultant
Guidehouse
Full-time
Wheaton-Glenmont, MD
Job Family:\nData Science Consulting\n...\nTra...
[Support National Security clients as a Data S...
[U.S. citizenship and ability to obtain a Publ...
[Guidehouse offers a comprehensive, total rewa...
NaN
Monster
12 hours ago
https://www.google.com/search?hl=en&gl=us&q=Gu...
3
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi...
Data Scientist Jobs
ManTech International
Full-time
Herndon, VA
Secure our Nation, Ignite your Future\n\nEach ...
[Each day U.S. Customs and Border Protection (...
[HS Diploma/GED and 20+ years, AS/AA and 18+ y...
[The projected compensation range for this pos...
NaN
Clearance Jobs
11 hours ago
http://www.mantech.com/
4
eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
Junior Data Scientist
ENS Solutions
Full-time
Washington, DC
The contractor shall work closely with teammat...
[The contractor shall work closely with teamma...
[Experience with data exploration, data cleani...
[Platinum-Level Health/Vision/Dental coverage ...
NaN
Cleared Careers
NaN
https://www.google.com/search?hl=en&gl=us&q=EN...
3.3 Data wrangling, munging and cleaning
This is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.
We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.
Code
def join_data(data_lst):# Check if data_lst is not NaNif data_lst isnot np.nan:# If data_lst is not NaN, join the elements with ". " as the separatorreturn". ".join(data_lst)# If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)return np.nandc_jobs['responsibilities'] = dc_jobs['responsibilities'].apply(join_data)dc_jobs['qualifications'] = dc_jobs['qualifications'].apply(join_data)dc_jobs['benefits'] = dc_jobs['benefits'].apply(join_data)
Some of the job postings had listed their location as ‘Anywhere’. So I decided to do some feature extraction and created a new column (‘remote’) which specifies whether the job available allows remote work or not.
Code
# Function to check if the job location is remotedef remote_or_not(location):# Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)if location.lower().strip() =='anywhere':# If the location is "anywhere", return TruereturnTrue# If the location is not "anywhere", return FalsereturnFalse# Apply the remote_or_not function to the 'location' column of the 'dc_jobs' DataFrame and create a new 'remote' columndc_jobs['remote'] = dc_jobs['location'].apply(remote_or_not)
Next I saw that the ‘location’ column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.
Code
# Get city and statedef get_location(location):# Strip leading/trailing spaces from the location string location = location.strip()# Split the location string by comma loc_lst = location.split(',')# Get the number of elements in the loc_lst n =len(loc_lst)if n ==2:# If there are two elements, return the stripped city and statereturn loc_lst[0].strip(), loc_lst[1].strip()elif n ==1:# If there is only one element, return the stripped city and state as the same valuereturn loc_lst[0].strip(), loc_lst[0].strip()# Create empty lists to store the extracted cities and statescities = []states = []# Iterate over the 'location' column of the 'dc_jobs' DataFramefor i inrange(len(dc_jobs['location'])):# Extract the city and state using the get_location function city, state = get_location(dc_jobs['location'][i])# Check for city or state containing '+1'if'+1'in city: city_lst = city.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in city: city = city_lst[0] +' '+ city_lst[1]else: city = city_lst[0]if'+1'in state: state_lst = state.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in state: state = state_lst[0] +' '+ state_lst[1]else: state = state_lst[0]# Append the city and state to the respective lists cities.append(city) states.append(state)# Add 'city' and 'state' columns to the 'dc_jobs' DataFramedc_jobs['city'] = citiesdc_jobs['state'] = states# Merge certain states for consistencydc_jobs['state'] = dc_jobs['state'].replace('Maryland', 'MD')dc_jobs['state'] = dc_jobs['state'].replace('New York', 'NY')dc_jobs['state'] = dc_jobs['state'].replace('California', 'CA')# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the countrydc_jobs['state'] = dc_jobs['state'].replace('United States', 'Anywhere')# Drop the 'location' column and re-arrange the columns in the desired orderdc_jobs.drop(columns=['location'], inplace=True)dc_jobs = dc_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote','description', 'responsibilities', 'qualifications', 'benefits','salary', 'via', 'posted', 'resources']]
I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.
Code
# remove duplicate values from title and company namedc_jobs = dc_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)dc_jobs.head()
job_id
title
company_name
job_type
city
state
remote
description
responsibilities
qualifications
benefits
salary
via
posted
resources
0
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uLV...
Data Scientist - On-Site
BU_IS, Raytheon Technologies
Full-time
Herndon
VA
False
Date Posted:\n2022-11-10-08:00\n...\nCountry:\...
The Data Scientists will be responsible for id...
Bachelor’s degree with 5 years of experience w...
NaN
NaN
Raytheon Jobs - Raytheon Technologies
NaN
https://www.google.com/search?hl=en&gl=us&q=BU...
1
eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
Junior Data Scientist
Gryphon Scientific
Full-time
Takoma Park
MD
False
This position offers options for in-person (in...
As a junior data scientist at Gryphon, you wil...
Undergraduate degree in a relevant discipline ...
Fully paid health insurance for staff, and dis...
NaN
Jobs By Workable
6 days ago
http://www.gryphonscientific.com/
2
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgQ29uc3...
Data Scientist, Consultant
Guidehouse
Full-time
Wheaton-Glenmont
MD
False
Job Family:\nData Science Consulting\n...\nTra...
Support National Security clients as a Data Sc...
U.S. citizenship and ability to obtain a Publi...
Guidehouse offers a comprehensive, total rewar...
NaN
Monster
12 hours ago
https://www.google.com/search?hl=en&gl=us&q=Gu...
3
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi...
Data Scientist Jobs
ManTech International
Full-time
Herndon
VA
False
Secure our Nation, Ignite your Future\n\nEach ...
Each day U.S. Customs and Border Protection (C...
HS Diploma/GED and 20+ years. AS/AA and 18+ ye...
The projected compensation range for this posi...
NaN
Clearance Jobs
11 hours ago
http://www.mantech.com/
4
eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
Junior Data Scientist
ENS Solutions
Full-time
Washington
DC
False
The contractor shall work closely with teammat...
The contractor shall work closely with teammat...
Experience with data exploration, data cleanin...
Platinum-Level Health/Vision/Dental coverage w...
NaN
Cleared Careers
NaN
https://www.google.com/search?hl=en&gl=us&q=EN...
3.3.1 Missing Data
I always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.
Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.
Below you can find a function which will return you the missing value statistics of the created dataframe.
Code
# Define a function that returns a data-frame of missing data statisticsdef missing_val_stats(df):# Define columns of the data-frame df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 'num_miss', 'pct_miss']) tmp = pd.DataFrame()for c in df.columns:# Column tmp['column'] = [c]# Unique values in the column tmp['unique_val'] = [df[c].unique()]# Number of unique values in the column tmp['num_unique_val'] =len(list(df[c].unique()))# Number of unique values in the column without nan tmp['num_unique_val_nona'] =int(df[c].nunique())# Number of missing values in the column tmp['num_miss'] = df[c].isnull().sum()# Percentage of missing values in the column tmp['pct_miss'] = (df[c].isnull().sum()/len(df)).round(3)*100# Append the values to the dataframe df_stats = df_stats.append(tmp) # Return the created dataframereturn df_statsmissing_val_stats(dc_jobs)
column
unique_val
num_unique_val
num_unique_val_nona
num_miss
pct_miss
0
job_id
[eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uL...
314
314
0
0.0
0
title
[Data Scientist - On-Site, Junior Data Scienti...
252
252
0
0.0
0
company_name
[BU_IS, Raytheon Technologies, Gryphon Scienti...
239
239
0
0.0
0
job_type
[Full-time, Contractor, Part-time, Internship]
4
4
0
0.0
0
city
[Herndon, Takoma Park, Wheaton-Glenmont, Washi...
84
84
0
0.0
0
state
[VA, MD, DC, Anywhere, CA, AZ, MA, SC, NY, LA,...
26
26
0
0.0
0
remote
[False, True]
2
2
0
0.0
0
description
[Date Posted:\n2022-11-10-08:00\n...\nCountry:...
309
309
0
0.0
0
responsibilities
[The Data Scientists will be responsible for i...
257
256
39
12.4
0
qualifications
[Bachelor’s degree with 5 years of experience ...
297
297
0
0.0
0
benefits
[nan, Fully paid health insurance for staff, a...
149
148
156
49.7
0
salary
[nan, 20–28 an hour, 384,458 a year, 132,368 a...
33
32
262
83.4
0
via
[Raytheon Jobs - Raytheon Technologies, Jobs B...
116
116
0
0.0
0
posted
[nan, 6 days ago, 12 hours ago, 11 hours ago, ...
39
38
121
38.5
0
resources
[https://www.google.com/search?hl=en&gl=us&q=B...
240
240
0
0.0
As you can see, of all the columns ‘salary’ has 83.6% missing values. Now this can get really tricky because salary is not just any variable that you can impute. It depends upon several factors such as the employer, the job title, geographical location, etc.
Lets also visualize the missing data using ‘klib’ library so that you are able to realize this trend for each column in the dataset.
klib library helps us to visualize missing data trends in the dataset. Using the ‘missing_val’ plot, we will be able to extract necessary information of the missing data in every column.
Code
"Missing Value Plot"dc_klib = klib.missingval_plot(dc_jobs, figsize=(10,15))
My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in the DMV area.
4 INTERESTING INSIGHT
This usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.
Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.
4.1 Salary analysis using benefits of a job provided by the employer
I will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.
Code
# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'def get_sal_ben(benefit):# Convert the benefit string to lowercase and split it into words ben = benefit.lower().split()# Check if any of the keywords are present in the benefitif'salary'in ben or'range'in ben or'pay'in ben:returnTruereturnFalse# Create empty lists to store benefits containing salary information and their corresponding job IDsben_sal = []ben_job_id = []# Iterate over the 'benefits' column of the 'dc_jobs' DataFramefor i inrange(len(dc_jobs['benefits'])): benefit = dc_jobs['benefits'][i]# Check if the benefit is not NaNif benefit isnot np.nan:# If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' listif get_sal_ben(benefit): ben_sal.append(benefit) ben_job_id.append(dc_jobs['job_id'][i])# Define a regex pattern to extract salary information from the benefitssalary_pattern =r"\$([\d,.-]+[kK]?)"# Create empty lists to store the extracted salary information and their corresponding job IDsben_sal_list = []ben_job_id_lst = []# Iterate over the benefits containing salary informationfor i inrange(len(ben_sal)): benefit = ben_sal[i]# Find all matches of the salary pattern in the benefit matches = re.findall(salary_pattern, benefit)if matches:# If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst' ben_sal_list.append(matches) ben_job_id_lst.append(ben_job_id[i])
The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.
Code
# Function to convert a single value to floatdef convert_to_float(value):try:# check for values containing k flag =Falseif'k'in value or'K'in value: flag =True# check for values containing '.' pattern =r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'# Regular expression pattern match = re.search(pattern, value)if match: value = match.group(1).replace('.', '') # Remove dots from the matched value# Remove any non-digit characters (e.g., commas, hyphens) value =''.join(filter(str.isdigit, value))# Multiply by 10000 if it ends with 'k'if flag:returnfloat(value[:-1]) *10000else:returnfloat(value)exceptValueError:returnNone# Iterate over the data and convert each value to floatconverted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]
Our last step would be to iterate over the ‘converted_data’ list above and filter our original dataframe.
Code
# Create an empty list to store the corrected salary rangescorrect_data = []# Iterate over the converted_data listfor i inrange(len(converted_data)): sal_range = converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] correct_data.append(sal_range)# Filter the dc_jobs DataFrame based on the job IDs with salary informationben_filtered_df = dc_jobs[dc_jobs['job_id'].isin(ben_job_id_lst)]
Now that, we have got a new dataframe, we can proceed right?
This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.
After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.
Code
# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the correct_data listfor sal_lst in correct_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the minimum and maximum salaries to the ben_filtered_df DataFrameben_filtered_df['min_salary'] = min_salben_filtered_df['max_salary'] = max_sal# Get the data and job IDs of salaries from the ben_filtered_df DataFramedata =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])job_ids =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])# Define a regex pattern to extract salary rangessalary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop the redundant 'salary' columnben_filtered_df.drop(columns=['salary'], inplace=True)
Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.
So the filtered dataframe has been cleaned. Before proceeding with the visualization, lets take a look at it’s missing value statistics.
Code
missing_val_stats(ben_filtered_df)
column
unique_val
num_unique_val
num_unique_val_nona
num_miss
pct_miss
0
job_id
[eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc...
74
74
0
0.0
0
title
[Junior Data Scientist, Data Scientist Jobs, R...
63
63
0
0.0
0
company_name
[Gryphon Scientific, ManTech International, Bo...
53
53
0
0.0
0
job_type
[Full-time, Internship, Contractor]
3
3
0
0.0
0
city
[Takoma Park, Herndon, Washington, Columbia, A...
32
32
0
0.0
0
state
[MD, VA, DC, CA, WA, Anywhere, NY, HI, IL]
9
9
0
0.0
0
remote
[False, True]
2
2
0
0.0
0
description
[This position offers options for in-person (i...
73
73
0
0.0
0
responsibilities
[As a junior data scientist at Gryphon, you wi...
68
68
0
0.0
0
qualifications
[Undergraduate degree in a relevant discipline...
69
69
0
0.0
0
benefits
[Fully paid health insurance for staff, and di...
70
70
0
0.0
0
via
[Jobs By Workable, Clearance Jobs, CareerCircl...
40
40
0
0.0
0
posted
[6 days ago, 11 hours ago, 5 days ago, nan, 3 ...
20
19
30
40.5
0
resources
[http://www.gryphonscientific.com/, http://www...
52
52
0
0.0
0
min_salary
[50000.0, 137000.0, 73100.0, 100000.0, 110574....
57
56
1
1.4
0
max_salary
[75000.0, 190000.0, 166000.0, 145000.0, 187976...
54
53
1
1.4
0
salary_status
[Yearly, Hourly, nan]
3
2
1
1.4
You can see that columns such as ‘min_salary’, ‘max_salary’, ‘salary_status’ have only one missing value. So it wouldn’t hurt to drop that value.
Code
# dropping nan valuesben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)ben_filtered_df.head()
job_id
title
company_name
job_type
city
state
remote
description
responsibilities
qualifications
benefits
via
posted
resources
min_salary
max_salary
salary_status
1
eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
Junior Data Scientist
Gryphon Scientific
Full-time
Takoma Park
MD
False
This position offers options for in-person (in...
As a junior data scientist at Gryphon, you wil...
Undergraduate degree in a relevant discipline ...
Fully paid health insurance for staff, and dis...
Jobs By Workable
6 days ago
http://www.gryphonscientific.com/
50000.0
75000.0
Yearly
3
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi...
Data Scientist Jobs
ManTech International
Full-time
Herndon
VA
False
Secure our Nation, Ignite your Future\n\nEach ...
Each day U.S. Customs and Border Protection (C...
HS Diploma/GED and 20+ years. AS/AA and 18+ ye...
The projected compensation range for this posi...
Clearance Jobs
11 hours ago
http://www.mantech.com/
137000.0
190000.0
Yearly
6
eyJqb2JfdGl0bGUiOiJSZXNlYXJjaCBEYXRhIFNjaWVudG...
Research Data Scientist
Booz Allen Hamilton
Full-time
Washington
DC
False
Job Number: R0163240\n\nResearch Data Scientis...
You'll work closely with your c us tomer to un...
2+ years of experience with machine learning, ...
You may be able to take advantage of our tuiti...
CareerCircle.com
5 days ago
http://www.boozallen.com/
73100.0
166000.0
Yearly
9
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgTGVhZC...
Data Scientist, Lead
Blend360
Full-time
Columbia
MD
False
Company Description\n\nBlend360 is a Dynamic, ...
We evaluate different sources of data, discove...
Master's degree in Statistics, Math, Operation...
The starting pay range for this role is $100,0...
ZipRecruiter
NaN
http://blend360.com/
100000.0
145000.0
Yearly
10
eyJqb2JfdGl0bGUiOiJDbGVhcmVkIERhdGEgU2NpZW50aX...
Cleared Data Scientist
ICF
Full-time
Arlington
VA
False
We are looking for an experienced Data Scienti...
Ability to perform knowledge elicitation from ...
The ideal candidate should be someone who is s...
Pay Range - There are multiple factors that ar...
LinkedIn
3 days ago
https://www.icf.com/
110574.0
187976.0
Yearly
We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using ‘description’ column.
4.2 Salary analysis using description of the job provided by the employer
Code
# Define a function to check if the description contains keywords related to salarydef get_sal_desc(descript): descpt = descript.lower().split()if'salary'in descpt or'range'in descpt or'pay'in descpt:returnTruereturnFalse# Create empty lists to store the descriptions and job IDs with salary informationdesc_sal = []desc_job_id = []# Iterate over the descriptions in the dc_jobs DataFramefor i inrange(len(dc_jobs['description'])): descpt = dc_jobs['description'][i]if descpt isnot np.nan:if get_sal_desc(descpt): desc_sal.append(descpt) desc_job_id.append(dc_jobs['job_id'][i])# If the description contained the keyword, extract the salary from it.salary_pattern =r"\$([\d,.-]+[kK]?)"desc_sal_list = []desc_job_id_lst = []# Iterate over the descriptions with salary informationfor i inrange(len(desc_sal)): descript = desc_sal[i] matches = re.findall(salary_pattern, descript)if matches: desc_sal_list.append(matches) desc_job_id_lst.append(dc_jobs['job_id'][i])# Iterate over the data and convert each value to floatdesc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]# Create an empty list to store the corrected salary rangesdesc_correct_data = []# Iterate over the converted datafor i inrange(len(desc_converted_data)): sal_range = desc_converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] desc_correct_data.append(sal_range)# Filter the dc_jobs DataFrame based on the job IDs with salary informationdesc_filtered_df = dc_jobs[dc_jobs['job_id'].isin(desc_job_id_lst)]# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the converted datafor sal_lst in desc_converted_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the min_salary and max_salary columns to the desc_filtered_df DataFramedesc_filtered_df['min_salary'] = min_saldesc_filtered_df['max_salary'] = max_sal# Extract salaries from the 'salary' columndata =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])job_ids =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])salary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop redundant 'salary' columndesc_filtered_df.drop(columns=['salary'], inplace=True)# Define a function to determine the salary status based on the min_salarydef salary_status(salary):if salary <=100:return'Hourly'elif salary >100:return'Yearly'else:return np.nan# Add the 'salary_status' column to the desc_filtered_df DataFramedesc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)# Reorder the columns in the desc_filtered_df DataFramedesc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state','remote', 'description', 'responsibilities', 'qualifications','benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]desc_filtered_df.head()
job_id
title
company_name
job_type
city
state
remote
description
responsibilities
qualifications
benefits
min_salary
max_salary
salary_status
via
posted
resources
0
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uLV...
Data Scientist - On-Site
BU_IS, Raytheon Technologies
Full-time
Herndon
VA
False
Date Posted:\n2022-11-10-08:00\n...\nCountry:\...
The Data Scientists will be responsible for id...
Bachelor’s degree with 5 years of experience w...
NaN
50000.0
75000.0
Yearly
Raytheon Jobs - Raytheon Technologies
NaN
https://www.google.com/search?hl=en&gl=us&q=BU...
2
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgQ29uc3...
Data Scientist, Consultant
Guidehouse
Full-time
Wheaton-Glenmont
MD
False
Job Family:\nData Science Consulting\n...\nTra...
Support National Security clients as a Data Sc...
U.S. citizenship and ability to obtain a Publi...
Guidehouse offers a comprehensive, total rewar...
137000.0
190000.0
Yearly
Monster
12 hours ago
https://www.google.com/search?hl=en&gl=us&q=Gu...
3
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi...
Data Scientist Jobs
ManTech International
Full-time
Herndon
VA
False
Secure our Nation, Ignite your Future\n\nEach ...
Each day U.S. Customs and Border Protection (C...
HS Diploma/GED and 20+ years. AS/AA and 18+ ye...
The projected compensation range for this posi...
73100.0
166000.0
Yearly
Clearance Jobs
11 hours ago
http://www.mantech.com/
4
eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
Junior Data Scientist
ENS Solutions
Full-time
Washington
DC
False
The contractor shall work closely with teammat...
The contractor shall work closely with teammat...
Experience with data exploration, data cleanin...
Platinum-Level Health/Vision/Dental coverage w...
100000.0
145000.0
Yearly
Cleared Careers
NaN
https://www.google.com/search?hl=en&gl=us&q=EN...
5
eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
Junior Data Scientist
cBEYONData
Full-time
Fort Belvoir
VA
False
cBEYONData is seeking a Junior Data Scientist ...
cBEYONData is seeking a Junior Data Scientist ...
1-5 years of experience in Data Science field....
NaN
110574.0
187976.0
Yearly
DataYoshi
28 days ago
http://www.cbeyondata.com/
Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.
Lets check for missing data for this dataframe as well.
Code
missing_val_stats(desc_filtered_df)
column
unique_val
num_unique_val
num_unique_val_nona
num_miss
pct_miss
0
job_id
[eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uL...
116
116
0
0.0
0
title
[Data Scientist - On-Site, Data Scientist, Con...
96
96
0
0.0
0
company_name
[BU_IS, Raytheon Technologies, Guidehouse, Man...
96
96
0
0.0
0
job_type
[Full-time, Contractor, Part-time, Internship]
4
4
0
0.0
0
city
[Herndon, Wheaton-Glenmont, Washington, Fort B...
46
46
0
0.0
0
state
[VA, MD, DC, CA, AZ, Anywhere, MA, SC, NY, LA,...
15
15
0
0.0
0
remote
[False, True]
2
2
0
0.0
0
description
[Date Posted:\n2022-11-10-08:00\n...\nCountry:...
115
115
0
0.0
0
responsibilities
[The Data Scientists will be responsible for i...
100
99
13
11.2
0
qualifications
[Bachelor’s degree with 5 years of experience ...
110
110
0
0.0
0
benefits
[nan, Guidehouse offers a comprehensive, total...
51
50
64
55.2
0
min_salary
[50000.0, 137000.0, 73100.0, 100000.0, 110574....
76
76
0
0.0
0
max_salary
[75000.0, 190000.0, 166000.0, 145000.0, 187976...
80
80
0
0.0
0
salary_status
[Yearly, Hourly]
2
2
0
0.0
0
via
[Raytheon Jobs - Raytheon Technologies, Monste...
58
58
0
0.0
0
posted
[nan, 12 hours ago, 11 hours ago, 28 days ago,...
26
25
45
38.8
0
resources
[https://www.google.com/search?hl=en&gl=us&q=B...
97
97
0
0.0
Well, this turned out to be just fine since you can see that the columns required for the visualization don’t have any missing data.
5 Data Visualization
5.1 Geospatial
My first geospatial plot for jobs around the DMV area comes with the plotly Choropleth module.
Code
# CREATE A CHOROPLETH MAPfig = go.Figure(go.Choropleth( locations=total_count_jobs['state'], z=total_count_jobs['total_count'], colorscale='darkmint', locationmode ='USA-states', name="", text=total_count_jobs['state_name'] +'<br>'+'Total jobs: '+ total_count_jobs['total_count'].astype(str), hovertemplate='%{text}',))# ADD TITLE AND ANNOTATIONSfig.update_layout( title_text='<b>Number of Jobs in the DMV region</b>', title_font_size=24, title_x=0.5, geo_scope='usa', width=1100, height=700)# SHOW FIGUREfig.show()
The number of occupations in the DMV (District of Columbia, Maryland, and Virginia) region are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the DMV area. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details.
The caption of the map, “Number of Jobs in the DMV region,” gives the information being displayed a clear context as you can that Virgina shows 63 jobs whicha are the most and Maryland shows 55 which is sort of in the middle of the range.
For my next chart, I used the very famous folium library to create another interactive visualization.
Code
dc_jobs_final = dc_jobs_final[(dc_jobs_final['state'] =='DC') | (dc_jobs_final['state'] =='MD') | (dc_jobs_final['state'] =='VA')]# CREATE DATAdata = dc_jobs_final[["Latitude", "Longitude"]].values.tolist()# Define a list of bounding boxes for the United States, including Alaskaus_bounding_boxes = [ {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548}, # Contiguous U.S. {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0} # Alaska]# Filter out lat/long pairs that do not belong to the United Stateslatlong_list = []for latlong in data: point = Point(latlong[1], latlong[0]) # Shapely uses (x, y) coordinates, so we swap lat and longfor bounding_box in us_bounding_boxes: box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']), (bounding_box['min_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['min_lat'])])if point.within(box): latlong_list.append(latlong)break# No need to check remaining bounding boxes if the point is already within one# INITIALIZE MAPdmv_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)# ADD POINTS plugins.MarkerCluster(latlong_list).add_to(dmv_job_map)# SHOW MAPdmv_job_map
Make this Notebook Trusted to load map: File -> Trust Notebook
This is an interactive map to demonstrate how jobs are distributed around the DMV (District of Columbia, Maryland, and Virginia) region. It provides insightful information on the geographic distribution of employment prospects within the DMV region by visually portraying the job locations. The map’s markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.
Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in the DMV region and eventually assisting in decision-making related to job search and recruitment efforts.
I am hoping that you now have a clear idea about the number of jobs around this region. Since you have reached this far, I am also assuming that you would interested in knowing more about the jobs in DMV.
Don’t worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.
5.2 Textual Analyses
The dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.
Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.
Code
def remove_punct(text):""" A method to remove punctuations from text """ text ="".join([char for char in text if char notin punctuation]) text = re.sub('[0-9]+', '', text) #removes numbers from textreturn textdef remove_stopwords(text):""" A method to remove all the stopwords """ stopwords =set(nltk.corpus.stopwords.words('english')) text = [word for word in text if word notin stopwords]return textdef tokenization(text):""" A method to tokenize text data """ text = re.split('\W+', text) #splitting each sentence/ tweet into its individual wordsreturn textdef stemming(text):""" A method to perform stemming on text data""" porter_stem = nltk.PorterStemmer() text = [porter_stem.stem(word) for word in text]return textdef lemmatizer(text): word_net_lemma = nltk.WordNetLemmatizer() text = [word_net_lemma.lemmatize(word) for word in text]return text# Making a common cleaning function for every part below for code reproducabilitydef clean_words(list_words):# Making a regex pattern to match in the characters we would like to replace from the words character_replace =",()0123456789.?!@#$%&;*:_,/" pattern ="["+ character_replace +"]" new_list_words = []# Looping through every word to remove the characters and appending back to a new list# replace is being used for the characters that could not be catched through regexfor s in list_words: new_word = s.lower() new_word = re.sub(pattern,"",new_word) new_word = new_word.replace('[', '') new_word = new_word.replace(']', '') new_word = new_word.replace('-', '') new_word = new_word.replace('—', '') new_word = new_word.replace('“', '') new_word = new_word.replace("’", '') new_word = new_word.replace("”", '') new_word = new_word.replace("‘", '') new_word = new_word.replace('"', '') new_word = new_word.replace("'", '') new_word = new_word.replace(" ", '') new_list_words.append(new_word)# Using filter to remove empty strings new_list_words =list(filter(None, new_list_words))return new_list_wordsdef clean_text(corpus):""" A method to do basic data cleaning """# Remove punctuation and numbers from the text clean_text = remove_punct([corpus])# Tokenize the text into individual words text_tokenized = tokenization(clean_text.lower())# Remove stopwords from the tokenized text stopwords =set(nltk.corpus.stopwords.words('english')) text_without_stop = remove_stopwords(text_tokenized)# Perform stemming on the text text_stemmed = stemming(text_without_stop)# Perform lemmatization on the text text_lemmatized = lemmatizer(text_without_stop)# Further clean and process the words text_final = clean_words(text_lemmatized)# Join the cleaned words back into a single stringreturn" ".join(text_final)
How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.
Moving on, for our very first textual analyses, I will be using the pipeline created for the ‘description’ column
Code
descript_list = []for descript in dc_jobs['description']: descript_list.append(clean_text(descript))
Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs in the DMV area.
Code
# Join the list of descriptions into a single stringtext =' '.join(descript_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()
The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:
“Data”: This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.
“Experience”: This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.
“Machine Learning”: This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.
“Skills”: This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.
“Analytics”: This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.
Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.
What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the ‘responsibilities’ column and generating it’s word cloud
Code
# Removing missing values from responsibilities for text cleaningdc_jobs.dropna(subset=['responsibilities'], inplace=True)response_list = []for response in dc_jobs['responsibilities']: response_list.append(clean_text(response))# Join the list of descriptions into a single stringtext =' '.join(response_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda*args, **kwargs: 'black').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()
Similar to description wordcloud, we see that words such ‘data’, ‘machine learning’, ‘design’, ‘big data’, ‘project’, ‘model’, ‘development’, etc. are prevalent.
This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.
My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behing. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.
At last, we have the moment of truth. Whether you’re capable of doing this job or not? What qualities one must have in them so they are a suitable fit for the employer?
Let’s check this out.
Code
qualif_list = []for qualif in dc_jobs['qualifications']: qualif_list.append(clean_text(qualif))# Join the list of descriptions into a single stringtext =' '.join(qualif_list)# Generate the word cloud with a custom background colorwordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda*args, **kwargs: 'black').generate(text)# Create the figure and axisfig, ax = plt.subplots(figsize=(10, 6))ax.imshow(wordcloud, interpolation='bilinear')ax.axis('off')# Display the word cloudplt.show()
As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:
Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.
Work Experience: The inclusion of “Work Experience” emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.
Data Science: The prominence of “Data Science” indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.
Bachelor Degree: The presence of “Bachelor Degree” suggests that having a bachelor’s degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.
Machine Learning and Deep Learning: The inclusion of “Machine Learning” and “Deep Learning” highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.
Communication Skills: The mention of “Communication Skill” underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.
Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor’s degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.
5.3 Visualizing Salaries
Finally!!
I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you’ve deserved it.
Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it’s worth knowing about the pay scale.
5.3.1 Using benefits
Coming to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.
Code
# Filter the dataframe by yearly salary statusstatus_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(status_filtered_df['title'])company_names =list(status_filtered_df['company_name'])min_salaries =list(status_filtered_df['min_salary'])max_salaries =list(status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()
The plot up top shows the various job titles’ wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.
The salaries extend from $1500 of a Research Scientist in Machine Learning for OpenAI to $370,000 for the same role in the same company.
For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.
Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.
In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.
5.3.1.1 Anomaly
I tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.
5.3.2 Using description
Code
# Filter the dataframe by yearly salary statusdesc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(desc_status_filtered_df['title'])company_names =list(desc_status_filtered_df['company_name'])min_salaries =list(desc_status_filtered_df['min_salary'])max_salaries =list(desc_status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()
Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.
In the plot above, you will also see that some jobs such as Big Data Cloud Architect for Tephra Inc. has only one salary i.e $5090. Now this was provided in the data itself.
The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers around the DMV area.
6 Limitations
It can be said that this dataset isn’t perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.
One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.
7 Conclusions
The project’s insightful findings have highlighted the DMV region as a prime focus area for aspiring Data Scientists, offering abundant job opportunities and immense potential for career advancement. The concentrated presence of relevant job postings in this region makes it an ideal destination for professionals seeking to thrive in the field.
Furthermore, the analysis has emphasized the critical importance of comprehensive job postings. Companies that provide detailed information on aspects such as salary, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit a genuine consideration for prospective candidates. These organizations are highly sought-after by top talent, making them highly desirable employment prospects.
By immersing myself in the exploration of Data Science job opportunities throughout the DMV area, I have gained invaluable knowledge that will serve as a compass for my career decisions and shape my professional trajectory. I sincerely hope that you, too, have derived substantial benefits from this analysis, enabling you to grasp the intricacies and dynamics of the Data Science job market with a profound understanding.
Source Code
---title: "JobHuntDC: Navigating Data Science careers through Data Visualization" ---# IntroductionIt's great that you have narrowed down your search to the District of Columbia, Maryland and Virginia (DMV) area. The region is quickly becoming a hotspot for data science experts looking for rewarding employment opportunities. The area offers a distinctive environment that encourages creativity and cooperation in the field of data science due to its close proximity to prominent academic institutions, governmental organizations, and cutting-edge enterprises. The DMV region has a thriving IT sector, with a variety of businesses in the public and private sectors that produce enormous volumes of data that are ready for investigation and analysis. Additionally, the region has a wealth of talent, since each year, talented graduates in data science and related subjects are produced by a large number of top institutions and colleges. Additionally, the area's vibrant startup scene and strong support for entrepreneurship foster a climate that fosters data-driven businesses, making it a desirable location for data science experts looking to have a real effect. Overall, the DMV region is a prime center for data science professionals wishing to grow their careers in a challenging and gratifying environment due to its combination of a thriving industry, research possibilities, and talent pool.Let me walk you through this comprehensive report which will help you find your next job.# DataSome information about the dataset that was provided by our very own Georgetown University DSAN department.1. This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.2. There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .3. The postings were obtained using the following search query terms:- "data-scientist",- "data-analyst",- "neural-networks",- "big-data-and-cloud-computing",- "machine-learning",- "reinforcement-learning",- "deep-learning",- "time-series",- "block-chain",- "natural-language-processing"4. The search for this data is a local search for DC, respectively. The files may contain duplicate job postings.5. Furthermore, this search result extends outside of the specified search area. However, they likely contain a higher sampling of jobs in the DC area.6. The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search# Data preparation## Importing the librariesThis step needs no explanation. Required packages must always be loaded.```{python}import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport klibimport plotly.graph_objects as goimport foliumfrom folium import pluginsfrom shapely.geometry import Polygon, Pointfrom wordcloud import WordCloudimport jsonimport globimport osimport reimport nltkfrom nltk.stem import PorterStemmerfrom string import punctuationnltk.download('stopwords')nltk.download('wordnet')nltk.download('omw-1.4')import warningswarnings.filterwarnings('ignore')```## Importing the datasetI created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.The function below will return the list of dataframes for respective job searches.```{python}# Function to create the required dataframe for analysis.def create_job_df(path): """ Takes as input directory to construct a list of dataframes from and returns that list :param path: a Path to a directory :return: a list of pandas DataFrames """# Get every file in the folder using glob all_files = glob.glob(os.path.join(path, "*.json"))# lists for appending dataframes for every job-search data_scientist_list = [] data_analyst_list = [] neural_networks_list = [] big_data_and_cloud_computing_list = [] machine_learning_list = [] reinforcement_learning_list = [] deep_learning_list = [] time_series_list = [] block_chain_list = [] natural_language_processing_list = []# Iterate over the files in the folderfor filename in all_files:# Read the json filewithopen(filename, 'r') as fp: data = json.load(fp)if'jobs_results'in data:# create dataframe df = pd.DataFrame(data['jobs_results'])# Data Cleaning# Via df['via'] = df['via'].apply(lambda x: x[4:])# Job highlights qualifications = [] responsibilities = [] benefits = []for i inrange(len(df['job_highlights'])): jd = df['job_highlights'][i] n =len(jd)if n ==3: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(jd[2]['items'])elif n==2: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(np.nan)elif n==1: qualifications.append(jd[0]['items']) responsibilities.append(np.nan) benefits.append(np.nan)else: qualifications.append(np.nan) responsibilities.append(np.nan) benefits.append(np.nan)# Related links resources = []for i inrange(len(df['related_links'])): links = df['related_links'][i] resources.append(links[0]['link'])# Extensions and detected extensions posted = [] salary = [] job_type = []for i inrange(len(df['detected_extensions'])): extn = df['detected_extensions'][i]if'posted_at'in extn.keys(): posted.append(extn['posted_at'])else: posted.append(np.nan)if'salary'in extn.keys(): salary.append(extn['salary']) else: salary.append(np.nan)if'schedule_type'in extn.keys(): job_type.append(extn['schedule_type'])else: job_type.append(np.nan)# Add the created columns df['qualifications'] = qualifications df['responsibilities'] = responsibilities df['benefits'] = benefits df['posted'] = posted df['salary'] = salary df['job_type'] = job_type df['resources'] = resources# Drop the redundant columns df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)# Rearrange the columns df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 'benefits', 'salary', 'via', 'posted', 'resources']] search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing","machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain","natural-language-processing"]if"data-scientist"in filename: data_scientist_list.append(df)elif"data-analyst"in filename: data_analyst_list.append(df)elif"neural-networks"in filename: neural_networks_list.append(df)elif"big-data-and-cloud-computing"in filename: big_data_and_cloud_computing_list.append(df)elif"machine-learning"in filename: machine_learning_list.append(df)elif"reinforcement-learning"in filename: reinforcement_learning_list.append(df)elif"deep-learning"in filename: deep_learning_list.append(df)elif"time-series"in filename: time_series_list.append(df)elif"block-chain"in filename: block_chain_list.append(df)elif"natural-language-processing"in filename: natural_language_processing_list.append(df)# Concat the lists to create the merged dataframe data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True) data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True) neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True) big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True) machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True) reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True) deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True) time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True) block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True) natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)# return the list of dataframes for every jobreturn [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]```Now that you've understood the function, lets see what kind of dataframe do we get for the potential analysis.```{python}# Define pathpath ='../data/DC/'# Execute the driver function to get the list of dataframesdf_list = create_job_df(path)# The respective dataframes for each job search which might be later used for potential analyses.data_scientist_df = df_list[0]data_analyst_df = df_list[1]neural_networks_df = df_list[2]big_data_and_cloud_computing_df = df_list[3]machine_learning_df = df_list[4]reinforcement_learning_df = df_list[5]deep_learning_df = df_list[6]time_series_df = df_list[7]block_chain_df = df_list[8]natural_language_processing_df = df_list[9]# Merge all the dataframes to get all job postings around DCdc_jobs = pd.concat(df_list, axis=0, ignore_index=True)dc_jobs.head()```## Data wrangling, munging and cleaningThis is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.```{python}def join_data(data_lst):# Check if data_lst is not NaNif data_lst isnot np.nan:# If data_lst is not NaN, join the elements with ". " as the separatorreturn". ".join(data_lst)# If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)return np.nandc_jobs['responsibilities'] = dc_jobs['responsibilities'].apply(join_data)dc_jobs['qualifications'] = dc_jobs['qualifications'].apply(join_data)dc_jobs['benefits'] = dc_jobs['benefits'].apply(join_data)```Some of the job postings had listed their location as 'Anywhere'. So I decided to do some feature extraction and created a new column ('remote') which specifies whether the job available allows remote work or not.```{python}# Function to check if the job location is remotedef remote_or_not(location):# Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)if location.lower().strip() =='anywhere':# If the location is "anywhere", return TruereturnTrue# If the location is not "anywhere", return FalsereturnFalse# Apply the remote_or_not function to the 'location' column of the 'dc_jobs' DataFrame and create a new 'remote' columndc_jobs['remote'] = dc_jobs['location'].apply(remote_or_not)```Next I saw that the 'location' column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.```{python}# Get city and statedef get_location(location):# Strip leading/trailing spaces from the location string location = location.strip()# Split the location string by comma loc_lst = location.split(',')# Get the number of elements in the loc_lst n =len(loc_lst)if n ==2:# If there are two elements, return the stripped city and statereturn loc_lst[0].strip(), loc_lst[1].strip()elif n ==1:# If there is only one element, return the stripped city and state as the same valuereturn loc_lst[0].strip(), loc_lst[0].strip()# Create empty lists to store the extracted cities and statescities = []states = []# Iterate over the 'location' column of the 'dc_jobs' DataFramefor i inrange(len(dc_jobs['location'])):# Extract the city and state using the get_location function city, state = get_location(dc_jobs['location'][i])# Check for city or state containing '+1'if'+1'in city: city_lst = city.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in city: city = city_lst[0] +' '+ city_lst[1]else: city = city_lst[0]if'+1'in state: state_lst = state.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in state: state = state_lst[0] +' '+ state_lst[1]else: state = state_lst[0]# Append the city and state to the respective lists cities.append(city) states.append(state)# Add 'city' and 'state' columns to the 'dc_jobs' DataFramedc_jobs['city'] = citiesdc_jobs['state'] = states# Merge certain states for consistencydc_jobs['state'] = dc_jobs['state'].replace('Maryland', 'MD')dc_jobs['state'] = dc_jobs['state'].replace('New York', 'NY')dc_jobs['state'] = dc_jobs['state'].replace('California', 'CA')# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the countrydc_jobs['state'] = dc_jobs['state'].replace('United States', 'Anywhere')# Drop the 'location' column and re-arrange the columns in the desired orderdc_jobs.drop(columns=['location'], inplace=True)dc_jobs = dc_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote','description', 'responsibilities', 'qualifications', 'benefits','salary', 'via', 'posted', 'resources']]```I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.```{python}# remove duplicate values from title and company namedc_jobs = dc_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)dc_jobs.head()```### Missing DataI always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.Below you can find a function which will return you the missing value statistics of the created dataframe.```{python}# Define a function that returns a data-frame of missing data statisticsdef missing_val_stats(df):# Define columns of the data-frame df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 'num_miss', 'pct_miss']) tmp = pd.DataFrame()for c in df.columns:# Column tmp['column'] = [c]# Unique values in the column tmp['unique_val'] = [df[c].unique()]# Number of unique values in the column tmp['num_unique_val'] =len(list(df[c].unique()))# Number of unique values in the column without nan tmp['num_unique_val_nona'] =int(df[c].nunique())# Number of missing values in the column tmp['num_miss'] = df[c].isnull().sum()# Percentage of missing values in the column tmp['pct_miss'] = (df[c].isnull().sum()/len(df)).round(3)*100# Append the values to the dataframe df_stats = df_stats.append(tmp) # Return the created dataframereturn df_statsmissing_val_stats(dc_jobs)```As you can see, of all the columns 'salary' has 83.6% missing values. Now this can get really tricky because salary is not just any variable that you can impute. It depends upon several factors such as the employer, the job title, geographical location, etc.Lets also visualize the missing data using 'klib' library so that you are able to realize this trend for each column in the dataset.klib library helps us to visualize missing data trends in the dataset. Using the 'missing_val' plot, we will be able to extract necessary information of the missing data in every column. <br><br>```{python}"Missing Value Plot"dc_klib = klib.missingval_plot(dc_jobs, figsize=(10,15))```My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in the DMV area.# INTERESTING INSIGHTThis usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.## Salary analysis using benefits of a job provided by the employerI will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.```{python}# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'def get_sal_ben(benefit):# Convert the benefit string to lowercase and split it into words ben = benefit.lower().split()# Check if any of the keywords are present in the benefitif'salary'in ben or'range'in ben or'pay'in ben:returnTruereturnFalse# Create empty lists to store benefits containing salary information and their corresponding job IDsben_sal = []ben_job_id = []# Iterate over the 'benefits' column of the 'dc_jobs' DataFramefor i inrange(len(dc_jobs['benefits'])): benefit = dc_jobs['benefits'][i]# Check if the benefit is not NaNif benefit isnot np.nan:# If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' listif get_sal_ben(benefit): ben_sal.append(benefit) ben_job_id.append(dc_jobs['job_id'][i])# Define a regex pattern to extract salary information from the benefitssalary_pattern =r"\$([\d,.-]+[kK]?)"# Create empty lists to store the extracted salary information and their corresponding job IDsben_sal_list = []ben_job_id_lst = []# Iterate over the benefits containing salary informationfor i inrange(len(ben_sal)): benefit = ben_sal[i]# Find all matches of the salary pattern in the benefit matches = re.findall(salary_pattern, benefit)if matches:# If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst' ben_sal_list.append(matches) ben_job_id_lst.append(ben_job_id[i])```The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.```{python}# Function to convert a single value to floatdef convert_to_float(value):try:# check for values containing k flag =Falseif'k'in value or'K'in value: flag =True# check for values containing '.' pattern =r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'# Regular expression pattern match = re.search(pattern, value)if match: value = match.group(1).replace('.', '') # Remove dots from the matched value# Remove any non-digit characters (e.g., commas, hyphens) value =''.join(filter(str.isdigit, value))# Multiply by 10000 if it ends with 'k'if flag:returnfloat(value[:-1]) *10000else:returnfloat(value)exceptValueError:returnNone# Iterate over the data and convert each value to floatconverted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]```Our last step would be to iterate over the 'converted_data' list above and filter our original dataframe.```{python}# Create an empty list to store the corrected salary rangescorrect_data = []# Iterate over the converted_data listfor i inrange(len(converted_data)): sal_range = converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] correct_data.append(sal_range)# Filter the dc_jobs DataFrame based on the job IDs with salary informationben_filtered_df = dc_jobs[dc_jobs['job_id'].isin(ben_job_id_lst)]```Now that, we have got a new dataframe, we can proceed right?This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.```{python}# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the correct_data listfor sal_lst in correct_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the minimum and maximum salaries to the ben_filtered_df DataFrameben_filtered_df['min_salary'] = min_salben_filtered_df['max_salary'] = max_sal# Get the data and job IDs of salaries from the ben_filtered_df DataFramedata =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])job_ids =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])# Define a regex pattern to extract salary rangessalary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop the redundant 'salary' columnben_filtered_df.drop(columns=['salary'], inplace=True)```Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.```{python}def salary_status(salary):if salary <=100:return'Hourly'elif salary >100:return'Yearly'else:return np.nanben_filtered_df['salary_status'] = ben_filtered_df['min_salary'].apply(salary_status)```So the filtered dataframe has been cleaned. Before proceeding with the visualization, lets take a look at it's missing value statistics.```{python}missing_val_stats(ben_filtered_df)```You can see that columns such as 'min_salary', 'max_salary', 'salary_status' have only one missing value. So it wouldn't hurt to drop that value.```{python}# dropping nan valuesben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)ben_filtered_df.head()```We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using 'description' column.## Salary analysis using description of the job provided by the employer```{python}# Define a function to check if the description contains keywords related to salarydef get_sal_desc(descript): descpt = descript.lower().split()if'salary'in descpt or'range'in descpt or'pay'in descpt:returnTruereturnFalse# Create empty lists to store the descriptions and job IDs with salary informationdesc_sal = []desc_job_id = []# Iterate over the descriptions in the dc_jobs DataFramefor i inrange(len(dc_jobs['description'])): descpt = dc_jobs['description'][i]if descpt isnot np.nan:if get_sal_desc(descpt): desc_sal.append(descpt) desc_job_id.append(dc_jobs['job_id'][i])# If the description contained the keyword, extract the salary from it.salary_pattern =r"\$([\d,.-]+[kK]?)"desc_sal_list = []desc_job_id_lst = []# Iterate over the descriptions with salary informationfor i inrange(len(desc_sal)): descript = desc_sal[i] matches = re.findall(salary_pattern, descript)if matches: desc_sal_list.append(matches) desc_job_id_lst.append(dc_jobs['job_id'][i])# Iterate over the data and convert each value to floatdesc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]# Create an empty list to store the corrected salary rangesdesc_correct_data = []# Iterate over the converted datafor i inrange(len(desc_converted_data)): sal_range = desc_converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] desc_correct_data.append(sal_range)# Filter the dc_jobs DataFrame based on the job IDs with salary informationdesc_filtered_df = dc_jobs[dc_jobs['job_id'].isin(desc_job_id_lst)]# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the converted datafor sal_lst in desc_converted_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the min_salary and max_salary columns to the desc_filtered_df DataFramedesc_filtered_df['min_salary'] = min_saldesc_filtered_df['max_salary'] = max_sal# Extract salaries from the 'salary' columndata =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])job_ids =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])salary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop redundant 'salary' columndesc_filtered_df.drop(columns=['salary'], inplace=True)# Define a function to determine the salary status based on the min_salarydef salary_status(salary):if salary <=100:return'Hourly'elif salary >100:return'Yearly'else:return np.nan# Add the 'salary_status' column to the desc_filtered_df DataFramedesc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)# Reorder the columns in the desc_filtered_df DataFramedesc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state','remote', 'description', 'responsibilities', 'qualifications','benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]desc_filtered_df.head()```Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.Lets check for missing data for this dataframe as well.```{python}missing_val_stats(desc_filtered_df)```Well, this turned out to be just fine since you can see that the columns required for the visualization don't have any missing data.# Data Visualization## Geospatial```{python}#| echo: false#| warning: false# Create a new column 'Address' by combining 'city' and 'state' columnsdc_jobs['Address'] = dc_jobs['city'] +', '+ dc_jobs['state']# Read the address coordinates from the 'address_coords.csv' fileaddress_df = pd.read_csv("../data/address_coords.csv")# Merge the 'dc_jobs' DataFrame with the 'address_df' DataFrame based on the 'Address' column# and add the coordinates information to 'dc_jobs_final' DataFramedc_jobs_final = pd.merge(dc_jobs, address_df, on="Address", how="left")# Read the 'uscities.csv' file containing State IDs and State Namesuscities_df = pd.read_csv("../data/uscities.csv")# Extract the State IDs and State Names from 'uscities_df' and drop duplicate rowsstate_name = uscities_df[["state_id", "state_name"]].drop_duplicates().reset_index(drop=True)state_name.columns = ["state", "state_name"]# Count the number of job sightings in each statetotal_count_jobs = dc_jobs_final.groupby("state")['job_id'].count().reset_index()total_count_jobs.columns = ["state", "total_count"]# Filter the total count of jobs for specific states (DC, MD, Maryland, VA)total_count_jobs = total_count_jobs[(total_count_jobs["state"] =="DC") | (total_count_jobs["state"] =="MD") | (total_count_jobs["state"] =="Maryland") | (total_count_jobs["state"] =="VA")].reset_index(drop=True)# Merge the state IDs and state names with the total count of jobstotal_count_jobs = pd.merge(total_count_jobs, state_name, on="state", how="left")```My first geospatial plot for jobs around the DMV area comes with the plotly Choropleth module.```{python}# CREATE A CHOROPLETH MAPfig = go.Figure(go.Choropleth( locations=total_count_jobs['state'], z=total_count_jobs['total_count'], colorscale='darkmint', locationmode ='USA-states', name="", text=total_count_jobs['state_name'] +'<br>'+'Total jobs: '+ total_count_jobs['total_count'].astype(str), hovertemplate='%{text}',))# ADD TITLE AND ANNOTATIONSfig.update_layout( title_text='<b>Number of Jobs in the DMV region</b>', title_font_size=24, title_x=0.5, geo_scope='usa', width=1100, height=700)# SHOW FIGUREfig.show()```The number of occupations in the DMV (District of Columbia, Maryland, and Virginia) region are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the DMV area. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details. The caption of the map, "Number of Jobs in the DMV region," gives the information being displayed a clear context as you can that Virgina shows 63 jobs whicha are the most and Maryland shows 55 which is sort of in the middle of the range.For my next chart, I used the very famous folium library to create another interactive visualization. ```{python}dc_jobs_final = dc_jobs_final[(dc_jobs_final['state'] =='DC') | (dc_jobs_final['state'] =='MD') | (dc_jobs_final['state'] =='VA')]# CREATE DATAdata = dc_jobs_final[["Latitude", "Longitude"]].values.tolist()# Define a list of bounding boxes for the United States, including Alaskaus_bounding_boxes = [ {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548}, # Contiguous U.S. {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0} # Alaska]# Filter out lat/long pairs that do not belong to the United Stateslatlong_list = []for latlong in data: point = Point(latlong[1], latlong[0]) # Shapely uses (x, y) coordinates, so we swap lat and longfor bounding_box in us_bounding_boxes: box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']), (bounding_box['min_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['min_lat'])])if point.within(box): latlong_list.append(latlong)break# No need to check remaining bounding boxes if the point is already within one# INITIALIZE MAPdmv_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)# ADD POINTS plugins.MarkerCluster(latlong_list).add_to(dmv_job_map)# SHOW MAPdmv_job_map```This is an interactive map to demonstrate how jobs are distributed around the DMV (District of Columbia, Maryland, and Virginia) region. It provides insightful information on the geographic distribution of employment prospects within the DMV region by visually portraying the job locations. The map's markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in the DMV region and eventually assisting in decision-making related to job search and recruitment efforts.I am hoping that you now have a clear idea about the number of jobs around this region. Since you have reached this far, I am also assuming that you would interested in knowing more about the jobs in DMV.Don't worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.## Textual AnalysesThe dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.```{python}def remove_punct(text):""" A method to remove punctuations from text """ text ="".join([char for char in text if char notin punctuation]) text = re.sub('[0-9]+', '', text) #removes numbers from textreturn textdef remove_stopwords(text):""" A method to remove all the stopwords """ stopwords =set(nltk.corpus.stopwords.words('english')) text = [word for word in text if word notin stopwords]return textdef tokenization(text):""" A method to tokenize text data """ text = re.split('\W+', text) #splitting each sentence/ tweet into its individual wordsreturn textdef stemming(text):""" A method to perform stemming on text data""" porter_stem = nltk.PorterStemmer() text = [porter_stem.stem(word) for word in text]return textdef lemmatizer(text): word_net_lemma = nltk.WordNetLemmatizer() text = [word_net_lemma.lemmatize(word) for word in text]return text# Making a common cleaning function for every part below for code reproducabilitydef clean_words(list_words):# Making a regex pattern to match in the characters we would like to replace from the words character_replace =",()0123456789.?!@#$%&;*:_,/" pattern ="["+ character_replace +"]" new_list_words = []# Looping through every word to remove the characters and appending back to a new list# replace is being used for the characters that could not be catched through regexfor s in list_words: new_word = s.lower() new_word = re.sub(pattern,"",new_word) new_word = new_word.replace('[', '') new_word = new_word.replace(']', '') new_word = new_word.replace('-', '') new_word = new_word.replace('—', '') new_word = new_word.replace('“', '') new_word = new_word.replace("’", '') new_word = new_word.replace("”", '') new_word = new_word.replace("‘", '') new_word = new_word.replace('"', '') new_word = new_word.replace("'", '') new_word = new_word.replace(" ", '') new_list_words.append(new_word)# Using filter to remove empty strings new_list_words =list(filter(None, new_list_words))return new_list_wordsdef clean_text(corpus):""" A method to do basic data cleaning """# Remove punctuation and numbers from the text clean_text = remove_punct([corpus])# Tokenize the text into individual words text_tokenized = tokenization(clean_text.lower())# Remove stopwords from the tokenized text stopwords =set(nltk.corpus.stopwords.words('english')) text_without_stop = remove_stopwords(text_tokenized)# Perform stemming on the text text_stemmed = stemming(text_without_stop)# Perform lemmatization on the text text_lemmatized = lemmatizer(text_without_stop)# Further clean and process the words text_final = clean_words(text_lemmatized)# Join the cleaned words back into a single stringreturn" ".join(text_final)```How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.Moving on, for our very first textual analyses, I will be using the pipeline created for the 'description' column```{python}descript_list = []for descript in dc_jobs['description']: descript_list.append(clean_text(descript))```Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs in the DMV area.```{python}# Join the list of descriptions into a single stringtext =' '.join(descript_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()```The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:1. "Data": This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.2. "Experience": This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.3. "Machine Learning": This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.4. "Skills": This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.5. "Analytics": This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the 'responsibilities' column and generating it's word cloud```{python}# Removing missing values from responsibilities for text cleaningdc_jobs.dropna(subset=['responsibilities'], inplace=True)response_list = []for response in dc_jobs['responsibilities']: response_list.append(clean_text(response))# Join the list of descriptions into a single stringtext =' '.join(response_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda*args, **kwargs: 'black').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()```Similar to description wordcloud, we see that words such 'data', 'machine learning', 'design', 'big data', 'project', 'model', 'development', etc. are prevalent.This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behing. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.At last, we have the moment of truth. Whether you're capable of doing this job or not?What qualities one must have in them so they are a suitable fit for the employer?Let's check this out.```{python}qualif_list = []for qualif in dc_jobs['qualifications']: qualif_list.append(clean_text(qualif))# Join the list of descriptions into a single stringtext =' '.join(qualif_list)# Generate the word cloud with a custom background colorwordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda*args, **kwargs: 'black').generate(text)# Create the figure and axisfig, ax = plt.subplots(figsize=(10, 6))ax.imshow(wordcloud, interpolation='bilinear')ax.axis('off')# Display the word cloudplt.show()```As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:1. Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.2. Work Experience: The inclusion of "Work Experience" emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.3. Data Science: The prominence of "Data Science" indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.4. Bachelor Degree: The presence of "Bachelor Degree" suggests that having a bachelor's degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.5. Machine Learning and Deep Learning: The inclusion of "Machine Learning" and "Deep Learning" highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.6. Communication Skills: The mention of "Communication Skill" underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor's degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.## Visualizing SalariesFinally!!I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you've deserved it.Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it's worth knowing about the pay scale.### Using benefitsComing to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.```{python}# Filter the dataframe by yearly salary statusstatus_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(status_filtered_df['title'])company_names =list(status_filtered_df['company_name'])min_salaries =list(status_filtered_df['min_salary'])max_salaries =list(status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()```The plot up top shows the various job titles' wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.The salaries extend from $1500 of a Research Scientist in Machine Learning for OpenAI to $370,000 for the same role in the same company.For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.#### AnomalyI tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.### Using description```{python}# Filter the dataframe by yearly salary statusdesc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(desc_status_filtered_df['title'])company_names =list(desc_status_filtered_df['company_name'])min_salaries =list(desc_status_filtered_df['min_salary'])max_salaries =list(desc_status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()```Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.In the plot above, you will also see that some jobs such as Big Data Cloud Architect for Tephra Inc. has only one salary i.e $5090. Now this was provided in the data itself.The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers around the DMV area.# LimitationsIt can be said that this dataset isn't perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.# ConclusionsThe project's insightful findings have highlighted the DMV region as a prime focus area for aspiring Data Scientists, offering abundant job opportunities and immense potential for career advancement. The concentrated presence of relevant job postings in this region makes it an ideal destination for professionals seeking to thrive in the field.Furthermore, the analysis has emphasized the critical importance of comprehensive job postings. Companies that provide detailed information on aspects such as salary, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit a genuine consideration for prospective candidates. These organizations are highly sought-after by top talent, making them highly desirable employment prospects.By immersing myself in the exploration of Data Science job opportunities throughout the DMV area, I have gained invaluable knowledge that will serve as a compass for my career decisions and shape my professional trajectory. I sincerely hope that you, too, have derived substantial benefits from this analysis, enabling you to grasp the intricacies and dynamics of the Data Science job market with a profound understanding.