Due to a number of variables, the United States of America (USA) has become a center for employment possibilities in data science. First of all, the nation has hubs for innovation and cutting-edge technological infrastructure. Cities with high concentrations of technological businesses, startups, and research institutes include Silicon Valley in California, Seattle in Washington, and Boston in Massachusetts. In addition to luring top talent, these areas provide a thriving ecosystem for data science professionals to work on innovative projects and collaborate with one another.
Second, the USA is well-represented in a wide range of industries, from technology and banking to healthcare and retail. Numerous businesses in these industries have extensively invested in data science capabilities because they understand the value of data-driven decision-making. With so many huge organizations, including corporate behemoths like Google, Amazon, and Microsoft, data scientists have plenty of chances to work on challenging challenges and make important contributions. The USA also has a thriving startup scene, with several new businesses upending numerous industries with ground-breaking data-driven solutions.
Overall, the United States is a desirable location for job seekers looking for data science positions because of its large industry presence, modern technological infrastructure, and innovative culture. The nation is a growing hub for data science workers because it provides a wide range of possibilities, access to cutting-edge initiatives, and the possibility to collaborate with top organizations and subject matter experts.
Let me walk you through this comprehensive report which will help you find your next job.
2 Data
Some information about the dataset that was provided by our very own Georgetown University DSAN department.
This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.
There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .
The postings were obtained using the following search query terms:
“data-scientist”,
“data-analyst”,
“neural-networks”,
“big-data-and-cloud-computing”,
“machine-learning”,
“reinforcement-learning”,
“deep-learning”,
“time-series”,
“block-chain”,
“natural-language-processing”
The search for this data is for USA. The files may contain duplicate job postings.
The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search
3 Data preparation
3.1 Importing the libraries
This step needs no explanation. Required packages must always be loaded.
Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport klibimport plotly.graph_objects as goimport foliumfrom folium import pluginsfrom shapely.geometry import Polygon, Pointfrom wordcloud import WordCloudimport jsonimport globimport osimport reimport nltkfrom nltk.stem import PorterStemmerfrom string import punctuationnltk.download('stopwords')nltk.download('wordnet')nltk.download('omw-1.4')import warningswarnings.filterwarnings('ignore')
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/raghavsharma/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/raghavsharma/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data] /Users/raghavsharma/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
3.2 Importing the dataset
I created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.
The function below will return the list of dataframes for respective job searches.
Code
# Function to create the required dataframe for analysis.def create_job_df(path): """ Takes as input directory to construct a list of dataframes from and returns that list :param path: a Path to a directory :return: a list of pandas DataFrames """# Get every file in the folder using glob all_files = glob.glob(os.path.join(path, "*.json"))# lists for appending dataframes for every job-search data_scientist_list = [] data_analyst_list = [] neural_networks_list = [] big_data_and_cloud_computing_list = [] machine_learning_list = [] reinforcement_learning_list = [] deep_learning_list = [] time_series_list = [] block_chain_list = [] natural_language_processing_list = []# Iterate over the files in the folderfor filename in all_files:# Read the json filewithopen(filename, 'r') as fp: data = json.load(fp)if'jobs_results'in data:# create dataframe df = pd.DataFrame(data['jobs_results'])# Data Cleaning# Via df['via'] = df['via'].apply(lambda x: x[4:])# Job highlights qualifications = [] responsibilities = [] benefits = []for i inrange(len(df['job_highlights'])): jd = df['job_highlights'][i] n =len(jd)if n ==3: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(jd[2]['items'])elif n==2: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(np.nan)elif n==1: qualifications.append(jd[0]['items']) responsibilities.append(np.nan) benefits.append(np.nan)else: qualifications.append(np.nan) responsibilities.append(np.nan) benefits.append(np.nan)# Related links resources = []for i inrange(len(df['related_links'])): links = df['related_links'][i] resources.append(links[0]['link'])# Extensions and detected extensions posted = [] salary = [] job_type = []for i inrange(len(df['detected_extensions'])): extn = df['detected_extensions'][i]if'posted_at'in extn.keys(): posted.append(extn['posted_at'])else: posted.append(np.nan)if'salary'in extn.keys(): salary.append(extn['salary']) else: salary.append(np.nan)if'schedule_type'in extn.keys(): job_type.append(extn['schedule_type'])else: job_type.append(np.nan)# Add the created columns df['qualifications'] = qualifications df['responsibilities'] = responsibilities df['benefits'] = benefits df['posted'] = posted df['salary'] = salary df['job_type'] = job_type df['resources'] = resources# Drop the redundant columns df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)# Rearrange the columns df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 'benefits', 'salary', 'via', 'posted', 'resources']] search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing","machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain","natural-language-processing"]if"data-scientist"in filename: data_scientist_list.append(df)elif"data-analyst"in filename: data_analyst_list.append(df)elif"neural-networks"in filename: neural_networks_list.append(df)elif"big-data-and-cloud-computing"in filename: big_data_and_cloud_computing_list.append(df)elif"machine-learning"in filename: machine_learning_list.append(df)elif"reinforcement-learning"in filename: reinforcement_learning_list.append(df)elif"deep-learning"in filename: deep_learning_list.append(df)elif"time-series"in filename: time_series_list.append(df)elif"block-chain"in filename: block_chain_list.append(df)elif"natural-language-processing"in filename: natural_language_processing_list.append(df)# Concat the lists to create the merged dataframe data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True) data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True) neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True) big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True) machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True) reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True) deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True) time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True) block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True) natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)# return the list of dataframes for every jobreturn [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]
Now that you’ve understood the function, lets see what kind of dataframe do we get for the potential analysis.
Code
# Define pathpath ='../data/USA/'# Execute the driver function to get the list of dataframesdf_list = create_job_df(path)# The respective dataframes for each job search which might be later used for potential analyses.data_scientist_df = df_list[0]data_analyst_df = df_list[1]neural_networks_df = df_list[2]big_data_and_cloud_computing_df = df_list[3]machine_learning_df = df_list[4]reinforcement_learning_df = df_list[5]deep_learning_df = df_list[6]time_series_df = df_list[7]block_chain_df = df_list[8]natural_language_processing_df = df_list[9]# Merge all the dataframes to get all job postings around DCcountry_jobs = pd.concat(df_list, axis=0, ignore_index=True)country_jobs.head()
job_id
title
company_name
job_type
location
description
responsibilities
qualifications
benefits
salary
via
posted
resources
0
eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE...
Sr. Data Scientist (NLP)
MCKESSON
Full-time
Texas
Ontada is a leading oncology real-world data a...
[Collaborate with product management, product ...
[5+ years of industry experience in ML and/or ...
[As part of Total Rewards, we are proud to off...
NaN
Jobs At MCKESSON
NaN
http://www.mckesson.com/
1
eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF...
Sr Dir Data Science & Analytics
Northwestern Mutual
Full-time
Milwaukee, WI
At Northwestern Mutual, we are strong, innovat...
[Provides leadership and direction to analytic...
[Recognized as an expert in the industry and s...
[$143,360.00, $204,800.00]
NaN
Northwestern Mutual Careers
21 days ago
https://www.google.com/search?q=Northwestern+M...
2
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBTZW5pb3...
Data Scientist Senior
CHRISTUS Health
Full-time
Irving, TX
Summary:\n\nThe Data Scientist Senior is respo...
[The Data Scientist Senior is responsible for ...
[Individual must have extensive knowledge of S...
NaN
NaN
Christus Health Careers
28 days ago
http://www.christushealth.org/
3
eyJqb2JfdGl0bGUiOiJEaXJlY3RvciwgR2xvYmFsIERlbW...
Director, Global Demand Data Scientist Lead
7Z4 Pfizer, Inc.
Full-time
Anywhere
Why Patients Need You Our manufacturing logist...
NaN
[Why Patients Need You Our manufacturing logis...
NaN
NaN
Workday
2 days ago
https://www.google.com/search?q=7Z4+Pfizer,+In...
4
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW...
Data Scientist
John Deere
Full-time
Anywhere
There are over 7 billion people on this planet...
[Be responsible for working with large amounts...
[5 years experience in programming and data an...
[Additionally, we offer a comprehensive reward...
NaN
Salary.com
NaN
http://www.deere.com/
Generating a separate dataframe for DC job listings which will be merged with the overall country job listings.
Code
# Define pathpath ='../data/USA/'# Execute the driver function to get the list of dataframesdf_list = create_job_df(path)# The respective dataframes for each job search which might be later used for potential analyses.data_scientist_df = df_list[0]data_analyst_df = df_list[1]neural_networks_df = df_list[2]big_data_and_cloud_computing_df = df_list[3]machine_learning_df = df_list[4]reinforcement_learning_df = df_list[5]deep_learning_df = df_list[6]time_series_df = df_list[7]block_chain_df = df_list[8]natural_language_processing_df = df_list[9]# Merge all the dataframes to get all job postings around DCusa_jobs = pd.concat(df_list, axis=0, ignore_index=True)usa_jobs.head()
This is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.
We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.
Code
def join_data(data_lst):# Check if data_lst is not NaNif data_lst isnot np.nan:# If data_lst is not NaN, join the elements with ". " as the separatorreturn". ".join(data_lst)# If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)return np.nanusa_jobs['responsibilities'] = usa_jobs['responsibilities'].apply(join_data)usa_jobs['qualifications'] = usa_jobs['qualifications'].apply(join_data)usa_jobs['benefits'] = usa_jobs['benefits'].apply(join_data)
Some of the job postings had listed their location as ‘Anywhere’. So I decided to do some feature extraction and created a new column (‘remote’) which specifies whether the job available allows remote work or not.
Code
# Function to check if the job location is remotedef remote_or_not(location):# Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)if location.lower().strip() =='anywhere':# If the location is "anywhere", return TruereturnTrue# If the location is not "anywhere", return FalsereturnFalse# Apply the remote_or_not function to the 'location' column of the 'usa_jobs' DataFrame and create a new 'remote' columnusa_jobs['remote'] = usa_jobs['location'].apply(remote_or_not)
Next I saw that the ‘location’ column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.
Code
# Get city and statedef get_location(location):# Strip leading/trailing spaces from the location string location = location.strip()# Split the location string by comma loc_lst = location.split(',')# Get the number of elements in the loc_lst n =len(loc_lst)if n ==2:# If there are two elements, return the stripped city and statereturn loc_lst[0].strip(), loc_lst[1].strip()elif n ==1:# If there is only one element, return the stripped city and state as the same valuereturn loc_lst[0].strip(), loc_lst[0].strip()# Create empty lists to store the extracted cities and statescities = []states = []# Iterate over the 'location' column of the 'usa_jobs' DataFramefor i inrange(len(usa_jobs['location'])):# Extract the city and state using the get_location function city, state = get_location(usa_jobs['location'][i])# Check for city or state containing '+1'if'+1'in city: city_lst = city.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in city: city = city_lst[0] +' '+ city_lst[1]else: city = city_lst[0]if'+1'in state: state_lst = state.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in state: state = state_lst[0] +' '+ state_lst[1]else: state = state_lst[0]# Append the city and state to the respective lists cities.append(city) states.append(state)# Add 'city' and 'state' columns to the 'usa_jobs' DataFrameusa_jobs['city'] = citiesusa_jobs['state'] = states# Merge certain states for consistencyusa_jobs['state'] = usa_jobs['state'].replace('Maryland', 'MD')usa_jobs['state'] = usa_jobs['state'].replace('New York', 'NY')usa_jobs['state'] = usa_jobs['state'].replace('California', 'CA')# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the countryusa_jobs['state'] = usa_jobs['state'].replace('United States', 'Anywhere')# Drop the 'location' column and re-arrange the columns in the desired orderusa_jobs.drop(columns=['location'], inplace=True)usa_jobs = usa_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote','description', 'responsibilities', 'qualifications', 'benefits','salary', 'via', 'posted', 'resources']]
I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.
Code
# remove duplicate values from title and company nameusa_jobs = usa_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)usa_jobs.head()
job_id
title
company_name
job_type
city
state
remote
description
responsibilities
qualifications
benefits
salary
via
posted
resources
0
eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE...
Sr. Data Scientist (NLP)
MCKESSON
Full-time
Texas
Texas
False
Ontada is a leading oncology real-world data a...
Collaborate with product management, product o...
5+ years of industry experience in ML and/or d...
As part of Total Rewards, we are proud to offe...
NaN
Jobs At MCKESSON
NaN
http://www.mckesson.com/
1
eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF...
Sr Dir Data Science & Analytics
Northwestern Mutual
Full-time
Milwaukee
WI
False
At Northwestern Mutual, we are strong, innovat...
Provides leadership and direction to analytics...
Recognized as an expert in the industry and sh...
$143,360.00. $204,800.00
NaN
Northwestern Mutual Careers
21 days ago
https://www.google.com/search?q=Northwestern+M...
2
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBTZW5pb3...
Data Scientist Senior
CHRISTUS Health
Full-time
Irving
TX
False
Summary:\n\nThe Data Scientist Senior is respo...
The Data Scientist Senior is responsible for d...
Individual must have extensive knowledge of St...
NaN
NaN
Christus Health Careers
28 days ago
http://www.christushealth.org/
3
eyJqb2JfdGl0bGUiOiJEaXJlY3RvciwgR2xvYmFsIERlbW...
Director, Global Demand Data Scientist Lead
7Z4 Pfizer, Inc.
Full-time
Anywhere
Anywhere
True
Why Patients Need You Our manufacturing logist...
NaN
Why Patients Need You Our manufacturing logist...
NaN
NaN
Workday
2 days ago
https://www.google.com/search?q=7Z4+Pfizer,+In...
4
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW...
Data Scientist
John Deere
Full-time
Anywhere
Anywhere
True
There are over 7 billion people on this planet...
Be responsible for working with large amounts ...
5 years experience in programming and data ana...
Additionally, we offer a comprehensive reward ...
NaN
Salary.com
NaN
http://www.deere.com/
3.3.1 Missing Data
I always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.
Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.
Lets visualize the missing data using ‘klib’ library so that you are able to realize this trend for each column in the dataset.
klib library helps us to visualize missing data trends in the dataset. Using the ‘missing_val’ plot, we will be able to extract necessary information of the missing data in every column.
Code
"Missing Value Plot"usa_klib = klib.missingval_plot(usa_jobs, figsize=(10,15))
There are 490 values of salary which is missing. My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in this country.
4 INTERESTING INSIGHT
This usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.
Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.
4.1 Salary analysis using benefits of a job provided by the employer
I will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.
Code
# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'def get_sal_ben(benefit):# Convert the benefit string to lowercase and split it into words ben = benefit.lower().split()# Check if any of the keywords are present in the benefitif'salary'in ben or'range'in ben or'pay'in ben:returnTruereturnFalseusa_jobs.dropna()# Create empty lists to store benefits containing salary information and their corresponding job IDsben_sal = []ben_job_id = []# Iterate over the 'benefits' column of the 'usa_jobs' DataFramefor i inrange(len(usa_jobs['benefits'])): benefit = usa_jobs['benefits'][i]# Check if the benefit is not NaNif benefit isnot np.nan:# If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' listif get_sal_ben(benefit): ben_sal.append(benefit) ben_job_id.append(usa_jobs['job_id'][i])# Define a regex pattern to extract salary information from the benefitssalary_pattern =r"\$([\d,.-]+[kK]?)"# Create empty lists to store the extracted salary information and their corresponding job IDsben_sal_list = []ben_job_id_lst = []# Iterate over the benefits containing salary informationfor i inrange(len(ben_sal)): benefit = ben_sal[i]# Find all matches of the salary pattern in the benefit matches = re.findall(salary_pattern, benefit)if matches:# If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst' ben_sal_list.append(matches) ben_job_id_lst.append(ben_job_id[i])
The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.
Code
# Function to convert a single value to floatdef convert_to_float(value):try:# check for values containing k flag =Falseif'k'in value or'K'in value: flag =True# check for values containing '.' pattern =r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'# Regular expression pattern match = re.search(pattern, value)if match: value = match.group(1).replace('.', '') # Remove dots from the matched value# Remove any non-digit characters (e.g., commas, hyphens) value =''.join(filter(str.isdigit, value))# Multiply by 10000 if it ends with 'k'if flag:returnfloat(value[:-1]) *10000else:returnfloat(value)exceptValueError:returnNone# Iterate over the data and convert each value to floatconverted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]
Our last step would be to iterate over the ‘converted_data’ list above and filter our original dataframe.
Code
# Create an empty list to store the corrected salary rangescorrect_data = []# Iterate over the converted_data listfor i inrange(len(converted_data)): sal_range = converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] isnotNoneand sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] correct_data.append(sal_range)# Filter the usa_jobs DataFrame based on the job IDs with salary informationben_filtered_df = usa_jobs[usa_jobs['job_id'].isin(ben_job_id_lst)]
Now that, we have got a new dataframe, we can proceed right?
This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.
After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.
Code
# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the correct_data listfor sal_lst in correct_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the minimum and maximum salaries to the ben_filtered_df DataFrameben_filtered_df['min_salary'] = min_salben_filtered_df['max_salary'] = max_sal# Get the data and job IDs of salaries from the ben_filtered_df DataFramedata =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])job_ids =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])# Define a regex pattern to extract salary rangessalary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop the redundant 'salary' columnben_filtered_df.drop(columns=['salary'], inplace=True)
Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.
Dropping any vague missing data from our analyses for the visualization later on.
Code
# dropping nan valuesben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)ben_filtered_df.head()
job_id
title
company_name
job_type
city
state
remote
description
responsibilities
qualifications
benefits
via
posted
resources
min_salary
max_salary
salary_status
0
eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE...
Sr. Data Scientist (NLP)
MCKESSON
Full-time
Texas
Texas
False
Ontada is a leading oncology real-world data a...
Collaborate with product management, product o...
5+ years of industry experience in ML and/or d...
As part of Total Rewards, we are proud to offe...
Jobs At MCKESSON
NaN
http://www.mckesson.com/
117500.0
195800.0
Yearly
5
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW...
Data Scientist
Trustees of University of Pennsylvania
Full-time
Anywhere
Anywhere
True
University Overview The University of Pennsylv...
Posted Job Title Data Scientist Job Profile Ti...
At least 3 years of experience required in a r...
The University offers a competitive benefits p...
Careers@Penn
NaN
https://www.google.com/search?q=Trustees+of+Un...
61046.0
132906.0
Yearly
12
eyJqb2JfdGl0bGUiOiJMZWFkIERhdGEgU2NpZW50aXN0Ii...
Lead Data Scientist
SPECTRUM
Full-time
Colorado Springs
CO
False
JOB SUMMARY\nThe goal of our Sales & Competiti...
In this role, the ideal candidate utilizes ana...
Required Skills/Abilities and Knowledge. Abili...
The pay for this position has a salary range o...
Spectrum Careers
6 days ago
https://www.google.com/search?gl=us&hl=en&q=SP...
98900.0
175000.0
Yearly
13
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIENyZW...
Data Scientist - Credit Risk
Cottonwood Financial
Full-time
Irving
TX
False
Job Description\nReporting to our Director of ...
Reporting to our Director of Credit Risk Manag...
BS in an analytical field such as Statistics, ...
Starting annual salary of $121,000. Medical, d...
LinkedIn
3 days ago
https://www.google.com/search?gl=us&hl=en&q=Co...
121000.0
121000.0
Yearly
16
eyJqb2JfdGl0bGUiOiJTdGFmZiBEYXRhIFNjaWVudGlzdC...
Staff Data Scientist, Ad Formats & Optimizatio...
Reddit
Full-time
Anywhere
Anywhere
True
Reddit is a community of communities where peo...
You will partner closely with cross functional...
5+ years of experience in data analytics or re...
Comprehensive Health benefits. 401k Matching. ...
Built In
4 days ago
https://www.reddit.com/
184000.0
275000.0
Yearly
We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using ‘description’ column.
4.2 Salary analysis using description of the job provided by the employer
Code
# Define a function to check if the description contains keywords related to salarydef get_sal_desc(descript): descpt = descript.lower().split()if'salary'in descpt or'range'in descpt or'pay'in descpt:returnTruereturnFalse# Create empty lists to store the descriptions and job IDs with salary informationdesc_sal = []desc_job_id = []# Iterate over the descriptions in the usa_jobs DataFramefor i inrange(len(usa_jobs['description'])): descpt = usa_jobs['description'][i]if descpt isnot np.nan:if get_sal_desc(descpt): desc_sal.append(descpt) desc_job_id.append(usa_jobs['job_id'][i])# If the description contained the keyword, extract the salary from it.salary_pattern =r"\$([\d,.-]+[kK]?)"desc_sal_list = []desc_job_id_lst = []# Iterate over the descriptions with salary informationfor i inrange(len(desc_sal)): descript = desc_sal[i] matches = re.findall(salary_pattern, descript)if matches: desc_sal_list.append(matches) desc_job_id_lst.append(usa_jobs['job_id'][i])# Iterate over the data and convert each value to floatdesc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]# Create an empty list to store the corrected salary rangesdesc_correct_data = []# Iterate over the converted datafor i inrange(len(desc_converted_data)): sal_range = desc_converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] desc_correct_data.append(sal_range)# Filter the usa_jobs DataFrame based on the job IDs with salary informationdesc_filtered_df = usa_jobs[usa_jobs['job_id'].isin(desc_job_id_lst)]# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the converted datafor sal_lst in desc_converted_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the min_salary and max_salary columns to the desc_filtered_df DataFramedesc_filtered_df['min_salary'] = min_saldesc_filtered_df['max_salary'] = max_sal# Extract salaries from the 'salary' columndata =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])job_ids =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])salary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop redundant 'salary' columndesc_filtered_df.drop(columns=['salary'], inplace=True)# Define a function to determine the salary status based on the min_salarydef salary_status(salary):if salary <=100:return'Hourly'elif salary >100:return'Yearly'else:return np.nan# Add the 'salary_status' column to the desc_filtered_df DataFramedesc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)# Reorder the columns in the desc_filtered_df DataFramedesc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state','remote', 'description', 'responsibilities', 'qualifications','benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]desc_filtered_df.head()
job_id
title
company_name
job_type
city
state
remote
description
responsibilities
qualifications
benefits
min_salary
max_salary
salary_status
via
posted
resources
0
eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE...
Sr. Data Scientist (NLP)
MCKESSON
Full-time
Texas
Texas
False
Ontada is a leading oncology real-world data a...
Collaborate with product management, product o...
5+ years of industry experience in ML and/or d...
As part of Total Rewards, we are proud to offe...
117500.0
195800.0
Yearly
Jobs At MCKESSON
NaN
http://www.mckesson.com/
1
eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF...
Sr Dir Data Science & Analytics
Northwestern Mutual
Full-time
Milwaukee
WI
False
At Northwestern Mutual, we are strong, innovat...
Provides leadership and direction to analytics...
Recognized as an expert in the industry and sh...
$143,360.00. $204,800.00
143360.0
204800.0
Yearly
Northwestern Mutual Careers
21 days ago
https://www.google.com/search?q=Northwestern+M...
4
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW...
Data Scientist
John Deere
Full-time
Anywhere
Anywhere
True
There are over 7 billion people on this planet...
Be responsible for working with large amounts ...
5 years experience in programming and data ana...
Additionally, we offer a comprehensive reward ...
61046.0
132906.0
Yearly
Salary.com
NaN
http://www.deere.com/
6
eyJqb2JfdGl0bGUiOiJQcmUgU2FsZXMgRGF0YSBTY2llbn...
Pre Sales Data Scientist
Explorium
Full-time
United States
Anywhere
False
Description\n\nPre Sales Data Scientist...\n\n...
In this consultative role, you’ll be relied on...
Proficiency in Python, SQL, and/or R. Ability ...
NaN
81500.0
142600.0
Yearly
Comeet
4 days ago
https://www.google.com/search?q=Explorium&sa=X...
8
eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW...
Data Scientist
Mars
Full-time
Chicago
IL
False
[Insert short summary of role – approximately ...
An industry competitive salary and benefits pa...
[Insert list of top 4 key responsibilities for...
NaN
98900.0
175300.0
Yearly
Careers At Mars - Mars, Incorporated
NaN
http://www.mars.com/
Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.
5 Data Visualization
5.1 Geospatial
My first geospatial plot for jobs comes with the plotly Choropleth module.
Code
# CREATE A CHOROPLETH MAPfig = go.Figure(go.Choropleth( locations=total_count_jobs['state'], z=total_count_jobs['total_count'], colorscale='darkmint', locationmode ='USA-states', name="", text=total_count_jobs['state_name'] +'<br>'+'Total jobs: '+ total_count_jobs['total_count'].astype(str), hovertemplate='%{text}',))# ADD TITLE AND ANNOTATIONSfig.update_layout( title_text='<b>Number of Jobs across USA</b>', title_font_size=24, title_x=0.5, geo_scope='usa', width=1100, height=700)# SHOW FIGUREfig.show()
The number of occupations in USA are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the country. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details.
The caption of the map, “Number of Jobs across USA,” gives the information being displayed a clear context.
For my next chart, I used the very famous folium library to create another interactive visualization.
Code
# CREATE DATAdata = usa_jobs_final[["Latitude", "Longitude"]].values.tolist()# Define a list of bounding boxes for the United States, including Alaskaus_bounding_boxes = [ {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548}, # Contiguous U.S. {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0} # Alaska]# Filter out lat/long pairs that do not belong to the United Stateslatlong_list = []for latlong in data: point = Point(latlong[1], latlong[0]) # Shapely uses (x, y) coordinates, so we swap lat and longfor bounding_box in us_bounding_boxes: box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']), (bounding_box['min_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['min_lat'])])if point.within(box): latlong_list.append(latlong)break# No need to check remaining bounding boxes if the point is already within one# INITIALIZE MAPusa_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)# ADD POINTS plugins.MarkerCluster(latlong_list).add_to(usa_job_map)# SHOW MAPusa_job_map
Make this Notebook Trusted to load map: File -> Trust Notebook
This is an interactive map to demonstrate how jobs are distributed across USA. It provides insightful information on the geographic distribution of employment prospects across the country by visually portraying the job locations. The map’s markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.
Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in USA and eventually assisting in decision-making related to job search and recruitment efforts.
I am hoping that you now have a clear idea about the number of jobs around the country. Since you have reached this far, I am also assuming that you would interested in knowing more about these jobs.
Don’t worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.
5.2 Textual Analyses
The dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.
Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.
Code
def remove_punct(text):""" A method to remove punctuations from text """ text ="".join([char for char in text if char notin punctuation]) text = re.sub('[0-9]+', '', text) #removes numbers from textreturn textdef remove_stopwords(text):""" A method to remove all the stopwords """ stopwords =set(nltk.corpus.stopwords.words('english')) text = [word for word in text if word notin stopwords]return textdef tokenization(text):""" A method to tokenize text data """ text = re.split('\W+', text) #splitting each sentence/ tweet into its individual wordsreturn textdef stemming(text):""" A method to perform stemming on text data""" porter_stem = nltk.PorterStemmer() text = [porter_stem.stem(word) for word in text]return textdef lemmatizer(text): word_net_lemma = nltk.WordNetLemmatizer() text = [word_net_lemma.lemmatize(word) for word in text]return text# Making a common cleaning function for every part below for code reproducabilitydef clean_words(list_words):# Making a regex pattern to match in the characters we would like to replace from the words character_replace =",()0123456789.?!@#$%&;*:_,/" pattern ="["+ character_replace +"]" new_list_words = []# Looping through every word to remove the characters and appending back to a new list# replace is being used for the characters that could not be catched through regexfor s in list_words: new_word = s.lower() new_word = re.sub(pattern,"",new_word) new_word = new_word.replace('[', '') new_word = new_word.replace(']', '') new_word = new_word.replace('-', '') new_word = new_word.replace('—', '') new_word = new_word.replace('“', '') new_word = new_word.replace("’", '') new_word = new_word.replace("”", '') new_word = new_word.replace("‘", '') new_word = new_word.replace('"', '') new_word = new_word.replace("'", '') new_word = new_word.replace(" ", '') new_list_words.append(new_word)# Using filter to remove empty strings new_list_words =list(filter(None, new_list_words))return new_list_wordsdef clean_text(corpus):""" A method to do basic data cleaning """# Remove punctuation and numbers from the text clean_text = remove_punct([corpus])# Tokenize the text into individual words text_tokenized = tokenization(clean_text.lower())# Remove stopwords from the tokenized text stopwords =set(nltk.corpus.stopwords.words('english')) text_without_stop = remove_stopwords(text_tokenized)# Perform stemming on the text text_stemmed = stemming(text_without_stop)# Perform lemmatization on the text text_lemmatized = lemmatizer(text_without_stop)# Further clean and process the words text_final = clean_words(text_lemmatized)# Join the cleaned words back into a single stringreturn" ".join(text_final)
How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.
Moving on, for our very first textual analyses, I will be using the pipeline created for the ‘description’ column
Code
descript_list = []for descript in usa_jobs['description']: descript_list.append(clean_text(descript))
Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs.
Code
# Join the list of descriptions into a single stringtext =' '.join(descript_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()
The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:
“Data”: This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.
“Experience”: This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.
“Machine Learning”: This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.
“Skills”: This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.
“Analytics”: This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.
Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.
What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the ‘responsibilities’ column and generating it’s word cloud
Code
# Removing missing values from responsibilities for text cleaningusa_jobs.dropna(subset=['responsibilities'], inplace=True)response_list = []for response in usa_jobs['responsibilities']: response_list.append(clean_text(response))# Join the list of descriptions into a single stringtext =' '.join(response_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda*args, **kwargs: 'black').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()
Similar to description wordcloud, we see that words such ‘data’, ‘machine learning’, ‘design’, ‘big data’, ‘project’, ‘model’, ‘development’, etc. are prevalent.
This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.
My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behind. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.
At last, we have the moment of truth. Whether you’re capable of doing this job or not? What qualities one must have in them so they are a suitable fit for the employer?
Let’s check this out.
Code
qualif_list = []for qualif in usa_jobs['qualifications']: qualif_list.append(clean_text(qualif))# Join the list of descriptions into a single stringtext =' '.join(qualif_list)# Generate the word cloud with a custom background colorwordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda*args, **kwargs: 'black').generate(text)# Create the figure and axisfig, ax = plt.subplots(figsize=(10, 6))ax.imshow(wordcloud, interpolation='bilinear')ax.axis('off')# Display the word cloudplt.show()
As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:
Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.
Work Experience: The inclusion of “Work Experience” emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.
Data Science: The prominence of “Data Science” indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.
Bachelor Degree: The presence of “Bachelor Degree” suggests that having a bachelor’s degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.
Machine Learning and Deep Learning: The inclusion of “Machine Learning” and “Deep Learning” highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.
Communication Skills: The mention of “Communication Skill” underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.
Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor’s degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.
5.3 Visualizing Salaries
Finally!!
I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you’ve deserved it.
Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it’s worth knowing about the pay scale.
5.3.1 Using benefits
Coming to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.
Code
# Filter the dataframe by yearly salary statusstatus_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(status_filtered_df['title'])company_names =list(status_filtered_df['company_name'])min_salaries =list(status_filtered_df['min_salary'])max_salaries =list(status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()
The plot up top shows the various job titles’ wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.
For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.
Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.
In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.
5.3.1.1 Anomaly
I tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.
5.3.2 Using description
Code
# Filter the dataframe by yearly salary statusdesc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(desc_status_filtered_df['title'])company_names =list(desc_status_filtered_df['company_name'])min_salaries =list(desc_status_filtered_df['min_salary'])max_salaries =list(desc_status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()
Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.
The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers.
6 Limitations
It can be said that this dataset isn’t perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.
One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.
7 Conclusions
Based on the insightful findings of this project, it has become evident that aspiring Data Scientists can significantly enhance their future career prospects by focusing on job opportunities in the DMV, California, Texas, and Illinois areas. These regions boast a higher concentration of relevant job postings, presenting a wealth of potential for professional growth and advancement.
Moreover, the analysis has shed light on the paramount importance of comprehensive job postings. Companies that provide detailed information regarding salary descriptions, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit consideration for potential candidates. Such companies are more likely to attract top talent and are therefore highly desirable employment options.
By delving deep into the employment landscape of Data Science jobs across the USA, this project has armed me with invaluable knowledge that will guide my decision-making and shape my future career trajectory. I sincerely hope that you, too, have derived considerable benefit from this analysis, gaining a profound understanding of the intricacies and dynamics of the Data Science job market.
Source Code
---title: "JobHuntUSA: Navigating Data Science careers through Data Visualization"---# IntroductionDue to a number of variables, the United States of America (USA) has become a center for employment possibilities in data science. First of all, the nation has hubs for innovation and cutting-edge technological infrastructure. Cities with high concentrations of technological businesses, startups, and research institutes include Silicon Valley in California, Seattle in Washington, and Boston in Massachusetts. In addition to luring top talent, these areas provide a thriving ecosystem for data science professionals to work on innovative projects and collaborate with one another.Second, the USA is well-represented in a wide range of industries, from technology and banking to healthcare and retail. Numerous businesses in these industries have extensively invested in data science capabilities because they understand the value of data-driven decision-making. With so many huge organizations, including corporate behemoths like Google, Amazon, and Microsoft, data scientists have plenty of chances to work on challenging challenges and make important contributions. The USA also has a thriving startup scene, with several new businesses upending numerous industries with ground-breaking data-driven solutions.Overall, the United States is a desirable location for job seekers looking for data science positions because of its large industry presence, modern technological infrastructure, and innovative culture. The nation is a growing hub for data science workers because it provides a wide range of possibilities, access to cutting-edge initiatives, and the possibility to collaborate with top organizations and subject matter experts.Let me walk you through this comprehensive report which will help you find your next job.# DataSome information about the dataset that was provided by our very own Georgetown University DSAN department.1. This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.2. There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .3. The postings were obtained using the following search query terms:- "data-scientist",- "data-analyst",- "neural-networks",- "big-data-and-cloud-computing",- "machine-learning",- "reinforcement-learning",- "deep-learning",- "time-series",- "block-chain",- "natural-language-processing"4. The search for this data is for USA. The files may contain duplicate job postings.5. The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search# Data preparation## Importing the librariesThis step needs no explanation. Required packages must always be loaded.```{python}import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport klibimport plotly.graph_objects as goimport foliumfrom folium import pluginsfrom shapely.geometry import Polygon, Pointfrom wordcloud import WordCloudimport jsonimport globimport osimport reimport nltkfrom nltk.stem import PorterStemmerfrom string import punctuationnltk.download('stopwords')nltk.download('wordnet')nltk.download('omw-1.4')import warningswarnings.filterwarnings('ignore')```## Importing the datasetI created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.The function below will return the list of dataframes for respective job searches.```{python}# Function to create the required dataframe for analysis.def create_job_df(path): """ Takes as input directory to construct a list of dataframes from and returns that list :param path: a Path to a directory :return: a list of pandas DataFrames """# Get every file in the folder using glob all_files = glob.glob(os.path.join(path, "*.json"))# lists for appending dataframes for every job-search data_scientist_list = [] data_analyst_list = [] neural_networks_list = [] big_data_and_cloud_computing_list = [] machine_learning_list = [] reinforcement_learning_list = [] deep_learning_list = [] time_series_list = [] block_chain_list = [] natural_language_processing_list = []# Iterate over the files in the folderfor filename in all_files:# Read the json filewithopen(filename, 'r') as fp: data = json.load(fp)if'jobs_results'in data:# create dataframe df = pd.DataFrame(data['jobs_results'])# Data Cleaning# Via df['via'] = df['via'].apply(lambda x: x[4:])# Job highlights qualifications = [] responsibilities = [] benefits = []for i inrange(len(df['job_highlights'])): jd = df['job_highlights'][i] n =len(jd)if n ==3: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(jd[2]['items'])elif n==2: qualifications.append(jd[0]['items']) responsibilities.append(jd[1]['items']) benefits.append(np.nan)elif n==1: qualifications.append(jd[0]['items']) responsibilities.append(np.nan) benefits.append(np.nan)else: qualifications.append(np.nan) responsibilities.append(np.nan) benefits.append(np.nan)# Related links resources = []for i inrange(len(df['related_links'])): links = df['related_links'][i] resources.append(links[0]['link'])# Extensions and detected extensions posted = [] salary = [] job_type = []for i inrange(len(df['detected_extensions'])): extn = df['detected_extensions'][i]if'posted_at'in extn.keys(): posted.append(extn['posted_at'])else: posted.append(np.nan)if'salary'in extn.keys(): salary.append(extn['salary']) else: salary.append(np.nan)if'schedule_type'in extn.keys(): job_type.append(extn['schedule_type'])else: job_type.append(np.nan)# Add the created columns df['qualifications'] = qualifications df['responsibilities'] = responsibilities df['benefits'] = benefits df['posted'] = posted df['salary'] = salary df['job_type'] = job_type df['resources'] = resources# Drop the redundant columns df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)# Rearrange the columns df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 'benefits', 'salary', 'via', 'posted', 'resources']] search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing","machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain","natural-language-processing"]if"data-scientist"in filename: data_scientist_list.append(df)elif"data-analyst"in filename: data_analyst_list.append(df)elif"neural-networks"in filename: neural_networks_list.append(df)elif"big-data-and-cloud-computing"in filename: big_data_and_cloud_computing_list.append(df)elif"machine-learning"in filename: machine_learning_list.append(df)elif"reinforcement-learning"in filename: reinforcement_learning_list.append(df)elif"deep-learning"in filename: deep_learning_list.append(df)elif"time-series"in filename: time_series_list.append(df)elif"block-chain"in filename: block_chain_list.append(df)elif"natural-language-processing"in filename: natural_language_processing_list.append(df)# Concat the lists to create the merged dataframe data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True) data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True) neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True) big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True) machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True) reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True) deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True) time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True) block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True) natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)# return the list of dataframes for every jobreturn [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]```Now that you've understood the function, lets see what kind of dataframe do we get for the potential analysis.```{python}# Define pathpath ='../data/USA/'# Execute the driver function to get the list of dataframesdf_list = create_job_df(path)# The respective dataframes for each job search which might be later used for potential analyses.data_scientist_df = df_list[0]data_analyst_df = df_list[1]neural_networks_df = df_list[2]big_data_and_cloud_computing_df = df_list[3]machine_learning_df = df_list[4]reinforcement_learning_df = df_list[5]deep_learning_df = df_list[6]time_series_df = df_list[7]block_chain_df = df_list[8]natural_language_processing_df = df_list[9]# Merge all the dataframes to get all job postings around DCcountry_jobs = pd.concat(df_list, axis=0, ignore_index=True)country_jobs.head()```Generating a separate dataframe for DC job listings which will be merged with the overall country job listings.```{python}# Define pathpath ='../data/USA/'# Execute the driver function to get the list of dataframesdf_list = create_job_df(path)# The respective dataframes for each job search which might be later used for potential analyses.data_scientist_df = df_list[0]data_analyst_df = df_list[1]neural_networks_df = df_list[2]big_data_and_cloud_computing_df = df_list[3]machine_learning_df = df_list[4]reinforcement_learning_df = df_list[5]deep_learning_df = df_list[6]time_series_df = df_list[7]block_chain_df = df_list[8]natural_language_processing_df = df_list[9]# Merge all the dataframes to get all job postings around DCusa_jobs = pd.concat(df_list, axis=0, ignore_index=True)usa_jobs.head()```Merging the two dataframes created above```{python}usa_jobs = pd.concat([country_jobs, usa_jobs], ignore_index=True)usa_jobs.head()```## Data wrangling, munging and cleaningThis is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.```{python}def join_data(data_lst):# Check if data_lst is not NaNif data_lst isnot np.nan:# If data_lst is not NaN, join the elements with ". " as the separatorreturn". ".join(data_lst)# If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)return np.nanusa_jobs['responsibilities'] = usa_jobs['responsibilities'].apply(join_data)usa_jobs['qualifications'] = usa_jobs['qualifications'].apply(join_data)usa_jobs['benefits'] = usa_jobs['benefits'].apply(join_data)```Some of the job postings had listed their location as 'Anywhere'. So I decided to do some feature extraction and created a new column ('remote') which specifies whether the job available allows remote work or not.```{python}# Function to check if the job location is remotedef remote_or_not(location):# Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)if location.lower().strip() =='anywhere':# If the location is "anywhere", return TruereturnTrue# If the location is not "anywhere", return FalsereturnFalse# Apply the remote_or_not function to the 'location' column of the 'usa_jobs' DataFrame and create a new 'remote' columnusa_jobs['remote'] = usa_jobs['location'].apply(remote_or_not)```Next I saw that the 'location' column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.```{python}# Get city and statedef get_location(location):# Strip leading/trailing spaces from the location string location = location.strip()# Split the location string by comma loc_lst = location.split(',')# Get the number of elements in the loc_lst n =len(loc_lst)if n ==2:# If there are two elements, return the stripped city and statereturn loc_lst[0].strip(), loc_lst[1].strip()elif n ==1:# If there is only one element, return the stripped city and state as the same valuereturn loc_lst[0].strip(), loc_lst[0].strip()# Create empty lists to store the extracted cities and statescities = []states = []# Iterate over the 'location' column of the 'usa_jobs' DataFramefor i inrange(len(usa_jobs['location'])):# Extract the city and state using the get_location function city, state = get_location(usa_jobs['location'][i])# Check for city or state containing '+1'if'+1'in city: city_lst = city.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in city: city = city_lst[0] +' '+ city_lst[1]else: city = city_lst[0]if'+1'in state: state_lst = state.split()# If the value is United States, merge the first two items to generate the proper locationif'United States'in state: state = state_lst[0] +' '+ state_lst[1]else: state = state_lst[0]# Append the city and state to the respective lists cities.append(city) states.append(state)# Add 'city' and 'state' columns to the 'usa_jobs' DataFrameusa_jobs['city'] = citiesusa_jobs['state'] = states# Merge certain states for consistencyusa_jobs['state'] = usa_jobs['state'].replace('Maryland', 'MD')usa_jobs['state'] = usa_jobs['state'].replace('New York', 'NY')usa_jobs['state'] = usa_jobs['state'].replace('California', 'CA')# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the countryusa_jobs['state'] = usa_jobs['state'].replace('United States', 'Anywhere')# Drop the 'location' column and re-arrange the columns in the desired orderusa_jobs.drop(columns=['location'], inplace=True)usa_jobs = usa_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote','description', 'responsibilities', 'qualifications', 'benefits','salary', 'via', 'posted', 'resources']]```I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.```{python}# remove duplicate values from title and company nameusa_jobs = usa_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)usa_jobs.head()```### Missing DataI always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.Lets visualize the missing data using 'klib' library so that you are able to realize this trend for each column in the dataset.klib library helps us to visualize missing data trends in the dataset. Using the 'missing_val' plot, we will be able to extract necessary information of the missing data in every column. <br><br>```{python}"Missing Value Plot"usa_klib = klib.missingval_plot(usa_jobs, figsize=(10,15))```There are 490 values of salary which is missing. My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in this country.# INTERESTING INSIGHTThis usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.## Salary analysis using benefits of a job provided by the employerI will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.```{python}# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'def get_sal_ben(benefit):# Convert the benefit string to lowercase and split it into words ben = benefit.lower().split()# Check if any of the keywords are present in the benefitif'salary'in ben or'range'in ben or'pay'in ben:returnTruereturnFalseusa_jobs.dropna()# Create empty lists to store benefits containing salary information and their corresponding job IDsben_sal = []ben_job_id = []# Iterate over the 'benefits' column of the 'usa_jobs' DataFramefor i inrange(len(usa_jobs['benefits'])): benefit = usa_jobs['benefits'][i]# Check if the benefit is not NaNif benefit isnot np.nan:# If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' listif get_sal_ben(benefit): ben_sal.append(benefit) ben_job_id.append(usa_jobs['job_id'][i])# Define a regex pattern to extract salary information from the benefitssalary_pattern =r"\$([\d,.-]+[kK]?)"# Create empty lists to store the extracted salary information and their corresponding job IDsben_sal_list = []ben_job_id_lst = []# Iterate over the benefits containing salary informationfor i inrange(len(ben_sal)): benefit = ben_sal[i]# Find all matches of the salary pattern in the benefit matches = re.findall(salary_pattern, benefit)if matches:# If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst' ben_sal_list.append(matches) ben_job_id_lst.append(ben_job_id[i])```The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.```{python}# Function to convert a single value to floatdef convert_to_float(value):try:# check for values containing k flag =Falseif'k'in value or'K'in value: flag =True# check for values containing '.' pattern =r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'# Regular expression pattern match = re.search(pattern, value)if match: value = match.group(1).replace('.', '') # Remove dots from the matched value# Remove any non-digit characters (e.g., commas, hyphens) value =''.join(filter(str.isdigit, value))# Multiply by 10000 if it ends with 'k'if flag:returnfloat(value[:-1]) *10000else:returnfloat(value)exceptValueError:returnNone# Iterate over the data and convert each value to floatconverted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]```Our last step would be to iterate over the 'converted_data' list above and filter our original dataframe.```{python}# Create an empty list to store the corrected salary rangescorrect_data = []# Iterate over the converted_data listfor i inrange(len(converted_data)): sal_range = converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] isnotNoneand sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] correct_data.append(sal_range)# Filter the usa_jobs DataFrame based on the job IDs with salary informationben_filtered_df = usa_jobs[usa_jobs['job_id'].isin(ben_job_id_lst)]```Now that, we have got a new dataframe, we can proceed right?This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.```{python}# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the correct_data listfor sal_lst in correct_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the minimum and maximum salaries to the ben_filtered_df DataFrameben_filtered_df['min_salary'] = min_salben_filtered_df['max_salary'] = max_sal# Get the data and job IDs of salaries from the ben_filtered_df DataFramedata =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])job_ids =list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])# Define a regex pattern to extract salary rangessalary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop the redundant 'salary' columnben_filtered_df.drop(columns=['salary'], inplace=True)```Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.```{python}def salary_status(salary):if salary <=100:return'Hourly'elif salary >100:return'Yearly'else:return np.nanben_filtered_df['salary_status'] = ben_filtered_df['min_salary'].apply(salary_status)```Dropping any vague missing data from our analyses for the visualization later on.```{python}# dropping nan valuesben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)ben_filtered_df.head()```We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using 'description' column.## Salary analysis using description of the job provided by the employer```{python}# Define a function to check if the description contains keywords related to salarydef get_sal_desc(descript): descpt = descript.lower().split()if'salary'in descpt or'range'in descpt or'pay'in descpt:returnTruereturnFalse# Create empty lists to store the descriptions and job IDs with salary informationdesc_sal = []desc_job_id = []# Iterate over the descriptions in the usa_jobs DataFramefor i inrange(len(usa_jobs['description'])): descpt = usa_jobs['description'][i]if descpt isnot np.nan:if get_sal_desc(descpt): desc_sal.append(descpt) desc_job_id.append(usa_jobs['job_id'][i])# If the description contained the keyword, extract the salary from it.salary_pattern =r"\$([\d,.-]+[kK]?)"desc_sal_list = []desc_job_id_lst = []# Iterate over the descriptions with salary informationfor i inrange(len(desc_sal)): descript = desc_sal[i] matches = re.findall(salary_pattern, descript)if matches: desc_sal_list.append(matches) desc_job_id_lst.append(usa_jobs['job_id'][i])# Iterate over the data and convert each value to floatdesc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]# Create an empty list to store the corrected salary rangesdesc_correct_data = []# Iterate over the converted datafor i inrange(len(desc_converted_data)): sal_range = desc_converted_data[i] n =len(sal_range)# If the salary range has only one value less than 16.5, replace it with NaNif n ==1and sal_range[0] <16.5: sal_range = [np.nan]# If the salary range has more than two values, find the minimum and maximum valueselif n >2: min_sal =min(salary for salary in sal_range if salary !=0.0) max_sal =max(sal_range) sal_range = [min_sal, max_sal] desc_correct_data.append(sal_range)# Filter the usa_jobs DataFrame based on the job IDs with salary informationdesc_filtered_df = usa_jobs[usa_jobs['job_id'].isin(desc_job_id_lst)]# Create empty lists to store the minimum and maximum salariesmin_sal = []max_sal = []# Iterate over the converted datafor sal_lst in desc_converted_data:iflen(sal_lst) ==2: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[1])else: min_sal.append(sal_lst[0]) max_sal.append(sal_lst[0])# Add the min_salary and max_salary columns to the desc_filtered_df DataFramedesc_filtered_df['min_salary'] = min_saldesc_filtered_df['max_salary'] = max_sal# Extract salaries from the 'salary' columndata =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])job_ids =list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])salary_pattern =r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'# Iterate over the data and extract salariesfor i inrange(len(data)): match = re.search(salary_pattern, data[i])if match: min_salary =float(match.group(1))if match.group(3): min_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary max_salary =float(match.group(4))if match.group(6): max_salary *=1000 desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary# Drop redundant 'salary' columndesc_filtered_df.drop(columns=['salary'], inplace=True)# Define a function to determine the salary status based on the min_salarydef salary_status(salary):if salary <=100:return'Hourly'elif salary >100:return'Yearly'else:return np.nan# Add the 'salary_status' column to the desc_filtered_df DataFramedesc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)# Reorder the columns in the desc_filtered_df DataFramedesc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state','remote', 'description', 'responsibilities', 'qualifications','benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]desc_filtered_df.head()```Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.# Data Visualization## Geospatial```{python}#| echo: false#| warning: false# Create a new column 'Address' by combining 'city' and 'state' columnsusa_jobs['Address'] = usa_jobs['city'] +', '+ usa_jobs['state']# Read the address coordinates from the 'address_coords.csv' fileaddress_df = pd.read_csv("../data/address_coords.csv")# Merge the 'usa_jobs' DataFrame with the 'address_df' DataFrame based on the 'Address' column# and add the coordinates information to 'usa_jobs_final' DataFrameusa_jobs_final = pd.merge(usa_jobs, address_df, on="Address", how="left")# Read the 'uscities.csv' file containing State IDs and State Namesuscities_df = pd.read_csv("../data/uscities.csv")# Extract the State IDs and State Names from 'uscities_df' and drop duplicate rowsstate_name = uscities_df[["state_id", "state_name"]].drop_duplicates().reset_index(drop=True)state_name.columns = ["state", "state_name"]# Count the number of job sightings in each statetotal_count_jobs = usa_jobs_final.groupby("state")['job_id'].count().reset_index()total_count_jobs.columns = ["state", "total_count"]# Merge the state IDs and state names with the total count of jobstotal_count_jobs = pd.merge(total_count_jobs, state_name, on="state", how="left")```My first geospatial plot for jobs comes with the plotly Choropleth module.```{python}# CREATE A CHOROPLETH MAPfig = go.Figure(go.Choropleth( locations=total_count_jobs['state'], z=total_count_jobs['total_count'], colorscale='darkmint', locationmode ='USA-states', name="", text=total_count_jobs['state_name'] +'<br>'+'Total jobs: '+ total_count_jobs['total_count'].astype(str), hovertemplate='%{text}',))# ADD TITLE AND ANNOTATIONSfig.update_layout( title_text='<b>Number of Jobs across USA</b>', title_font_size=24, title_x=0.5, geo_scope='usa', width=1100, height=700)# SHOW FIGUREfig.show()```The number of occupations in USA are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the country. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details. The caption of the map, "Number of Jobs across USA," gives the information being displayed a clear context.For my next chart, I used the very famous folium library to create another interactive visualization. ```{python}# CREATE DATAdata = usa_jobs_final[["Latitude", "Longitude"]].values.tolist()# Define a list of bounding boxes for the United States, including Alaskaus_bounding_boxes = [ {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548}, # Contiguous U.S. {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0} # Alaska]# Filter out lat/long pairs that do not belong to the United Stateslatlong_list = []for latlong in data: point = Point(latlong[1], latlong[0]) # Shapely uses (x, y) coordinates, so we swap lat and longfor bounding_box in us_bounding_boxes: box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']), (bounding_box['min_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['max_lat']), (bounding_box['max_long'], bounding_box['min_lat'])])if point.within(box): latlong_list.append(latlong)break# No need to check remaining bounding boxes if the point is already within one# INITIALIZE MAPusa_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)# ADD POINTS plugins.MarkerCluster(latlong_list).add_to(usa_job_map)# SHOW MAPusa_job_map```This is an interactive map to demonstrate how jobs are distributed across USA. It provides insightful information on the geographic distribution of employment prospects across the country by visually portraying the job locations. The map's markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in USA and eventually assisting in decision-making related to job search and recruitment efforts.I am hoping that you now have a clear idea about the number of jobs around the country. Since you have reached this far, I am also assuming that you would interested in knowing more about these jobs.Don't worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.## Textual AnalysesThe dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.```{python}def remove_punct(text):""" A method to remove punctuations from text """ text ="".join([char for char in text if char notin punctuation]) text = re.sub('[0-9]+', '', text) #removes numbers from textreturn textdef remove_stopwords(text):""" A method to remove all the stopwords """ stopwords =set(nltk.corpus.stopwords.words('english')) text = [word for word in text if word notin stopwords]return textdef tokenization(text):""" A method to tokenize text data """ text = re.split('\W+', text) #splitting each sentence/ tweet into its individual wordsreturn textdef stemming(text):""" A method to perform stemming on text data""" porter_stem = nltk.PorterStemmer() text = [porter_stem.stem(word) for word in text]return textdef lemmatizer(text): word_net_lemma = nltk.WordNetLemmatizer() text = [word_net_lemma.lemmatize(word) for word in text]return text# Making a common cleaning function for every part below for code reproducabilitydef clean_words(list_words):# Making a regex pattern to match in the characters we would like to replace from the words character_replace =",()0123456789.?!@#$%&;*:_,/" pattern ="["+ character_replace +"]" new_list_words = []# Looping through every word to remove the characters and appending back to a new list# replace is being used for the characters that could not be catched through regexfor s in list_words: new_word = s.lower() new_word = re.sub(pattern,"",new_word) new_word = new_word.replace('[', '') new_word = new_word.replace(']', '') new_word = new_word.replace('-', '') new_word = new_word.replace('—', '') new_word = new_word.replace('“', '') new_word = new_word.replace("’", '') new_word = new_word.replace("”", '') new_word = new_word.replace("‘", '') new_word = new_word.replace('"', '') new_word = new_word.replace("'", '') new_word = new_word.replace(" ", '') new_list_words.append(new_word)# Using filter to remove empty strings new_list_words =list(filter(None, new_list_words))return new_list_wordsdef clean_text(corpus):""" A method to do basic data cleaning """# Remove punctuation and numbers from the text clean_text = remove_punct([corpus])# Tokenize the text into individual words text_tokenized = tokenization(clean_text.lower())# Remove stopwords from the tokenized text stopwords =set(nltk.corpus.stopwords.words('english')) text_without_stop = remove_stopwords(text_tokenized)# Perform stemming on the text text_stemmed = stemming(text_without_stop)# Perform lemmatization on the text text_lemmatized = lemmatizer(text_without_stop)# Further clean and process the words text_final = clean_words(text_lemmatized)# Join the cleaned words back into a single stringreturn" ".join(text_final)```How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.Moving on, for our very first textual analyses, I will be using the pipeline created for the 'description' column```{python}descript_list = []for descript in usa_jobs['description']: descript_list.append(clean_text(descript))```Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs.```{python}# Join the list of descriptions into a single stringtext =' '.join(descript_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()```The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:1. "Data": This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.2. "Experience": This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.3. "Machine Learning": This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.4. "Skills": This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.5. "Analytics": This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the 'responsibilities' column and generating it's word cloud```{python}# Removing missing values from responsibilities for text cleaningusa_jobs.dropna(subset=['responsibilities'], inplace=True)response_list = []for response in usa_jobs['responsibilities']: response_list.append(clean_text(response))# Join the list of descriptions into a single stringtext =' '.join(response_list)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda*args, **kwargs: 'black').generate(text)# Display the word cloudplt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()```Similar to description wordcloud, we see that words such 'data', 'machine learning', 'design', 'big data', 'project', 'model', 'development', etc. are prevalent.This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behind. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.At last, we have the moment of truth. Whether you're capable of doing this job or not?What qualities one must have in them so they are a suitable fit for the employer?Let's check this out.```{python}qualif_list = []for qualif in usa_jobs['qualifications']: qualif_list.append(clean_text(qualif))# Join the list of descriptions into a single stringtext =' '.join(qualif_list)# Generate the word cloud with a custom background colorwordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda*args, **kwargs: 'black').generate(text)# Create the figure and axisfig, ax = plt.subplots(figsize=(10, 6))ax.imshow(wordcloud, interpolation='bilinear')ax.axis('off')# Display the word cloudplt.show()```As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:1. Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.2. Work Experience: The inclusion of "Work Experience" emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.3. Data Science: The prominence of "Data Science" indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.4. Bachelor Degree: The presence of "Bachelor Degree" suggests that having a bachelor's degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.5. Machine Learning and Deep Learning: The inclusion of "Machine Learning" and "Deep Learning" highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.6. Communication Skills: The mention of "Communication Skill" underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor's degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.## Visualizing SalariesFinally!!I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you've deserved it.Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it's worth knowing about the pay scale.### Using benefitsComing to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.```{python}# Filter the dataframe by yearly salary statusstatus_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(status_filtered_df['title'])company_names =list(status_filtered_df['company_name'])min_salaries =list(status_filtered_df['min_salary'])max_salaries =list(status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()```The plot up top shows the various job titles' wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.#### AnomalyI tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.### Using description```{python}# Filter the dataframe by yearly salary statusdesc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] =='Yearly']# Extract relevant data columnsjob_titles =list(desc_status_filtered_df['title'])company_names =list(desc_status_filtered_df['company_name'])min_salaries =list(desc_status_filtered_df['min_salary'])max_salaries =list(desc_status_filtered_df['max_salary'])salary_ranges =list(zip(min_salaries, max_salaries))# Create the figure and add the tracesfig = go.Figure()for i, (title, company, salary_range) inenumerate(zip(job_titles, company_names, salary_ranges)):# Create hover text with job title, company, and salary range hover_text =f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"# Add a scatter trace for each job title fig.add_trace(go.Scatter( x=[salary_range[0], salary_range[1]], y=[title, title], mode='lines+markers', name=title, line=dict(width=4), marker=dict(size=10), hovertemplate=hover_text, ))# Customize the layoutfig.update_layout( title='Salary Range for Different Job Titles', xaxis_title='Salary', yaxis_title='Job Title', hovermode='closest', showlegend=False, width=1500, # Specify the desired width height=600# Specify the desired height)# Show the interactive graphfig.show()```Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers.# LimitationsIt can be said that this dataset isn't perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.# ConclusionsBased on the insightful findings of this project, it has become evident that aspiring Data Scientists can significantly enhance their future career prospects by focusing on job opportunities in the DMV, California, Texas, and Illinois areas. These regions boast a higher concentration of relevant job postings, presenting a wealth of potential for professional growth and advancement.Moreover, the analysis has shed light on the paramount importance of comprehensive job postings. Companies that provide detailed information regarding salary descriptions, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit consideration for potential candidates. Such companies are more likely to attract top talent and are therefore highly desirable employment options.By delving deep into the employment landscape of Data Science jobs across the USA, this project has armed me with invaluable knowledge that will guide my decision-making and shape my future career trajectory. I sincerely hope that you, too, have derived considerable benefit from this analysis, gaining a profound understanding of the intricacies and dynamics of the Data Science job market.