Data Science JobHunt
  • Home
  • Code
  • JobHunt
    • DMV region
    • USA

On this page

  • 1 Introduction
  • 2 Data
  • 3 Data preparation
    • 3.1 Importing the libraries
    • 3.2 Importing the dataset
    • 3.3 Data wrangling, munging and cleaning
      • 3.3.1 Missing Data
  • 4 INTERESTING INSIGHT
    • 4.1 Salary analysis using benefits of a job provided by the employer
    • 4.2 Salary analysis using description of the job provided by the employer
  • 5 Data Visualization
    • 5.1 Geospatial
    • 5.2 Textual Analyses
    • 5.3 Visualizing Salaries
      • 5.3.1 Using benefits
      • 5.3.2 Using description
  • 6 Limitations
  • 7 Conclusions

JobHuntUSA: Navigating Data Science careers through Data Visualization

  • Show All Code
  • Hide All Code

  • View Source

1 Introduction

Due to a number of variables, the United States of America (USA) has become a center for employment possibilities in data science. First of all, the nation has hubs for innovation and cutting-edge technological infrastructure. Cities with high concentrations of technological businesses, startups, and research institutes include Silicon Valley in California, Seattle in Washington, and Boston in Massachusetts. In addition to luring top talent, these areas provide a thriving ecosystem for data science professionals to work on innovative projects and collaborate with one another.

Second, the USA is well-represented in a wide range of industries, from technology and banking to healthcare and retail. Numerous businesses in these industries have extensively invested in data science capabilities because they understand the value of data-driven decision-making. With so many huge organizations, including corporate behemoths like Google, Amazon, and Microsoft, data scientists have plenty of chances to work on challenging challenges and make important contributions. The USA also has a thriving startup scene, with several new businesses upending numerous industries with ground-breaking data-driven solutions.

Overall, the United States is a desirable location for job seekers looking for data science positions because of its large industry presence, modern technological infrastructure, and innovative culture. The nation is a growing hub for data science workers because it provides a wide range of possibilities, access to cutting-edge initiatives, and the possibility to collaborate with top organizations and subject matter experts.

Let me walk you through this comprehensive report which will help you find your next job.

2 Data

Some information about the dataset that was provided by our very own Georgetown University DSAN department.

  1. This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.

  2. There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .

  3. The postings were obtained using the following search query terms:

  • “data-scientist”,
  • “data-analyst”,
  • “neural-networks”,
  • “big-data-and-cloud-computing”,
  • “machine-learning”,
  • “reinforcement-learning”,
  • “deep-learning”,
  • “time-series”,
  • “block-chain”,
  • “natural-language-processing”
  1. The search for this data is for USA. The files may contain duplicate job postings.

  2. The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search

3 Data preparation

3.1 Importing the libraries

This step needs no explanation. Required packages must always be loaded.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import klib
import plotly.graph_objects as go
import folium
from folium import plugins
from shapely.geometry import Polygon, Point
from wordcloud import WordCloud

import json
import glob
import os

import re
import nltk
from nltk.stem import PorterStemmer
from string import punctuation
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings('ignore')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/raghavsharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/raghavsharma/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/raghavsharma/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

3.2 Importing the dataset

I created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.

The function below will return the list of dataframes for respective job searches.

Code
# Function to create the required dataframe for analysis.
def create_job_df(path):    
    """
    Takes as input directory to construct a list of dataframes from and returns that list
    :param path: a Path to a directory
    :return: a list of pandas DataFrames
    """

    # Get every file in the folder using glob
    all_files = glob.glob(os.path.join(path, "*.json"))

    # lists for appending dataframes for every job-search
    data_scientist_list = []
    data_analyst_list = []
    neural_networks_list = []
    big_data_and_cloud_computing_list = []
    machine_learning_list = []
    reinforcement_learning_list = []
    deep_learning_list = []
    time_series_list = []
    block_chain_list = []
    natural_language_processing_list = []

    # Iterate over the files in the folder
    for filename in all_files:
        # Read the json file
        with open(filename, 'r') as fp:
            data = json.load(fp)
        
        if 'jobs_results' in data:
            # create dataframe
            df = pd.DataFrame(data['jobs_results'])

            # Data Cleaning
            # Via
            df['via'] = df['via'].apply(lambda x: x[4:])

            # Job highlights
            qualifications = []
            responsibilities = []
            benefits = []

            for i in range(len(df['job_highlights'])):
                jd = df['job_highlights'][i]
                n = len(jd)

                if n == 3:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(jd[2]['items'])
                
                elif n==2:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(np.nan)
                
                elif n==1:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)
                else:
                    qualifications.append(np.nan)
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)

            # Related links
            resources = []
            for i in range(len(df['related_links'])):
                links = df['related_links'][i]
                resources.append(links[0]['link'])

            # Extensions and detected extensions
            posted = []
            salary = []
            job_type = []
            for i in range(len(df['detected_extensions'])):
                extn = df['detected_extensions'][i]
                if 'posted_at' in extn.keys():
                    posted.append(extn['posted_at'])
                else:
                    posted.append(np.nan)

                if 'salary' in extn.keys():
                    salary.append(extn['salary'])  
                else:
                    salary.append(np.nan)

                if 'schedule_type' in extn.keys():
                    job_type.append(extn['schedule_type'])
                else:
                    job_type.append(np.nan)

            # Add the created columns
            df['qualifications'] = qualifications
            df['responsibilities'] = responsibilities
            df['benefits'] = benefits
            df['posted'] = posted
            df['salary'] = salary
            df['job_type'] = job_type
            df['resources'] = resources

            # Drop the redundant columns
            df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)

            # Rearrange the columns
            df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 
                    'benefits', 'salary', 'via', 'posted', 'resources']]
            
            search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing",
                "machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain",
                "natural-language-processing"]
            
            if "data-scientist" in filename:
                data_scientist_list.append(df)
            elif "data-analyst" in filename:
                data_analyst_list.append(df)
            elif "neural-networks" in filename:
                neural_networks_list.append(df)
            elif "big-data-and-cloud-computing" in filename:
                big_data_and_cloud_computing_list.append(df)
            elif "machine-learning" in filename:
                machine_learning_list.append(df)
            elif "reinforcement-learning" in filename:
                reinforcement_learning_list.append(df)
            elif "deep-learning" in filename:
                deep_learning_list.append(df)
            elif "time-series" in filename:
                time_series_list.append(df)
            elif "block-chain" in filename:
                block_chain_list.append(df)
            elif "natural-language-processing" in filename:
                natural_language_processing_list.append(df)
    
    # Concat the lists to create the merged dataframe
    data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True)

    data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True)

    neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True)

    big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True)

    machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True)

    reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True)

    deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True)

    time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True)

    block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True)

    natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)

    # return the list of dataframes for every job
    return [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]

Now that you’ve understood the function, lets see what kind of dataframe do we get for the potential analysis.

Code
# Define path
path = '../data/USA/'
# Execute the driver function to get the list of dataframes
df_list = create_job_df(path)

# The respective dataframes for each job search which might be later used for potential analyses.
data_scientist_df = df_list[0]
data_analyst_df = df_list[1]
neural_networks_df = df_list[2]
big_data_and_cloud_computing_df = df_list[3]
machine_learning_df = df_list[4]
reinforcement_learning_df = df_list[5]
deep_learning_df = df_list[6]
time_series_df = df_list[7]
block_chain_df = df_list[8]
natural_language_processing_df = df_list[9]

# Merge all the dataframes to get all job postings around DC
country_jobs = pd.concat(df_list, axis=0, ignore_index=True)
country_jobs.head()
job_id title company_name job_type location description responsibilities qualifications benefits salary via posted resources
0 eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE... Sr. Data Scientist (NLP) MCKESSON Full-time Texas Ontada is a leading oncology real-world data a... [Collaborate with product management, product ... [5+ years of industry experience in ML and/or ... [As part of Total Rewards, we are proud to off... NaN Jobs At MCKESSON NaN http://www.mckesson.com/
1 eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF... Sr Dir Data Science & Analytics Northwestern Mutual Full-time Milwaukee, WI At Northwestern Mutual, we are strong, innovat... [Provides leadership and direction to analytic... [Recognized as an expert in the industry and s... [$143,360.00, $204,800.00] NaN Northwestern Mutual Careers 21 days ago https://www.google.com/search?q=Northwestern+M...
2 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBTZW5pb3... Data Scientist Senior CHRISTUS Health Full-time Irving, TX Summary:\n\nThe Data Scientist Senior is respo... [The Data Scientist Senior is responsible for ... [Individual must have extensive knowledge of S... NaN NaN Christus Health Careers 28 days ago http://www.christushealth.org/
3 eyJqb2JfdGl0bGUiOiJEaXJlY3RvciwgR2xvYmFsIERlbW... Director, Global Demand Data Scientist Lead 7Z4 Pfizer, Inc. Full-time Anywhere Why Patients Need You Our manufacturing logist... NaN [Why Patients Need You Our manufacturing logis... NaN NaN Workday 2 days ago https://www.google.com/search?q=7Z4+Pfizer,+In...
4 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW... Data Scientist John Deere Full-time Anywhere There are over 7 billion people on this planet... [Be responsible for working with large amounts... [5 years experience in programming and data an... [Additionally, we offer a comprehensive reward... NaN Salary.com NaN http://www.deere.com/

Generating a separate dataframe for DC job listings which will be merged with the overall country job listings.

Code
# Define path
path = '../data/USA/'
# Execute the driver function to get the list of dataframes
df_list = create_job_df(path)

# The respective dataframes for each job search which might be later used for potential analyses.
data_scientist_df = df_list[0]
data_analyst_df = df_list[1]
neural_networks_df = df_list[2]
big_data_and_cloud_computing_df = df_list[3]
machine_learning_df = df_list[4]
reinforcement_learning_df = df_list[5]
deep_learning_df = df_list[6]
time_series_df = df_list[7]
block_chain_df = df_list[8]
natural_language_processing_df = df_list[9]

# Merge all the dataframes to get all job postings around DC
usa_jobs = pd.concat(df_list, axis=0, ignore_index=True)
usa_jobs.head()
job_id title company_name job_type location description responsibilities qualifications benefits salary via posted resources
0 eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE... Sr. Data Scientist (NLP) MCKESSON Full-time Texas Ontada is a leading oncology real-world data a... [Collaborate with product management, product ... [5+ years of industry experience in ML and/or ... [As part of Total Rewards, we are proud to off... NaN Jobs At MCKESSON NaN http://www.mckesson.com/
1 eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF... Sr Dir Data Science & Analytics Northwestern Mutual Full-time Milwaukee, WI At Northwestern Mutual, we are strong, innovat... [Provides leadership and direction to analytic... [Recognized as an expert in the industry and s... [$143,360.00, $204,800.00] NaN Northwestern Mutual Careers 21 days ago https://www.google.com/search?q=Northwestern+M...
2 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBTZW5pb3... Data Scientist Senior CHRISTUS Health Full-time Irving, TX Summary:\n\nThe Data Scientist Senior is respo... [The Data Scientist Senior is responsible for ... [Individual must have extensive knowledge of S... NaN NaN Christus Health Careers 28 days ago http://www.christushealth.org/
3 eyJqb2JfdGl0bGUiOiJEaXJlY3RvciwgR2xvYmFsIERlbW... Director, Global Demand Data Scientist Lead 7Z4 Pfizer, Inc. Full-time Anywhere Why Patients Need You Our manufacturing logist... NaN [Why Patients Need You Our manufacturing logis... NaN NaN Workday 2 days ago https://www.google.com/search?q=7Z4+Pfizer,+In...
4 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW... Data Scientist John Deere Full-time Anywhere There are over 7 billion people on this planet... [Be responsible for working with large amounts... [5 years experience in programming and data an... [Additionally, we offer a comprehensive reward... NaN Salary.com NaN http://www.deere.com/

Merging the two dataframes created above

Code
usa_jobs = pd.concat([country_jobs, usa_jobs], ignore_index=True)
usa_jobs.head()
job_id title company_name job_type location description responsibilities qualifications benefits salary via posted resources
0 eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE... Sr. Data Scientist (NLP) MCKESSON Full-time Texas Ontada is a leading oncology real-world data a... [Collaborate with product management, product ... [5+ years of industry experience in ML and/or ... [As part of Total Rewards, we are proud to off... NaN Jobs At MCKESSON NaN http://www.mckesson.com/
1 eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF... Sr Dir Data Science & Analytics Northwestern Mutual Full-time Milwaukee, WI At Northwestern Mutual, we are strong, innovat... [Provides leadership and direction to analytic... [Recognized as an expert in the industry and s... [$143,360.00, $204,800.00] NaN Northwestern Mutual Careers 21 days ago https://www.google.com/search?q=Northwestern+M...
2 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBTZW5pb3... Data Scientist Senior CHRISTUS Health Full-time Irving, TX Summary:\n\nThe Data Scientist Senior is respo... [The Data Scientist Senior is responsible for ... [Individual must have extensive knowledge of S... NaN NaN Christus Health Careers 28 days ago http://www.christushealth.org/
3 eyJqb2JfdGl0bGUiOiJEaXJlY3RvciwgR2xvYmFsIERlbW... Director, Global Demand Data Scientist Lead 7Z4 Pfizer, Inc. Full-time Anywhere Why Patients Need You Our manufacturing logist... NaN [Why Patients Need You Our manufacturing logis... NaN NaN Workday 2 days ago https://www.google.com/search?q=7Z4+Pfizer,+In...
4 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW... Data Scientist John Deere Full-time Anywhere There are over 7 billion people on this planet... [Be responsible for working with large amounts... [5 years experience in programming and data an... [Additionally, we offer a comprehensive reward... NaN Salary.com NaN http://www.deere.com/

3.3 Data wrangling, munging and cleaning

This is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.

We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.

Code
def join_data(data_lst):
    # Check if data_lst is not NaN
    if data_lst is not np.nan:
        # If data_lst is not NaN, join the elements with ". " as the separator
        return ". ".join(data_lst)
    # If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)
    return np.nan

usa_jobs['responsibilities'] = usa_jobs['responsibilities'].apply(join_data)
usa_jobs['qualifications'] = usa_jobs['qualifications'].apply(join_data)
usa_jobs['benefits'] = usa_jobs['benefits'].apply(join_data)

Some of the job postings had listed their location as ‘Anywhere’. So I decided to do some feature extraction and created a new column (‘remote’) which specifies whether the job available allows remote work or not.

Code
# Function to check if the job location is remote
def remote_or_not(location):
    # Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)
    if location.lower().strip() == 'anywhere':
        # If the location is "anywhere", return True
        return True
    # If the location is not "anywhere", return False
    return False

# Apply the remote_or_not function to the 'location' column of the 'usa_jobs' DataFrame and create a new 'remote' column
usa_jobs['remote'] = usa_jobs['location'].apply(remote_or_not)

Next I saw that the ‘location’ column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.

Code
# Get city and state
def get_location(location):
    # Strip leading/trailing spaces from the location string
    location = location.strip()
    # Split the location string by comma
    loc_lst = location.split(',')
    # Get the number of elements in the loc_lst
    n = len(loc_lst)
    if n == 2:
        # If there are two elements, return the stripped city and state
        return loc_lst[0].strip(), loc_lst[1].strip()
    elif n == 1:
        # If there is only one element, return the stripped city and state as the same value
        return loc_lst[0].strip(), loc_lst[0].strip()

# Create empty lists to store the extracted cities and states
cities = []
states = []

# Iterate over the 'location' column of the 'usa_jobs' DataFrame
for i in range(len(usa_jobs['location'])):
    # Extract the city and state using the get_location function
    city, state = get_location(usa_jobs['location'][i])
    
    # Check for city or state containing '+1'
    if '+1' in city:
        city_lst = city.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in city:
            city = city_lst[0] + ' ' + city_lst[1]
        else:
            city = city_lst[0]
    if '+1' in state:
        state_lst = state.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in state:
            state = state_lst[0] + ' ' + state_lst[1]
        else:
            state = state_lst[0]
    
    # Append the city and state to the respective lists
    cities.append(city)
    states.append(state)

# Add 'city' and 'state' columns to the 'usa_jobs' DataFrame
usa_jobs['city'] = cities
usa_jobs['state'] = states

# Merge certain states for consistency
usa_jobs['state'] = usa_jobs['state'].replace('Maryland', 'MD')
usa_jobs['state'] = usa_jobs['state'].replace('New York', 'NY')
usa_jobs['state'] = usa_jobs['state'].replace('California', 'CA')

# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the country
usa_jobs['state'] = usa_jobs['state'].replace('United States', 'Anywhere')

# Drop the 'location' column and re-arrange the columns in the desired order
usa_jobs.drop(columns=['location'], inplace=True)
usa_jobs = usa_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote',
       'description', 'responsibilities', 'qualifications', 'benefits',
       'salary', 'via', 'posted', 'resources']]

I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.

Code
# remove duplicate values from title and company name
usa_jobs = usa_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)
usa_jobs.head()
job_id title company_name job_type city state remote description responsibilities qualifications benefits salary via posted resources
0 eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE... Sr. Data Scientist (NLP) MCKESSON Full-time Texas Texas False Ontada is a leading oncology real-world data a... Collaborate with product management, product o... 5+ years of industry experience in ML and/or d... As part of Total Rewards, we are proud to offe... NaN Jobs At MCKESSON NaN http://www.mckesson.com/
1 eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF... Sr Dir Data Science & Analytics Northwestern Mutual Full-time Milwaukee WI False At Northwestern Mutual, we are strong, innovat... Provides leadership and direction to analytics... Recognized as an expert in the industry and sh... $143,360.00. $204,800.00 NaN Northwestern Mutual Careers 21 days ago https://www.google.com/search?q=Northwestern+M...
2 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBTZW5pb3... Data Scientist Senior CHRISTUS Health Full-time Irving TX False Summary:\n\nThe Data Scientist Senior is respo... The Data Scientist Senior is responsible for d... Individual must have extensive knowledge of St... NaN NaN Christus Health Careers 28 days ago http://www.christushealth.org/
3 eyJqb2JfdGl0bGUiOiJEaXJlY3RvciwgR2xvYmFsIERlbW... Director, Global Demand Data Scientist Lead 7Z4 Pfizer, Inc. Full-time Anywhere Anywhere True Why Patients Need You Our manufacturing logist... NaN Why Patients Need You Our manufacturing logist... NaN NaN Workday 2 days ago https://www.google.com/search?q=7Z4+Pfizer,+In...
4 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW... Data Scientist John Deere Full-time Anywhere Anywhere True There are over 7 billion people on this planet... Be responsible for working with large amounts ... 5 years experience in programming and data ana... Additionally, we offer a comprehensive reward ... NaN Salary.com NaN http://www.deere.com/

3.3.1 Missing Data

I always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.

Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.

Lets visualize the missing data using ‘klib’ library so that you are able to realize this trend for each column in the dataset.

klib library helps us to visualize missing data trends in the dataset. Using the ‘missing_val’ plot, we will be able to extract necessary information of the missing data in every column.

Code
"Missing Value Plot"
usa_klib = klib.missingval_plot(usa_jobs, figsize=(10,15))

There are 490 values of salary which is missing. My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in this country.

4 INTERESTING INSIGHT

This usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.

Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.

4.1 Salary analysis using benefits of a job provided by the employer

I will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.

Code
# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'
def get_sal_ben(benefit):
    # Convert the benefit string to lowercase and split it into words
    ben = benefit.lower().split()
    # Check if any of the keywords are present in the benefit
    if 'salary' in ben or 'range' in ben or 'pay' in ben:
        return True
    return False

usa_jobs.dropna()

# Create empty lists to store benefits containing salary information and their corresponding job IDs
ben_sal = []
ben_job_id = []

# Iterate over the 'benefits' column of the 'usa_jobs' DataFrame
for i in range(len(usa_jobs['benefits'])):
    benefit = usa_jobs['benefits'][i]
    # Check if the benefit is not NaN
    if benefit is not np.nan:
        # If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' list
        if get_sal_ben(benefit):
            ben_sal.append(benefit)
            ben_job_id.append(usa_jobs['job_id'][i])

# Define a regex pattern to extract salary information from the benefits
salary_pattern = r"\$([\d,.-]+[kK]?)"

# Create empty lists to store the extracted salary information and their corresponding job IDs
ben_sal_list = []
ben_job_id_lst = []

# Iterate over the benefits containing salary information
for i in range(len(ben_sal)):
    benefit = ben_sal[i]
    # Find all matches of the salary pattern in the benefit
    matches = re.findall(salary_pattern, benefit)
    if matches:
        # If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst'
        ben_sal_list.append(matches)
        ben_job_id_lst.append(ben_job_id[i])

The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.

Code
# Function to convert a single value to float
def convert_to_float(value):
    try:
        # check for values containing k
        flag = False
        if 'k' in value or 'K' in value:
            flag = True
        # check for values containing '.'
        pattern = r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'  # Regular expression pattern
        match = re.search(pattern, value)
        if match:
            value =  match.group(1).replace('.', '')  # Remove dots from the matched value
        # Remove any non-digit characters (e.g., commas, hyphens)
        value = ''.join(filter(str.isdigit, value))
        # Multiply by 10000 if it ends with 'k'
        if flag:
            return float(value[:-1]) * 10000
        else:
            return float(value)
    except ValueError:
        return None

# Iterate over the data and convert each value to float
converted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]

Our last step would be to iterate over the ‘converted_data’ list above and filter our original dataframe.

Code
# Create an empty list to store the corrected salary ranges
correct_data = []

# Iterate over the converted_data list
for i in range(len(converted_data)):
    sal_range = converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] is not None and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    correct_data.append(sal_range)

# Filter the usa_jobs DataFrame based on the job IDs with salary information
ben_filtered_df = usa_jobs[usa_jobs['job_id'].isin(ben_job_id_lst)]

Now that, we have got a new dataframe, we can proceed right?

This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.

After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

Code
# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the correct_data list
for sal_lst in correct_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the minimum and maximum salaries to the ben_filtered_df DataFrame
ben_filtered_df['min_salary'] = min_sal
ben_filtered_df['max_salary'] = max_sal

# Get the data and job IDs of salaries from the ben_filtered_df DataFrame
data = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])
job_ids = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])

# Define a regex pattern to extract salary ranges
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop the redundant 'salary' column
ben_filtered_df.drop(columns=['salary'], inplace=True)

Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

Code
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

ben_filtered_df['salary_status'] = ben_filtered_df['min_salary'].apply(salary_status)

Dropping any vague missing data from our analyses for the visualization later on.

Code
# dropping nan values
ben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)
ben_filtered_df.head()
job_id title company_name job_type city state remote description responsibilities qualifications benefits via posted resources min_salary max_salary salary_status
0 eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE... Sr. Data Scientist (NLP) MCKESSON Full-time Texas Texas False Ontada is a leading oncology real-world data a... Collaborate with product management, product o... 5+ years of industry experience in ML and/or d... As part of Total Rewards, we are proud to offe... Jobs At MCKESSON NaN http://www.mckesson.com/ 117500.0 195800.0 Yearly
5 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW... Data Scientist Trustees of University of Pennsylvania Full-time Anywhere Anywhere True University Overview The University of Pennsylv... Posted Job Title Data Scientist Job Profile Ti... At least 3 years of experience required in a r... The University offers a competitive benefits p... Careers@Penn NaN https://www.google.com/search?q=Trustees+of+Un... 61046.0 132906.0 Yearly
12 eyJqb2JfdGl0bGUiOiJMZWFkIERhdGEgU2NpZW50aXN0Ii... Lead Data Scientist SPECTRUM Full-time Colorado Springs CO False JOB SUMMARY\nThe goal of our Sales & Competiti... In this role, the ideal candidate utilizes ana... Required Skills/Abilities and Knowledge. Abili... The pay for this position has a salary range o... Spectrum Careers 6 days ago https://www.google.com/search?gl=us&hl=en&q=SP... 98900.0 175000.0 Yearly
13 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIENyZW... Data Scientist - Credit Risk Cottonwood Financial Full-time Irving TX False Job Description\nReporting to our Director of ... Reporting to our Director of Credit Risk Manag... BS in an analytical field such as Statistics, ... Starting annual salary of $121,000. Medical, d... LinkedIn 3 days ago https://www.google.com/search?gl=us&hl=en&q=Co... 121000.0 121000.0 Yearly
16 eyJqb2JfdGl0bGUiOiJTdGFmZiBEYXRhIFNjaWVudGlzdC... Staff Data Scientist, Ad Formats & Optimizatio... Reddit Full-time Anywhere Anywhere True Reddit is a community of communities where peo... You will partner closely with cross functional... 5+ years of experience in data analytics or re... Comprehensive Health benefits. 401k Matching. ... Built In 4 days ago https://www.reddit.com/ 184000.0 275000.0 Yearly

We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using ‘description’ column.

4.2 Salary analysis using description of the job provided by the employer

Code
# Define a function to check if the description contains keywords related to salary
def get_sal_desc(descript):
    descpt = descript.lower().split()
    if 'salary' in descpt or 'range' in descpt or 'pay' in descpt:
        return True
    return False

# Create empty lists to store the descriptions and job IDs with salary information
desc_sal = []
desc_job_id = []

# Iterate over the descriptions in the usa_jobs DataFrame
for i in range(len(usa_jobs['description'])):
    descpt = usa_jobs['description'][i]
    if descpt is not np.nan:
        if get_sal_desc(descpt):
            desc_sal.append(descpt)
            desc_job_id.append(usa_jobs['job_id'][i])

# If the description contained the keyword, extract the salary from it.
salary_pattern = r"\$([\d,.-]+[kK]?)"
desc_sal_list = []
desc_job_id_lst = []

# Iterate over the descriptions with salary information
for i in range(len(desc_sal)):
    descript = desc_sal[i]
    matches = re.findall(salary_pattern, descript)
    if matches:
        desc_sal_list.append(matches)
        desc_job_id_lst.append(usa_jobs['job_id'][i])

# Iterate over the data and convert each value to float
desc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]

# Create an empty list to store the corrected salary ranges
desc_correct_data = []

# Iterate over the converted data
for i in range(len(desc_converted_data)):
    sal_range = desc_converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    desc_correct_data.append(sal_range)

# Filter the usa_jobs DataFrame based on the job IDs with salary information
desc_filtered_df = usa_jobs[usa_jobs['job_id'].isin(desc_job_id_lst)]

# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the converted data
for sal_lst in desc_converted_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the min_salary and max_salary columns to the desc_filtered_df DataFrame
desc_filtered_df['min_salary'] = min_sal
desc_filtered_df['max_salary'] = max_sal

# Extract salaries from the 'salary' column
data = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])
job_ids = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop redundant 'salary' column
desc_filtered_df.drop(columns=['salary'], inplace=True)

# Define a function to determine the salary status based on the min_salary
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

# Add the 'salary_status' column to the desc_filtered_df DataFrame
desc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)

# Reorder the columns in the desc_filtered_df DataFrame
desc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state',
       'remote', 'description', 'responsibilities', 'qualifications',
       'benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]

desc_filtered_df.head()
job_id title company_name job_type city state remote description responsibilities qualifications benefits min_salary max_salary salary_status via posted resources
0 eyJqb2JfdGl0bGUiOiJTci4gRGF0YSBTY2llbnRpc3QgKE... Sr. Data Scientist (NLP) MCKESSON Full-time Texas Texas False Ontada is a leading oncology real-world data a... Collaborate with product management, product o... 5+ years of industry experience in ML and/or d... As part of Total Rewards, we are proud to offe... 117500.0 195800.0 Yearly Jobs At MCKESSON NaN http://www.mckesson.com/
1 eyJqb2JfdGl0bGUiOiJTciBEaXIgRGF0YSBTY2llbmNlIF... Sr Dir Data Science & Analytics Northwestern Mutual Full-time Milwaukee WI False At Northwestern Mutual, we are strong, innovat... Provides leadership and direction to analytics... Recognized as an expert in the industry and sh... $143,360.00. $204,800.00 143360.0 204800.0 Yearly Northwestern Mutual Careers 21 days ago https://www.google.com/search?q=Northwestern+M...
4 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW... Data Scientist John Deere Full-time Anywhere Anywhere True There are over 7 billion people on this planet... Be responsible for working with large amounts ... 5 years experience in programming and data ana... Additionally, we offer a comprehensive reward ... 61046.0 132906.0 Yearly Salary.com NaN http://www.deere.com/
6 eyJqb2JfdGl0bGUiOiJQcmUgU2FsZXMgRGF0YSBTY2llbn... Pre Sales Data Scientist Explorium Full-time United States Anywhere False Description\n\nPre Sales Data Scientist...\n\n... In this consultative role, you’ll be relied on... Proficiency in Python, SQL, and/or R. Ability ... NaN 81500.0 142600.0 Yearly Comeet 4 days ago https://www.google.com/search?q=Explorium&sa=X...
8 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW... Data Scientist Mars Full-time Chicago IL False [Insert short summary of role – approximately ... An industry competitive salary and benefits pa... [Insert list of top 4 key responsibilities for... NaN 98900.0 175300.0 Yearly Careers At Mars - Mars, Incorporated NaN http://www.mars.com/

Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.

5 Data Visualization

5.1 Geospatial

My first geospatial plot for jobs comes with the plotly Choropleth module.

Code
# CREATE A CHOROPLETH MAP
fig = go.Figure(go.Choropleth(
    locations=total_count_jobs['state'],
    z=total_count_jobs['total_count'],
    colorscale='darkmint',
    locationmode = 'USA-states',
    name="",
    text=total_count_jobs['state_name'] + '<br>' + 'Total jobs: ' + total_count_jobs['total_count'].astype(str),
    hovertemplate='%{text}',
))

# ADD TITLE AND ANNOTATIONS
fig.update_layout(
    title_text='<b>Number of Jobs across USA</b>',
    title_font_size=24,
    title_x=0.5,
    geo_scope='usa',
    width=1100,
    height=700
)

# SHOW FIGURE
fig.show()

The number of occupations in USA are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the country. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details.

The caption of the map, “Number of Jobs across USA,” gives the information being displayed a clear context.

For my next chart, I used the very famous folium library to create another interactive visualization.

Code
# CREATE DATA
data = usa_jobs_final[["Latitude", "Longitude"]].values.tolist()

# Define a list of bounding boxes for the United States, including Alaska
us_bounding_boxes = [
    {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548},  # Contiguous U.S.
    {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0}  # Alaska
]

# Filter out lat/long pairs that do not belong to the United States
latlong_list = []
for latlong in data:
    point = Point(latlong[1], latlong[0])  # Shapely uses (x, y) coordinates, so we swap lat and long
    for bounding_box in us_bounding_boxes:
        box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']),
                       (bounding_box['min_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['min_lat'])])
        if point.within(box):
            latlong_list.append(latlong)
            break  # No need to check remaining bounding boxes if the point is already within one

# INITIALIZE MAP
usa_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)

# ADD POINTS 
plugins.MarkerCluster(latlong_list).add_to(usa_job_map)

# SHOW MAP
usa_job_map
Make this Notebook Trusted to load map: File -> Trust Notebook

This is an interactive map to demonstrate how jobs are distributed across USA. It provides insightful information on the geographic distribution of employment prospects across the country by visually portraying the job locations. The map’s markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.

Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in USA and eventually assisting in decision-making related to job search and recruitment efforts.

I am hoping that you now have a clear idea about the number of jobs around the country. Since you have reached this far, I am also assuming that you would interested in knowing more about these jobs.

Don’t worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.

5.2 Textual Analyses

The dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.

Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.

Code
def remove_punct(text):
    """ A method to remove punctuations from text """
    text  = "".join([char for char in text if char not in punctuation])
    text = re.sub('[0-9]+', '', text) #removes numbers from text
    return text

def remove_stopwords(text):
    """ A method to remove all the stopwords """
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = [word for word in text if word not in stopwords]
    return text

def tokenization(text):
    """ A method to tokenize text data """
    text = re.split('\W+', text) #splitting each sentence/ tweet into its individual words
    return text

def stemming(text):
    """ A method to perform stemming on text data"""
    porter_stem = nltk.PorterStemmer()
    text = [porter_stem.stem(word) for word in text]
    return text

def lemmatizer(text):
    word_net_lemma = nltk.WordNetLemmatizer()
    text = [word_net_lemma.lemmatize(word) for word in text]
    return text

# Making a common cleaning function for every part below for code reproducability
def clean_words(list_words):
    # Making a regex pattern to match in the characters we would like to replace from the words
    character_replace = ",()0123456789.?!@#$%&;*:_,/" 
    pattern = "[" + character_replace + "]"
    new_list_words = []
    
    # Looping through every word to remove the characters and appending back to a new list
    # replace is being used for the characters that could not be catched through regex
    for s in list_words:
        new_word = s.lower()
        new_word = re.sub(pattern,"",new_word)
        new_word = new_word.replace('[', '')
        new_word = new_word.replace(']', '')
        new_word = new_word.replace('-', '')
        new_word = new_word.replace('—', '')
        new_word = new_word.replace('“', '')
        new_word = new_word.replace("’", '')
        new_word = new_word.replace("”", '')
        new_word = new_word.replace("‘", '')
        new_word = new_word.replace('"', '')
        new_word = new_word.replace("'", '')
        new_word = new_word.replace(" ", '')
        new_list_words.append(new_word)

    # Using filter to remove empty strings
    new_list_words = list(filter(None, new_list_words))
    return new_list_words

def clean_text(corpus):
    """ A method to do basic data cleaning """
    
    # Remove punctuation and numbers from the text
    clean_text = remove_punct([corpus])
    
    # Tokenize the text into individual words
    text_tokenized = tokenization(clean_text.lower())
    
    # Remove stopwords from the tokenized text
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text_without_stop = remove_stopwords(text_tokenized)
    
    # Perform stemming on the text
    text_stemmed = stemming(text_without_stop)
    
    # Perform lemmatization on the text
    text_lemmatized = lemmatizer(text_without_stop)
    
    # Further clean and process the words
    text_final = clean_words(text_lemmatized)
    
    # Join the cleaned words back into a single string
    return " ".join(text_final)

How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.

Moving on, for our very first textual analyses, I will be using the pipeline created for the ‘description’ column

Code
descript_list = []
for descript in usa_jobs['description']:
    descript_list.append(clean_text(descript))

Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs.

Code
# Join the list of descriptions into a single string
text = ' '.join(descript_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:

  1. “Data”: This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.

  2. “Experience”: This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.

  3. “Machine Learning”: This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.

  4. “Skills”: This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.

  5. “Analytics”: This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.

Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.

What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the ‘responsibilities’ column and generating it’s word cloud

Code
# Removing missing values from responsibilities for text cleaning
usa_jobs.dropna(subset=['responsibilities'], inplace=True)

response_list = []
for response in usa_jobs['responsibilities']:
    response_list.append(clean_text(response))

# Join the list of descriptions into a single string
text = ' '.join(response_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda *args, **kwargs: 'black').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Similar to description wordcloud, we see that words such ‘data’, ‘machine learning’, ‘design’, ‘big data’, ‘project’, ‘model’, ‘development’, etc. are prevalent.

This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.

My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behind. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.

At last, we have the moment of truth. Whether you’re capable of doing this job or not? What qualities one must have in them so they are a suitable fit for the employer?

Let’s check this out.

Code
qualif_list = []
for qualif in usa_jobs['qualifications']:
    qualif_list.append(clean_text(qualif))

# Join the list of descriptions into a single string
text = ' '.join(qualif_list)

# Generate the word cloud with a custom background color
wordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda *args, **kwargs: 'black').generate(text)

# Create the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')

# Display the  word cloud
plt.show()

As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:

  1. Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.

  2. Work Experience: The inclusion of “Work Experience” emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.

  3. Data Science: The prominence of “Data Science” indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.

  4. Bachelor Degree: The presence of “Bachelor Degree” suggests that having a bachelor’s degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.

  5. Machine Learning and Deep Learning: The inclusion of “Machine Learning” and “Deep Learning” highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.

  6. Communication Skills: The mention of “Communication Skill” underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.

Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor’s degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.

5.3 Visualizing Salaries

Finally!!

I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you’ve deserved it.

Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it’s worth knowing about the pay scale.

5.3.1 Using benefits

Coming to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.

Code
# Filter the dataframe by yearly salary status
status_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(status_filtered_df['title'])
company_names = list(status_filtered_df['company_name'])
min_salaries = list(status_filtered_df['min_salary'])
max_salaries = list(status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()

The plot up top shows the various job titles’ wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.

For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.

Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.

In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.

5.3.1.1 Anomaly

I tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.

5.3.2 Using description

Code
# Filter the dataframe by yearly salary status
desc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(desc_status_filtered_df['title'])
company_names = list(desc_status_filtered_df['company_name'])
min_salaries = list(desc_status_filtered_df['min_salary'])
max_salaries = list(desc_status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()

Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.

The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers.

6 Limitations

It can be said that this dataset isn’t perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.

One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.

7 Conclusions

Based on the insightful findings of this project, it has become evident that aspiring Data Scientists can significantly enhance their future career prospects by focusing on job opportunities in the DMV, California, Texas, and Illinois areas. These regions boast a higher concentration of relevant job postings, presenting a wealth of potential for professional growth and advancement.

Moreover, the analysis has shed light on the paramount importance of comprehensive job postings. Companies that provide detailed information regarding salary descriptions, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit consideration for potential candidates. Such companies are more likely to attract top talent and are therefore highly desirable employment options.

By delving deep into the employment landscape of Data Science jobs across the USA, this project has armed me with invaluable knowledge that will guide my decision-making and shape my future career trajectory. I sincerely hope that you, too, have derived considerable benefit from this analysis, gaining a profound understanding of the intricacies and dynamics of the Data Science job market.

Source Code
---
title: "JobHuntUSA: Navigating Data Science careers through Data Visualization"
---

# Introduction

Due to a number of variables, the United States of America (USA) has become a center for employment possibilities in data science. First of all, the nation has hubs for innovation and cutting-edge technological infrastructure. Cities with high concentrations of technological businesses, startups, and research institutes include Silicon Valley in California, Seattle in Washington, and Boston in Massachusetts. In addition to luring top talent, these areas provide a thriving ecosystem for data science professionals to work on innovative projects and collaborate with one another.

Second, the USA is well-represented in a wide range of industries, from technology and banking to healthcare and retail. Numerous businesses in these industries have extensively invested in data science capabilities because they understand the value of data-driven decision-making. With so many huge organizations, including corporate behemoths like Google, Amazon, and Microsoft, data scientists have plenty of chances to work on challenging challenges and make important contributions. The USA also has a thriving startup scene, with several new businesses upending numerous industries with ground-breaking data-driven solutions.

Overall, the United States is a desirable location for job seekers looking for data science positions because of its large industry presence, modern technological infrastructure, and innovative culture. The nation is a growing hub for data science workers because it provides a wide range of possibilities, access to cutting-edge initiatives, and the possibility to collaborate with top organizations and subject matter experts.

Let me walk you through this comprehensive report which will help you find your next job.

# Data

Some information about the dataset that was provided by our very own Georgetown University DSAN department.

1. This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.

2. There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .

3. The postings were obtained using the following search query terms:
- "data-scientist",
- "data-analyst",
- "neural-networks",
- "big-data-and-cloud-computing",
- "machine-learning",
- "reinforcement-learning",
- "deep-learning",
- "time-series",
- "block-chain",
- "natural-language-processing"

4. The search for this data is for USA. The files may contain duplicate job postings.

5. The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search


# Data preparation

## Importing the libraries

This step needs no explanation. Required packages must always be loaded.


```{python}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import klib
import plotly.graph_objects as go
import folium
from folium import plugins
from shapely.geometry import Polygon, Point
from wordcloud import WordCloud

import json
import glob
import os

import re
import nltk
from nltk.stem import PorterStemmer
from string import punctuation
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings('ignore')
```

## Importing the dataset

I created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.

The function below will return the list of dataframes for respective job searches.

```{python}
# Function to create the required dataframe for analysis.
def create_job_df(path):    
    """
    Takes as input directory to construct a list of dataframes from and returns that list
    :param path: a Path to a directory
    :return: a list of pandas DataFrames
    """

    # Get every file in the folder using glob
    all_files = glob.glob(os.path.join(path, "*.json"))

    # lists for appending dataframes for every job-search
    data_scientist_list = []
    data_analyst_list = []
    neural_networks_list = []
    big_data_and_cloud_computing_list = []
    machine_learning_list = []
    reinforcement_learning_list = []
    deep_learning_list = []
    time_series_list = []
    block_chain_list = []
    natural_language_processing_list = []

    # Iterate over the files in the folder
    for filename in all_files:
        # Read the json file
        with open(filename, 'r') as fp:
            data = json.load(fp)
        
        if 'jobs_results' in data:
            # create dataframe
            df = pd.DataFrame(data['jobs_results'])

            # Data Cleaning
            # Via
            df['via'] = df['via'].apply(lambda x: x[4:])

            # Job highlights
            qualifications = []
            responsibilities = []
            benefits = []

            for i in range(len(df['job_highlights'])):
                jd = df['job_highlights'][i]
                n = len(jd)

                if n == 3:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(jd[2]['items'])
                
                elif n==2:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(np.nan)
                
                elif n==1:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)
                else:
                    qualifications.append(np.nan)
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)

            # Related links
            resources = []
            for i in range(len(df['related_links'])):
                links = df['related_links'][i]
                resources.append(links[0]['link'])

            # Extensions and detected extensions
            posted = []
            salary = []
            job_type = []
            for i in range(len(df['detected_extensions'])):
                extn = df['detected_extensions'][i]
                if 'posted_at' in extn.keys():
                    posted.append(extn['posted_at'])
                else:
                    posted.append(np.nan)

                if 'salary' in extn.keys():
                    salary.append(extn['salary'])  
                else:
                    salary.append(np.nan)

                if 'schedule_type' in extn.keys():
                    job_type.append(extn['schedule_type'])
                else:
                    job_type.append(np.nan)

            # Add the created columns
            df['qualifications'] = qualifications
            df['responsibilities'] = responsibilities
            df['benefits'] = benefits
            df['posted'] = posted
            df['salary'] = salary
            df['job_type'] = job_type
            df['resources'] = resources

            # Drop the redundant columns
            df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)

            # Rearrange the columns
            df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 
                    'benefits', 'salary', 'via', 'posted', 'resources']]
            
            search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing",
                "machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain",
                "natural-language-processing"]
            
            if "data-scientist" in filename:
                data_scientist_list.append(df)
            elif "data-analyst" in filename:
                data_analyst_list.append(df)
            elif "neural-networks" in filename:
                neural_networks_list.append(df)
            elif "big-data-and-cloud-computing" in filename:
                big_data_and_cloud_computing_list.append(df)
            elif "machine-learning" in filename:
                machine_learning_list.append(df)
            elif "reinforcement-learning" in filename:
                reinforcement_learning_list.append(df)
            elif "deep-learning" in filename:
                deep_learning_list.append(df)
            elif "time-series" in filename:
                time_series_list.append(df)
            elif "block-chain" in filename:
                block_chain_list.append(df)
            elif "natural-language-processing" in filename:
                natural_language_processing_list.append(df)
    
    # Concat the lists to create the merged dataframe
    data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True)

    data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True)

    neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True)

    big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True)

    machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True)

    reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True)

    deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True)

    time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True)

    block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True)

    natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)

    # return the list of dataframes for every job
    return [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]
```

Now that you've understood the function, lets see what kind of dataframe do we get for the potential analysis.

```{python}
# Define path
path = '../data/USA/'
# Execute the driver function to get the list of dataframes
df_list = create_job_df(path)

# The respective dataframes for each job search which might be later used for potential analyses.
data_scientist_df = df_list[0]
data_analyst_df = df_list[1]
neural_networks_df = df_list[2]
big_data_and_cloud_computing_df = df_list[3]
machine_learning_df = df_list[4]
reinforcement_learning_df = df_list[5]
deep_learning_df = df_list[6]
time_series_df = df_list[7]
block_chain_df = df_list[8]
natural_language_processing_df = df_list[9]

# Merge all the dataframes to get all job postings around DC
country_jobs = pd.concat(df_list, axis=0, ignore_index=True)
country_jobs.head()
```

Generating a separate dataframe for DC job listings which will be merged with the overall country job listings.

```{python}
# Define path
path = '../data/USA/'
# Execute the driver function to get the list of dataframes
df_list = create_job_df(path)

# The respective dataframes for each job search which might be later used for potential analyses.
data_scientist_df = df_list[0]
data_analyst_df = df_list[1]
neural_networks_df = df_list[2]
big_data_and_cloud_computing_df = df_list[3]
machine_learning_df = df_list[4]
reinforcement_learning_df = df_list[5]
deep_learning_df = df_list[6]
time_series_df = df_list[7]
block_chain_df = df_list[8]
natural_language_processing_df = df_list[9]

# Merge all the dataframes to get all job postings around DC
usa_jobs = pd.concat(df_list, axis=0, ignore_index=True)
usa_jobs.head()
```

Merging the two dataframes created above

```{python}
usa_jobs = pd.concat([country_jobs, usa_jobs], ignore_index=True)
usa_jobs.head()
```

## Data wrangling, munging and cleaning

This is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.

We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.

```{python}
def join_data(data_lst):
    # Check if data_lst is not NaN
    if data_lst is not np.nan:
        # If data_lst is not NaN, join the elements with ". " as the separator
        return ". ".join(data_lst)
    # If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)
    return np.nan

usa_jobs['responsibilities'] = usa_jobs['responsibilities'].apply(join_data)
usa_jobs['qualifications'] = usa_jobs['qualifications'].apply(join_data)
usa_jobs['benefits'] = usa_jobs['benefits'].apply(join_data)
```

Some of the job postings had listed their location as 'Anywhere'. So I decided to do some feature extraction and created a new column ('remote') which specifies whether the job available allows remote work or not.

```{python}
# Function to check if the job location is remote
def remote_or_not(location):
    # Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)
    if location.lower().strip() == 'anywhere':
        # If the location is "anywhere", return True
        return True
    # If the location is not "anywhere", return False
    return False

# Apply the remote_or_not function to the 'location' column of the 'usa_jobs' DataFrame and create a new 'remote' column
usa_jobs['remote'] = usa_jobs['location'].apply(remote_or_not)
```

Next I saw that the 'location' column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.

```{python}
# Get city and state
def get_location(location):
    # Strip leading/trailing spaces from the location string
    location = location.strip()
    # Split the location string by comma
    loc_lst = location.split(',')
    # Get the number of elements in the loc_lst
    n = len(loc_lst)
    if n == 2:
        # If there are two elements, return the stripped city and state
        return loc_lst[0].strip(), loc_lst[1].strip()
    elif n == 1:
        # If there is only one element, return the stripped city and state as the same value
        return loc_lst[0].strip(), loc_lst[0].strip()

# Create empty lists to store the extracted cities and states
cities = []
states = []

# Iterate over the 'location' column of the 'usa_jobs' DataFrame
for i in range(len(usa_jobs['location'])):
    # Extract the city and state using the get_location function
    city, state = get_location(usa_jobs['location'][i])
    
    # Check for city or state containing '+1'
    if '+1' in city:
        city_lst = city.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in city:
            city = city_lst[0] + ' ' + city_lst[1]
        else:
            city = city_lst[0]
    if '+1' in state:
        state_lst = state.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in state:
            state = state_lst[0] + ' ' + state_lst[1]
        else:
            state = state_lst[0]
    
    # Append the city and state to the respective lists
    cities.append(city)
    states.append(state)

# Add 'city' and 'state' columns to the 'usa_jobs' DataFrame
usa_jobs['city'] = cities
usa_jobs['state'] = states

# Merge certain states for consistency
usa_jobs['state'] = usa_jobs['state'].replace('Maryland', 'MD')
usa_jobs['state'] = usa_jobs['state'].replace('New York', 'NY')
usa_jobs['state'] = usa_jobs['state'].replace('California', 'CA')

# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the country
usa_jobs['state'] = usa_jobs['state'].replace('United States', 'Anywhere')

# Drop the 'location' column and re-arrange the columns in the desired order
usa_jobs.drop(columns=['location'], inplace=True)
usa_jobs = usa_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote',
       'description', 'responsibilities', 'qualifications', 'benefits',
       'salary', 'via', 'posted', 'resources']]
```

I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.

```{python}
# remove duplicate values from title and company name
usa_jobs = usa_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)
usa_jobs.head()
```


### Missing Data

I always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.

Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.


Lets visualize the missing data using 'klib' library so that you are able to realize this trend for each column in the dataset.

klib library helps us to visualize missing data trends in the dataset. Using the 'missing_val' plot, we will be able to extract necessary information of the missing data in every column. <br><br>

```{python}
"Missing Value Plot"
usa_klib = klib.missingval_plot(usa_jobs, figsize=(10,15))
```

There are 490 values of salary which is missing. My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in this country.

# INTERESTING INSIGHT

This usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.

Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.

## Salary analysis using benefits of a job provided by the employer

I will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.

```{python}
# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'
def get_sal_ben(benefit):
    # Convert the benefit string to lowercase and split it into words
    ben = benefit.lower().split()
    # Check if any of the keywords are present in the benefit
    if 'salary' in ben or 'range' in ben or 'pay' in ben:
        return True
    return False

usa_jobs.dropna()

# Create empty lists to store benefits containing salary information and their corresponding job IDs
ben_sal = []
ben_job_id = []

# Iterate over the 'benefits' column of the 'usa_jobs' DataFrame
for i in range(len(usa_jobs['benefits'])):
    benefit = usa_jobs['benefits'][i]
    # Check if the benefit is not NaN
    if benefit is not np.nan:
        # If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' list
        if get_sal_ben(benefit):
            ben_sal.append(benefit)
            ben_job_id.append(usa_jobs['job_id'][i])

# Define a regex pattern to extract salary information from the benefits
salary_pattern = r"\$([\d,.-]+[kK]?)"

# Create empty lists to store the extracted salary information and their corresponding job IDs
ben_sal_list = []
ben_job_id_lst = []

# Iterate over the benefits containing salary information
for i in range(len(ben_sal)):
    benefit = ben_sal[i]
    # Find all matches of the salary pattern in the benefit
    matches = re.findall(salary_pattern, benefit)
    if matches:
        # If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst'
        ben_sal_list.append(matches)
        ben_job_id_lst.append(ben_job_id[i])
```

The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.

```{python}
# Function to convert a single value to float
def convert_to_float(value):
    try:
        # check for values containing k
        flag = False
        if 'k' in value or 'K' in value:
            flag = True
        # check for values containing '.'
        pattern = r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'  # Regular expression pattern
        match = re.search(pattern, value)
        if match:
            value =  match.group(1).replace('.', '')  # Remove dots from the matched value
        # Remove any non-digit characters (e.g., commas, hyphens)
        value = ''.join(filter(str.isdigit, value))
        # Multiply by 10000 if it ends with 'k'
        if flag:
            return float(value[:-1]) * 10000
        else:
            return float(value)
    except ValueError:
        return None

# Iterate over the data and convert each value to float
converted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]
```

Our last step would be to iterate over the 'converted_data' list above and filter our original dataframe.

```{python}
# Create an empty list to store the corrected salary ranges
correct_data = []

# Iterate over the converted_data list
for i in range(len(converted_data)):
    sal_range = converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] is not None and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    correct_data.append(sal_range)

# Filter the usa_jobs DataFrame based on the job IDs with salary information
ben_filtered_df = usa_jobs[usa_jobs['job_id'].isin(ben_job_id_lst)]
```

Now that, we have got a new dataframe, we can proceed right?

This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.

After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

```{python}
# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the correct_data list
for sal_lst in correct_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the minimum and maximum salaries to the ben_filtered_df DataFrame
ben_filtered_df['min_salary'] = min_sal
ben_filtered_df['max_salary'] = max_sal

# Get the data and job IDs of salaries from the ben_filtered_df DataFrame
data = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])
job_ids = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])

# Define a regex pattern to extract salary ranges
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop the redundant 'salary' column
ben_filtered_df.drop(columns=['salary'], inplace=True)
```

Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

```{python}
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

ben_filtered_df['salary_status'] = ben_filtered_df['min_salary'].apply(salary_status)
```

Dropping any vague missing data from our analyses for the visualization later on.

```{python}
# dropping nan values
ben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)
ben_filtered_df.head()
```

We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using 'description' column.

## Salary analysis using description of the job provided by the employer

```{python}
# Define a function to check if the description contains keywords related to salary
def get_sal_desc(descript):
    descpt = descript.lower().split()
    if 'salary' in descpt or 'range' in descpt or 'pay' in descpt:
        return True
    return False

# Create empty lists to store the descriptions and job IDs with salary information
desc_sal = []
desc_job_id = []

# Iterate over the descriptions in the usa_jobs DataFrame
for i in range(len(usa_jobs['description'])):
    descpt = usa_jobs['description'][i]
    if descpt is not np.nan:
        if get_sal_desc(descpt):
            desc_sal.append(descpt)
            desc_job_id.append(usa_jobs['job_id'][i])

# If the description contained the keyword, extract the salary from it.
salary_pattern = r"\$([\d,.-]+[kK]?)"
desc_sal_list = []
desc_job_id_lst = []

# Iterate over the descriptions with salary information
for i in range(len(desc_sal)):
    descript = desc_sal[i]
    matches = re.findall(salary_pattern, descript)
    if matches:
        desc_sal_list.append(matches)
        desc_job_id_lst.append(usa_jobs['job_id'][i])

# Iterate over the data and convert each value to float
desc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]

# Create an empty list to store the corrected salary ranges
desc_correct_data = []

# Iterate over the converted data
for i in range(len(desc_converted_data)):
    sal_range = desc_converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    desc_correct_data.append(sal_range)

# Filter the usa_jobs DataFrame based on the job IDs with salary information
desc_filtered_df = usa_jobs[usa_jobs['job_id'].isin(desc_job_id_lst)]

# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the converted data
for sal_lst in desc_converted_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the min_salary and max_salary columns to the desc_filtered_df DataFrame
desc_filtered_df['min_salary'] = min_sal
desc_filtered_df['max_salary'] = max_sal

# Extract salaries from the 'salary' column
data = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])
job_ids = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop redundant 'salary' column
desc_filtered_df.drop(columns=['salary'], inplace=True)

# Define a function to determine the salary status based on the min_salary
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

# Add the 'salary_status' column to the desc_filtered_df DataFrame
desc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)

# Reorder the columns in the desc_filtered_df DataFrame
desc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state',
       'remote', 'description', 'responsibilities', 'qualifications',
       'benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]

desc_filtered_df.head()
```

Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.

# Data Visualization

## Geospatial

```{python}
#| echo: false
#| warning: false

# Create a new column 'Address' by combining 'city' and 'state' columns
usa_jobs['Address'] = usa_jobs['city'] + ', ' + usa_jobs['state']

# Read the address coordinates from the 'address_coords.csv' file
address_df = pd.read_csv("../data/address_coords.csv")

# Merge the 'usa_jobs' DataFrame with the 'address_df' DataFrame based on the 'Address' column
# and add the coordinates information to 'usa_jobs_final' DataFrame
usa_jobs_final = pd.merge(usa_jobs, address_df, on="Address", how="left")

# Read the 'uscities.csv' file containing State IDs and State Names
uscities_df = pd.read_csv("../data/uscities.csv")

# Extract the State IDs and State Names from 'uscities_df' and drop duplicate rows
state_name = uscities_df[["state_id", "state_name"]].drop_duplicates().reset_index(drop=True)
state_name.columns = ["state", "state_name"]

# Count the number of job sightings in each state
total_count_jobs = usa_jobs_final.groupby("state")['job_id'].count().reset_index()
total_count_jobs.columns = ["state", "total_count"]

# Merge the state IDs and state names with the total count of jobs
total_count_jobs = pd.merge(total_count_jobs, state_name, on="state", how="left")
```

My first geospatial plot for jobs comes with the plotly Choropleth module.

```{python}
# CREATE A CHOROPLETH MAP
fig = go.Figure(go.Choropleth(
    locations=total_count_jobs['state'],
    z=total_count_jobs['total_count'],
    colorscale='darkmint',
    locationmode = 'USA-states',
    name="",
    text=total_count_jobs['state_name'] + '<br>' + 'Total jobs: ' + total_count_jobs['total_count'].astype(str),
    hovertemplate='%{text}',
))

# ADD TITLE AND ANNOTATIONS
fig.update_layout(
    title_text='<b>Number of Jobs across USA</b>',
    title_font_size=24,
    title_x=0.5,
    geo_scope='usa',
    width=1100,
    height=700
)

# SHOW FIGURE
fig.show()
```

The number of occupations in USA are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the country. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details. 

The caption of the map, "Number of Jobs across USA," gives the information being displayed a clear context.

For my next chart, I used the very famous folium library to create another interactive visualization. 

```{python}
# CREATE DATA
data = usa_jobs_final[["Latitude", "Longitude"]].values.tolist()

# Define a list of bounding boxes for the United States, including Alaska
us_bounding_boxes = [
    {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548},  # Contiguous U.S.
    {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0}  # Alaska
]

# Filter out lat/long pairs that do not belong to the United States
latlong_list = []
for latlong in data:
    point = Point(latlong[1], latlong[0])  # Shapely uses (x, y) coordinates, so we swap lat and long
    for bounding_box in us_bounding_boxes:
        box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']),
                       (bounding_box['min_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['min_lat'])])
        if point.within(box):
            latlong_list.append(latlong)
            break  # No need to check remaining bounding boxes if the point is already within one

# INITIALIZE MAP
usa_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)

# ADD POINTS 
plugins.MarkerCluster(latlong_list).add_to(usa_job_map)

# SHOW MAP
usa_job_map
```

This is an interactive map to demonstrate how jobs are distributed across USA. It provides insightful information on the geographic distribution of employment prospects across the country by visually portraying the job locations. The map's markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.

Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in USA and eventually assisting in decision-making related to job search and recruitment efforts.


I am hoping that you now have a clear idea about the number of jobs around the country. Since you have reached this far, I am also assuming that you would interested in knowing more about these jobs.

Don't worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.

## Textual Analyses

The dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.

Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.

```{python}
def remove_punct(text):
    """ A method to remove punctuations from text """
    text  = "".join([char for char in text if char not in punctuation])
    text = re.sub('[0-9]+', '', text) #removes numbers from text
    return text

def remove_stopwords(text):
    """ A method to remove all the stopwords """
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = [word for word in text if word not in stopwords]
    return text

def tokenization(text):
    """ A method to tokenize text data """
    text = re.split('\W+', text) #splitting each sentence/ tweet into its individual words
    return text

def stemming(text):
    """ A method to perform stemming on text data"""
    porter_stem = nltk.PorterStemmer()
    text = [porter_stem.stem(word) for word in text]
    return text

def lemmatizer(text):
    word_net_lemma = nltk.WordNetLemmatizer()
    text = [word_net_lemma.lemmatize(word) for word in text]
    return text

# Making a common cleaning function for every part below for code reproducability
def clean_words(list_words):
    # Making a regex pattern to match in the characters we would like to replace from the words
    character_replace = ",()0123456789.?!@#$%&;*:_,/" 
    pattern = "[" + character_replace + "]"
    new_list_words = []
    
    # Looping through every word to remove the characters and appending back to a new list
    # replace is being used for the characters that could not be catched through regex
    for s in list_words:
        new_word = s.lower()
        new_word = re.sub(pattern,"",new_word)
        new_word = new_word.replace('[', '')
        new_word = new_word.replace(']', '')
        new_word = new_word.replace('-', '')
        new_word = new_word.replace('—', '')
        new_word = new_word.replace('“', '')
        new_word = new_word.replace("’", '')
        new_word = new_word.replace("”", '')
        new_word = new_word.replace("‘", '')
        new_word = new_word.replace('"', '')
        new_word = new_word.replace("'", '')
        new_word = new_word.replace(" ", '')
        new_list_words.append(new_word)

    # Using filter to remove empty strings
    new_list_words = list(filter(None, new_list_words))
    return new_list_words

def clean_text(corpus):
    """ A method to do basic data cleaning """
    
    # Remove punctuation and numbers from the text
    clean_text = remove_punct([corpus])
    
    # Tokenize the text into individual words
    text_tokenized = tokenization(clean_text.lower())
    
    # Remove stopwords from the tokenized text
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text_without_stop = remove_stopwords(text_tokenized)
    
    # Perform stemming on the text
    text_stemmed = stemming(text_without_stop)
    
    # Perform lemmatization on the text
    text_lemmatized = lemmatizer(text_without_stop)
    
    # Further clean and process the words
    text_final = clean_words(text_lemmatized)
    
    # Join the cleaned words back into a single string
    return " ".join(text_final)
```

How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.

Moving on, for our very first textual analyses, I will be using the pipeline created for the 'description' column

```{python}
descript_list = []
for descript in usa_jobs['description']:
    descript_list.append(clean_text(descript))
```

Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs.

```{python}
# Join the list of descriptions into a single string
text = ' '.join(descript_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```

The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:

1. "Data": This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.

2. "Experience": This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.

3. "Machine Learning": This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.

4. "Skills": This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.

5. "Analytics": This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.

Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.

What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the 'responsibilities' column and generating it's word cloud

```{python}
# Removing missing values from responsibilities for text cleaning
usa_jobs.dropna(subset=['responsibilities'], inplace=True)

response_list = []
for response in usa_jobs['responsibilities']:
    response_list.append(clean_text(response))

# Join the list of descriptions into a single string
text = ' '.join(response_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda *args, **kwargs: 'black').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```

Similar to description wordcloud, we see that words such 'data', 'machine learning', 'design', 'big data', 'project', 'model', 'development', etc. are prevalent.

This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.

My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behind. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.


At last, we have the moment of truth. Whether you're capable of doing this job or not?
What qualities one must have in them so they are a suitable fit for the employer?

Let's check this out.

```{python}
qualif_list = []
for qualif in usa_jobs['qualifications']:
    qualif_list.append(clean_text(qualif))

# Join the list of descriptions into a single string
text = ' '.join(qualif_list)

# Generate the word cloud with a custom background color
wordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda *args, **kwargs: 'black').generate(text)

# Create the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')

# Display the  word cloud
plt.show()
```

As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:

1. Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.

2. Work Experience: The inclusion of "Work Experience" emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.

3. Data Science: The prominence of "Data Science" indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.

4. Bachelor Degree: The presence of "Bachelor Degree" suggests that having a bachelor's degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.

5. Machine Learning and Deep Learning: The inclusion of "Machine Learning" and "Deep Learning" highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.

6. Communication Skills: The mention of "Communication Skill" underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.

Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor's degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.

## Visualizing Salaries

Finally!!

I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you've deserved it.

Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it's worth knowing about the pay scale.

### Using benefits

Coming to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.

```{python}
# Filter the dataframe by yearly salary status
status_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(status_filtered_df['title'])
company_names = list(status_filtered_df['company_name'])
min_salaries = list(status_filtered_df['min_salary'])
max_salaries = list(status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()
```

The plot up top shows the various job titles' wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.

For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.

Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.

In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.

#### Anomaly

I tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.

### Using description

```{python}
# Filter the dataframe by yearly salary status
desc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(desc_status_filtered_df['title'])
company_names = list(desc_status_filtered_df['company_name'])
min_salaries = list(desc_status_filtered_df['min_salary'])
max_salaries = list(desc_status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()
```

Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.

The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers.

# Limitations

It can be said that this dataset isn't perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.

One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.

# Conclusions

Based on the insightful findings of this project, it has become evident that aspiring Data Scientists can significantly enhance their future career prospects by focusing on job opportunities in the DMV, California, Texas, and Illinois areas. These regions boast a higher concentration of relevant job postings, presenting a wealth of potential for professional growth and advancement.

Moreover, the analysis has shed light on the paramount importance of comprehensive job postings. Companies that provide detailed information regarding salary descriptions, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit consideration for potential candidates. Such companies are more likely to attract top talent and are therefore highly desirable employment options.

By delving deep into the employment landscape of Data Science jobs across the USA, this project has armed me with invaluable knowledge that will guide my decision-making and shape my future career trajectory. I sincerely hope that you, too, have derived considerable benefit from this analysis, gaining a profound understanding of the intricacies and dynamics of the Data Science job market.