Data Science JobHunt
  • Home
  • Code
  • JobHunt
    • DMV region
    • USA

On this page

  • 1 Introduction
  • 2 Data
  • 3 Data preparation
    • 3.1 Importing the libraries
    • 3.2 Importing the dataset
    • 3.3 Data wrangling, munging and cleaning
      • 3.3.1 Missing Data
  • 4 INTERESTING INSIGHT
    • 4.1 Salary analysis using benefits of a job provided by the employer
    • 4.2 Salary analysis using description of the job provided by the employer
  • 5 Data Visualization
    • 5.1 Geospatial
    • 5.2 Textual Analyses
    • 5.3 Visualizing Salaries
      • 5.3.1 Using benefits
      • 5.3.2 Using description
  • 6 Limitations
  • 7 Conclusions

JobHuntDC: Navigating Data Science careers through Data Visualization

  • Show All Code
  • Hide All Code

  • View Source

1 Introduction

It’s great that you have narrowed down your search to the District of Columbia, Maryland and Virginia (DMV) area.

The region is quickly becoming a hotspot for data science experts looking for rewarding employment opportunities. The area offers a distinctive environment that encourages creativity and cooperation in the field of data science due to its close proximity to prominent academic institutions, governmental organizations, and cutting-edge enterprises. The DMV region has a thriving IT sector, with a variety of businesses in the public and private sectors that produce enormous volumes of data that are ready for investigation and analysis. Additionally, the region has a wealth of talent, since each year, talented graduates in data science and related subjects are produced by a large number of top institutions and colleges.

Additionally, the area’s vibrant startup scene and strong support for entrepreneurship foster a climate that fosters data-driven businesses, making it a desirable location for data science experts looking to have a real effect. Overall, the DMV region is a prime center for data science professionals wishing to grow their careers in a challenging and gratifying environment due to its combination of a thriving industry, research possibilities, and talent pool.

Let me walk you through this comprehensive report which will help you find your next job.

2 Data

Some information about the dataset that was provided by our very own Georgetown University DSAN department.

  1. This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.

  2. There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .

  3. The postings were obtained using the following search query terms:

  • “data-scientist”,
  • “data-analyst”,
  • “neural-networks”,
  • “big-data-and-cloud-computing”,
  • “machine-learning”,
  • “reinforcement-learning”,
  • “deep-learning”,
  • “time-series”,
  • “block-chain”,
  • “natural-language-processing”
  1. The search for this data is a local search for DC, respectively. The files may contain duplicate job postings.

  2. Furthermore, this search result extends outside of the specified search area. However, they likely contain a higher sampling of jobs in the DC area.

  3. The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search

3 Data preparation

3.1 Importing the libraries

This step needs no explanation. Required packages must always be loaded.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import klib
import plotly.graph_objects as go
import folium
from folium import plugins
from shapely.geometry import Polygon, Point
from wordcloud import WordCloud

import json
import glob
import os

import re
import nltk
from nltk.stem import PorterStemmer
from string import punctuation
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings('ignore')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/raghavsharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/raghavsharma/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/raghavsharma/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

3.2 Importing the dataset

I created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.

The function below will return the list of dataframes for respective job searches.

Code
# Function to create the required dataframe for analysis.
def create_job_df(path):    
    """
    Takes as input directory to construct a list of dataframes from and returns that list
    :param path: a Path to a directory
    :return: a list of pandas DataFrames
    """

    # Get every file in the folder using glob
    all_files = glob.glob(os.path.join(path, "*.json"))

    # lists for appending dataframes for every job-search
    data_scientist_list = []
    data_analyst_list = []
    neural_networks_list = []
    big_data_and_cloud_computing_list = []
    machine_learning_list = []
    reinforcement_learning_list = []
    deep_learning_list = []
    time_series_list = []
    block_chain_list = []
    natural_language_processing_list = []

    # Iterate over the files in the folder
    for filename in all_files:
        # Read the json file
        with open(filename, 'r') as fp:
            data = json.load(fp)
        
        if 'jobs_results' in data:
            # create dataframe
            df = pd.DataFrame(data['jobs_results'])

            # Data Cleaning
            # Via
            df['via'] = df['via'].apply(lambda x: x[4:])

            # Job highlights
            qualifications = []
            responsibilities = []
            benefits = []

            for i in range(len(df['job_highlights'])):
                jd = df['job_highlights'][i]
                n = len(jd)

                if n == 3:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(jd[2]['items'])
                
                elif n==2:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(np.nan)
                
                elif n==1:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)
                else:
                    qualifications.append(np.nan)
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)

            # Related links
            resources = []
            for i in range(len(df['related_links'])):
                links = df['related_links'][i]
                resources.append(links[0]['link'])

            # Extensions and detected extensions
            posted = []
            salary = []
            job_type = []
            for i in range(len(df['detected_extensions'])):
                extn = df['detected_extensions'][i]
                if 'posted_at' in extn.keys():
                    posted.append(extn['posted_at'])
                else:
                    posted.append(np.nan)

                if 'salary' in extn.keys():
                    salary.append(extn['salary'])  
                else:
                    salary.append(np.nan)

                if 'schedule_type' in extn.keys():
                    job_type.append(extn['schedule_type'])
                else:
                    job_type.append(np.nan)

            # Add the created columns
            df['qualifications'] = qualifications
            df['responsibilities'] = responsibilities
            df['benefits'] = benefits
            df['posted'] = posted
            df['salary'] = salary
            df['job_type'] = job_type
            df['resources'] = resources

            # Drop the redundant columns
            df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)

            # Rearrange the columns
            df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 
                    'benefits', 'salary', 'via', 'posted', 'resources']]
            
            search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing",
                "machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain",
                "natural-language-processing"]
            
            if "data-scientist" in filename:
                data_scientist_list.append(df)
            elif "data-analyst" in filename:
                data_analyst_list.append(df)
            elif "neural-networks" in filename:
                neural_networks_list.append(df)
            elif "big-data-and-cloud-computing" in filename:
                big_data_and_cloud_computing_list.append(df)
            elif "machine-learning" in filename:
                machine_learning_list.append(df)
            elif "reinforcement-learning" in filename:
                reinforcement_learning_list.append(df)
            elif "deep-learning" in filename:
                deep_learning_list.append(df)
            elif "time-series" in filename:
                time_series_list.append(df)
            elif "block-chain" in filename:
                block_chain_list.append(df)
            elif "natural-language-processing" in filename:
                natural_language_processing_list.append(df)
    
    # Concat the lists to create the merged dataframe
    data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True)

    data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True)

    neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True)

    big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True)

    machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True)

    reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True)

    deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True)

    time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True)

    block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True)

    natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)

    # return the list of dataframes for every job
    return [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]

Now that you’ve understood the function, lets see what kind of dataframe do we get for the potential analysis.

Code
# Define path
path = '../data/DC/'
# Execute the driver function to get the list of dataframes
df_list = create_job_df(path)

# The respective dataframes for each job search which might be later used for potential analyses.
data_scientist_df = df_list[0]
data_analyst_df = df_list[1]
neural_networks_df = df_list[2]
big_data_and_cloud_computing_df = df_list[3]
machine_learning_df = df_list[4]
reinforcement_learning_df = df_list[5]
deep_learning_df = df_list[6]
time_series_df = df_list[7]
block_chain_df = df_list[8]
natural_language_processing_df = df_list[9]

# Merge all the dataframes to get all job postings around DC
dc_jobs = pd.concat(df_list, axis=0, ignore_index=True)
dc_jobs.head()
job_id title company_name job_type location description responsibilities qualifications benefits salary via posted resources
0 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uLV... Data Scientist - On-Site BU_IS, Raytheon Technologies Full-time Herndon, VA Date Posted:\n2022-11-10-08:00\n...\nCountry:\... [The Data Scientists will be responsible for i... [Bachelor’s degree with 5 years of experience ... NaN NaN Raytheon Jobs - Raytheon Technologies NaN https://www.google.com/search?hl=en&gl=us&q=BU...
1 eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3... Junior Data Scientist Gryphon Scientific Full-time Takoma Park, MD This position offers options for in-person (in... [As a junior data scientist at Gryphon, you wi... [Undergraduate degree in a relevant discipline... [Fully paid health insurance for staff, and di... NaN Jobs By Workable 6 days ago http://www.gryphonscientific.com/
2 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgQ29uc3... Data Scientist, Consultant Guidehouse Full-time Wheaton-Glenmont, MD Job Family:\nData Science Consulting\n...\nTra... [Support National Security clients as a Data S... [U.S. citizenship and ability to obtain a Publ... [Guidehouse offers a comprehensive, total rewa... NaN Monster 12 hours ago https://www.google.com/search?hl=en&gl=us&q=Gu...
3 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi... Data Scientist Jobs ManTech International Full-time Herndon, VA Secure our Nation, Ignite your Future\n\nEach ... [Each day U.S. Customs and Border Protection (... [HS Diploma/GED and 20+ years, AS/AA and 18+ y... [The projected compensation range for this pos... NaN Clearance Jobs 11 hours ago http://www.mantech.com/
4 eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3... Junior Data Scientist ENS Solutions Full-time Washington, DC The contractor shall work closely with teammat... [The contractor shall work closely with teamma... [Experience with data exploration, data cleani... [Platinum-Level Health/Vision/Dental coverage ... NaN Cleared Careers NaN https://www.google.com/search?hl=en&gl=us&q=EN...

3.3 Data wrangling, munging and cleaning

This is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.

We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.

Code
def join_data(data_lst):
    # Check if data_lst is not NaN
    if data_lst is not np.nan:
        # If data_lst is not NaN, join the elements with ". " as the separator
        return ". ".join(data_lst)
    # If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)
    return np.nan

dc_jobs['responsibilities'] = dc_jobs['responsibilities'].apply(join_data)
dc_jobs['qualifications'] = dc_jobs['qualifications'].apply(join_data)
dc_jobs['benefits'] = dc_jobs['benefits'].apply(join_data)

Some of the job postings had listed their location as ‘Anywhere’. So I decided to do some feature extraction and created a new column (‘remote’) which specifies whether the job available allows remote work or not.

Code
# Function to check if the job location is remote
def remote_or_not(location):
    # Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)
    if location.lower().strip() == 'anywhere':
        # If the location is "anywhere", return True
        return True
    # If the location is not "anywhere", return False
    return False

# Apply the remote_or_not function to the 'location' column of the 'dc_jobs' DataFrame and create a new 'remote' column
dc_jobs['remote'] = dc_jobs['location'].apply(remote_or_not)

Next I saw that the ‘location’ column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.

Code
# Get city and state
def get_location(location):
    # Strip leading/trailing spaces from the location string
    location = location.strip()
    # Split the location string by comma
    loc_lst = location.split(',')
    # Get the number of elements in the loc_lst
    n = len(loc_lst)
    if n == 2:
        # If there are two elements, return the stripped city and state
        return loc_lst[0].strip(), loc_lst[1].strip()
    elif n == 1:
        # If there is only one element, return the stripped city and state as the same value
        return loc_lst[0].strip(), loc_lst[0].strip()

# Create empty lists to store the extracted cities and states
cities = []
states = []

# Iterate over the 'location' column of the 'dc_jobs' DataFrame
for i in range(len(dc_jobs['location'])):
    # Extract the city and state using the get_location function
    city, state = get_location(dc_jobs['location'][i])
    
    # Check for city or state containing '+1'
    if '+1' in city:
        city_lst = city.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in city:
            city = city_lst[0] + ' ' + city_lst[1]
        else:
            city = city_lst[0]
    if '+1' in state:
        state_lst = state.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in state:
            state = state_lst[0] + ' ' + state_lst[1]
        else:
            state = state_lst[0]
    
    # Append the city and state to the respective lists
    cities.append(city)
    states.append(state)

# Add 'city' and 'state' columns to the 'dc_jobs' DataFrame
dc_jobs['city'] = cities
dc_jobs['state'] = states

# Merge certain states for consistency
dc_jobs['state'] = dc_jobs['state'].replace('Maryland', 'MD')
dc_jobs['state'] = dc_jobs['state'].replace('New York', 'NY')
dc_jobs['state'] = dc_jobs['state'].replace('California', 'CA')

# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the country
dc_jobs['state'] = dc_jobs['state'].replace('United States', 'Anywhere')

# Drop the 'location' column and re-arrange the columns in the desired order
dc_jobs.drop(columns=['location'], inplace=True)
dc_jobs = dc_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote',
       'description', 'responsibilities', 'qualifications', 'benefits',
       'salary', 'via', 'posted', 'resources']]

I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.

Code
# remove duplicate values from title and company name
dc_jobs = dc_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)
dc_jobs.head()
job_id title company_name job_type city state remote description responsibilities qualifications benefits salary via posted resources
0 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uLV... Data Scientist - On-Site BU_IS, Raytheon Technologies Full-time Herndon VA False Date Posted:\n2022-11-10-08:00\n...\nCountry:\... The Data Scientists will be responsible for id... Bachelor’s degree with 5 years of experience w... NaN NaN Raytheon Jobs - Raytheon Technologies NaN https://www.google.com/search?hl=en&gl=us&q=BU...
1 eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3... Junior Data Scientist Gryphon Scientific Full-time Takoma Park MD False This position offers options for in-person (in... As a junior data scientist at Gryphon, you wil... Undergraduate degree in a relevant discipline ... Fully paid health insurance for staff, and dis... NaN Jobs By Workable 6 days ago http://www.gryphonscientific.com/
2 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgQ29uc3... Data Scientist, Consultant Guidehouse Full-time Wheaton-Glenmont MD False Job Family:\nData Science Consulting\n...\nTra... Support National Security clients as a Data Sc... U.S. citizenship and ability to obtain a Publi... Guidehouse offers a comprehensive, total rewar... NaN Monster 12 hours ago https://www.google.com/search?hl=en&gl=us&q=Gu...
3 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi... Data Scientist Jobs ManTech International Full-time Herndon VA False Secure our Nation, Ignite your Future\n\nEach ... Each day U.S. Customs and Border Protection (C... HS Diploma/GED and 20+ years. AS/AA and 18+ ye... The projected compensation range for this posi... NaN Clearance Jobs 11 hours ago http://www.mantech.com/
4 eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3... Junior Data Scientist ENS Solutions Full-time Washington DC False The contractor shall work closely with teammat... The contractor shall work closely with teammat... Experience with data exploration, data cleanin... Platinum-Level Health/Vision/Dental coverage w... NaN Cleared Careers NaN https://www.google.com/search?hl=en&gl=us&q=EN...

3.3.1 Missing Data

I always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.

Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.

Below you can find a function which will return you the missing value statistics of the created dataframe.

Code
# Define a function that returns a data-frame of missing data statistics
def missing_val_stats(df):
    # Define columns of the data-frame
    df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 
                                       'num_miss', 'pct_miss'])
    tmp = pd.DataFrame()
    
    for c in df.columns:
        # Column
        tmp['column'] = [c]
        # Unique values in the column
        tmp['unique_val'] = [df[c].unique()]
        # Number of unique values in the column
        tmp['num_unique_val'] = len(list(df[c].unique()))
        # Number of unique values in the column without nan
        tmp['num_unique_val_nona'] = int(df[c].nunique())
        # Number of missing values in the column
        tmp['num_miss'] = df[c].isnull().sum()
        # Percentage of missing values in the column
        tmp['pct_miss'] = (df[c].isnull().sum()/ len(df)).round(3)*100
        # Append the values to the dataframe
        df_stats = df_stats.append(tmp)    
    # Return the created dataframe
    return df_stats

missing_val_stats(dc_jobs)
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 job_id [eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uL... 314 314 0 0.0
0 title [Data Scientist - On-Site, Junior Data Scienti... 252 252 0 0.0
0 company_name [BU_IS, Raytheon Technologies, Gryphon Scienti... 239 239 0 0.0
0 job_type [Full-time, Contractor, Part-time, Internship] 4 4 0 0.0
0 city [Herndon, Takoma Park, Wheaton-Glenmont, Washi... 84 84 0 0.0
0 state [VA, MD, DC, Anywhere, CA, AZ, MA, SC, NY, LA,... 26 26 0 0.0
0 remote [False, True] 2 2 0 0.0
0 description [Date Posted:\n2022-11-10-08:00\n...\nCountry:... 309 309 0 0.0
0 responsibilities [The Data Scientists will be responsible for i... 257 256 39 12.4
0 qualifications [Bachelor’s degree with 5 years of experience ... 297 297 0 0.0
0 benefits [nan, Fully paid health insurance for staff, a... 149 148 156 49.7
0 salary [nan, 20–28 an hour, 384,458 a year, 132,368 a... 33 32 262 83.4
0 via [Raytheon Jobs - Raytheon Technologies, Jobs B... 116 116 0 0.0
0 posted [nan, 6 days ago, 12 hours ago, 11 hours ago, ... 39 38 121 38.5
0 resources [https://www.google.com/search?hl=en&gl=us&q=B... 240 240 0 0.0

As you can see, of all the columns ‘salary’ has 83.6% missing values. Now this can get really tricky because salary is not just any variable that you can impute. It depends upon several factors such as the employer, the job title, geographical location, etc.

Lets also visualize the missing data using ‘klib’ library so that you are able to realize this trend for each column in the dataset.

klib library helps us to visualize missing data trends in the dataset. Using the ‘missing_val’ plot, we will be able to extract necessary information of the missing data in every column.

Code
"Missing Value Plot"
dc_klib = klib.missingval_plot(dc_jobs, figsize=(10,15))

My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in the DMV area.

4 INTERESTING INSIGHT

This usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.

Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.

4.1 Salary analysis using benefits of a job provided by the employer

I will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.

Code
# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'
def get_sal_ben(benefit):
    # Convert the benefit string to lowercase and split it into words
    ben = benefit.lower().split()
    # Check if any of the keywords are present in the benefit
    if 'salary' in ben or 'range' in ben or 'pay' in ben:
        return True
    return False

# Create empty lists to store benefits containing salary information and their corresponding job IDs
ben_sal = []
ben_job_id = []

# Iterate over the 'benefits' column of the 'dc_jobs' DataFrame
for i in range(len(dc_jobs['benefits'])):
    benefit = dc_jobs['benefits'][i]
    # Check if the benefit is not NaN
    if benefit is not np.nan:
        # If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' list
        if get_sal_ben(benefit):
            ben_sal.append(benefit)
            ben_job_id.append(dc_jobs['job_id'][i])

# Define a regex pattern to extract salary information from the benefits
salary_pattern = r"\$([\d,.-]+[kK]?)"

# Create empty lists to store the extracted salary information and their corresponding job IDs
ben_sal_list = []
ben_job_id_lst = []

# Iterate over the benefits containing salary information
for i in range(len(ben_sal)):
    benefit = ben_sal[i]
    # Find all matches of the salary pattern in the benefit
    matches = re.findall(salary_pattern, benefit)
    if matches:
        # If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst'
        ben_sal_list.append(matches)
        ben_job_id_lst.append(ben_job_id[i])

The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.

Code
# Function to convert a single value to float
def convert_to_float(value):
    try:
        # check for values containing k
        flag = False
        if 'k' in value or 'K' in value:
            flag = True
        # check for values containing '.'
        pattern = r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'  # Regular expression pattern
        match = re.search(pattern, value)
        if match:
            value =  match.group(1).replace('.', '')  # Remove dots from the matched value
        # Remove any non-digit characters (e.g., commas, hyphens)
        value = ''.join(filter(str.isdigit, value))
        # Multiply by 10000 if it ends with 'k'
        if flag:
            return float(value[:-1]) * 10000
        else:
            return float(value)
    except ValueError:
        return None

# Iterate over the data and convert each value to float
converted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]

Our last step would be to iterate over the ‘converted_data’ list above and filter our original dataframe.

Code
# Create an empty list to store the corrected salary ranges
correct_data = []

# Iterate over the converted_data list
for i in range(len(converted_data)):
    sal_range = converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    correct_data.append(sal_range)

# Filter the dc_jobs DataFrame based on the job IDs with salary information
ben_filtered_df = dc_jobs[dc_jobs['job_id'].isin(ben_job_id_lst)]

Now that, we have got a new dataframe, we can proceed right?

This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.

After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

Code
# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the correct_data list
for sal_lst in correct_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the minimum and maximum salaries to the ben_filtered_df DataFrame
ben_filtered_df['min_salary'] = min_sal
ben_filtered_df['max_salary'] = max_sal

# Get the data and job IDs of salaries from the ben_filtered_df DataFrame
data = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])
job_ids = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])

# Define a regex pattern to extract salary ranges
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop the redundant 'salary' column
ben_filtered_df.drop(columns=['salary'], inplace=True)

Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn’t distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

Code
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

ben_filtered_df['salary_status'] = ben_filtered_df['min_salary'].apply(salary_status)

So the filtered dataframe has been cleaned. Before proceeding with the visualization, lets take a look at it’s missing value statistics.

Code
missing_val_stats(ben_filtered_df)
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 job_id [eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc... 74 74 0 0.0
0 title [Junior Data Scientist, Data Scientist Jobs, R... 63 63 0 0.0
0 company_name [Gryphon Scientific, ManTech International, Bo... 53 53 0 0.0
0 job_type [Full-time, Internship, Contractor] 3 3 0 0.0
0 city [Takoma Park, Herndon, Washington, Columbia, A... 32 32 0 0.0
0 state [MD, VA, DC, CA, WA, Anywhere, NY, HI, IL] 9 9 0 0.0
0 remote [False, True] 2 2 0 0.0
0 description [This position offers options for in-person (i... 73 73 0 0.0
0 responsibilities [As a junior data scientist at Gryphon, you wi... 68 68 0 0.0
0 qualifications [Undergraduate degree in a relevant discipline... 69 69 0 0.0
0 benefits [Fully paid health insurance for staff, and di... 70 70 0 0.0
0 via [Jobs By Workable, Clearance Jobs, CareerCircl... 40 40 0 0.0
0 posted [6 days ago, 11 hours ago, 5 days ago, nan, 3 ... 20 19 30 40.5
0 resources [http://www.gryphonscientific.com/, http://www... 52 52 0 0.0
0 min_salary [50000.0, 137000.0, 73100.0, 100000.0, 110574.... 57 56 1 1.4
0 max_salary [75000.0, 190000.0, 166000.0, 145000.0, 187976... 54 53 1 1.4
0 salary_status [Yearly, Hourly, nan] 3 2 1 1.4

You can see that columns such as ‘min_salary’, ‘max_salary’, ‘salary_status’ have only one missing value. So it wouldn’t hurt to drop that value.

Code
# dropping nan values
ben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)
ben_filtered_df.head()
job_id title company_name job_type city state remote description responsibilities qualifications benefits via posted resources min_salary max_salary salary_status
1 eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3... Junior Data Scientist Gryphon Scientific Full-time Takoma Park MD False This position offers options for in-person (in... As a junior data scientist at Gryphon, you wil... Undergraduate degree in a relevant discipline ... Fully paid health insurance for staff, and dis... Jobs By Workable 6 days ago http://www.gryphonscientific.com/ 50000.0 75000.0 Yearly
3 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi... Data Scientist Jobs ManTech International Full-time Herndon VA False Secure our Nation, Ignite your Future\n\nEach ... Each day U.S. Customs and Border Protection (C... HS Diploma/GED and 20+ years. AS/AA and 18+ ye... The projected compensation range for this posi... Clearance Jobs 11 hours ago http://www.mantech.com/ 137000.0 190000.0 Yearly
6 eyJqb2JfdGl0bGUiOiJSZXNlYXJjaCBEYXRhIFNjaWVudG... Research Data Scientist Booz Allen Hamilton Full-time Washington DC False Job Number: R0163240\n\nResearch Data Scientis... You'll work closely with your c us tomer to un... 2+ years of experience with machine learning, ... You may be able to take advantage of our tuiti... CareerCircle.com 5 days ago http://www.boozallen.com/ 73100.0 166000.0 Yearly
9 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgTGVhZC... Data Scientist, Lead Blend360 Full-time Columbia MD False Company Description\n\nBlend360 is a Dynamic, ... We evaluate different sources of data, discove... Master's degree in Statistics, Math, Operation... The starting pay range for this role is $100,0... ZipRecruiter NaN http://blend360.com/ 100000.0 145000.0 Yearly
10 eyJqb2JfdGl0bGUiOiJDbGVhcmVkIERhdGEgU2NpZW50aX... Cleared Data Scientist ICF Full-time Arlington VA False We are looking for an experienced Data Scienti... Ability to perform knowledge elicitation from ... The ideal candidate should be someone who is s... Pay Range - There are multiple factors that ar... LinkedIn 3 days ago https://www.icf.com/ 110574.0 187976.0 Yearly

We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using ‘description’ column.

4.2 Salary analysis using description of the job provided by the employer

Code
# Define a function to check if the description contains keywords related to salary
def get_sal_desc(descript):
    descpt = descript.lower().split()
    if 'salary' in descpt or 'range' in descpt or 'pay' in descpt:
        return True
    return False

# Create empty lists to store the descriptions and job IDs with salary information
desc_sal = []
desc_job_id = []

# Iterate over the descriptions in the dc_jobs DataFrame
for i in range(len(dc_jobs['description'])):
    descpt = dc_jobs['description'][i]
    if descpt is not np.nan:
        if get_sal_desc(descpt):
            desc_sal.append(descpt)
            desc_job_id.append(dc_jobs['job_id'][i])

# If the description contained the keyword, extract the salary from it.
salary_pattern = r"\$([\d,.-]+[kK]?)"
desc_sal_list = []
desc_job_id_lst = []

# Iterate over the descriptions with salary information
for i in range(len(desc_sal)):
    descript = desc_sal[i]
    matches = re.findall(salary_pattern, descript)
    if matches:
        desc_sal_list.append(matches)
        desc_job_id_lst.append(dc_jobs['job_id'][i])

# Iterate over the data and convert each value to float
desc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]

# Create an empty list to store the corrected salary ranges
desc_correct_data = []

# Iterate over the converted data
for i in range(len(desc_converted_data)):
    sal_range = desc_converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    desc_correct_data.append(sal_range)

# Filter the dc_jobs DataFrame based on the job IDs with salary information
desc_filtered_df = dc_jobs[dc_jobs['job_id'].isin(desc_job_id_lst)]

# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the converted data
for sal_lst in desc_converted_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the min_salary and max_salary columns to the desc_filtered_df DataFrame
desc_filtered_df['min_salary'] = min_sal
desc_filtered_df['max_salary'] = max_sal

# Extract salaries from the 'salary' column
data = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])
job_ids = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop redundant 'salary' column
desc_filtered_df.drop(columns=['salary'], inplace=True)

# Define a function to determine the salary status based on the min_salary
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

# Add the 'salary_status' column to the desc_filtered_df DataFrame
desc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)

# Reorder the columns in the desc_filtered_df DataFrame
desc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state',
       'remote', 'description', 'responsibilities', 'qualifications',
       'benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]

desc_filtered_df.head()
job_id title company_name job_type city state remote description responsibilities qualifications benefits min_salary max_salary salary_status via posted resources
0 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uLV... Data Scientist - On-Site BU_IS, Raytheon Technologies Full-time Herndon VA False Date Posted:\n2022-11-10-08:00\n...\nCountry:\... The Data Scientists will be responsible for id... Bachelor’s degree with 5 years of experience w... NaN 50000.0 75000.0 Yearly Raytheon Jobs - Raytheon Technologies NaN https://www.google.com/search?hl=en&gl=us&q=BU...
2 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCwgQ29uc3... Data Scientist, Consultant Guidehouse Full-time Wheaton-Glenmont MD False Job Family:\nData Science Consulting\n...\nTra... Support National Security clients as a Data Sc... U.S. citizenship and ability to obtain a Publi... Guidehouse offers a comprehensive, total rewar... 137000.0 190000.0 Yearly Monster 12 hours ago https://www.google.com/search?hl=en&gl=us&q=Gu...
3 eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBKb2JzIi... Data Scientist Jobs ManTech International Full-time Herndon VA False Secure our Nation, Ignite your Future\n\nEach ... Each day U.S. Customs and Border Protection (C... HS Diploma/GED and 20+ years. AS/AA and 18+ ye... The projected compensation range for this posi... 73100.0 166000.0 Yearly Clearance Jobs 11 hours ago http://www.mantech.com/
4 eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3... Junior Data Scientist ENS Solutions Full-time Washington DC False The contractor shall work closely with teammat... The contractor shall work closely with teammat... Experience with data exploration, data cleanin... Platinum-Level Health/Vision/Dental coverage w... 100000.0 145000.0 Yearly Cleared Careers NaN https://www.google.com/search?hl=en&gl=us&q=EN...
5 eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3... Junior Data Scientist cBEYONData Full-time Fort Belvoir VA False cBEYONData is seeking a Junior Data Scientist ... cBEYONData is seeking a Junior Data Scientist ... 1-5 years of experience in Data Science field.... NaN 110574.0 187976.0 Yearly DataYoshi 28 days ago http://www.cbeyondata.com/

Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.

Lets check for missing data for this dataframe as well.

Code
missing_val_stats(desc_filtered_df)
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 job_id [eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCAtIE9uL... 116 116 0 0.0
0 title [Data Scientist - On-Site, Data Scientist, Con... 96 96 0 0.0
0 company_name [BU_IS, Raytheon Technologies, Guidehouse, Man... 96 96 0 0.0
0 job_type [Full-time, Contractor, Part-time, Internship] 4 4 0 0.0
0 city [Herndon, Wheaton-Glenmont, Washington, Fort B... 46 46 0 0.0
0 state [VA, MD, DC, CA, AZ, Anywhere, MA, SC, NY, LA,... 15 15 0 0.0
0 remote [False, True] 2 2 0 0.0
0 description [Date Posted:\n2022-11-10-08:00\n...\nCountry:... 115 115 0 0.0
0 responsibilities [The Data Scientists will be responsible for i... 100 99 13 11.2
0 qualifications [Bachelor’s degree with 5 years of experience ... 110 110 0 0.0
0 benefits [nan, Guidehouse offers a comprehensive, total... 51 50 64 55.2
0 min_salary [50000.0, 137000.0, 73100.0, 100000.0, 110574.... 76 76 0 0.0
0 max_salary [75000.0, 190000.0, 166000.0, 145000.0, 187976... 80 80 0 0.0
0 salary_status [Yearly, Hourly] 2 2 0 0.0
0 via [Raytheon Jobs - Raytheon Technologies, Monste... 58 58 0 0.0
0 posted [nan, 12 hours ago, 11 hours ago, 28 days ago,... 26 25 45 38.8
0 resources [https://www.google.com/search?hl=en&gl=us&q=B... 97 97 0 0.0

Well, this turned out to be just fine since you can see that the columns required for the visualization don’t have any missing data.

5 Data Visualization

5.1 Geospatial

My first geospatial plot for jobs around the DMV area comes with the plotly Choropleth module.

Code
# CREATE A CHOROPLETH MAP
fig = go.Figure(go.Choropleth(
    locations=total_count_jobs['state'],
    z=total_count_jobs['total_count'],
    colorscale='darkmint',
    locationmode = 'USA-states',
    name="",
    text=total_count_jobs['state_name'] + '<br>' + 'Total jobs: ' + total_count_jobs['total_count'].astype(str),
    hovertemplate='%{text}',
))

# ADD TITLE AND ANNOTATIONS
fig.update_layout(
    title_text='<b>Number of Jobs in the DMV region</b>',
    title_font_size=24,
    title_x=0.5,
    geo_scope='usa',
    width=1100,
    height=700
)

# SHOW FIGURE
fig.show()

The number of occupations in the DMV (District of Columbia, Maryland, and Virginia) region are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the DMV area. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details.

The caption of the map, “Number of Jobs in the DMV region,” gives the information being displayed a clear context as you can that Virgina shows 63 jobs whicha are the most and Maryland shows 55 which is sort of in the middle of the range.

For my next chart, I used the very famous folium library to create another interactive visualization.

Code
dc_jobs_final = dc_jobs_final[(dc_jobs_final['state'] == 'DC') | (dc_jobs_final['state'] == 'MD') | (dc_jobs_final['state'] == 'VA')]

# CREATE DATA
data = dc_jobs_final[["Latitude", "Longitude"]].values.tolist()

# Define a list of bounding boxes for the United States, including Alaska
us_bounding_boxes = [
    {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548},  # Contiguous U.S.
    {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0}  # Alaska
]

# Filter out lat/long pairs that do not belong to the United States
latlong_list = []
for latlong in data:
    point = Point(latlong[1], latlong[0])  # Shapely uses (x, y) coordinates, so we swap lat and long
    for bounding_box in us_bounding_boxes:
        box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']),
                       (bounding_box['min_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['min_lat'])])
        if point.within(box):
            latlong_list.append(latlong)
            break  # No need to check remaining bounding boxes if the point is already within one

# INITIALIZE MAP
dmv_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)

# ADD POINTS 
plugins.MarkerCluster(latlong_list).add_to(dmv_job_map)

# SHOW MAP
dmv_job_map
Make this Notebook Trusted to load map: File -> Trust Notebook

This is an interactive map to demonstrate how jobs are distributed around the DMV (District of Columbia, Maryland, and Virginia) region. It provides insightful information on the geographic distribution of employment prospects within the DMV region by visually portraying the job locations. The map’s markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.

Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in the DMV region and eventually assisting in decision-making related to job search and recruitment efforts.

I am hoping that you now have a clear idea about the number of jobs around this region. Since you have reached this far, I am also assuming that you would interested in knowing more about the jobs in DMV.

Don’t worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.

5.2 Textual Analyses

The dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.

Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.

Code
def remove_punct(text):
    """ A method to remove punctuations from text """
    text  = "".join([char for char in text if char not in punctuation])
    text = re.sub('[0-9]+', '', text) #removes numbers from text
    return text

def remove_stopwords(text):
    """ A method to remove all the stopwords """
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = [word for word in text if word not in stopwords]
    return text

def tokenization(text):
    """ A method to tokenize text data """
    text = re.split('\W+', text) #splitting each sentence/ tweet into its individual words
    return text

def stemming(text):
    """ A method to perform stemming on text data"""
    porter_stem = nltk.PorterStemmer()
    text = [porter_stem.stem(word) for word in text]
    return text

def lemmatizer(text):
    word_net_lemma = nltk.WordNetLemmatizer()
    text = [word_net_lemma.lemmatize(word) for word in text]
    return text

# Making a common cleaning function for every part below for code reproducability
def clean_words(list_words):
    # Making a regex pattern to match in the characters we would like to replace from the words
    character_replace = ",()0123456789.?!@#$%&;*:_,/" 
    pattern = "[" + character_replace + "]"
    new_list_words = []
    
    # Looping through every word to remove the characters and appending back to a new list
    # replace is being used for the characters that could not be catched through regex
    for s in list_words:
        new_word = s.lower()
        new_word = re.sub(pattern,"",new_word)
        new_word = new_word.replace('[', '')
        new_word = new_word.replace(']', '')
        new_word = new_word.replace('-', '')
        new_word = new_word.replace('—', '')
        new_word = new_word.replace('“', '')
        new_word = new_word.replace("’", '')
        new_word = new_word.replace("”", '')
        new_word = new_word.replace("‘", '')
        new_word = new_word.replace('"', '')
        new_word = new_word.replace("'", '')
        new_word = new_word.replace(" ", '')
        new_list_words.append(new_word)

    # Using filter to remove empty strings
    new_list_words = list(filter(None, new_list_words))
    return new_list_words

def clean_text(corpus):
    """ A method to do basic data cleaning """
    
    # Remove punctuation and numbers from the text
    clean_text = remove_punct([corpus])
    
    # Tokenize the text into individual words
    text_tokenized = tokenization(clean_text.lower())
    
    # Remove stopwords from the tokenized text
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text_without_stop = remove_stopwords(text_tokenized)
    
    # Perform stemming on the text
    text_stemmed = stemming(text_without_stop)
    
    # Perform lemmatization on the text
    text_lemmatized = lemmatizer(text_without_stop)
    
    # Further clean and process the words
    text_final = clean_words(text_lemmatized)
    
    # Join the cleaned words back into a single string
    return " ".join(text_final)

How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.

Moving on, for our very first textual analyses, I will be using the pipeline created for the ‘description’ column

Code
descript_list = []
for descript in dc_jobs['description']:
    descript_list.append(clean_text(descript))

Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs in the DMV area.

Code
# Join the list of descriptions into a single string
text = ' '.join(descript_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:

  1. “Data”: This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.

  2. “Experience”: This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.

  3. “Machine Learning”: This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.

  4. “Skills”: This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.

  5. “Analytics”: This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.

Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.

What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the ‘responsibilities’ column and generating it’s word cloud

Code
# Removing missing values from responsibilities for text cleaning
dc_jobs.dropna(subset=['responsibilities'], inplace=True)

response_list = []
for response in dc_jobs['responsibilities']:
    response_list.append(clean_text(response))

# Join the list of descriptions into a single string
text = ' '.join(response_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda *args, **kwargs: 'black').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Similar to description wordcloud, we see that words such ‘data’, ‘machine learning’, ‘design’, ‘big data’, ‘project’, ‘model’, ‘development’, etc. are prevalent.

This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.

My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behing. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.

At last, we have the moment of truth. Whether you’re capable of doing this job or not? What qualities one must have in them so they are a suitable fit for the employer?

Let’s check this out.

Code
qualif_list = []
for qualif in dc_jobs['qualifications']:
    qualif_list.append(clean_text(qualif))

# Join the list of descriptions into a single string
text = ' '.join(qualif_list)

# Generate the word cloud with a custom background color
wordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda *args, **kwargs: 'black').generate(text)

# Create the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')

# Display the  word cloud
plt.show()

As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:

  1. Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.

  2. Work Experience: The inclusion of “Work Experience” emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.

  3. Data Science: The prominence of “Data Science” indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.

  4. Bachelor Degree: The presence of “Bachelor Degree” suggests that having a bachelor’s degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.

  5. Machine Learning and Deep Learning: The inclusion of “Machine Learning” and “Deep Learning” highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.

  6. Communication Skills: The mention of “Communication Skill” underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.

Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor’s degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.

5.3 Visualizing Salaries

Finally!!

I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you’ve deserved it.

Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it’s worth knowing about the pay scale.

5.3.1 Using benefits

Coming to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.

Code
# Filter the dataframe by yearly salary status
status_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(status_filtered_df['title'])
company_names = list(status_filtered_df['company_name'])
min_salaries = list(status_filtered_df['min_salary'])
max_salaries = list(status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()

The plot up top shows the various job titles’ wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.

The salaries extend from $1500 of a Research Scientist in Machine Learning for OpenAI to $370,000 for the same role in the same company.

For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.

Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.

In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.

5.3.1.1 Anomaly

I tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.

5.3.2 Using description

Code
# Filter the dataframe by yearly salary status
desc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(desc_status_filtered_df['title'])
company_names = list(desc_status_filtered_df['company_name'])
min_salaries = list(desc_status_filtered_df['min_salary'])
max_salaries = list(desc_status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()

Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.

In the plot above, you will also see that some jobs such as Big Data Cloud Architect for Tephra Inc. has only one salary i.e $5090. Now this was provided in the data itself.

The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers around the DMV area.

6 Limitations

It can be said that this dataset isn’t perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.

One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.

7 Conclusions

The project’s insightful findings have highlighted the DMV region as a prime focus area for aspiring Data Scientists, offering abundant job opportunities and immense potential for career advancement. The concentrated presence of relevant job postings in this region makes it an ideal destination for professionals seeking to thrive in the field.

Furthermore, the analysis has emphasized the critical importance of comprehensive job postings. Companies that provide detailed information on aspects such as salary, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit a genuine consideration for prospective candidates. These organizations are highly sought-after by top talent, making them highly desirable employment prospects.

By immersing myself in the exploration of Data Science job opportunities throughout the DMV area, I have gained invaluable knowledge that will serve as a compass for my career decisions and shape my professional trajectory. I sincerely hope that you, too, have derived substantial benefits from this analysis, enabling you to grasp the intricacies and dynamics of the Data Science job market with a profound understanding.

Source Code
---
title: "JobHuntDC: Navigating Data Science careers through Data Visualization"  
---

# Introduction

It's great that you have narrowed down your search to the District of Columbia, Maryland and Virginia (DMV) area. 

The region is quickly becoming a hotspot for data science experts looking for rewarding employment opportunities. The area offers a distinctive environment that encourages creativity and cooperation in the field of data science due to its close proximity to prominent academic institutions, governmental organizations, and cutting-edge enterprises. The DMV region has a thriving IT sector, with a variety of businesses in the public and private sectors that produce enormous volumes of data that are ready for investigation and analysis. Additionally, the region has a wealth of talent, since each year, talented graduates in data science and related subjects are produced by a large number of top institutions and colleges. 

Additionally, the area's vibrant startup scene and strong support for entrepreneurship foster a climate that fosters data-driven businesses, making it a desirable location for data science experts looking to have a real effect. Overall, the DMV region is a prime center for data science professionals wishing to grow their careers in a challenging and gratifying environment due to its combination of a thriving industry, research possibilities, and talent pool.

Let me walk you through this comprehensive report which will help you find your next job.

# Data

Some information about the dataset that was provided by our very own Georgetown University DSAN department.

1. This dataset is the outcome of a web-crawling exercise aimed at identifying employment opportunities that could potentially interest DSAN students.

2. There are roughly 85 searches, each yielding up to 10 job postings, for a total of around 850 jobs, which are currently active online, as of 04/14/2023 .

3. The postings were obtained using the following search query terms:
- "data-scientist",
- "data-analyst",
- "neural-networks",
- "big-data-and-cloud-computing",
- "machine-learning",
- "reinforcement-learning",
- "deep-learning",
- "time-series",
- "block-chain",
- "natural-language-processing"

4. The search for this data is a local search for DC, respectively. The files may contain duplicate job postings.

5. Furthermore, this search result extends outside of the specified search area. However, they likely contain a higher sampling of jobs in the DC area.

6. The search results are stored in multiple JSON files, with the file name representing the search term. with each file containing the results of a single search


# Data preparation

## Importing the libraries

This step needs no explanation. Required packages must always be loaded.


```{python}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import klib
import plotly.graph_objects as go
import folium
from folium import plugins
from shapely.geometry import Polygon, Point
from wordcloud import WordCloud

import json
import glob
import os

import re
import nltk
from nltk.stem import PorterStemmer
from string import punctuation
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings('ignore')
```

## Importing the dataset

I created a driver function to import the data in such a manner that both the audiences (technical and non-technical) are able to understand it.

The function below will return the list of dataframes for respective job searches.

```{python}
# Function to create the required dataframe for analysis.
def create_job_df(path):    
    """
    Takes as input directory to construct a list of dataframes from and returns that list
    :param path: a Path to a directory
    :return: a list of pandas DataFrames
    """

    # Get every file in the folder using glob
    all_files = glob.glob(os.path.join(path, "*.json"))

    # lists for appending dataframes for every job-search
    data_scientist_list = []
    data_analyst_list = []
    neural_networks_list = []
    big_data_and_cloud_computing_list = []
    machine_learning_list = []
    reinforcement_learning_list = []
    deep_learning_list = []
    time_series_list = []
    block_chain_list = []
    natural_language_processing_list = []

    # Iterate over the files in the folder
    for filename in all_files:
        # Read the json file
        with open(filename, 'r') as fp:
            data = json.load(fp)
        
        if 'jobs_results' in data:
            # create dataframe
            df = pd.DataFrame(data['jobs_results'])

            # Data Cleaning
            # Via
            df['via'] = df['via'].apply(lambda x: x[4:])

            # Job highlights
            qualifications = []
            responsibilities = []
            benefits = []

            for i in range(len(df['job_highlights'])):
                jd = df['job_highlights'][i]
                n = len(jd)

                if n == 3:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(jd[2]['items'])
                
                elif n==2:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(jd[1]['items'])
                    benefits.append(np.nan)
                
                elif n==1:
                    qualifications.append(jd[0]['items'])
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)
                else:
                    qualifications.append(np.nan)
                    responsibilities.append(np.nan)
                    benefits.append(np.nan)

            # Related links
            resources = []
            for i in range(len(df['related_links'])):
                links = df['related_links'][i]
                resources.append(links[0]['link'])

            # Extensions and detected extensions
            posted = []
            salary = []
            job_type = []
            for i in range(len(df['detected_extensions'])):
                extn = df['detected_extensions'][i]
                if 'posted_at' in extn.keys():
                    posted.append(extn['posted_at'])
                else:
                    posted.append(np.nan)

                if 'salary' in extn.keys():
                    salary.append(extn['salary'])  
                else:
                    salary.append(np.nan)

                if 'schedule_type' in extn.keys():
                    job_type.append(extn['schedule_type'])
                else:
                    job_type.append(np.nan)

            # Add the created columns
            df['qualifications'] = qualifications
            df['responsibilities'] = responsibilities
            df['benefits'] = benefits
            df['posted'] = posted
            df['salary'] = salary
            df['job_type'] = job_type
            df['resources'] = resources

            # Drop the redundant columns
            df.drop(columns=['job_highlights', 'related_links', 'extensions', 'detected_extensions'], inplace=True)

            # Rearrange the columns
            df = df[['job_id', 'title', 'company_name', 'job_type', 'location', 'description', 'responsibilities', 'qualifications', 
                    'benefits', 'salary', 'via', 'posted', 'resources']]
            
            search_query = ["data-scientist","data-analyst","neural-networks","big-data-and-cloud-computing",
                "machine-learning", 'reinforcement-learning','deep-learning', "time-series","block-chain",
                "natural-language-processing"]
            
            if "data-scientist" in filename:
                data_scientist_list.append(df)
            elif "data-analyst" in filename:
                data_analyst_list.append(df)
            elif "neural-networks" in filename:
                neural_networks_list.append(df)
            elif "big-data-and-cloud-computing" in filename:
                big_data_and_cloud_computing_list.append(df)
            elif "machine-learning" in filename:
                machine_learning_list.append(df)
            elif "reinforcement-learning" in filename:
                reinforcement_learning_list.append(df)
            elif "deep-learning" in filename:
                deep_learning_list.append(df)
            elif "time-series" in filename:
                time_series_list.append(df)
            elif "block-chain" in filename:
                block_chain_list.append(df)
            elif "natural-language-processing" in filename:
                natural_language_processing_list.append(df)
    
    # Concat the lists to create the merged dataframe
    data_scientist_df = pd.concat(data_scientist_list, axis=0, ignore_index=True)

    data_analyst_df = pd.concat(data_analyst_list, axis=0, ignore_index=True)

    neural_networks_df = pd.concat(neural_networks_list, axis=0, ignore_index=True)

    big_data_and_cloud_computing_df = pd.concat(big_data_and_cloud_computing_list, axis=0, ignore_index=True)

    machine_learning_df = pd.concat(machine_learning_list, axis=0, ignore_index=True)

    reinforcement_learning_df = pd.concat(reinforcement_learning_list, axis=0, ignore_index=True)

    deep_learning_df = pd.concat(deep_learning_list, axis=0, ignore_index=True)

    time_series_df = pd.concat(time_series_list, axis=0, ignore_index=True)

    block_chain_df = pd.concat(block_chain_list, axis=0, ignore_index=True)

    natural_language_processing_df = pd.concat(natural_language_processing_list, axis=0, ignore_index=True)

    # return the list of dataframes for every job
    return [data_scientist_df, data_analyst_df, neural_networks_df, big_data_and_cloud_computing_df, machine_learning_df, reinforcement_learning_df, deep_learning_df, time_series_df, block_chain_df, natural_language_processing_df]
```

Now that you've understood the function, lets see what kind of dataframe do we get for the potential analysis.

```{python}
# Define path
path = '../data/DC/'
# Execute the driver function to get the list of dataframes
df_list = create_job_df(path)

# The respective dataframes for each job search which might be later used for potential analyses.
data_scientist_df = df_list[0]
data_analyst_df = df_list[1]
neural_networks_df = df_list[2]
big_data_and_cloud_computing_df = df_list[3]
machine_learning_df = df_list[4]
reinforcement_learning_df = df_list[5]
deep_learning_df = df_list[6]
time_series_df = df_list[7]
block_chain_df = df_list[8]
natural_language_processing_df = df_list[9]

# Merge all the dataframes to get all job postings around DC
dc_jobs = pd.concat(df_list, axis=0, ignore_index=True)
dc_jobs.head()
```

## Data wrangling, munging and cleaning

This is quite an interesting section. You will witness how the data was cleaned and munged and what other techniques were used to preprocess it. This section will also involve feature-extraction.

We see some of the columns have categorical data as a list. I created a function to join these lists to form the full corpus for the specific column.

```{python}
def join_data(data_lst):
    # Check if data_lst is not NaN
    if data_lst is not np.nan:
        # If data_lst is not NaN, join the elements with ". " as the separator
        return ". ".join(data_lst)
    # If data_lst is NaN, return NaN (assuming np.nan is a valid representation of NaN)
    return np.nan

dc_jobs['responsibilities'] = dc_jobs['responsibilities'].apply(join_data)
dc_jobs['qualifications'] = dc_jobs['qualifications'].apply(join_data)
dc_jobs['benefits'] = dc_jobs['benefits'].apply(join_data)
```

Some of the job postings had listed their location as 'Anywhere'. So I decided to do some feature extraction and created a new column ('remote') which specifies whether the job available allows remote work or not.

```{python}
# Function to check if the job location is remote
def remote_or_not(location):
    # Check if the location parameter is "anywhere" (case-insensitive and stripped of leading/trailing spaces)
    if location.lower().strip() == 'anywhere':
        # If the location is "anywhere", return True
        return True
    # If the location is not "anywhere", return False
    return False

# Apply the remote_or_not function to the 'location' column of the 'dc_jobs' DataFrame and create a new 'remote' column
dc_jobs['remote'] = dc_jobs['location'].apply(remote_or_not)
```

Next I saw that the 'location' column had some absurd values. Perhaps this column was cleaned and the respective cities and states were extracted for later analyses.

```{python}
# Get city and state
def get_location(location):
    # Strip leading/trailing spaces from the location string
    location = location.strip()
    # Split the location string by comma
    loc_lst = location.split(',')
    # Get the number of elements in the loc_lst
    n = len(loc_lst)
    if n == 2:
        # If there are two elements, return the stripped city and state
        return loc_lst[0].strip(), loc_lst[1].strip()
    elif n == 1:
        # If there is only one element, return the stripped city and state as the same value
        return loc_lst[0].strip(), loc_lst[0].strip()

# Create empty lists to store the extracted cities and states
cities = []
states = []

# Iterate over the 'location' column of the 'dc_jobs' DataFrame
for i in range(len(dc_jobs['location'])):
    # Extract the city and state using the get_location function
    city, state = get_location(dc_jobs['location'][i])
    
    # Check for city or state containing '+1'
    if '+1' in city:
        city_lst = city.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in city:
            city = city_lst[0] + ' ' + city_lst[1]
        else:
            city = city_lst[0]
    if '+1' in state:
        state_lst = state.split()
        # If the value is United States, merge the first two items to generate the proper location
        if 'United States' in state:
            state = state_lst[0] + ' ' + state_lst[1]
        else:
            state = state_lst[0]
    
    # Append the city and state to the respective lists
    cities.append(city)
    states.append(state)

# Add 'city' and 'state' columns to the 'dc_jobs' DataFrame
dc_jobs['city'] = cities
dc_jobs['state'] = states

# Merge certain states for consistency
dc_jobs['state'] = dc_jobs['state'].replace('Maryland', 'MD')
dc_jobs['state'] = dc_jobs['state'].replace('New York', 'NY')
dc_jobs['state'] = dc_jobs['state'].replace('California', 'CA')

# Replace 'United States' with 'Anywhere' since it indicates working anywhere within the country
dc_jobs['state'] = dc_jobs['state'].replace('United States', 'Anywhere')

# Drop the 'location' column and re-arrange the columns in the desired order
dc_jobs.drop(columns=['location'], inplace=True)
dc_jobs = dc_jobs[['job_id', 'title', 'company_name', 'job_type', 'city', 'state', 'remote',
       'description', 'responsibilities', 'qualifications', 'benefits',
       'salary', 'via', 'posted', 'resources']]
```

I have dropped the duplicate job postings. This has been done very carefully by taking into account the columns of job title, company name, and the location (city, state). An employer may have the same job posting at a different location.

```{python}
# remove duplicate values from title and company name
dc_jobs = dc_jobs.drop_duplicates(subset=['title', 'company_name', 'city', 'state'], ignore_index=True)
dc_jobs.head()
```


### Missing Data

I always find missing data very crucial to any analyses. Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the dataframes.

Hence I feel that before progressing, one should always check missing data and take appropriate steps to handle it.

Below you can find a function which will return you the missing value statistics of the created dataframe.

```{python}
# Define a function that returns a data-frame of missing data statistics
def missing_val_stats(df):
    # Define columns of the data-frame
    df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 
                                       'num_miss', 'pct_miss'])
    tmp = pd.DataFrame()
    
    for c in df.columns:
        # Column
        tmp['column'] = [c]
        # Unique values in the column
        tmp['unique_val'] = [df[c].unique()]
        # Number of unique values in the column
        tmp['num_unique_val'] = len(list(df[c].unique()))
        # Number of unique values in the column without nan
        tmp['num_unique_val_nona'] = int(df[c].nunique())
        # Number of missing values in the column
        tmp['num_miss'] = df[c].isnull().sum()
        # Percentage of missing values in the column
        tmp['pct_miss'] = (df[c].isnull().sum()/ len(df)).round(3)*100
        # Append the values to the dataframe
        df_stats = df_stats.append(tmp)    
    # Return the created dataframe
    return df_stats

missing_val_stats(dc_jobs)
```


As you can see, of all the columns 'salary' has 83.6% missing values. Now this can get really tricky because salary is not just any variable that you can impute. It depends upon several factors such as the employer, the job title, geographical location, etc.

Lets also visualize the missing data using 'klib' library so that you are able to realize this trend for each column in the dataset.

klib library helps us to visualize missing data trends in the dataset. Using the 'missing_val' plot, we will be able to extract necessary information of the missing data in every column. <br><br>

```{python}
"Missing Value Plot"
dc_klib = klib.missingval_plot(dc_jobs, figsize=(10,15))
```

My point is if such a huge amount of salary data is missing, then how should I proceed with my research to help make you take a very important decision about your career in the DMV area.

# INTERESTING INSIGHT

This usually is not the case but sometimes an employer may provide information about the salary either in description or in benefits. Hence I decided to troubleshoot and verify if I could come up with something useful.

Turns out, my intuition was right. And as per my intuition I have provided you two very interesting analyses regarding salary which are related to benefits and description respectively.

## Salary analysis using benefits of a job provided by the employer

I will be using the below functions to provide salary information for that job whose given benefits can be used to extract the salary range.

```{python}
# Define a function to check if the benefit contains the keyword 'salary', 'pay', or 'range'
def get_sal_ben(benefit):
    # Convert the benefit string to lowercase and split it into words
    ben = benefit.lower().split()
    # Check if any of the keywords are present in the benefit
    if 'salary' in ben or 'range' in ben or 'pay' in ben:
        return True
    return False

# Create empty lists to store benefits containing salary information and their corresponding job IDs
ben_sal = []
ben_job_id = []

# Iterate over the 'benefits' column of the 'dc_jobs' DataFrame
for i in range(len(dc_jobs['benefits'])):
    benefit = dc_jobs['benefits'][i]
    # Check if the benefit is not NaN
    if benefit is not np.nan:
        # If the benefit contains the keywords, append it to the 'ben_sal' list and its job ID to the 'ben_job_id' list
        if get_sal_ben(benefit):
            ben_sal.append(benefit)
            ben_job_id.append(dc_jobs['job_id'][i])

# Define a regex pattern to extract salary information from the benefits
salary_pattern = r"\$([\d,.-]+[kK]?)"

# Create empty lists to store the extracted salary information and their corresponding job IDs
ben_sal_list = []
ben_job_id_lst = []

# Iterate over the benefits containing salary information
for i in range(len(ben_sal)):
    benefit = ben_sal[i]
    # Find all matches of the salary pattern in the benefit
    matches = re.findall(salary_pattern, benefit)
    if matches:
        # If there are matches, append them to the 'ben_sal_list' and their corresponding job ID to the 'ben_job_id_lst'
        ben_sal_list.append(matches)
        ben_job_id_lst.append(ben_job_id[i])
```

The salary ranges have been extracted from the benefits of some job ids. Note that these currently are string value. Check the below function that creates the value to float.

```{python}
# Function to convert a single value to float
def convert_to_float(value):
    try:
        # check for values containing k
        flag = False
        if 'k' in value or 'K' in value:
            flag = True
        # check for values containing '.'
        pattern = r'^(\d{1,3}(?:,\d{3})*)(?:\.\d+)?'  # Regular expression pattern
        match = re.search(pattern, value)
        if match:
            value =  match.group(1).replace('.', '')  # Remove dots from the matched value
        # Remove any non-digit characters (e.g., commas, hyphens)
        value = ''.join(filter(str.isdigit, value))
        # Multiply by 10000 if it ends with 'k'
        if flag:
            return float(value[:-1]) * 10000
        else:
            return float(value)
    except ValueError:
        return None

# Iterate over the data and convert each value to float
converted_data = [[convert_to_float(value) for value in row] for row in ben_sal_list]
```

Our last step would be to iterate over the 'converted_data' list above and filter our original dataframe.

```{python}
# Create an empty list to store the corrected salary ranges
correct_data = []

# Iterate over the converted_data list
for i in range(len(converted_data)):
    sal_range = converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    correct_data.append(sal_range)

# Filter the dc_jobs DataFrame based on the job IDs with salary information
ben_filtered_df = dc_jobs[dc_jobs['job_id'].isin(ben_job_id_lst)]
```

Now that, we have got a new dataframe, we can proceed right?

This is where I follow one of the principles of data munging and cleaning that whenever you have made certain changes to a dataframe and filtered it to create a new one, always run some pre-verification checks. This will make sure that the data is tidy and you should proceed with your study.

After taking a deep dive, I realized the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

```{python}
# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the correct_data list
for sal_lst in correct_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the minimum and maximum salaries to the ben_filtered_df DataFrame
ben_filtered_df['min_salary'] = min_sal
ben_filtered_df['max_salary'] = max_sal

# Get the data and job IDs of salaries from the ben_filtered_df DataFrame
data = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['salary'])
job_ids = list(ben_filtered_df[ben_filtered_df['salary'].notna()]['job_id'])

# Define a regex pattern to extract salary ranges
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        ben_filtered_df.loc[ben_filtered_df[ben_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop the redundant 'salary' column
ben_filtered_df.drop(columns=['salary'], inplace=True)
```

Another insight I had for this data is that the salary provided for each job is either hourly or yearly. But it wasn't distinguished in the beginning. Hence I thought it would make sense to add another column that can describe whether the provided salary is hourly or yearly.

```{python}
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

ben_filtered_df['salary_status'] = ben_filtered_df['min_salary'].apply(salary_status)
```

So the filtered dataframe has been cleaned. Before proceeding with the visualization, lets take a look at it's missing value statistics.

```{python}
missing_val_stats(ben_filtered_df)
```

You can see that columns such as 'min_salary', 'max_salary', 'salary_status' have only one missing value. So it wouldn't hurt to drop that value.

```{python}
# dropping nan values
ben_filtered_df.dropna(subset=['min_salary', 'max_salary', 'salary_status'], inplace=True)
ben_filtered_df.head()
```

We have our final dataframe for the salary analyses of a job using benefits provided. Lets proceed to the follow the same steps for the other research using 'description' column.

## Salary analysis using description of the job provided by the employer

```{python}
# Define a function to check if the description contains keywords related to salary
def get_sal_desc(descript):
    descpt = descript.lower().split()
    if 'salary' in descpt or 'range' in descpt or 'pay' in descpt:
        return True
    return False

# Create empty lists to store the descriptions and job IDs with salary information
desc_sal = []
desc_job_id = []

# Iterate over the descriptions in the dc_jobs DataFrame
for i in range(len(dc_jobs['description'])):
    descpt = dc_jobs['description'][i]
    if descpt is not np.nan:
        if get_sal_desc(descpt):
            desc_sal.append(descpt)
            desc_job_id.append(dc_jobs['job_id'][i])

# If the description contained the keyword, extract the salary from it.
salary_pattern = r"\$([\d,.-]+[kK]?)"
desc_sal_list = []
desc_job_id_lst = []

# Iterate over the descriptions with salary information
for i in range(len(desc_sal)):
    descript = desc_sal[i]
    matches = re.findall(salary_pattern, descript)
    if matches:
        desc_sal_list.append(matches)
        desc_job_id_lst.append(dc_jobs['job_id'][i])

# Iterate over the data and convert each value to float
desc_converted_data = [[convert_to_float(value) for value in row] for row in desc_sal_list]

# Create an empty list to store the corrected salary ranges
desc_correct_data = []

# Iterate over the converted data
for i in range(len(desc_converted_data)):
    sal_range = desc_converted_data[i]
    n = len(sal_range)
    # If the salary range has only one value less than 16.5, replace it with NaN
    if n == 1 and sal_range[0] < 16.5:
        sal_range = [np.nan]
    # If the salary range has more than two values, find the minimum and maximum values
    elif n > 2:
        min_sal = min(salary for salary in sal_range if salary != 0.0)
        max_sal = max(sal_range)
        sal_range = [min_sal, max_sal]
    desc_correct_data.append(sal_range)

# Filter the dc_jobs DataFrame based on the job IDs with salary information
desc_filtered_df = dc_jobs[dc_jobs['job_id'].isin(desc_job_id_lst)]

# Create empty lists to store the minimum and maximum salaries
min_sal = []
max_sal = []

# Iterate over the converted data
for sal_lst in desc_converted_data:
    if len(sal_lst) == 2:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[1])
    else:
        min_sal.append(sal_lst[0])
        max_sal.append(sal_lst[0])

# Add the min_salary and max_salary columns to the desc_filtered_df DataFrame
desc_filtered_df['min_salary'] = min_sal
desc_filtered_df['max_salary'] = max_sal

# Extract salaries from the 'salary' column
data = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['salary'])
job_ids = list(desc_filtered_df[desc_filtered_df['salary'].notna()]['job_id'])
salary_pattern = r'(\d+(\.\d+)?)([kK])?–(\d+(\.\d+)?)([kK])?'

# Iterate over the data and extract salaries
for i in range(len(data)):
    match = re.search(salary_pattern, data[i])
    if match:
        min_salary = float(match.group(1))
        if match.group(3):
            min_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'min_salary'] = min_salary
        max_salary = float(match.group(4))
        if match.group(6):
            max_salary *= 1000
        desc_filtered_df.loc[desc_filtered_df[desc_filtered_df['job_id'] == job_ids[i]].index, 'max_salary'] = max_salary

# Drop redundant 'salary' column
desc_filtered_df.drop(columns=['salary'], inplace=True)

# Define a function to determine the salary status based on the min_salary
def salary_status(salary):
    if salary <= 100:
        return 'Hourly'
    elif salary > 100:
        return 'Yearly'
    else:
        return np.nan

# Add the 'salary_status' column to the desc_filtered_df DataFrame
desc_filtered_df['salary_status'] = desc_filtered_df['min_salary'].apply(salary_status)

# Reorder the columns in the desc_filtered_df DataFrame
desc_filtered_df = desc_filtered_df[['job_id', 'title', 'company_name', 'job_type', 'city', 'state',
       'remote', 'description', 'responsibilities', 'qualifications',
       'benefits', 'min_salary', 'max_salary', 'salary_status', 'via', 'posted', 'resources']]

desc_filtered_df.head()
```

Like benefits, description has also been used to create a separate dataframe that I will use to visualize salary information so that you can gain interesting insights.

Lets check for missing data for this dataframe as well.

```{python}
missing_val_stats(desc_filtered_df)
```

Well, this turned out to be just fine since you can see that the columns required for the visualization don't have any missing data.

# Data Visualization

## Geospatial


```{python}
#| echo: false
#| warning: false
# Create a new column 'Address' by combining 'city' and 'state' columns
dc_jobs['Address'] = dc_jobs['city'] + ', ' + dc_jobs['state']

# Read the address coordinates from the 'address_coords.csv' file
address_df = pd.read_csv("../data/address_coords.csv")

# Merge the 'dc_jobs' DataFrame with the 'address_df' DataFrame based on the 'Address' column
# and add the coordinates information to 'dc_jobs_final' DataFrame
dc_jobs_final = pd.merge(dc_jobs, address_df, on="Address", how="left")

# Read the 'uscities.csv' file containing State IDs and State Names
uscities_df = pd.read_csv("../data/uscities.csv")

# Extract the State IDs and State Names from 'uscities_df' and drop duplicate rows
state_name = uscities_df[["state_id", "state_name"]].drop_duplicates().reset_index(drop=True)
state_name.columns = ["state", "state_name"]

# Count the number of job sightings in each state
total_count_jobs = dc_jobs_final.groupby("state")['job_id'].count().reset_index()
total_count_jobs.columns = ["state", "total_count"]

# Filter the total count of jobs for specific states (DC, MD, Maryland, VA)
total_count_jobs = total_count_jobs[(total_count_jobs["state"] == "DC") |
                                    (total_count_jobs["state"] == "MD") |
                                    (total_count_jobs["state"] == "Maryland") |
                                    (total_count_jobs["state"] == "VA")].reset_index(drop=True)

# Merge the state IDs and state names with the total count of jobs
total_count_jobs = pd.merge(total_count_jobs, state_name, on="state", how="left")
```

My first geospatial plot for jobs around the DMV area comes with the plotly Choropleth module.

```{python}
# CREATE A CHOROPLETH MAP
fig = go.Figure(go.Choropleth(
    locations=total_count_jobs['state'],
    z=total_count_jobs['total_count'],
    colorscale='darkmint',
    locationmode = 'USA-states',
    name="",
    text=total_count_jobs['state_name'] + '<br>' + 'Total jobs: ' + total_count_jobs['total_count'].astype(str),
    hovertemplate='%{text}',
))

# ADD TITLE AND ANNOTATIONS
fig.update_layout(
    title_text='<b>Number of Jobs in the DMV region</b>',
    title_font_size=24,
    title_x=0.5,
    geo_scope='usa',
    width=1100,
    height=700
)

# SHOW FIGURE
fig.show()
```

The number of occupations in the DMV (District of Columbia, Maryland, and Virginia) region are shown graphically by the choropleth map. The total number of jobs is used to color-code each state, with darker hues indicating more jobs. The map gives a visual representation of the distribution of jobs in the DMV area. The name of the state and the overall number of employment in that state are displayed when a state is hovered over to reveal further details. 

The caption of the map, "Number of Jobs in the DMV region," gives the information being displayed a clear context as you can that Virgina shows 63 jobs whicha are the most and Maryland shows 55 which is sort of in the middle of the range.

For my next chart, I used the very famous folium library to create another interactive visualization. 

```{python}
dc_jobs_final = dc_jobs_final[(dc_jobs_final['state'] == 'DC') | (dc_jobs_final['state'] == 'MD') | (dc_jobs_final['state'] == 'VA')]

# CREATE DATA
data = dc_jobs_final[["Latitude", "Longitude"]].values.tolist()

# Define a list of bounding boxes for the United States, including Alaska
us_bounding_boxes = [
    {'min_lat': 24.9493, 'min_long': -124.7333, 'max_lat': 49.5904, 'max_long': -66.9548},  # Contiguous U.S.
    {'min_lat': 50.0, 'min_long': -171.0, 'max_lat': 71.0, 'max_long': -129.0}  # Alaska
]

# Filter out lat/long pairs that do not belong to the United States
latlong_list = []
for latlong in data:
    point = Point(latlong[1], latlong[0])  # Shapely uses (x, y) coordinates, so we swap lat and long
    for bounding_box in us_bounding_boxes:
        box = Polygon([(bounding_box['min_long'], bounding_box['min_lat']),
                       (bounding_box['min_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['max_lat']),
                       (bounding_box['max_long'], bounding_box['min_lat'])])
        if point.within(box):
            latlong_list.append(latlong)
            break  # No need to check remaining bounding boxes if the point is already within one

# INITIALIZE MAP
dmv_job_map = folium.Map([40, -100], zoom_start=4, min_zoom=3)

# ADD POINTS 
plugins.MarkerCluster(latlong_list).add_to(dmv_job_map)

# SHOW MAP
dmv_job_map
```

This is an interactive map to demonstrate how jobs are distributed around the DMV (District of Columbia, Maryland, and Virginia) region. It provides insightful information on the geographic distribution of employment prospects within the DMV region by visually portraying the job locations. The map's markers emphasize the precise areas where job openings are present, giving a clear picture of job concentrations and hotspots. The ability to identify areas with a higher density of employment prospects and make educated decisions about their job search and prospective relocation is one of the main benefits of this information for job seekers.

Furthermore, the marker clustering feature used in the map aids in identifying regions with a high concentration of employment opportunities. The clustering technique assembles neighboring job locations into clusters, each of which is symbolized by a single marker. This makes it simple for visitors to pinpoint areas with lots of employment prospects. Job searchers can zoom in on these clusters to learn more about individual regions and the regional labor market by doing so. As a result, the map is an effective resource for both job seekers and employers, offering a thorough picture of the locations and concentrations of jobs in the DMV region and eventually assisting in decision-making related to job search and recruitment efforts.


I am hoping that you now have a clear idea about the number of jobs around this region. Since you have reached this far, I am also assuming that you would interested in knowing more about the jobs in DMV.

Don't worry. I have got you covered. Let me walk you step by step so that you are mentally prepared to take your crucial decision.

## Textual Analyses

The dataset provided certainly revolved around text data. So I thought to use my NLP concepts that I gained from ANLY-580 (Natural Language Processing) and ANLY-521 (Computational Linguistics) courses. I would recommend you take these courses too as they have proven to be very beneficial.

Coming to handling the text data, I have created some functions that will run in such a sequence as if they were to be ran in a pipeline.

```{python}
def remove_punct(text):
    """ A method to remove punctuations from text """
    text  = "".join([char for char in text if char not in punctuation])
    text = re.sub('[0-9]+', '', text) #removes numbers from text
    return text

def remove_stopwords(text):
    """ A method to remove all the stopwords """
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = [word for word in text if word not in stopwords]
    return text

def tokenization(text):
    """ A method to tokenize text data """
    text = re.split('\W+', text) #splitting each sentence/ tweet into its individual words
    return text

def stemming(text):
    """ A method to perform stemming on text data"""
    porter_stem = nltk.PorterStemmer()
    text = [porter_stem.stem(word) for word in text]
    return text

def lemmatizer(text):
    word_net_lemma = nltk.WordNetLemmatizer()
    text = [word_net_lemma.lemmatize(word) for word in text]
    return text

# Making a common cleaning function for every part below for code reproducability
def clean_words(list_words):
    # Making a regex pattern to match in the characters we would like to replace from the words
    character_replace = ",()0123456789.?!@#$%&;*:_,/" 
    pattern = "[" + character_replace + "]"
    new_list_words = []
    
    # Looping through every word to remove the characters and appending back to a new list
    # replace is being used for the characters that could not be catched through regex
    for s in list_words:
        new_word = s.lower()
        new_word = re.sub(pattern,"",new_word)
        new_word = new_word.replace('[', '')
        new_word = new_word.replace(']', '')
        new_word = new_word.replace('-', '')
        new_word = new_word.replace('—', '')
        new_word = new_word.replace('“', '')
        new_word = new_word.replace("’", '')
        new_word = new_word.replace("”", '')
        new_word = new_word.replace("‘", '')
        new_word = new_word.replace('"', '')
        new_word = new_word.replace("'", '')
        new_word = new_word.replace(" ", '')
        new_list_words.append(new_word)

    # Using filter to remove empty strings
    new_list_words = list(filter(None, new_list_words))
    return new_list_words

def clean_text(corpus):
    """ A method to do basic data cleaning """
    
    # Remove punctuation and numbers from the text
    clean_text = remove_punct([corpus])
    
    # Tokenize the text into individual words
    text_tokenized = tokenization(clean_text.lower())
    
    # Remove stopwords from the tokenized text
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text_without_stop = remove_stopwords(text_tokenized)
    
    # Perform stemming on the text
    text_stemmed = stemming(text_without_stop)
    
    # Perform lemmatization on the text
    text_lemmatized = lemmatizer(text_without_stop)
    
    # Further clean and process the words
    text_final = clean_words(text_lemmatized)
    
    # Join the cleaned words back into a single string
    return " ".join(text_final)
```

How did I create the above pipeline of cleaning text data? The answer to this question would again be taking either of the above courses mentioned.

Moving on, for our very first textual analyses, I will be using the pipeline created for the 'description' column

```{python}
descript_list = []
for descript in dc_jobs['description']:
    descript_list.append(clean_text(descript))
```

Now that the data has been cleaned. I have used the function below to create a wordcloud that can provide you with some information about the description of Data Science jobs in the DMV area.

```{python}
# Join the list of descriptions into a single string
text = ' '.join(descript_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```

The generated word cloud provides a visual representation of the most frequent words in the descriptions of data science jobs. By analyzing the word cloud, we can identify some important words that stand out:

1. "Data": This word indicates the central focus of data science jobs. It highlights the importance of working with data, analyzing it, and extracting insights. Job seekers should emphasize their skills and experience related to data handling, data analysis, and data-driven decision-making.

2. "Experience": This word suggests that job seekers should pay attention to the level of experience required for data science positions. Employers often look for candidates with relevant industry experience or specific technical skills. Job seekers should tailor their resumes to showcase their experience and highlight relevant projects or accomplishments.

3. "Machine Learning": This term highlights the growing demand for machine learning expertise in data science roles. Job seekers should focus on showcasing their knowledge and experience in machine learning algorithms, model development, and implementation.

4. "Skills": This word emphasizes the importance of having a diverse skill set in data science. Job seekers should highlight their proficiency in programming languages (e.g., Python, R), statistical analysis, data visualization, and other relevant tools and technologies.

5. "Analytics": This term suggests that data science positions often involve working with analytics tools and techniques. Job seekers should demonstrate their ability to extract insights from data, perform statistical analysis, and apply analytical approaches to solve complex problems.

Overall, I would advise job seekers should pay attention to the recurring words in the word cloud and tailor their resumes and job applications accordingly. They should emphasize their experience with data, machine learning, relevant skills, and analytics. Additionally, job seekers should highlight any unique qualifications or specific domain expertise that aligns with the requirements of the data science roles they are interested in.

What are the responsibilities of a Data Scientist or Machine Learning Engineer or a Data Analyst? Lets find out by running the pipeline for the 'responsibilities' column and generating it's word cloud

```{python}
# Removing missing values from responsibilities for text cleaning
dc_jobs.dropna(subset=['responsibilities'], inplace=True)

response_list = []
for response in dc_jobs['responsibilities']:
    response_list.append(clean_text(response))

# Join the list of descriptions into a single string
text = ' '.join(response_list)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='yellow', color_func=lambda *args, **kwargs: 'black').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```

Similar to description wordcloud, we see that words such 'data', 'machine learning', 'design', 'big data', 'project', 'model', 'development', etc. are prevalent.

This indicates that when you will join a company as a Data Scientist or any other similar role, you will be looped into a project that may involve machine learning or big data. You maybe required to do some development and generate some models and provide an analyses in a similar fashion in what I am doing right now.

My advice here would be to practice as much as you can. Be it coding, maths, statistics, machine learning or any other data science related concept, if you practice you will never fall behing. I would also encourage job seekers to do a lot of projects. Projects help you in adjusting towards a formal way of doing work. Using github, connecting with your teammates over zoom or google meets for the agenda of the project can shape you up for working in a corporate environment.


At last, we have the moment of truth. Whether you're capable of doing this job or not?
What qualities one must have in them so they are a suitable fit for the employer?

Let's check this out.

```{python}
qualif_list = []
for qualif in dc_jobs['qualifications']:
    qualif_list.append(clean_text(qualif))

# Join the list of descriptions into a single string
text = ' '.join(qualif_list)

# Generate the word cloud with a custom background color
wordcloud = WordCloud(width=800, height=400, background_color='green', color_func=lambda *args, **kwargs: 'black').generate(text)

# Create the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')

# Display the  word cloud
plt.show()
```

As per the word cloud, I can give you some certain keywords which are in turn basically qualities and skills that job seekers must have in order to be qualified for a Data Science related job. These are as follows:

1. Python: Python is a popular programming language widely used in data science. Its presence in the word cloud suggests that proficiency in Python is important for data science job roles. Job seekers should focus on acquiring or highlighting their Python skills to increase their chances of success in data science positions.

2. Work Experience: The inclusion of "Work Experience" emphasizes the importance of relevant work experience in the field of data science. Job seekers should consider showcasing their practical experience and projects related to data science to demonstrate their expertise and ability to apply concepts in real-world scenarios.

3. Data Science: The prominence of "Data Science" indicates that job seekers should have a strong foundation in data science concepts, techniques, and methodologies. Employers are likely looking for candidates who possess a solid understanding of data analysis, statistical modeling, data visualization, and machine learning algorithms.

4. Bachelor Degree: The presence of "Bachelor Degree" suggests that having a bachelor's degree, preferably in a related field such as computer science, mathematics, or statistics, is often a minimum requirement for data science roles. Job seekers should ensure they meet the educational qualifications specified in the job descriptions.

5. Machine Learning and Deep Learning: The inclusion of "Machine Learning" and "Deep Learning" highlights the increasing demand for expertise in these areas within the field of data science. Job seekers should consider acquiring knowledge and practical experience in machine learning and deep learning techniques, algorithms, and frameworks to enhance their competitiveness in the job market.

6. Communication Skills: The mention of "Communication Skill" underscores the importance of effective communication for data scientists. Job seekers should focus not only on technical skills but also on developing strong communication skills, including the ability to present findings, explain complex concepts to non-technical stakeholders, and collaborate effectively within interdisciplinary teams.

Overall, this word cloud suggests that job seekers in the field of data science should prioritize acquiring or highlighting skills in Python programming, gaining relevant work experience, having a solid understanding of data science principles, possessing a bachelor's degree, particularly in a related field, and developing strong communication skills. Additionally, focusing on machine learning and deep learning techniques can further enhance their prospects in the job market.

## Visualizing Salaries

Finally!!

I know ever since the beginning you have been waiting for this. Scrolling and soaking in every tiny bit of information provided above, you have been waiting for the visualizations depicting salaries. I would say you've deserved it.

Now that you know about the geographical aspect of these jobs and the fact that you know what you will do in a particular role, what will be your responsibilites over there and what can you do to make yourself qualified for that job, it's worth knowing about the pay scale.

### Using benefits

Coming to my first visualization which I have generated using plotly for the yearly salaries extracted from the benefits of the job.

```{python}
# Filter the dataframe by yearly salary status
status_filtered_df = ben_filtered_df[ben_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(status_filtered_df['title'])
company_names = list(status_filtered_df['company_name'])
min_salaries = list(status_filtered_df['min_salary'])
max_salaries = list(status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()
```

The plot up top shows the various job titles' wage ranges in a visual manner. The position along the x-axis denotes the wage range, and each data point on the plot is associated with a particular job title. The job titles are displayed on the y-axis, making it simple to compare and identify the salary ranges for various positions.

The salaries extend from $1500 of a Research Scientist in Machine Learning for OpenAI to $370,000 for the same role in the same company.

For job seekers, this plot is quite useful because it provides information on the expected salaries for various job titles. Job searchers can better comprehend the possible earning potential for various roles by examining the distribution of salary ranges. When evaluating employment opportunities and negotiating compensation packages, this information might be helpful.

Additionally, the plot makes it possible for job seekers to spot any differences in salary ranges among positions with the same title. They can identify outliers or ranges that are unusually high or low in comparison to others, which may point to variables impacting the wage such as experience level, area of speciality, or geographic location.

In the end, this visualization enables job seekers to make better selections throughout the hiring process. It enables individuals to focus on options that coincide with their financial aspirations by assisting them in matching their professional goals and expectations with the wage ranges associated with particular job titles.

#### Anomaly

I tried to generate the same plot for the hourly wages in the data too. But turns out, due to their number being very small (4 in particular), it made no sense in generating that plot.

### Using description

```{python}
# Filter the dataframe by yearly salary status
desc_status_filtered_df = desc_filtered_df[desc_filtered_df['salary_status'] == 'Yearly']

# Extract relevant data columns
job_titles = list(desc_status_filtered_df['title'])
company_names = list(desc_status_filtered_df['company_name'])
min_salaries = list(desc_status_filtered_df['min_salary'])
max_salaries = list(desc_status_filtered_df['max_salary'])
salary_ranges = list(zip(min_salaries, max_salaries))

# Create the figure and add the traces
fig = go.Figure()

for i, (title, company, salary_range) in enumerate(zip(job_titles, company_names, salary_ranges)):
    # Create hover text with job title, company, and salary range
    hover_text = f"{title}<br>Company: {company}<br>Salary Range: ${salary_range[0]:,} - ${salary_range[1]:,}"
    
    # Add a scatter trace for each job title
    fig.add_trace(go.Scatter(
        x=[salary_range[0], salary_range[1]],
        y=[title, title],
        mode='lines+markers',
        name=title,
        line=dict(width=4),
        marker=dict(size=10),
        hovertemplate=hover_text,
    ))

# Customize the layout
fig.update_layout(
    title='Salary Range for Different Job Titles',
    xaxis_title='Salary',
    yaxis_title='Job Title',
    hovermode='closest',
    showlegend=False,
    width=1500,  # Specify the desired width
    height=600  # Specify the desired height
)

# Show the interactive graph
fig.show()
```

Similar to the plot generated using benefits, this plot too provides information about the salary ranges for different job titles. Each job title is represented by a data point on the plot, with the x-axis indicating the salary range and the y-axis indicating the job title.

In the plot above, you will also see that some jobs such as Big Data Cloud Architect for Tephra Inc. has only one salary i.e $5090. Now this was provided in the data itself.

The dumbell plots generated using salary ranges extracted from benefits and description provide a holistic overview of the salaries given by the employers around the DMV area.

# Limitations

It can be said that this dataset isn't perfect after all. I have given my best effort to provide as much meaningful information out of this data but this dataset certainly has some anomalies.

One can see that the plotly visuals for salaries extracted from benefits and description might show different job titles which may not be present in the other plot or vice-versa. If that is the case, then it can only mean one thing: The salary was either provided in benefits or description.

# Conclusions

The project's insightful findings have highlighted the DMV region as a prime focus area for aspiring Data Scientists, offering abundant job opportunities and immense potential for career advancement. The concentrated presence of relevant job postings in this region makes it an ideal destination for professionals seeking to thrive in the field.

Furthermore, the analysis has emphasized the critical importance of comprehensive job postings. Companies that provide detailed information on aspects such as salary, benefits, qualifications, and requirements not only demonstrate transparency but also exhibit a genuine consideration for prospective candidates. These organizations are highly sought-after by top talent, making them highly desirable employment prospects.

By immersing myself in the exploration of Data Science job opportunities throughout the DMV area, I have gained invaluable knowledge that will serve as a compass for my career decisions and shape my professional trajectory. I sincerely hope that you, too, have derived substantial benefits from this analysis, enabling you to grasp the intricacies and dynamics of the Data Science job market with a profound understanding.