Data Gathering

This page talks about the data gathering techniques that have been used to collect raw data that would be further used in developing the ML model.

Data has been collected via 4 methods:

Python API
R API
Web Scraping
Raw data

1 Data gathering using Python API

Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (Twitter setup page will guide you in the processhe API. Once you have access to the API use the step-by-step guide to create an app and project. Remember to copy the keys in a txt file on your local machine.

Twitter is one of the most widely used social networks. For many organizations and people, having a great Twitter presence is a key factor to keeping their audience engaged. Part of having a great Twitter presence involves keeping your account active with new tweets and retweets, following interesting accounts, and quickly replying to your followers’ messages.

Here Tweepy is introduced as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object. Once we have collected some data, the possibilities in terms of analytics applications are endless. One of the major applications Tweepy is used for is extracting tweets for sentiment or emotion analysis. The emotion of the user can be obtained from the tweets by tokenizing each word and applying machine learning algorithms on that data. Such emotion or sentiment detection is used worldwide and will be broadly used in the future.

We have used tweepy to extract tweets related to salary and employment. And then further the text data is used for analysis to predict the names of the firms based upon statements in the past.

Link to Notebook

2 Data gathering using R API

NEWSDATA.IO provides API that generates news from 20,000 sources. It provides users to search by query, country, language, category and domain. The data from API displays the status of the request, total number of results and the news results. The news results contain title of the news, link to the article where it was posted, source of the article, keywords i.e. list of keywords related to the article, creator i.e. autohr(s) of the article, image_url, video_url, description, published Date i.e. date when the article was published and the full content.

Newsdata.io’s news API is developed to provide live and historical global news from thousands of sources with exceptional response. You can retrieve top stories based on country, as well as search all news data and filter by category, language, source, publish date, and more. Newsdata.io API provides headlines, images, and other article metadata from a range of popular news sources in JSON architecture with an API Key.

Link to Code

3 Web Scraping Levels.fyi

Levels.fyi is an online platform that helps in comparing career levels and compensation packages across different companies. We would be scraping data from this website for our model.

If you work in one of the MAANG(Meta/Apple/Amazon/Netflix/Google) companies, you’ve likely heard of site levels.fyi. For those who haven’t, Levels makes it easy to compare and contrast different career levels across different companies, and is generally considered more accurate in terms of actual tech salaries relative to other compensation sites like glassdoor.com. Levels.fyi has a great interface and its visualizations make it easy to compare salary bands across companies, however, for this project data has been analyzed in a way that pre-aggregated views on the site doesn’t allow.

Link to Notebook

4 Raw Data Gathering

Raw data has been gathered from multiple sources and hosted on Github.

Downloaded data

--- title: Data Gathering --- This page talks about the data gathering techniques that have been used to collect raw data that would be further used in developing the ML model. ```{=html} <table cellspacing="10" cellpadding="0"> <tbody> <tr> <th> <img src="./images/Data-Collection.png" style="width:500px;" align="right"> </th> <th> <img src="./images/data-gathering.jpeg" style="width:500px;" align="center"> </th> </tr> </tbody> </table> ``` Data has been collected via 4 methods: 1. <a href="#python_code">Python API</a> 2. <a href="#r_code">R API</a> 3. <a href="#web_scraping">Web Scraping</a> 4. <a href="#raw_data">Raw data</a> ## Data gathering using Python API Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (<a href="https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api" target="_blank">Twitter setup page</a> will guide you in the processhe API. Once you have access to the API use the <a href="https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2" target="_blank"> step-by-step guide</a> to create an app and project. Remember to copy the keys in a txt file on your local machine. Twitter is one of the most widely used social networks. For many organizations and people, having a great Twitter presence is a key factor to keeping their audience engaged. Part of having a great Twitter presence involves keeping your account active with new tweets and retweets, following interesting accounts, and quickly replying to your followers’ messages. Here Tweepy is introduced as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object. Once we have collected some data, the possibilities in terms of analytics applications are endless. One of the major applications Tweepy is used for is extracting tweets for sentiment or emotion analysis. The emotion of the user can be obtained from the tweets by tokenizing each word and applying machine learning algorithms on that data. Such emotion or sentiment detection is used worldwide and will be broadly used in the future. We have used tweepy to extract tweets related to salary and employment. And then further the text data is used for analysis to predict the names of the firms based upon statements in the past. <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Gathering/Python/Using_Twitter_API_for_gathering_tweets.ipynb" target="_blank">Link to Notebook</a> ## Data gathering using R API <a href="https://newsapi.org/" target="_blank">NEWSDATA.IO</a> provides API that generates news from 20,000 sources. It provides users to search by query, country, language, category and domain. The data from API displays the status of the request, total number of results and the news results. The news results contain title of the news, link to the article where it was posted, source of the article, keywords i.e. list of keywords related to the article, creator i.e. autohr(s) of the article, image_url, video_url, description, published Date i.e. date when the article was published and the full content. Newsdata.io’s news API is developed to provide live and historical global news from thousands of sources with exceptional response. You can retrieve top stories based on country, as well as search all news data and filter by category, language, source, publish date, and more. Newsdata.io API provides headlines, images, and other article metadata from a range of popular news sources in JSON architecture with an API Key. <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Gathering/R/NewsAPI_Data_gathering.Rmd" target="_blank">Link to Code</a> ## Web Scraping Levels.fyi <a href="https://www.levels.fyi" target="_blank">Levels.fyi</a> is an online platform that helps in comparing career levels and compensation packages across different companies. We would be scraping data from this website for our model. If you work in one of the MAANG(Meta/Apple/Amazon/Netflix/Google) companies, you’ve likely heard of site levels.fyi. For those who haven’t, Levels makes it easy to compare and contrast different career levels across different companies, and is generally considered more accurate in terms of actual tech salaries relative to other compensation sites like <a href="https://www.glassdoor.com/member/home/index.htm" target="_blank">glassdoor.com</a>. Levels.fyi has a great interface and its visualizations make it easy to compare salary bands across companies, however, for this project data has been analyzed in a way that pre-aggregated views on the site doesn’t allow. <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Data/Data%20Gathering/Python/Analyzing_Salaries_Scraping_LevelsFyi.ipynb" target="_blank">Link to Notebook</a> ## Raw Data Gathering Raw data has been gathered from multiple sources and hosted on Github. <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/tree/main/501-project-website/501/data" target="_blank">Downloaded data</a>