Exploratory Data Analysis
This page explores data in the form of plots and visuals to give a better idea about the data that was cleaned.
1 Raw record data
2 Clean record data
3 Histogram of label(Sector)
4 Using AutoViz on data web-scraped from Levels.fyi
AutoViz is a library that essentially does the work of finding ‘relevant’ details of a dataset, grouping them together in logical ways, then providing an output to the user in various graph forms. To apply and try this out on our levels.fyi dataset, all we need to do is name the dependent variable we want to analyze (totalyearlycompensation in this example). Let’s also filter the dataset to software engineers, so we can analyze a single job type when looking at our output:
So we can see that out of the 13 features in the dataset, AutoViz decided to drop 6 due to low-information relevance, surprisingly enough company and location are among them. However, if we look at the location counts using df.value_counts(normalize=True) we see that 51% of the salaries come from just 5 locations, so there may not be enough data to make location a strong indicator.
Let’s now look at the graphs it generated and see what we can learn. Under ‘continuous variables mapped against target variable’, we see the following:
From this we can see that: 1. Base salaries tend to top out around $250,000, regardless of total yearly compensation. 2. Stock grants and bonuses are highly variable and can comprise a large portion of a Software Engineer’s total yearly compensation. 3. Bonuses tend to more evenly dispersed, while stock grants appear to have three clusters of points (90/50/20 degree angles). This could be a data quality issue, but this could also suggest considerable stock value appreciation.