Apr 26 2023 · Adrian Błazeusz

Python Dashboard for 15,963 Data Analyst job listings

Interactive Matplotlib plots in Jupyter NotebookThe US job market is filled with exciting opportunities for aspiring Data Analysts. However, landing your first job can be challenging due to the diverse range of requirements employers are looking for. In this article, we analyze data from 15,963 Data Analyst job listings. We build a Jupyter Notebook with data analysis and visualization, and serve it as an interactive web app. For example, we search for the most needed skills and show their dependency on the average yearly salary. Let's check what the most needed skills for Data Analyst are!

The outline for this article:

  1. About dataset
  2. Import needed libraries
  3. Data preprocessing
  4. Creating charts
  5. Exploratory data analysis (EDA)
  6. Share your analysis as a web app
  7. Summary

Infographic

Data Analyst skills

1. About dataset

We use Data Analyst Job Postings dataset available at kaggle.com website. This dataset pulls job postings from Google's search results for Data Analyst positions in the United States every 6 hours. It has the following information about listing:

  • title - a job title,
  • company_name the company that listed the job posting,
  • location - city and state,
  • via - which is information about social media platforms used to post a job listing,
  • description - full description of the job posting,
  • extensions - common keywords for the job posting,
  • salary - salary from job posting,
  • description_tokens - tokens with skills.
Markdown for about data
  • The dataset has 15,963 job offers.
  • 81% of the payout data is missing. Understandably, most companies will want to negotiate the price.

The data and notebook from this article are available on the GitHub repository.

2. Import needed libraries

We use the Jupyter Notebook for conducting analysis. It is a popular tool for prototyping and exploring data because of its interactive interface. It provides rich output capabilities for visualizations, plots, and interactive widgets. we need the following packages:

matplotlib>=3.6.2
mercury>=2.2.7
numpy>=1.24.0
pandas>=1.5.2
wordcloud>=1.8.2.2

In the first cell, we import required libraries:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import mercury as mr
from wordcloud import WordCloud

from dateutil.parser import parse
from string import punctuation as punct

Packages explanation:

  • numpy provides support for numerical operations,
  • pandas is a library for data manipulation and analysis,
  • matplotlib.pyplot allows you to visualize data and create charts,
  • mercury allows you to create pages from the notebook and add widgets,
  • wordcloud allows the creation of word clouds,
  • dateutil.parser provides tools for parsing and manipulating dates and times,
  • string contains a set of constants, such as punctuation marks, that can be used in operations on text.

Load data

Data is available as CSV file (zipped to use less space). Let's load it with pandas package:

df = pd.read_csv("df-job-al.zip")
df = df.drop(columns=["Unnamed: 0", "index"])

# Removes punctuation marks from strings, and then splits each string into a list of words.
remove_punct = lambda x: x.translate(str.maketrans('', '', punct)).split()
df.extensions = df.extensions.apply(remove_punct)
df.description_tokens = df.description_tokens.apply(remove_punct)

3. Data preprocessing

Data processing is one of the more difficult parts of the analysis. It requires good planning of operations and preparation of code that can handle large amounts of data efficiently. It is worth taking the time to refine the code to the best possible quality because the whole value of our analysis will depend on it.

Data preprocessing.

In this section, let's deal with preparing the data for Explorative Data Analysis (EDA):

  • job experience: let's create a separate column (experience) in the df dataset where we specify the required experience level for each offer,
  • skills: let's create a separate dataset skills_df where we include the counted number of appearances of skills in the offerings.

Job experience

In this step, our goal is to collect information on the level of experience required for the position, the only problem we will encounter is that each offer is written in a different way. The level that is needed can be described in different words. In order to generalize the data and extract information from it, we create a set of keywords for each job grade.


# keywords to detect seniority level

junior_keywords = {'junior', 'entry level', 'intern', 'apprentice','jr','I'}

mid_keywords = {'mid-level', 'mid level', 'midlevel','mid', 'II',
    'consultant', 'intermediate', 'experienced','associate','practitioner'}

senior_keywords = {'senior', 'lead', 'principal', 'sr', 'III', 'IV', 
    'senior level', 'advanced','senior associate', 'expert'}

After creating a set of keywords, we can run them through a simple function that iterates through each announcement, and if it finds a keyword from the set, it assigns it to one of the three classes (Junior, Mid, Senior). Of course, it could be that it will find nothing - in that case, it assigns unknown class. After checking all the job offers, it creates a new column called experience.

# function to assign an offer to a category
def categorize_job(row):
    title = row['title'].lower()
    description = row['description'].lower()
    for keyword in junior_keywords:
        if keyword in title or keyword in description:
            return 'Junior'
    for keyword in mid_keywords:
        if keyword in title or keyword in description:
            return 'Mid'
    for keyword in senior_keywords:
        if keyword in title or keyword in description:
            return 'Senior'
    return 'unknown'

# add column 
df['experience'] = df.apply(categorize_job, axis=1)
Data preprocessing.

Skills

In this case, to create a new dataset skills_df, we use the tokens available in the dataset from the ["description_tokens"] column and count them by for loop and add each token to the "skills" list using the extend function. After executing the loop, the "skills" list will contain all the skills appearing in the descriptions. Then, using the "np.unique" function, it will count how many unique skills there are and how many times each skill appears in the "skills" list.

# Counting skills appearing in the description and title
skills = []

for val in df.description_tokens.values:
    skills.extend(val)
skills, counts = np.unique(skills, return_counts=True)

# Create DataFrame with skills and number_of_postings columns
skills_df = pd.DataFrame({"skill": list(skills), "number_of_postings": counts})

Great, we have a new dataset, now let's just select those values that appear at least 10 times and sort them from the largest number.

# Select rows with number_of_postings >= 10
skills_df = skills_df[skills_df['number_of_postings'] >= 10]

# Sort by number_of_postings in descending order
skills_df = skills_df.sort_values(by=["number_of_postings"], ascending=False)

Our data frame looks like this.

skill   number_of_postings
==========================
sql           7743
excel         5292
tableau       4291
python        4189
powerbi       4111
...           ...
aspnet         14
dplyr          13
unixlinux      12
jquery         12
js             11
==========================

4. Create data analysis

Our data is ready to be presented, let's create charts. We will use the matplotlib library to create graphs.

plt.figure(figsize=(12,6))
plt.style.use('ggplot')

level = df['experience'].value_counts()
level.plot(kind='bar')

plt.title(" Number of jobs")
plt.ylabel("Counts")
plt.xticks(rotation=45)
plt.show()

The next three charts are done exactly the same way, just change the column name in the df[ ] to one of these three:

  • company_name
  • via
  • schedule_type

Remote work

To make a chart with remote work, I chose the get_dummies function from the pandas library. The pd.get_dummies() function creates new columns with the prefix 'remote' and binary values for each unique value from the 'work_from_home' column.


dummies = pd.get_dummies(df_lvl['work_from_home'], prefix='remote', prefix_sep='_')

plt.figure(figsize=(12,6))
counts = dummies.value_counts()
bar_name = ["You can't work remotley", 'You can work remotely']
plt.bar(bar_name, counts)

Word Cloud

We can use Word Cloud to make our presentation more interesting. This is a collection of all the sub-counted words, the size of the word depends on the number of times the word is repeated in the text. The Word Cloud just like the rest of the code, will be customized according to our choice made in the experience column.

# create a dictionary from the sorted_counts variable
word_dict = skills_df.set_index('skill')['number_of_postings'].to_dict()

# generate the word cloud
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                min_font_size = 10).generate_from_frequencies(word_dict)

And make visualization for wordcloud with the matplotlib.

# Display the word cloud
plt.figure(figsize = (12, 6), facecolor = None) 
plt.imshow(wordcloud) 

plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.title('Word cloud for data analyst skills')
Word cloud for data analyst word.

Top 10 required skills

The last chart we will prepare is the Top 10 required skills. We just need to select the first 10 rows from our skills_df and present them on the chart.

# extract the 10 most common keywords
top_skills_df = skills_df[:10]
plt.figure(figsize=(12,6))
plt.bar(top_skills_df['skill'], top_skills_df['number_of_postings'])

plt.title('Top 10 most needed skills')
plt.ylabel('Count')

All done! The preprocessing and data visualization code is ready. 🎉

5. Exploratory Data Analysis

We can finally move on to checking our charts and drawing conclusions.

Who is looking for the largest number of Data Analysts?

The company Upwork with almost 1,000 job offers is dramatically breaking through. You can find there a freenlance contracts.

Who is looking for the largest number of analysts?

Where can you find the most job offers?

The Linkedin is positioned first, and right after that we see Upwork again.

Where can you find the most job offers?

What are the types of contract?

Of course, the most companies need full-time Data Analysts.

What are the types of conctract?

And when it comes to the possibility of working remotely, there are more than 5,000 listings with this option.

How many offers?

What are the 10 most needed skills?

According to the chart, the most sought-after skills are SQL, Excel and Tableau.

What skills are most in demand?

6. Share your analysis as a web app

We've done a great job with creating a Python notebook. Why not share it with our colleagues as a web app? It can be easily done with Mercury framework. You can add interactive widgets to the notebook and deploying it with few clicks in the Mercury Cloud.

We can use interactive widgets to personalize data and charts to recipients depending on their job experience and abilities. For example, user can choose job experience level from the widget, and Mercury framework will recompute all charts in the notebook based on selection.

Notebook with widget schema

Interactive widgets

Let's use a MultiSelect widget to filter job experience in the DataFrame.

lvl = mr.MultiSelect(value=['Junior','Mid','Senior'],
                    choices=['Junior','Mid','Senior'], \
                    label="Choose your level")
  • value determines what values will be assigned by default,
  • choices determines what all options we have.

Let's use widget selection to filter rows in our dataset df.

df_lvl = df[df['experience'].isin(lvl.value)]

In the case of our second dataset skills_df, the code will look as follows:

# skill selection widget
sk_set = mr.MultiSelect(value=['sql','tableau', 'python', 'r', ...],
                        choices=skills_df['skill'], 
                        label="Choose your skills")

# filter skills
sk_df = skills_df[skills_df.skill.isin(sk_set.value)]

From now on, any chart we have made so far after changing df to df_lvl, or skills_df to sk_df will only create charts based on the selected choices.

Start web app

Please navigate in the terminal to the folder with the notebook. Please execute command mercury run to start a web app at 127.0.0.1:8000. Below is a screen video showing how to interact with web app:

Mercury Web App created from Jupyter Notebook

You can easily and for free deploy the notebook over the internet with the Mercury Cloud service. The notebook from this article is available online at use-cases.runmercury.com.

7. Summary

In this article, we learned how to extract information from job offers and went through the entire process of creating EDA. The most of job listings are on the Linkedin. The most significant demand is for Junior Data Analyst. It turns out that Data Analysts with knowledge of SQL, Excel, and Tableau are most in need. Do you agree? We also showed how everyone could interact with your notebook and how, without the frontend technologies knowledge, you can easily share your work, thanks to Mercury framework.

Become a Data Science wizard, today!

Forget about Python problems, just do your work.

MLJAR Studio