Subscribe Us

Scraping Glassdoor Job Data

Glassdoor stores over 100 million reviews, salaries, and insights; has 2.2 million employers actively posting jobs to the marketplace, and gets about 59 million unique visits per month. With so much data and demand, Glassdoor is a gold mine for job and company data.

In today’s tutorial, we’ll be scraping job listing data from Glassdoor without using any type of headless browser or logging in to the site, keeping our activities legal and ethically responsible.

Talking about legal…

Is It Legal to Scrape Glassdoor?

The straightforward answer is yes, scraping Glassdoor is legal as long as you don’t break some essential rules.

In general, Glassdoor doesn’t like to be scraped, as stated in its Terms of Use. However, there are some nuances as you have to accept these terms for them to affect you.

For example, creating an account and then scraping data behind the logging wall would be considered illegal because you’ve agreed with the terms of use the moment you created the account.

That said, all pages that are accessible without an account are considered public; therefore, it’s legal for you to scrape those pages.

Still, we recommend using some best practices to ensure you’re treating the website respectfully and not harming its users.

Scraping Glassdoor Jobs in JavaScript

For this project, we will scrape Glassdoor’s part-time job opportunities in Milano to extract the job title, company hiring, and the link to the job posting.

These category pages are publicly available (they’re not behind any kind of log-in or paywall), so we’re doing this 100% white hat.

Requirements

Although we’ll explain every step of the process, we assume you have the basic knowledge covered. However, if you ever feel lost or confused, here are a few easier web scraping projects you can use to build up your skills gradually:

With this out of the way, let’s start by setting up the project.

1. Getting the Project Setup

To get everything up and running, you’ll need to create a new folder for your project (we named our folder glassdoor-scraper) and open it on VS Code or your favorite IDE.

Once inside the folder, open a terminal and initiate Node.JS like this:

1npm init -``y

It’ll create two necessary JSON files inside your project.

Then, we’ll install our favorite three dependencies:

1npm install axios cheerio objects``-``to``-``csv

From there, create a new file name glassdoorScraper.js and import the dependencies at the top:

1const axios = require(``"axios"``); 2const cheerio = require(``"cheerio"``); 3const ObjectsToCsv = require(``"objects-to-csv"``);

For the next step, let’s explore Glassdoor to understand how to access each data point we’re looking for.

2. Understanding Glassdoor’s Pages

When clicking on the link “Offerte di lavoro Part time a Milano.”

The website will take you to a list of job listings in the form of cards on the left and more details on the right. Each card represents a job and contains all the information we’re looking for: company name, job title, and a link to the job post.

But we’re not interested in the visual rendering, are we? To find our CSS targets, let’s inspect the page and see how these cards are structured.

The first element on the card is the company’s name, which is three levels down inside its container: div > a > span. This is important to notice because the <span> containing the text doesn’t have any attribute we could target.

If we just go for <span> inside our parser, we would extract all <span> elements on the page – that’s less than great.

Instead, we can go up a level and target the parent <a> tag because it has plenty of attributes to go for.

The selector for the company’s name would look something like “a.job-search-key-l2wjgv.e1n63ojh0.jobLink > span”.

In other words, we’re targeting every <a> tag with the class attribute “job-search-key-l2wjgv.e1n63ojh0.jobLink” and then moving down to the <span> child.

Note: This tag has three different class values assigned; in most cases, these values are separated with a space when checking the page’s HTML, like in the image above. However, this can become an issue when constructing the selector in your code, so instead, replace these spaces with a dot.

If you do the same process for the rest of the elements, it will look something like this:

  • For the Job Title, we’re actually going after the data-test attribute: “a[data-test=’job-link’] > span”
  • For the URL, we’re going to use the same selector as for the Job Title but without the <span> element.

But how can we tell if these are going to work? Well, we could build the scraper and try out the selectors on the first page, but if it doesn’t work, will just keep sending request after request?

No! Before we put our IP in danger, it’s better to use the Browser’s console to try out these selectors.

3. Testing Selectors Inside the Browser’s Console

Right from where you are, click on the Console tab. You’ll see a lot of information printed there.

To get rid of it, press CTRL + L in your keyword to clear the console.

With a clean slate, let’s pass the first selector to the querySelectorAll() function and see what gets returned:

Nice, that worked! As you can see, it returns a total of 30 nodes, and as we hover over them, they highlight the company’s name on each card. Plus, now we know there are 30 jobs per page

.

Try testing the rest of the selectors to see the process yourself. When you’re ready, let’s go back to VS Code.

4. Sending the HTTP Request Through Axios

You know what you want to extract, and now you know where to find it. It’s time to send our scraper out to the wild.

In our glasssdoorScraper.js file, let’s create a new async function and initialize Axios by passing the URL we’re targeting.

1(``async function () { 2const page = await axios(""); 3})();

Oh! But we haven’t chosen an URL yet, have we? Going back to the current page, the URL looks something like this:

1[https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0](https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0),6_IC2802090_KO7,16.htm

But you should never take the first URL without first evaluating if there’s a better variant.

Case in point, if we navigate the rest of the URLs in the paginated series, here’s a common trend from page to page:

Page 2:

https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0,6_IC2802090_KO7,16_IP2.htm?includeNoSalaryJobs=true&pgc=AB4AAYEAHgAAAAAAAAAAAAAAAeJsfSYASAEBAQ0BkGkLZy7wZR4%2F2Zo9gFfJc%2BaGfJR2hsdPG88aYkQEq%2BZCuA1D8cX0auxYd5YLWXw4PlrFLs6CbF64VTKidMy%2FVVlQewAA

There’s a lot of noise in these URLs, but take a closer look at the base of the URL highlighted in yellow.

If we use just that part, we’re getting the same results as if we were moving through the pagination. So let’s use that structure from now on.

1(``async function () { 2const page = await axios( 3"[https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0](https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0),6_IC2802090_KO7,16_IP1.htm?includeNoSalaryJobs=true" 4); 5 6console.log(page.status); 7})();

And we’re console logging for good measure.

Awesome, a 200 successful code! However, before we continue, we’ll need to do one more thing to make our scraper more resilient once we scale our project to more requests.

5. Integrating ScraperAPI to Avoid Getting Blocked

Something to consider while scraping high-traffic or data-heavy websites is that most of them don’t like to be scraped, so they have several tricks of their sleeves to block your scripts from accessing their servers.

To go around this, you’ll need to code different behaviors that convenience servers that your scraper is actually a real human interacting with the page like dealing with CAPTCHAs, rotating your IP address, creating and maintaining a pool of IP addresses to rotate from, sending the right headers, and change your IP location for accessing geo-sensitive data.

Or we can use a simple API to handle all of this for us.

ScraperAPI uses machine learning, years of statistical analysis, and huge browser farms to prevent your scraping bots from getting flagged and blocked.

First, let’s create a free ScraperAPI account to generate an API key – which you’ll find in your dashboard.

And we’ll use the following structure to modify our initial request:

1http:``/``/``api.scraperapi.com?api_key``=``{yourApiKey}&url``=``https:``/``/``www.glassdoor.it``/``Lavoro``/``milano``-``part``-``time``-``lavori``-``SRCH_IL.``0``,``6_IC2802090_KO7``,``16_IP1``.htm?includeNoSalaryJobs``=``true

Now, our request will be sent from ScraperAPI’s servers, rotating our IP address in every request and handling all complexities and anti-scraping systems our scraper encounters.

6. Parsing the Response with Cheerio

The fun part begins! The first step toward extracting our desired data is to parse the response so we can navigate through the nodes and pick the elements using the previously built selectors.

1const html = page.data; 2const $ = cheerio.load(html);

What you’ve done right now is storing the response’s data (which is HTML data) into a variable you then passed to Cheerio for parsing.

Cheerio will transform every element in the HTML file into Node objects we can traverse using XPath or, in our case, CSS selectors.

Still, there’s one selector we haven’t discussed yet: the main container.

On the page, every job listing is represented by a card, and each card contains the data we want. To make it easier for our scraper to find the information – and reduce the likelihood of useless data leaking into our project – we first need to pick all the cards and then loop through them to extract the data points.

Every card is a <li> element, and we can pick them using the [data-test="jobListing"] attribute.

Note: You can’t see it on the image because of the cut in the screenshot, but you’ll be able to find the attribute on the page.

So here’s how we can write the entire parser:

1let allJobs = $(``'[data-test="jobListing"]'``); 2allJobs.each((index, element) =``> { 3const jobTitle = $(element).find(``"a[data-test='job-link'] > span"``).text(); 4const company = $(element) 5.find(``"a.job-search-key-l2wjgv.e1n63ojh0.jobLink > span"``) 6.text(); 7const jobLink = $(element).find(``"a[data-test='job-link']"``).attr(``"href"``); 8});

Notice the .text() method at the end of the string? As you probably figure out, the method extracts the text data from the element. Without it, it would return the markup and text, which is not very helpful.

On the other hand, when we want to extract the value of an attribute within an element, we can use the .attr() method and pass the attribute from which we want the value.

If we ran our script now, nothing would actually happen because the script is not doing anything with the data it’s picking.

We can go ahead and log the data to the terminal, but it will all be very confusing to see. So, before we log it, let’s format it using an array.

7. Pushing the Data to an Empty Array

Outside of the main async function, create an empty array like so:

1let jobListings = [];

To add the scraped data inside, all we need to use is the .push() on the array:

1jobListings.push({ 2"Job Title"``: jobTitle, 3"Hiring Company"``: company, 4"Job Link"``: "[https://www.glassdoor.it](https://www.glassdoor.it/)" + jobLink, 5});

Did you catch that? We’re pushing a string before the returned value from jobLink. But why?

This is exactly why web scraping is about the details. Let’s go back to the page and see the href value:

There’s a lot of information there, but there’s a piece missing from the URL: “https://www.glassdoor.it”. This is a clever way to protect the URL from scrapers like us.

We’re concatenating the two into one string by passing this missing information as a string alongside jobLink’s value. Thus, making it useful again.

With this out of the way, let’s test our code by console logging the resulting array:

Excellent work so far; you’ve built the hardest part! Now, let’s take that data out of the terminal, shall we?

8. Building the CSV File

Exporting the scraped information to a CSV file is actually quite simple, thanks to the ObjectsToCsv package. All you’ll need to do is add the following snippet outside the .each() method:

1const csv = new ObjectsToCsv(jobListings); 2await csv.toDisk(``"./glassdoorJobs.csv"``, { append: true }); 3console.log(``"Save to CSV"``);

It’s important that we set append to true, so if we don’t overwrite the file everytime we use it.

We’ve tested this before, so don’t run your code yet. We still want to do one more thing before.

9. Dealing with Paginated Pages

We have already figured out how the URL structure changes from page to page within the paginated series. With that intel, we can create a for loop to increase the IP{x} number until we reach the last page in the pagination:

1for (let pageNumber = 1``; pageNumber < 31``; pageNumber +``= 1``){}

Also, we’ll need to add this number dynamically in the axios() request:

1const page = await axios( 2http://api.scraperapi.com?api_key={yourApiKey}&url=https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0,6_IC2802090_KO7,16_IP${pageNumber}.htm?includeNoSalaryJobs=true 3);

Finally, we move the entire code inside the for loop – leaving the CSV part out of the loop for simplicity’s sake.

10. Test Run and Full Glassdoor Node.JS Scraper

If you’ve been following along (if you came directly to this section: Hi

) your code base should look like this:

1const axios = require(``"axios"``); 2const cheerio = require(``"cheerio"``); 3const ObjectsToCsv = require(``"objects-to-csv"``); 4 5let jobListings = []; 6 7(``async function () { 8for (let pageNumber = 1``; pageNumber < 31``; pageNumber +``= 1``) { 9const page = await axios( 10 `http:```//api.scraperapi.com?api_key=51e43be283e4db2a5afb62660fc6ee44&url=https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0,6_IC2802090_KO7,16_IP${pageNumber}.htm?includeNoSalaryJobs=```true` `` 11); 12const html = await page.data; 13const $ = cheerio.load(html); 14 15let allJobs = $(``'[data-test="jobListing"]'``); 16allJobs.each((index, element) =``> { 17const jobTitle = $(element).find(``"a[data-test='job-link'] > span"``).text(); 18const company = $(element).find(``"a.e1n63ojh0 > span"``).text(); 19const jobLink = $(element).find(``"a[data-test='job-link']"``).attr(``"href"``); 20jobListings.push({ 21"Job Title"``: jobTitle, 22"Hiring Company"``: company, 23"Job Link"``: "[https://www.glassdoor.it/](https://www.glassdoor.it/)" + jobLink, 24}); 25}); 26 27console.log(pageNumber + " Done!"``); 28} 29 30const csv = new ObjectsToCsv(jobListings); 31await csv.toDisk(``"./glassdoorJobs.csv"``); 32console.log(``"Save to CSV"``); 33console.log(jobListings); 34})();

After running your code, a new CSV file will be created inside your folder.

Note: For this to work, remember that you need to add your ScraperAPI key to the script, replacing the {yourApiKey} placeholder.

We made a few changes:

  • First, we added a console.log(pageNumber + " Done!") line to give visual feedback while the script runs.
  • Second, we delete the { append: true } argument from the .toDisk() method; as it is no longer inside the for loop, we won’t be adding (appending) any more data to it.

Congratulations, you built your first Glassdoor scraper in JavaScript!

You can use the same principles to scrape basically every page on Glassdoor, and using the same logic, you can translate this script to other languages.

Let’s create a Python script to do the same thing as a demonstration.

Scraping Glassdoor Jobs in Python

When writing a Glassdoor scraper in Python, you might be inclined to use a tool like Selenium. However, just like with JavaScript, we don’t need to use any kind of headless browser.

Instead, we’ll use Requests and Beautiful Soup to build a loop to access and parse the HTML of the paginated pages, extracting the data as we did above.

1. Setting Up the Python Environment

Inside your project folder, create a new glassdoor-python-scraper directory and add a glassdoor_scraper.py file, and pip install Requests and Beautiful Soup from the terminal:

1pip install requests beautifulsoup4

Finally, import both dependencies to the top of the file:

1import requests 2from bs4 import BeautifulSoup

Just like that, we’re ready for the next step.

2. Using Requests in a For Loop

For good measure, send the initial request to the server and print the status code.

1response = requests.get( 2"[https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0](https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0),6_IC2802090_KO7,16_IP1.htm?includeNoSalaryJobs=true"``) 3 4print``(response.status_code)

Note: Remember that you’ll need to CD to the new folder before being able to run your Python script.

It’s working so far! Now, let’s put this into a for loop and try to access the first three pages in the pagination. To do so, we’ll create a range from 1 – 4 (it won’t include 4 in the range) and add an {x} variable to the string:

1for x in range``(``1``, 4``): 2response = requests.get( 3"[https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0](https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0),6_IC2802090_KO7,16_IP{x}.htm?includeNoSalaryJobs=true"``) 4 5print``(response.status_code)

With this simple for loop, our scraper will be able to move through the pagination without any issue.

3. Scraping Glassdoor Data with Beautiful Soup

For testing purposes, we don’t want our scraper to fail on three different pages, so let’s reduce the range to 1 – 2; it’ll only scrape page one.

As before, we’ll pick all the job cards using the [data-test="jobListing"] attribute selector:

1all_jobs = soup.select(``"[data-test='jobListing']"``)

With all cards stored inside the all_jobs variable, we can loop through them to extract the target data points:

1for job in all_jobs: 2job_title = job.find(``"a"``, attrs``=``{``"data-test"``: "job-link"``}).text 3company = job.select_one( 4"a.job-search-key-l2wjgv.e1n63ojh0.jobLink > span"``).text 5job_link = job.find(``"a"``, attrs``=``{``"data-test"``: "job-link"``})[``"href"``]

Note: For some reason, when using .find() to extract the company name it wasn’t working, so we decided to use the select_one() method instead.

4. Constructing the JSON file

We went into more detail about handling JSON files on our scraping tabular data with Python tutorial. Still, for a brief explanation, we’ll add the data to an empty array and use the json.dump() method to store the array into a JSON file:

1glassdoor_jobs.append({ 2"Job Title"``: job_title, 3"Company"``: company, 4"Job Link"``: "[https://www.glassdoor.it](https://www.glassdoor.it/)" + job_link 5})

Note: You’ll need to import json at the top of the file and create a new glassdoor_jobs = [] empty array outside the loop for this to work.

With the array ready with our data in a nice format, we’ll dump the data into a JSON file with the next snippet:

1with open``(``'glassdoor_jobs'``, 'w'``) as json_file: 2json.dump(glassdoor_jobs, json_file, indent``=``2``)

One last thing to do: test it!

5. Test Run and Full Glassdoor Python Scraper

Without more preamble, here’s the full Python script to scrape Glassdoor job data:

1import requests 2from bs4 import BeautifulSoup 3import json 4 5glassdoor_jobs = [] 6 7for x in range``(``1``, 31``): 8response = requests.get( 9"[http://api.scraperapi.com?api_key=](http://api.scraperapi.com/?api_key=){your_api_key}&url=[https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0](https://www.glassdoor.it/Lavoro/milano-part-time-lavori-SRCH_IL.0),6_IC2802090_KO7,16_IP{x}.htm?includeNoSalaryJobs=true"``) 10soup = BeautifulSoup(response.content, "html.parser"``) 11 12all_jobs = soup.select(``"[data-test='jobListing']"``) 13for job in all_jobs: 14job_title = job.find(``"a"``, attrs``=``{``"data-test"``: "job-link"``}).text 15company = job.select_one( 16"a.job-search-key-l2wjgv.e1n63ojh0.jobLink > span"``).text 17job_link = job.find(``"a"``, attrs``=``{``"data-test"``: "job-link"``})[``"href"``] 18glassdoor_jobs.append({ 19"Job Title"``: job_title, 20"Company"``: company, 21"Job Link"``: "[https://www.glassdoor.it](https://www.glassdoor.it/)" + job_link 22}) 23print``(``"Page " + str``(x) + " is done"``) 24 25with open``(``'glassdoor_jobs'``, 'w'``) as json_file: 26json.dump(glassdoor_jobs, json_file, indent``=``2``)

A few changes we’ve made:

  • We changed the range from 1 – 2 to 1 – 31. The script will stop at page 30 (as 31 is not included), which is the last page in the paginated series.
  • We added a print("Page: " + str(x) + " is done") statement for visual feedback as the code runs. It converts our x variable from an integer to a string so that we can concatenate the entire phrase.
  • To protect our IP and handle any anti-scraping technique thrown at us, we’ll send our requests through ScraperAPI’s servers. You can see the new string in the initial URL and learn more about ScraperAPI functionalities with our documentation.

Here’s the end result:

30 pages scraped and all data formatted into a reusable JSON file.

Wrapping Up

By scaling this project, you can scrape more pages and get even more data points. You can also scrape specific jobs by filtering the information like only jobs with a certain title, location, or value (i.e. jobs that show salary) and build a curated job board or job opportunity newsletter.

With this much information, the sky is the limit, so keep your mind open to the possibilities.

Until next time, happy scraping!



Scraping Glassdoor Job Data
Source: Trends Pinoy

Post a Comment

0 Comments