Python web scraping simplified for beginners — Automate your workload with Selenium

Disclaimer

The information below comes from my personal learning experiences as a self-taught developer. Depending on when you read this, some parts of this article may be outdated. Please be respectful when scraping websites.

What is web scraping?

Simply put, web scraping is the process of extracting information from websites using bots. For the bot to know what needs to be extracted, you’ll need to tell it what you want using CSS Selectors (explained below).

Here are a few cases where‌ ‌web‌ ‌scraping‌ ‌can‌ ‌be‌ ‌used:

  • Monitoring prices of desired products
    Using a scheduling tool, you could set up a daily or weekly trigger that would run a web scraper that would notify you if a price of a specific product falls below a certain price.
  • Stock market analysis/prediction
    With the help of machine learning, you could set up a scraper to check stock trends and history to notify you when you should buy or sell certain stocks.
  • Data enrichment
    Salespeople could take advantage of web scraping to enhance the data they have on their prospects to help them close more deals. For example, you could extract social media links from a company website using the proper CSS selectors.
  • Monitoring real estate listings
    Paired with a scheduling tool, a web scraper could be set up to search multiple real estate sites for your ideal property and notify you whenever a new listing shows up.
  • Lead generation
    Some people use scrapers to extract contact information from sites that hold large data sets of contacts.
  • Gathering data to train machine learning
    The web is full of data that could be used to train machine learning. Web scraping can make it easier to gather that information.

Search engines like Google and Bing also use scraping to power their sites.

Terminology

  • Web Driver:
    Think of a web driver as a simulated browser for bots. If you want your bot to complete actions, it needs to be done through that browser (web driver).
  • Element:
    Elements are defined as blocks within a web page. In more technical terms, an element is everything in between a start and end tag in HTML.
  • Text editor:
    Text editors are tools used by developers to simplify writing scripts. It’s kinda like what Microsoft Word is for writers. There are many text editors out there, but I use Visual Studio Code.
  • CSS Selector:
    A CSS Selector is a path to an element(s) of a web page. This is how you define the elements you want to extract from a web page.
    W3Schools does a very good job going over this.
  • Libraries:
    Libraries are created by a third party to simplify/extend the functionality of a programming language. For example, the library I’ll be showing in this article (Selenium) isn’t built into Python, it’s something that someone else created to simplify the task of manipulating a browser/extracting data from a website.
  • Terminal:
    A terminal is a built-in tool used to communicate with your computer to accomplish certain tasks. Developers usually need to use the terminal to install certain programming languages/libraries and even run their code.
  • Root Folder:
    The root folder is a term used to identify the folder that holds your project. In the example I’ll be giving below, the root folder is “first-webscraping-project”.
  • Variable:
    In programming, variables allow developers to stores values within a unique identifier that can be re-used in other parts of their code. Developers have complete control over what they name their variables apart from a few odd exceptions. As soon as you define that a value is equal to an identifier, a variable is created.
    Here’s an example of declaring a variable (x) in Python.
  • Function:
    In programming, functions are blocks of reusable code that are designed to perform a specific action. Similar to variables, functions can be named almost anything. It’s best practice to keep the names short and descriptive to their main functionality. In Python, a function always starts with “def” and is followed by any given name. Functions can simply be run by writing its name followed by circle brackets.
  • Parameter:
    Parameters allow developers to pass values to a function. They are separated by commas and defined within the circle brackets after the function’s name. In the following example, there are 3 parameters.

Terminology (Bonus)

If youre like me and love to go a step forward in understanding how Selenium works in the background, you’ll need to understand a couple of other terminologies. Learning about classes should be a separate article since there are loads to cover, but I’ll try to simplify it while still touching most aspects of it.

  • Class:
    A class is a template that includes properties and functions that can be used to describe an object. For example, most cars are generally the same. They all have a color, a make, and a model, but those properties aren’t the same for each car. Classes allow programmers to define those differences. Classes can also have actions that can be written in a function (module) like accelerate or decelerate.
    Here’s an example of a simple class:
  • Object:
    An object is an instance of a class. To access properties or run modules for a certain class, you’ll need to do it using the instance of the class (the object).
  • Module:
    A module is a function defined in a class that can be used by an object.
  • Property:
    A property is like a variable for a class. It describes the traits of an object.

Installations

You’ll need to make sure you have a few things installed on your computer before starting to scrape the web.

  1. Install Python: https://www.python.org/downloads/
  2. Install Pip: https://pypi.org/project/pip/
  3. Install Selenium: Open Terminal and enter “pip install selenium”
    https://pypi.org/project/selenium/
  4. Install a text editor: I use Visual Studio Code

Is everything installed properly?

An easy way to check if you have everything installed properly is by doing the following:

  • Open Terminal and enter “python”. If you get any errors, python isn’t properly installed. This is similar to what you should see if python is properly installed.
  • If python is properly installed, in the same terminal, after entering “python”, enter “import selenium”. If you didn’t receive any errors, Selenium is properly installed.

Are you new to Python?

Although having a good understanding of Python isn’t necessarily required for what I’ll be covering in this article, it definitely doesn’t hurt to know a little bit about it.

Python is by far my favorite programming language and I recommend it for almost everyone that’s looking to learn how to code.

Here are a few resources that helped me a lot when learning Python.

I’ve also heard that Grasshopper is another great interactive way of learning Python.

Running an example project

Before going into details on how Selenium works, let’s create an example project to confirm that everything is set up properly.

Setup the essential folders:

First off, let’s create a folder for your project that’ll help keep everything organized. I use this format for most of my projects.

  1. Create a new folder to store your new web scraping project (I called mine first-webscraping-project)
  2. Create a new folder called “webdrivers” inside your web scraping project folder
  3. Download chromedriver
    Make sure you’re downloading the driver that matches the chrome version you’re currently using.
  4. Unzip the chromedriver file and move it to the newly created “webdrivers” folder

If you followed the steps above, you should be left with a folder that looks like this:

Now that you have the essential setup, you can start using Selenium!

Create a Selenium script:

  • Create a new file inside of first-webscraping-project (root folder) and name it “scraper.py” (or whatever you’d prefer as long as it ends with “.py”)
  • Open the newly created scraper.py file in your text editor (Visual Studio Code) and enter the following code (make sure to save):

Run the script:

The code above might seem scary at first, but trust me it’ll all make sense later in this article! For now, let’s just run it to make sure you have everything properly setup!

Using Visual Studio
If you’re using visual studio code, running your script is pretty easy! Just make sure you have the root folder (first-webscraping-project) opened in visual studio code.

  • In the top bar of visual studio code, click the “Terminal” dropdown and select “New Terminal”.

This will open up a new terminal at the bottom of the page. There you’ll be able to write “python scraper.py” to run your code. If you decided to rename your script to something else, just replace “scraper.py” with the name of your script.

If a new browser opens automatically and visits bing.com for 3 seconds, congratulations you’ve got everything set up correctly!

Using computer terminal
This second option requires a bit more steps but it’s still another great way of running your code.

  • Open a new terminal, start writing “cd “, then drag your root folder into the terminal window. Make sure to add a space after “cd”.

Once you’ve finished that, you’ll be able to run your script by writing “python scraper.py" inside of the terminal!

If a new browser opens and visits bing.com for 3 seconds, congratulations you’ve got everything installed properly!

How Selenium works

Now that we made sure you have everything properly set up, you can start working on your own project!

Step 1: Import webdriver

When you want to use Selenium in a project, you’ll need to first write a line of code to import it. Instead of importing the whole library, you’ll only need to import “webdriver” for what we’ll go over in this article.

from selenium import webdriver

Step 2: Create a browser instance

After that, you’ll need to create and store a browser instance in a variable. The browser instance is used to simulate a browser and run specific actions (listed below). Here’s how you create a variable of a browser instance:

driver = webdriver.Chrome("webdrivers/chromedriver")

The path that’s inside of the rounded brackets locates the webdriver that you’ve installed in an earlier step.

Step 3: Define your actions

Selenium has a list of specific actions that can be run on the browser level. For example, if you wanted to open a specific URL, you’d need to first open a driver, then use the get() method.

driver = webdriver.Chrome("webdrivers/chromedriver")
driver.get('https://google.com')

There are also a few methods (actions) that can only be completed after another. For example, if you’re trying to click a specific button (.click()), Selenium first needs to know which button you’d like to click. In that case, you’d first need to define the button by using find_element_by_css_selector(), then chain the .click() method afterwards. Here’s an example:

driver.find_element_by_css_selector(‘button’).click()

Methods that are run on the driver

All of the following methods (actions) need to be run on the browser instance. In my examples, the browser instance is stored in a variable called “driver”.

driver = webdriver.Chrome("webdrivers/chromedriver")
  • driver.get(url)
    This method is used to go to a specific URL.
  • driver.find_element_by_css_selector(selector)
    This method is used to find a single element based on a CSS selector
  • driver.find_elements_by_css_selector(selector)
    This method is used to find multiple elements based on a CSS selector
  • driver.close()
    This will close the browser instance

For a full list of selenium methods, check out their documentation: https://selenium-python.readthedocs.io/api.html

Methods ran on a specific element

The following methods (actions) can only be run after specifying an element. To define a specific element, you’ll need to run a method like find_element_by_css_selector(). In my examples, I’ll store the element in a variable called “elem”.

elem = driver.find_element_by_css_selector(‘a’)
  • elem.click()
    This will click the element
  • elem.send_keys(text)
    This will type text based on the parameter given.
  • elem.text
    This will retrieve the text of an element
  • elem.get_attribute(attribute)
    This will return a specific attribute of an element

For a full list of selenium methods, check out their documentation: https://selenium-python.readthedocs.io/api.html

Exporting Data

Which method you use to export the data really depends on what you’re trying to build. Sometimes, just printing the results could be the most efficient way of exporting the data.

  • Printing
    This is definitely the easiest and probably the one you’d prefer to stick to if you’re just starting out and only extracting a few results.
  • Exporting as CSV
    This would be a great option for bigger scraping projects where you’d like to use the data inside excel or google sheets. Python has a few libraries that can be used to help manipulate data in a CSV format.
  • Exporting as JSON
    This could be a good option for projects with multiple levels of depth.

In the exercise below, I’d recommend sticking to printing the results as it’s definitely the easiest. You can wrap the method (“elem.text”) that returns the desired value inside of a print tag to display the value inside of your terminal.

Finding the right CSS selectors

This step can be confusing at first, but there’s a lot of tools out there that’ll help simplify the process.

Google Chrome offers a built-in way of getting the right CSS selectors for your desired elements in just a few clicks. If you right-click on your desired element, you’ll see an “inspect” option.

Clicking that will open up Google’s developer tools.

Once there, you’ll be able to right-click the element that holds the element you desire and copy the CSS selector for it.

Scraping Challenge

Now’s the time for the fun part! You’ve installed python and selenium, you’ve created a test project to make sure everything was set up properly and you’re finally ready to experiment with creating your very own scraper.

The challenge:

  • Create a scraper that returns the first result of a bing search

I’ll let you try and figure it out on your own. If you’re having trouble, try re-visiting the “Terminology” and “How Selenium works” sections of this article.

Helpful guidelines for the challenge:

  1. Add your web driver to your project
  2. Import Selenium to your script
  3. Create a browser instance
  4. Add steps to get to the results page for your search
  5. Print the first result

Before writing any code, I’d recommend that you go through bing and try and figure out the shortest step or steps that need to be done before getting to the results page of a search.

I’ll have a few possible answers listed below once you’re ready to review what you wrote.

Possible answers

There are many ways of completing this challenge, but I’m only going to list 2 potential answers that will hopefully give you an idea of potential approaches you could’ve taken.

Script example #1:

This first example will definitely get the job done, but there’s an even simpler approach that could be taken.

If you’d like to attempt to find a better alternative, I’d suggest you try and look at ways that would allow you to eliminate the need to write “Amazon” in the input fields and click search. I’ll have the answer listed below.

Script example #2:

This method is a much simpler version that’ll quickly get you the result you’re looking for!

Above you can see that instead of using the input field to execute the search, I instead opted to directly visit the results page.

By replacing amazon in the following URL, you can essentially find the results page of any given search query! https://www.bing.com/search?q=amazon

Script example #3 (Bonus):

Although I said I’d only be giving 2 examples, I just had to include this last one. This one showcases some of the interesting things that can be accomplished the more comfortable you get with writing code.

Here’s an example of a script that creates all of the result page URLs for you based on a given list of search queries. This works wonders, but there is a downside…

Values within the “keywords” array need to be readable by a URL. Any special characters or spaces would end up breaking the script, but if you added some code to convert the values into URL-friendly keywords, it would solve all your problems!

Depending on which version of python you’re using, using one of these two scripts will make the values URL-friendly.

To check which version of python you’re using, in a new terminal enter the following code:

python -V

Python 2:

Python 3:

Thanks for reading!

I truly hope you got a lot of value from this article. If you have any questions or feedback, I’d love to hear them!

If you enjoyed reading, make sure to keep an eye out for more of my stories! I share my learnings and experiences as a polymath in the startup world.

Writing about my experiences as a polymath in the startup world.