How to Scrape Websites With PHP Using Goutte

November 11, 2021
Written by
Matt Nikonorov
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

How to Scrape Websites With PHP Using Goutte

For many PHP based applications involving data collection or data analysis, PHP scripts will need to scrape data from external web pages. This is especially true if the web source that you are looking to interact with doesn’t provide an API; or maybe they do provide an API, but you don’t want to pay for their API services.

Web scraping is usually performed with Node.js or Python, however, when trying to scrape data and pass it to the frontend, web scraping with Node.js or Python complicates the process of scraping data from the web and displaying it on a web page.

This is where Goutte makes life easier. Instead of relying on Node.js or Python scripts to scrape data from the web and display it on the frontend by passing it to a PHP script, with Goutte, you can scrape data from the web directly inside of your PHP script.

Goutte is a lightweight web scraping library, so passing scraped data to the frontend doesn't significantly increase loading time, nor does it take up too much RAM in the backend.

Functionality

Since Goutte is a lightweight library not intended to handle heavy processes, it has limited functionality compared to more heavyweight web scraping libraries. Goutte's functionality includes:

  1. Finding HTML elements through their CSS selector or HTML tag
  2. Extracting text from HTML elements
  3. Clicking links and filling out forms

In this article, you will learn how to use Goutte's functionality for different purposes through practical code examples.

Prerequisites

To follow this tutorial you'll need the following:

Setup

Before installing Goutte, you'll need to create the project directory and navigate into it, by running the commands below.

mkdir goutte-web-scraping
cd goutte-web-scraping

Installation

To add Goutte as a project dependency, run the following command in your project's terminal:

composer require fabpot/goutte

Working with Goutte

Let’s start with the most practical use case of Goutte: text extraction. First things first, using your preferred editor or IDE, create a new PHP script inside of your project directory called  scrape.php. To require and initialize the Goutte library inside scrape.php, add the following 3 lines of code to the beginning of the script:

<?php

require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();

Now that Goutte is initialized, add the two lines below to the end of the file to fetch a URL using the client->request() function.

$url = "https://www.bbc.com/news/topics/cgdzpg5yvdvt/stock-markets";
$crawler = $client->request('GET', $url);

I've chosen the URL destination to be the BBC stock market news website because it has valuable text information that you can extract using Goutte.

The BBC stock market news website

Finding elements

Let's try and scrape today's top stock market news headline as displayed on this web page. First, you will need a way to find the DOM (Document Object Model) element containing the first headline. You can do this by right-clicking on the first headline and clicking on the Inspect option, which you can see an example of in the screenshot below.

Inspecting the first news headline

A pop-up will then appear on the right of your screen, with the selected headline's DOM element highlighted in blue.

Pop-up displaying this webpage&#x27;s DOM

To access this element through Goutte, you can use the element's CSS selector. A selector is, in simple terms, the address of an element within a web page and is used for accessing a specific element inside the DOM. To get the selector of the highlighted element, in Chrome or Safari's Elements tab, or Firefox's Inspector tab, right-click on the element, hover over Copy, and click Copy selector in the dropdown.

Copy this element&#x27;s selector

If you take a look at the copied selector by pasting it, it may seem random at first glance, but it can actually be identified using its path or id property. This particular element's selector is #title_58798625. This is because this element's id is title_58798625, which you can see in the screenshot below.

This element&#x27;s id property

If present, an element's id will determine its selector. Note that the selector will have a "#" before the id as seen with this example element. If we take an element that doesn't have an id on the other hand, its selector will look something like the following:

div > div > a

As you can see, the selector of an element with no id will simply reflect its path from the body tag. The above element's selector indicates that it is located inside an a tag, located inside a parent div tag, located inside another parent div tag.

Extracting an element's text contents

Now that you've copied the first headline's selector, add the two lines below to the end of scrape.php, replacing <headline's selector> with the selector that you've just copied in your browser.

$news = $crawler->filter("<headline's selector>")->text();
echo($news."\n");

The code will be able to find the element within the web page by using the filter() function and will then be able to extract its text contents using the text() function, before finally displaying the extracted text contents. If you run the above code by running php scrape.php, you should see that it displays the top news headline from the BBC stock market news web page, as in the example below.

Evergrande investors kept waiting over 'major' deal

You can follow the same process to fetch multiple news headlines and display their extracted text contents. Use the following code, replacing the placeholders with their respective selectors, to display the text contents of multiple headline elements.

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$url = "https://www.bbc.com/news/topics/cgdzpg5yvdvt/stock-markets";
$crawler = $client->request('GET', $url);

$news = $crawler->filter("<first headline's selector>")->text();
$news2 = $crawler->filter("<second headline's selector>")->text();
$news3 = $crawler->filter("<third headline's selector>")->text();

echo($news."\n");
echo($news2."\n");
echo($news3."\n");

If you run the code again, you should see output similar to the example below.

Evergrande investors kept waiting over 'major' deal
Meet Mr Goxx, the crypto-trading hamster
Evergrande investors in the dark over $83m payment

Scraping elements using HTML tags

Another faster, but messier, method of scraping elements is by using their HTML tag. For example, using the code below, you can return the text contents of the web page's first h1 tag.

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$url = "https://www.bbc.com/news/topics/cgdzpg5yvdvt/stock-markets";
$crawler = $client->request('GET', $url);
$news = $crawler->filter('h1')->text();

echo($news."\n");

Goutte could also return the text contents of all the h3 tags inside this web page using the each() function. To do that, update scrape.php to match the revised code below.

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$url = "https://www.bbc.com/news/topics/cgdzpg5yvdvt/stock-markets";
$crawler = $client->request('GET', $url);

$crawler->filter('h3')->each(function ($node) {
   print $node->text()."\n";
});

Running the code again should give you output similar to the example below.

Evergrande investors kept waiting over 'major' deal
Meet Mr Goxx, the crypto-trading hamster
Evergrande investors in the dark over $83m payment
Evergrande shares jump as debt deadline looms
China set to open new stock exchange in Beijing
Chip giant hit by Beijing crackdown on business
Former Netflix staffers charged for insider trading
Shares jump in Indonesia's biggest ever market debut
Robinhood shares surge amid frenzied trading
China Tesla rival plans Hong Kong secondary listing
China ride-hailing giant denies plans to go private
Uber slides on reports of $2bn shares selloff
China stocks see biggest slump in US since 2008
Shares in China tuition firms slump after shake-up
Tencent shares slide after Beijing music crackdown
Shares of India's Zomato soar on market debut
Didi shares fall on reports of penalties in China
JP Morgan boss set to net millions from 'award'
Asia follows global shares slide amid Covid fears
Chinese ride-hailing firm Didi sued as shares slide

The reason this method is messy is because it can output lots of unwanted data which could be contained within the same type of HTML tag. Luckily, in the case of this web page, every h3 tag inside it contains a news headline but no excess data. Using the same method, you can also find elements inside a certain array of parent HTML tags. For example, you can retrieve the text contents of all the p tags that appear inside two parent div tags inside this web page, by updating the code in scrape.php to match the code below:

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$url = "https://www.bbc.com/news/topics/cgdzpg5yvdvt/stock-markets";
$crawler = $client->request('GET', $url);

$crawler->filter('div > div > p')->each(function ($node) {
  print $node->text()."\n";
});

Running the code again should give you output similar to the example below, which was shortened for brevity's sake.

Two more Chinese property companies are causing concerns over their ability to repay their debts.
...
Stocks in Asia follow Monday's falls, but shares in Europe make a brighter start on Tuesday.
The lawsuits come after a crackdown by Beijing triggered a slump in its share price of more than 20%.

Scraping and displaying innerHTML

It is also possible to parse a scraped element's innerHTML. Let's try parsing the innerHTML of the menu bar at the top of this web page by first copying its CSS selector.

Inspecting the navigation bar element

To parse this element's innerHTML, you must use the html() function instead of the text() function which you used earlier. Update scrape.php to match the code below, replacing <menu bar's selector> with the menu bar's CSS selector.

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$url = "https://www.bbc.com/news/topics/cgdzpg5yvdvt/stock-markets";
$crawler = $client->request('GET', $url);
$news = $crawler->filter("<menu bar's selector>")->html();

echo($news);

Running scrape.php again will display the innerHTML of this element, which you can see in the example output below, which I've truncated since it is too long to include in the article.

string(5672) "<div class="orb-nav-section orb-nav-blocks"><a href="https://www.bbc.co.uk">Homepage</a></div><section>...<span>Search BBC</span></button></div></form> </div>"

Interactive web scraping

Even though Goutte's interactive web scraping capabilities are very limited compared to more heavyweight web scraping libraries such as Puppeteer or Selenium. However, it can still do two things very well when it comes to interacting with a web page:

  1. Clicking on links
  2. Filling out and submitting forms

The first news headline displayed on the BBC stock market news web page is inside an a tag, which means that this headline acts as a link. Let's try clicking on the first news headline and getting the clicked link's destination's headline element.

The below screenshot shows the link destination's headline element.

Get the CSS selector of an article&#x27;s header in Chrome&#x27;s developer tools

To find the headline element through Goutte, copy the element's CSS selector. This element has an id property of main-heading, so this element's selector will conveniently be defined as #main-heading. Then, update scrape.php to match the following code, replacing the placeholders with appropriate selectors, to click on the first headline and display the link destination's headline’s text contents.

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$url = "https://www.bbc.com/news/topics/cgdzpg5yvdvt/stock-markets";
$crawler = $client->request('GET', $url);

$news = $crawler->filter("<headline's selector>")->text();
$link = $crawler->selectLink($news)->link();
$crawler = $client->click($link);
$h1 = $crawler->filter("<link destination's headline's selector>")->text();

echo($h1."\n");

Unfortunately, finding a link through its text contents is the only way Goutte is able to recognise links and can't use selectors like when fetching elements. The $crawler variable is effectively changed to the URL address of the clicked link. By doing this, this code fetches the headline from this link's destination through its selector and displays it's text contents.

If you run scrape.php again, you should get output similar to the example output below.

Evergrande: Investors kept waiting over 'major' deal

Submitting forms

You can also submit forms through Goutte. As a working example, we're going to make scrape.php fill-in and submit GitHub's sign in form by clicking the Sign in button at the top right of GitHub's home page, and then fill-in and submit the email and password fields once redirected to the sign in form.

The github home page

Just like links, Goutte can only find buttons through their text contents. Using the following code, Goutte will first find and click the Sign in button through its text contents before submitting the sign-in form.

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $crawler->click($crawler->selectLink('Sign in')->link());

To identify the input elements that you’d like your PHP script to find and fill-in once redirected to the sign in form, you must use their name attributes. I've highlighted them blue in the screenshots below.

Inspecting the first input element

In the screenshot above the name attribute has the value "login".

Inspecting the second input element

In the screenshot above the name attribute has the value ”password”.

Using the code below, scrape.php will be able to:

  1. Open https://github.com.
  2. Select the button whose text contents is "Sign in".
  3. Once redirected to the sign in form, fill-in each of the needed fields by recognising their name property, and finally submit the form.
  4. If successfully signed in, once redirected to your GitHub account's home page, scrape and display the first h1 tag inside this web page, to confirm that Goutte has signed in.
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'your email', 'password' => 'your password']);
​$h1 = $crawler->filter("h1")->text();

echo($h1."\n");

Update scrape.php to match the code above, and then run it again. The output should match the example below.

Dashboard

Now, let's make one, final, change, where you specify what you'd like code to fill-in inside the specified input elements. In example below, I've simply put the "your email" and "your password" values for these input fields.

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'myemail@gmail.com', 'password' => '12345']);
​$h1 = $crawler->filter("h1")->text();

echo($h1."\n");

Wrapping up

I hope that this article was successful in demonstrating how you can scrape external web pages' data in PHP using the Goutte library. Happy scraping!

Matt Nikonorov is a developer from Kazakhstan with a passion for data science, data mining and machine learning. He loves developing web and desktop applications to make the world more interesting. When he’s not developing, you can reach him via Twitter.