Finding and Fixing Website Link Rot with Python, BeautifulSoup and Requests
Time to read: 6 minutes
When hyperlinks go dead by returning 404 or 500 HTTP status codes or redirect to spam websites, that is the awful phenomenon know as “link rot”. Link rot is a widespread problem; in fact, research shows that an average link lasts four years.
In this blog post, we will look at how link rot affects user experience, using Full Stack Python as our example. We’ll build a Python script that detects link rot in Markdown and HTML files so we can quickly find and fix our issues.
A Link Rot Example
fullstackpython.com is a website created by Twilio employee Matt Makai in 2012. The site has helped many folks, including me, learn how to best use Python and the tools within its ecosystem.
The site now has over 145,000 words and 150 pages, including:
- 2400+ links in the repository
- 300+ HTML files
- 150+ Markdown files
And there’s expected to be more links and files in the future. With 2400+ links on the site, it is really difficult to immediately spot dead links. Users could report these via issues or pull requests at best, or at worst, users may not know what to do and leave the site. On the maintainer’s side, checking all the URLs by hand is not a sustainable solution. Assuming that link checking takes 10 seconds each, it would take at least 24000 seconds (or 6.7 hours) to go through all the links in one sitting.
There must be an automated solution to handle all of the link rot madness!
Python to the Rescue
Our approach will be to aggregate all the links from the site and check each URL using a Python script. Since the site content is all accessible on GitHub as a repository, we can clone the repository and run our script from the base folder.
The first step is to clone the repository:
Please make sure that Python 3 is installed on your machine before proceeding further. During the time of this writing, the latest GA version is Python 3.7.0.
We will use the following built-in packages for this script:
futures
is used for asynchronous processingmp
is used for determining CPU countos
is used for walking through filesjson
is used for printing jsonuuid
is used for generating random identifiers
We will also use the following third-party packages for this script:
BeautifulSoup
is used for parsing HTMLmarkdown
is used for parsing Markdownrequests
is an easy-to-use interface for doing HTTP requestsurllib3
is the underlying implementation for requests
Feel free to install the third-party packages with pip
or pipenv
.
If you plan to use pip
, create a requirements.txt
with the following content:
These packages are already inside the repository’s requirements.txt
file so can also copy and paste it from there.
Once the third-party packages are installed on your machine, create a new file named check_urls_twilio.py
. Start the file by declaring imports:
Now we can write the algorithm to seek out link rot.
Our First Link Rot-Finder Approach
Here is how our first attempt at coding our link rot finder will operate:
- Identify URLs from Markdown and HTML in fullstackpython repository
- Write identified URLs into an input file
- Read the URLs one-by-one from an input file
- Run a GET request and check if the URL gives a bad response (4xx/5xx)
- Write all bad URLs to output file
Why not just do a regular expression across all the files in the repository? I have access to all of them from the repository. I first configured a Linux command to do such a thing:
Here’s a high-level explanation of the command:
- Find all files (not directories)
- Look for HTTP(S) links
- Remove HTTP snippets that have bad snippets (i.e. Binary)
- Sort all the links and remove duplicate links
The regular expression is used for finding HTTP(S) links. The following is a description of what the expression means under the hood from left-to-right:
- http or https?
://
is the separator between the protocol and link metadata- Any combination of link characters
- Alphabetical characters
- Numeric characters
- Query parameter characters
- DNS
.
separators
But we are writing a Python script, not a Bash script! So update your check_urls_twilio.py
file with the same commands coded in Python:
We can process each URL one-by-one and check whether they were valid because all the URLs are in IN_PATH
:
The above highlighted code:
- Open
OUT_PATH
for writing, and NOT appending. We don’t want results from previous runs to be included. - Get the status of each URL.
- Print the progress of the total execution (i.e. ID and hostname of the URL being checked).
- Check if the URL status is bad or good; if it’s bad, then write it as a new line to
OUT_PATH
. - Increment the ID by 1 for each time steps 1 through 4 execute.
Let’s see how to implement get HTTP status.
The URL status checks if the link is “bogus” (i.e. localhost). If it’s legitimate, then the cleaned link is sent a GET request with a timeout of 10 seconds.
At the end of the function, an appropriate status code is attached to the link and both are returned to the caller as a tuple result.
We need to watch out for rate limiting mechanisms, so a header was added to change the request library’s identifier as I executed GET requests:
This meant that I was effectively a “different” user each time I ran the program, making it less likely to hit the periodic limit for throttling mechanisms to kick in on websites like Reddit.
We’ll use Python’s threading via run_workers
to bypass some of the I/O bottlenecks that are inherent with the GIL (Global Lock Interpreter) to improve performance:
The number of threads that run_workers
uses defaults to four times the number of CPUs the machine running the script has. However, this can be configured to whatever value makes the most sense.
Time to test our script! Assuming you named the script as check_urls_twilio.py
, you can run the script by executing python check_urls_twilio.py
. Verify that the script creates an output file called urlout.txt
with a list of bad links.
First Approach Pros and Cons
This was a simple solution to push as a pull request. It handled 80% of the use cases out there. And helped remove a lot of dead links. However, the script was not accurate enough for a couple nasty URLs. And it even identified false positives! See PR #159 for more context. I soon realized that URLs have so much variation that detecting one in arbitrary text is not trivial without using a parser.
Improving our script
We can definitely improve this script, so it’s time to use a parser! Luckily, we are just checking Markdown and HTML, so that limits our parsing needs considerably. Instead of running the OS system call to extract the links, Markdown
and bs4
can extract the required URLs. Remove the OS system call and the variables associated with that call so our script has the exact following code:
The logic for processing the URLs one-by-one still remained the same. So what we will focus on primarily is the link extraction logic.
Here’s how the code should work:
- Find all files recursively
- Detect whether they have Markdown or HTML extensions
- Extract URLs and add them to a set of unique URLs found so far
- Repeat steps 2 through 3 for all files identified in step 1
Let’s implement the extract_urls
function. Add the following code to the end of your check_urls_twilio.py
file:
The code above does as follows:
- Walk through current directory from top-down (recursive).
- For each visited directory, obtain a list of files.
- Determine if the file ends with a favorable extension.
- If a file with .markdown extension was found, then convert to HTML and extract URLs from content.
- If a file with .html extension was found, then extract URLs from content (no conversion).
Let’s also look at how content extraction can be implemented. Place the following code after the extract_urls
function code you just wrote.
We use BeautifulSoup to find all anchor DOM elements with a link reference (href=True
). If there are no link references, the for loop would terminate early. Otherwise, each loop execution will check if the href starts with http(which means it’s a website that’s active). If it starts with http, then we would add it to the available URLs. Since all_urls
is a Python set, adding duplicate URLs will be resolved automatically.
Replace the check links code with the improved highlighted lines so that our full, finished script looks like this:
The rationale behind this change was to avoid using urlin.txt
and urlout.txt
in the first place. Now the script can be validated simply by running python check_urls_twilio.py
.
Approach 2 Pros and Cons
This was even easier as I did not have to worry about the structure of the URL. I could just reference the href attribute of an anchor tag and check if the link started with http (which encompassed all the urls I wanted to check).
What could make this solution better is the following:
- Add
argparse
parameters to customize timeout for GET requests. - Identify which files have a certain URL.
- Apply this URL checking solution for file-scraping and web-scraping use cases.
- Open-source it so that others don’t have to reinvent the wheel.
Conclusion
In summary, we covered how to check links for a website with over 2000 links. Creating a pull request for bad links concerning this repository is much easier than before we had the script. I am now able to create them in just a few minutes.
Running the link rot checker is much better than creating one after spending 6.7 hours checking all of the links by hand.
To learn more about link rot and how to defend against it, check out these links:
Thanks for reading through this article. Happy link-checking everyone!
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.