A Quick Script to Find Any Broken Links on Your Site 🎯

  • Gabriel Romualdo

  • November 9, 2019

Introduction

It seems like almost every other click on the internet ends up in an "Error 404: Page Not Found" page. "Whoops, the page you're looking for does not exist," "Sorry, the requested URL was not found on this server," "Oops, something went wrong. Page not found." Every internet user has seen pages like these.

I think it's important that web developers consider paying less attention to building clever 404 pages, and start eliminating broken links altogether.

The Program

I've built an automated program to find broken links.

Program Demo

Written in Python 3, it recursively follows links on any given site and checks each one for 404 errors. When the program has finished searching an entire site, it prints out any found broken links and where those links are so that developers can fix them.

Note that the program does make a lot of HTTP requests in a relatively short period of time, so be aware of Internet usage rates and the like.

Usage

  1. Check if you have Python 3 installed:

If the following command does not yield a version number, download Python 3 from python.org.

$ python3 -V
  1. Download the Requests and BeautifulSoup package (for HTML parsing) with PyPi.

(Note: I do not maintain these packages and am not associated with them, so download at your own risk)

$ pip3 install requests
$ pip3 install beautifulsoup4
  1. Copy paste the following code into a file (I use the name find_broken_links.py in this article).
import requests
import sys
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from urllib.parse import urljoin

searched_links = []
broken_links = []

def getLinksFromHTML(html):
    def getLink(el):
        return el["href"]
    return list(map(getLink, BeautifulSoup(html, features="html.parser").select("a[href]")))

def find_broken_links(domainToSearch, URL, parentURL):
    if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg")):
        try:
            requestObj = requests.get(URL);
            searched_links.append(URL)
            if(requestObj.status_code == 404):
                broken_links.append("BROKEN: link " + URL + " from " + parentURL)
                print(broken_links[-1])
            else:
                print("NOT BROKEN: link " + URL + " from " + parentURL)
                if urlparse(URL).netloc == domainToSearch:
                    for link in getLinksFromHTML(requestObj.text):
                        find_broken_links(domainToSearch, urljoin(URL, link), URL)
        except Exception as e:
            print("ERROR: " + str(e));
            searched_links.append(domainToSearch)

find_broken_links(urlparse(sys.argv[1]).netloc, sys.argv[1], "")

print("\n--- DONE! ---\n")
print("The following links were broken:")

for link in broken_links:
    print ("\t" + link)
  1. Run on command line with a website of your choice.
$ python3 find_broken_links.py https://your_site.com/

Conclusion

I hope you found this useful, and it certainly helped me find a few broken links on my own site.

This program is CC0 Licensed, so it is completely free to use, but makes no warranties or guarantees.

If you liked this post, share it with your friends and colleagues!

Thanks for scrolling.

— Gabriel Romualdo, November 10, 2019