Python for SEO

Posted: Aug, 19, 2021

Updated: Jan. 6, 2022


This post introduces python scripts for SEO. Link to my relevant GitHub repository is here.


Python Modules for SEO

Before we get started, here are the most common python modules for SEO.

  • webbrowser: opens a browser to a specific page

  • requests: downloads files and webpages from the internet

  • bs4: parses html and xml

  • selenium: launches and controls a web browser

  • google-search-results: scrape and parse Google, Google Scholar, Bing, Baidu, Yandex, Yahoo, Ebay results using SerpApi

webbrowser is part of the python standard library, so it should automatically be installed when you installed python3.


For the remaining modules, you can install them with pip, which is also pre-installed. Simply type pip install followed by the module name. For example, pip install bs4.


If you're not sure whether the modules are installed, or you just want to double-check, run this test in terminal.

$ python -c "import webbrowser"
$ echo $?
0 # webbrowser module exists in system

$ python -c "import bs4"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named bs4
$ echo $?
1 # bs4 module does not exist in system

To get a list of installed modules without pip, type help("modules") in any python console. To do the same thing using pip, type pip list.


Now, the python scripts introduced in this post are mostly focused around data extraction and analysis. You can use most of them by simply copy/pasting even if you are a non-programmer. If you are already familiar with python and html, of course you can edit and customize the templates.


Python SEO Analyzer

The Python SEO Analyzer is a comprehensive crawler for your website or XML sitemap, that analyzes some of the most important ranking factors: page title, meta description, keywords, image alt text, and word count. This is useful for automating simple audits.


URL scraping

This article explains 3 ways to scrape search results pages in bulk using python. The author provides an example of scraping hostels data from Hostelworld, but the same workflow can be applied for your specific needs. For more on web scraping with python for beginners, I recommend the web scraping section of Automate the Boring Stuff with Python.


Screaming Frog Shingling

Most SEOs are probably already familiar with Screaming Frog, but this python script can make your audit process easier. Screaming Frog Shingling uses the Screaming Frog internal html with text extraction, along with a shingling algorithm, to compare possible duplicate content across your pages. Once the comparison is complete, it will create a csv file that includes similarity score between pages.


Optimize Images

Optimize Images is a command-line interface (CLI) utility to help you reduce image file sizes and boost page speed and performance.


Check for Page Updates

This depends on the page source code, but you can start by checking if either the last-modified date or etag were updated.

import requests

url = 'yoururl'
headers = requests.get(url).headers
last_mod = headers['Last-Modified']
etag = headers['Etag']

print('Etag: {}'.format(etag))
print('Last-Modified: {}'.format(last_mod))

If the last-modified and etag are not available, you can try a workaround with hashlib or httplib, or check this urlwatch project.


Time Series Analysis

If your business is on a seasonal or cyclical schedule, you will need to do time series analysis to compare results over relevant time periods. There are many different ways to do time series analysis, but this comprehensive article explains how to create an additive model for financial time series data using python and the Prophet forecasting package developed by Facebook.


External Links

The code below, only slightly modified from the template in Web Scraping with Python published by O'Reilly, outputs all external links from a given URL.

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urlsplit
import re
ext = set()
def getExt(url):
    o = urllib.parse.urlsplit(url)
    html = urlopen(url)
    bs = BeautifulSoup(html, 'html.parser')
    # Remove duplicate output for https and http
    for link in bs.find_all('a', href = re.compile('^((https://)|(http://))')):
        if 'href' in link.attrs:
            if o.netloc in (link.attrs['href']):
                continue
            else:
                ext.add(link.attrs['href'])
getExt('yoururl')
for i in ext:
    print(i)

Crawl 404

As the name suggests, Crawl 404 checks pages via XML sitemap, to find 404 errors, i.e., broken links. The way it works is by searching for links ("href" attribute), checking their response code, and dumping 404 error URLs to a text file.



Python can of course be useful for automating everyday workflows, not limited to SEO tasks. The shutil module is used for basic interactions with your local os, such as reading, renaming, and moving files. I recently started using this to automate my workflow for invoicing, which saved a lot of time on operational tasks.


See my repository here. SEO. This python script creates a new folder for organizing invoices, and checks that the file size is not too large.