Automatic sitemap generation with Jekyll

14 Jan 2021 - tsp
Last update 14 Jan 2021
Reading time 4 mins

This is a short summary on how to use the jekyll-sitemap plugin to automatically generate a sitemap.xml, how to include it in your robots.txt and how to exclude specific directories - for example containing static PDFs - from the sitemap when building a static webpage using Jekyll - like this page is being built.

What is a sitemap anyways

Basically a sitemap is just a list of all pages that make up a website. There are various formats that are supported - most commonly a simple plain text format that just contains a list of all fully qualified URIs of all pages which would look like the following with UTF-8 encoding:

https://www.example.com/page1.html
https://www.example.com/page2.html
https://www.example.com/page3.html

The second major format is an XML based format that’s also stored in an UTF-8 encoded text file. Such an file would look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
 <url>
  <loc>http://example.com/</loc>
  <lastmod>2021-01-14</lastmod>
  <changefreq>daily</changefreq>
  <priority>0.95</priority>
 </url>
 <url>
  <loc>http://example.com/page1.html</loc>
  <lastmod>2020-06-05</lastmod>
  <changefreq>monthly</changefreq>
  <priority>1.0</priority>
 </url>
</urlset>

The idea behind sitemaps is to allow better indexing of websites - usually a web crawler tries to follow links inside a website. In case there are pages that are not linked correctly it’s possible for a crawler to miss some pages. The sitemap usually does not magically add pages to the search engines crawler - the crawler still works as usual. But it helps debugging problems. In Google’s search engine console for example you can verify how many pages are submitted via the sitemap and how many of these pages are actually indexed. The sitemap also provides an indication to search engines which content one’s seeing by oneself as good search engine landing pages - for example one should not list index pages without high quality content.

Additionally the XML sitemap allows one to specify the change interval - in addition to the classic HTTP expires header to somewhat control crawling frequency. The mentioned Jekyll plugin by default only generates loc and lastmod entries.

In case one wants to add additional information it’s usually the best idea to build a custom sitemap.xml template.

Installing required plugins

The sitemap plugin itself is contained in the jekyll-sitemap gem. In addition it’s advisable to install jekyll-last-modified-at to supply the correct last modified date to the sitemap. On FreeBSD one would install the GEMs system wide by using:

$ gem install jekyll-sitemap
$ gem install jekyll-last-modified-at

On other systems - or when using a Gemfile one would only add the plugins and re-run bundler again.

After that one can add the plugins to the _config.yml:

plugins:
  - jekyll-sitemap
  - jekyll-last-modified-at

Including and excluding pages

First off the plugin defaults to include all resources that have been processed by Jekyll into the sitemap. This can be controlled using the sitemap front-matter command. If one only has a small number of files that should not be included into the sitemap one can simply set sitemap: false on those pages. For larger areas it might be interesting to set the defaults inside of _config.yml. If one - for example - wants to exclude all static PDF asset files contained inside the /assets/pdf directory as well as up to three sub directories one could decide to do this for example by this configuration:

defaults:
 -
   scope:
     path: "/assets/pdf/*.pdf"
   values:
     sitemap: false
 -
   scope:
     path: "/assets/pdf/*/*.pdf"
   values:
     sitemap: false
 -
   scope:
     path: "/assets/pdf/*/*/*.pdf"
   values:
     sitemap: false
 -
   scope:
     path: "/assets/pdf/*/*/*/*.pdf"
   values:
     sitemap: false

Publishing the sitemap

The sitemap can then be either linked manually on all desired search engines or it can be referenced from robots.txt. The latter is of course the preferred way of linking the sitemap. Just as a reminder robots.txt allows one to give a hint to search engines which resources one would like to be indexed and which one doesn’t want to be crawled - this can be done even on a per crawler basis but of course it’s totally voluntary to be honored by the crawlers. One can simply add a sitemap reference - which is by default generated in sitemap.xml to allow search engines to discover the sitemap automatically:

user-agent: *
disallow:
allow: /
sitemap: https://www.tspi.at/sitemap.xml