Semantic search has become a key ingredient in making websites more accessible and discoverable. Instead of matching keywords, semantic search interprets meaning: it retrieves content that is similar in meaning to a query or another document. This article explains how I built an automatic âSimilar Articlesâ recommender for a static blog built with Jekyll (the approach is generatorâagnostic) - the result of the recommender you can see on every article page on this blog. I will go from the basics of embeddings and cosine similarity, through PostgreSQL with pgvector
, to the actual pipeline and integration with Jekyll.

The above image shows an example set of recommendations generated by the script that is described in this article
From Keywords to Semantics
Traditional search engines operated on keywords for a very long time. A query like âneural networksâ will only match articles containing those exact words. Actually they perform of course stopword removal and then utilize triplets of words, assign them statistical weights and so on - but in essence itâs just matching keywords. Semantic search, in contrast, projects texts into a highâdimensional vector space where semantically similar texts lie close together. This is achieved using embeddings - dense high dimensional numerical representations of meaning. We have looked into those before.
Cosine Similarity
Lets quickly recall cosine similarity - one of the most common ways to measure similarity between embeddings. Each text embedding is a vector in an Nâdimensional space. Cosine similarity measures the angle between two vectors: the smaller the angle, the more similar the texts.
Visualizing Embeddings
Before describing the implementation we can ask ourself how this actually looks like. To demonstrate what embedding vectors are and what you can actually do with them, here I render interactive visualizations of this blogâs data. These views let the reader pan, tilt, zoom, and rotate the projected point cloud, and they can also choose which principal vectors to display. When JavaScript is enabled the display is fully interactive; if it is disabled we fall back to static images.
I created two datasets for this: one in which every so called chunk is shown and clustered into 24 clusters and one in which only the article centroids are shown and clustered into 16 groups. In both cases I applied kâmeans clustering after dimensionality reduction so that the reader can immediately see how groups of semantically related content emerge. Hovering the cursor over any point reveals which article it corresponds to, so the semantic relation between posts becomes directly tangible.
The choice of 24 clusters for the chunkâlevel view and 16 clusters for the articleâlevel view is deliberate: the larger number captures the greater variability between fineâgrained text fragments, while the smaller number emphasizes broader themes when only article centroids are shown. These values were determined experimentally and provide a balance between readability and resolution - for a practical application you can define metrics and methods like the gap statistics method to determine the optimal number of clusters, this has not been done here. In this way the visualization goes beyond a static scatter plot and turns the abstract mathematics of embeddings into an intuitive, explorable landscape of the blogs content.
Chunks (2D projection, main principal components)
Chunks (3D projection, selectable principal components)
Articles (2D projection, main principal components)
Articles (3D projection, selectable principal components)
Choosing the Right Database Backend
There are many ways to store and query embeddings, and each option comes with its own strengths and trade offs.
PostgreSQL with pgvector
PostgreSQL is a general purpose relational database that has proven itself as extremely stable and standardsâconformant. With the pgvector
extension it can natively store embedding vectors and perform similarity queries. Because it is still a full SQL database, you can combine semantic queries with relational joins, fullâtext search, and even graph queries (for example via Apache AGE). This makes it an excellent choice if embeddings are only one part of a larger system that also needs traditional relational data management or graphs.
PostgreSQL has been in continuous development since 1986 (originally as the ```POSTGRES`` project at the University of California, Berkeley) and saw its first public release in the midâ1990s. It is widely regarded as one of the most stable and mature openâsource databases. The project is actively maintained by a large global community and commercial contributors, with major releases coming regularly every year. PostgreSQL is released under the permissive PostgreSQL License, a variant of the MIT license, which makes it free and open source for both academic and commercial use. This long history and liberal license are key reasons why it is trusted for critical applications across industries.
Chroma
Chroma is a purposeâbuilt vector database written in Python. It is very easy to integrate into machine learning workflows and prototypes. Developers can quickly spin up a local instance, insert embeddings, and query them with minimal code. The downside is that it is less suited for heavy production workloads, and support on less common platforms (such as FreeBSD) can be problematic. Still, it shines in research environments and small projects where fast iteration matters more than longâterm operational stability.
Chroma is a relatively young project, having emerged in the 2020s, and is under active development with a fastâmoving feature set. It is released under the Apache 2.0 openâsource license. While this makes it attractive for experimentation and integration into AI projects, the rapid development pace means that breaking changes and evolving APIs should be expected.
Faiss
Faiss (Facebook AI Similarity Search) is a C++ library with Python bindings designed for extremely fast nearestâneighbor searches in highâdimensional vector spaces. It offers GPU acceleration and supports a wide range of indexing strategies (IVF, HNSW, product quantization, etc.), making it ideal for very large datasets where performance is critical. However, Faiss is a library, not a database - so you need to build your own storage, persistence, and metadata layers around it.
Faiss was first released by Facebook AI Research (FAIR) in 2017 and has since become one of the standard tools for largeâscale similarity search. It is released under the MIT license, is stable and widely used in both academic and industrial settings, and continues to be actively maintained and extended by the openâsource community.
The choice for this project
In my case I chose PostgreSQL with pgvector. This was rooted mainly because I want more than just vector search: I want standard SQL queries, relational joins, metadata storage, and overall database stability. At least in other projects. And on top of that stability and reliability as well as platform independence are some key features for me so the choice was easy.
The Pipeline
Gathering and Indexing Data
My tool iterates over all rendered HTML files generated by Jekyll, which are usually stored in the _site/
directory after a jekyll build
command. For each file it first extracts the main content using BeautifulSoup, deliberately ignoring navigation, sidebars, ads or contact information so that only real article text is processed. This approach not only reduces unnecessary data but also makes it easier to detect whether an articles content has truly changed, as opposed to layout or metadata adjustments.
The extracted HTML is then converted into Markdown, a format that is simpler to handle and more natural for large language models and embedding transformers. The markdown text is split into overlapping chunks of roughly 400 tokens each, with an overlap of about 100 tokens. This chunking is necessary because embedding models typically work best with limited context sizes, and the overlap helps compensate for random sentence cuts at the edges of chunks.
For every chunk, an embedding vector is generated using an transformer. These embeddings capture the semantic meaning of the text and are later used for similarity comparisons. To avoid unnecessary recomputation, the system computes a SHA hash of each fileâs content and only regenerates embeddings when the actual content has changed.
Below are the core code fragments that implement this pipeline setup and extraction logic:
# Defaults relevant to crawling and chunking
DEFAULTS = {
"site_root": "_site",
"exclude_globs": ["tags/**", "drafts/**", "private/**", "admin/**"],
"content_ids": ["content"],
"chunk": {"max_tokens": 800, "overlap_tokens": 80},
}
def select_content_element(soup, ids_or_classes):
"""Return inner HTML of the first element matched by the provided list.
Each entry can be an id, class, or any CSS selector."""
for key in ids_or_classes:
key = (key or "").strip()
if not key:
continue
if key.startswith(("#", ".")) or any(ch in key for ch in " >:+~[]"):
el = soup.select_one(key)
if el:
return el.decode_contents()
el = soup.find(id=key) or soup.find(class_=key)
if el:
return el.decode_contents()
return None
def html_to_markdown(inner_html: str) -> str:
try:
import markdownify
return markdownify.markdownify(inner_html, heading_style="ATX")
except Exception:
soup = BeautifulSoup(inner_html, "lxml")
for a in soup.find_all("a"):
a.replace_with(f"{a.get_text(strip=True)} ({a.get('href','')})")
return soup.get_text("
", strip=True)
def chunk_markdown(md: str, max_tokens=800, overlap=80):
words = md.split()
approx_ratio = 1/1.3
max_words = int(max_tokens * approx_ratio)
overlap_words = int(overlap * approx_ratio)
out, i = [], 0
while i < len(words):
j = min(len(words), i + max_words)
out.append(" ".join(words[i:j]))
if j == len(words):
break
i = max(0, j - overlap_words)
return out
Generating Embeddings
There are two embedding providers that my script currently supports, each with its own story and tradeâoffs.
Ollama is the default in my setup. It allows me to run models such as mxbaiâembedâlarge
locally through a simple REST
API. This makes deployment straightforward, as the only Python dependency is the requests
library. Under the hood, Ollama is capable of GPU acceleration, and because it exposes a network API, it is possible to distribute embedding generation across multiple machines in a cluster. Ollama is a fairly young project, but it has quickly become popular because it lowers the entry barrier to running strong embedding (and large language) models without the need to integrate a whole machine learning framework directly into your code. For me, the decisive advantages are that it avoids any external costs (remember that even decent cost adds up very fast especially as soon as you have automated tasks), it scales out to multiple indexers, and it remains extremly simple to operate.
OpenAI, on the other hand, provides a cloudâbased embedding service as part of their API platform offerings. These embeddings are generated with highâquality transformer models - though the quality is comparable for most applications to what we can achieve with Ollama. Even on very powerful GPUs there may be a small performance edge, but in practice this is usually negated by the latency of sending requests over the internet. The most important difference is the cost structure: pricing is per token, which can accumulate quickly if you need to embed an entire large corpus (you will not notice for a small blog even if you regenerate the index every day for sure). In addition, using OpenAI ties your pipeline to an external service, which can be acceptable for some use cases but runs counter to my preference for selfâhosting and independence - also when you think from point of view that you may build databases that should last for decades or even longer, then you cannot rely on an external cloud service to keep operating. In my opinion the advantage of using OpenAIs embeddings is neglectable for this application - of course in contrast to the application of large language models where local execution of larger models is usually prohibitive and the cloud is the only economically sane solution.
In the end, I prefer Ollama for dayâtoâday work, while acknowledging that OpenAIâs embeddings may be attractive for teams who value a managed service and are willing to pay for convenience and scalability - or who run serverless on demand scaleable services.
It is worth mentioning BERT here as well. BERT, short for Bidirectional Encoder Representations from Transformers, was one of the first transformer models widely used for generating embeddings. When I started experimenting with semantic similarity I initially used BERT models, typically loaded through Hugging Face Transformers. They produce quality embeddings and have the advantage of being free and being well documented. However, they also introduce a heavy dependency chain: you need to install large Python libraries, manage model weights, and often pull in GPUâspecific (and operating system specific) tooling. For lightweight pipelines and tools meant to be distributed or run in many environments this becomes cumbersome. Because of these additional dependencies and operational complexity, I decided not to include BERT in the final tool, even though it was the starting point of my experimentation.
The embedding calls and provider abstraction are tiny and explicit:
def embed_texts_ollama(texts, model, url):
embs = []
for txt in texts:
r = requests.post(url, json={"model": model, "prompt": txt}, timeout=(20, 600))
r.raise_for_status()
embs.append(r.json()["embedding"])
return embs
def embed_texts_openai(texts, model, base, api_key_env):
key = os.environ.get(api_key_env)
if not key:
raise RuntimeError(f"OpenAI API key missing (env {api_key_env})")
headers = {"Authorization": f"Bearer {key}"}
r = requests.post(base, json={"model": model, "input": texts}, headers=headers, timeout=600)
r.raise_for_status()
js = r.json()
return [d["embedding"] for d in js["data"]]
def embeddings_for(texts, cfg):
emb = cfg["embedding"]
if emb["provider"] == "ollama":
return embed_texts_ollama(texts, emb["model"], emb["ollama_url"])
return embed_texts_openai(texts, emb["model"], emb["openai_api_base"], emb["openai_api_key_env"])
def detect_embedding_dim(cfg):
vecs = embeddings_for(["dimension probe"], cfg)
if not vecs or not vecs[0]:
raise RuntimeError("Failed to detect embedding dimension from provider.")
return len(vecs[0])
Database Schema
The storage layer is deliberately small and transparent. The pages
table holds one row per rendered article, keyed by its URLâlike path
(primary key). I have decided to encode this path since the pages I use this for operate all at the root of their domain so it fits the URL scheme. Alongside this identifier we store a content_hash
(a SHA over the content area) to detect real changes and avoid needless reâembedding, the extracted (optional) OpenGraph metadata (title
, description
, image
) for later presentation, and a centroid
vector which is the arithmetic mean of all chunk embeddings for that page. Operational columns include updated_at
, automatically set to the current timestamp on each upsert, and is_public
, a boolean that lets the pipeline exclude drafts or private sections from recommendation generation. This flag is currently not supported - but may be used later on.
The chunks
table contains the text fragments produced during chunking. Each row stores the parent path
, an ord
field that preserves the original order of chunks within the page, the chunkâs markdown text in text_md
, and its highâdimensional embedding
vector. The column path
is declared as a foreign key to pages(path)
with ON DELETE CASCADE
, ensuring that removing a page also removes all of its chunks and keeping the database free of orphans. This deviates slightly from the scheme Iâd usually use (an artificially generated integer or UUID primary key, an index over path and the foreign key of course via the artificial primary key).
To make similarity search efficient, we create conventional Bâtree indexes where appropriate and vector indexes where they matter. There is an index on chunks(path)
to fetch all chunks of a page quickly, and helper indexes on pages(path)
, pages(updated_at)
, and pages(is_public)
to support frequent filters and maintenance queries. For vector retrieval we use pgvector`` IVFâFlat indexes on
chunks.embeddingand on
pages.centroid, both created with
USING ivfflat (⌠vector_cosine_ops) WITH (lists = 100)```. The cosine operator class gives us cosine distance as the ranking metric (smaller means more similar), while IVFâFlat provides approximate nearestâneighbor search that scales well for millions of vectors.
For readers who prefer seeing the schema as DDL, here is the exact creation snippet used by the tool:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS pages (
path TEXT PRIMARY KEY,
content_hash TEXT NOT NULL,
title TEXT,
description TEXT,
image TEXT,
centroid vector(DIM),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
is_public BOOLEAN NOT NULL DEFAULT true
);
CREATE TABLE IF NOT EXISTS chunks (
id BIGSERIAL PRIMARY KEY,
path TEXT NOT NULL REFERENCES pages(path) ON DELETE CASCADE,
ord INTEGER NOT NULL,
text_md TEXT NOT NULL,
embedding vector(DIM) NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_chunks_path ON chunks(path);
CREATE INDEX IF NOT EXISTS idx_pages_path ON pages(path);
CREATE INDEX IF NOT EXISTS idx_pages_updated_at ON pages(updated_at);
CREATE INDEX IF NOT EXISTS idx_pages_is_public ON pages(is_public);
CREATE INDEX IF NOT EXISTS idx_chunks_embedding ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX IF NOT EXISTS idx_pages_centroid ON pages USING ivfflat (centroid vector_cosine_ops) WITH (lists = 100);
In short, pages
captures one vector summary per article together with presentation metadata and housekeeping flags, chunks
stores the fineâgrained embeddings tied back to their page via a strict foreign key, and a small set of carefully chosen indexes keeps both maintenance and recommendation queries fast.
Error Handling
The script retries failed embedding requests, ensures dimension consistency, and validates that embeddings fit the expected size. On mismatch, a resetdb
command rebuilds the schema. The typical failure behaviour at the moment is simply crashing and raising an exception. This propagates as a nonâzero exit code to the upstream application - in my case the Jenkins build system - which is sufficient for detecting and reacting to errors.
Two examples of explicit error raising are the missing OpenAI API key and an embedding dimension mismatch:
if not key:
raise RuntimeError(f"OpenAI API key missing (env {api_key_env})")
if any(len(e) != dim_probe for e in embs):
raise RuntimeError("Embedding dimension mismatch; run 'resetdb' after changing model/provider.")
Because the tool is invoked as a CLI
, any uncaught exception will terminate the process with a nonâzero exit status, which Jenkins (or any orchestrator or build automation tool) can interpret as a failed build step.
Generating Related Articles
Once indexed, the system generates a JSON
file that maps each page to a curated set of neighbors. For every page I first fetch a pool of ksample
candidates from the database, which are simply the closest articles according to cosine distance on their centroid embeddings. From this larger pool I then selectk
items that will actually appear as the âsimilar articlesâ recommendation. The reduction step is important because the nearestâneighbor set often contains many articles that are semantically close in a very narrow way; I do not want to present a monotonous list but rather a balanced sample that still respects similarity - and also add some feeling of dynamics to the else static page for every page rebuild.
To achieve this balance I apply Boltzmann sampling. The intuition comes from statistical mechanics: each candidate with distance value $d$ is given a weight proportional to $e^{-\frac{d - d_{min}}{T}}$, where $d_{min}$ is the smallest distance in the pool and $T$ is the temperature parameter. At low temperatures (small $T$) the sampling distribution is sharply peaked, favoring the very best matches, while at higher temperatures the distribution flattens out, allowing more diversity. In practice we use $T = 0.7$, which is a good compromise between determinism and variety. This way the same article will not always be paired with exactly the same neighbors, but the selected ones remain recognizably close in meaning.
The configuration knobs live in the defaults and are overridable in your JSON config:
"neighbors": {
"ksample": 16,
"k": 8,
"temperature": 0.7,
"pin_top": True,
"seed": None,
"seealso": 4
}
The candidate retrieval and Boltzmann sampling are implemented as follows (simplified to the essential parts):
-- within cmd_genrel():
SELECT path,
centroid <=> (SELECT centroid FROM pages WHERE path = %s) AS dist
FROM pages
WHERE centroid IS NOT NULL AND is_public AND path <> %s
ORDER BY dist
LIMIT %s;
def boltzmann_sample(cands, k, temperature=0.7, pin_top=True, seed=None):
if not cands:
return []
k = min(k, len(cands))
rng = random.Random(seed) if seed is not None else random
chosen, rest = [], cands[:]
if pin_top and k > 0:
chosen.append(rest[0][0])
rest = rest[1:]
k -= 1
if k == 0:
return chosen
if not rest:
return chosen
dmin = min(d for _, d in rest)
T = max(float(temperature), 1e-6)
weights = [math.exp(- (d - dmin) / T) for _, d in rest]
selected = []
items = list(zip([p for p, _ in rest], weights))
for _ in range(k):
total = sum(w for _, w in items)
idx = rng.randrange(len(items)) if total <= 0 else next(i for i, (_, w) in enumerate(items) if (lambda r=sum(w for _, w in items)*rng.random(): sum(t for _, t in items[:i+1]) >= r)())
selected.append(items[idx][0])
items.pop(idx)
return chosen + selected
After sampling we exclude already chosen items and, when configured, add a handful of random suggestions labelled as âSee alsoâ. These provide serendipitous entry points into other parts of the blog that are not semantically near the source article but may still be interesting for the reader. The final mapping is written into _data/related.json
, which Jekyll can consume to render recommendation boxes on each article page.
For completeness, the final packaging step constructs each entries payload and writes the JSON:
def pack(p: str) -> Dict[str, str]:
mi = meta.get(p, {})
title = mi.get("title") or p.strip("/").split("/")[-1].replace("-", " ").title()
desc = mi.get("desc") or "" # Optionally one can fetch only a subset of characters to limit length
img = mi.get("image") or ""
return {"url": p, "title": title, "desc": desc, "image": img}
result[path] = {
"related": [pack(p) for p in related],
"seealso": [pack(p) for p in seealso]
}
out_file.write_text(json.dumps(result, ensure_ascii=False, indent=2), "utf-8")
print(f"Wrote {out_file}")
Jekyll Integration
In Jekyll templates (Liquid), I read the JSON and render a âRelated Articlesâ section at the end of each post. The Liquid code below is what I currently use (provided via an include to different layouts):
{%- assign key = page.url | relative_url -%}
{%- assign lastchar = key | split: '' | last -%}
{%- if key == "/" -%}
{%- assign key = "/index.html" -%}
{%- elsif key contains ".html" -%} <!-- -->
{%- elsif lastchar == "/" -%}
{%- assign key = key | append: "index.html" -%}
{%- else -%}
{%- assign key = key | append: "/index.html" -%}
{%- endif -%}
{%- assign block = site.data.related[key] -%}
{% if block %}
{% if block.related and block.related.size > 0 %}
<div class="related">
<h2> Related articles </h2>
<div class="relatedgrid">
{% for it in block.related %}
<div class="relcard">
<a href="{{ it.url | relative_url }}">
{% if it.image and it.image != "" %}<img src="{{ it.image | relative_url }}" alt="">{% else %}<img src="/assets/images/png/unknownpage_small.png" alt="">{% endif %}
<h3>{{ it.title }}</h3>
{% if it.desc %}<p>{{ it.desc }}</p>{% endif %}
</a>
</div>
{% endfor %}
</div>
{% endif %}
{% if block.seealso and block.seealso.size > 0 %}
<div class="related">
<h2> Also on this blog </h2>
<div class="relatedgrid">
{% for it in block.seealso %}
<div class="relcard">
<a href="{{ it.url | relative_url }}">
{% if it.image and it.image != "" %}<img src="{{ it.image | relative_url }}" alt="">{% else %}<img src="/assets/images/png/unknownpage_small.png" alt="">{% endif %}
<h3>{{ it.title }}</h3>
{% if it.desc %}<p>{{ it.desc }}</p>{% endif %}
</a>
</div>
{% endfor %}
</div>
</div>
{% endif %}
</div>
{% endif %}
This template logic normalizes the current page URL into a key that matches the entries in _data/related.json
. It then looks up the corresponding block of recommendations. If related articles exist, it renders them with their title, description, image and link. If configured, it also renders an additional section called âAlso on this blogâ that displays a few random suggestions. Each entry gracefully falls back to a placeholder image if no specific image is available. The result is that every article page automatically ends with a visually consistent block of recommendations that encourage further exploration.
Configuration
Before using the tool it is important to understand its configuration file. By default, blogsimi
looks for a JSON configuration file at ~/.config/blogsimilarity.cfg
. This file describes where the rendered site is located, how embeddings should be generated, and how the PostgreSQL database can be reached. It also defines parameters for chunk sizes and neighbor selection.
A minimal configuration might look like this:
{
"site_root": "_site",
"data_out": "_data/related.json",
"embedding": {
"provider": "ollama",
"model": "mxbai-embed-large",
"ollama_url": "http://127.0.0.1:11434/api/embeddings"
},
"db": {
"host": "127.0.0.1",
"port": 5432,
"user": "blog",
"password": "blog",
"dbname": "blog"
},
"neighbors": {
"ksample": 16,
"k": 8,
"temperature": 0.7,
"seealso": 4
}
}
Here site_root
points to the rendered HTML directory, data_out
specifies the JSON file that Jekyll will later consume, embedding
configures the embedding provider and model, and db
holds the connection details to PostgreSQL. The neighbors
section tunes how many candidates are sampled and how many recommendations will be shown, as well as the temperature parameter for Boltzmann sampling and the number of random âsee alsoâ entries. You can also override the glob patterns that exclude certain directories (for example drafts/**
or private/**
) and adjust the chunking parameters (max_tokens
and overlap_tokens
) to control how the text is split before embedding.
If no configuration file is found, the tool falls back to its internal defaults. Every option can be overridden by supplying a different configuration file with the --config
option.
Using the blogsimi
CLI
This project ships as a single script (blogsimi
) providing a small CLI. It is installable via PyPi. The source is available on GitHub. To install the package you can simply execute
All commands read the configuration from ~/.config/blogsimilarity.cfg
by default; use --config
to point to a different file.
Oneâtime database setup
A PostgreSQL superuser (or a role with sufficient privileges) must create the role and database and enable the pgvector
extension, this required superuser privileges:
-- as a PostgreSQL superuser
CREATE ROLE blog LOGIN PASSWORD 'blog';
CREATE DATABASE blog OWNER blog;
\c blog
CREATE EXTENSION IF NOT EXISTS vector; -- requires superuser or appropriate privileges
Then you can initialize the tables via the tool:
When you change the embedding provider/model (and thus the vector dimension), recreate the tables (dropping all data):
Index the rendered site
Point the indexer at your built HTML. Use --page
to override the site root (defaults to site_root
in the config). The indexer only reâembeds changed pages.
blogsimi index --page _site
Generate the recommendations JSON
Write the related/seealso mapping into your Jekyll data directory. Use --out
to override the output path (defaults to data_out
in the config).
blogsimi genrel --out _data/related.json
The CLI exits with a nonâzero status on errors (e.g., DB connection issues, missing API keys). This makes it easy to wire into CI systems like Jenkins.
Conclusion
Semantic search brings a new layer of discovery to static sites. By combining local embeddings, PostgreSQL with pgvector, and Jekyll integration, we can:
- Suggest related content that is truly relevant.
- Keep all data selfâhosted and under control.
- Extend beyond blog posts into any content type.
The result: a richer and more engaging browsing experience, built on top of solid and transparent technology.
This article is tagged: