Implementing search in a static site
The prerendering of single page applications made static websites popular (again). But server-side does have its perks; which is why so many static websites resort to cloud solutions for functionalities like comments and search.
When I rewrote this site to ditch the WordPress REST API, I implemented a clever search. The downside of cloud is that it eventually updates with breaking changes (I've had stable applications fail because of MongoDB updates). So the only cloud service for me is a CDN for stuff that doesn't fare well under GIT.
How search works
Searching a website that is rendered server side, mostly involves a MySQL query onto specific fields in specific tables. You never know what somebody might search, so all you could so to speed things up is cache these queries.
With our current static implementation, the equivalent of a database is a large collection of Markdown files. These are converted to JSON by a build step, to be loaded on-the-fly when you navigate. This page, for instance, is loaded through this JSON file.
But to search these would mean loading them all into memory with a shitload of requests. So much for client-side searching ... .. .
Find first, search later
But let's look at searching from another perspective: you might not know what people will search for, but you know exactly what they can find.
Keeping things static, we might not have any server-side computing power at our disposal, but we do have enough storage to search all conceivable results and store each as a JSON file.
Say you search for the word "frog". You would load frog.json
which is simply a list of endpoints containing the word frog: ["where-are-all-the-amfibians"]
. But because we do not want non-existing search terms to timeout, we have to create an index with all the possible words that have a JSON file. This also allows us to autosuggest- or complete any searches. So we end up with one index pointing to several hundred words (the word/endpoint files are prefixed s_
, so we can also search for the word "word"):
search
├─ words.json
├─ s_abandoned.json
├─ s_abbreviate.json
├─ ...
├─ s_zoomable.json
└─ s_zip.json
Too large to load
That index file is a bit of a bottleneck, because it will contain all the unique words from our combined collection of Markdown files. To bring down the filesize, we will only want meaningful words, of three characters or larger.
This is all automated with a build script of course, but there is the tedious manual part of sifting through all the words to see which ones we want to exclude (dummy text examples for instance). Although it might help to exclude code snippets with stupid variable names.
When I first implemented this for 250 pages, I extracted 2500 search terms resulting in a 24KB JSON file. Six years later, I'm up to 320 pages with 3500 search terms in a 37KB file. Maybe time for some pruning.
Luckily, files are cached, so that index file is only loaded once.
Now when we search for "pink frog", the index file is loaded and finds entries for both "pink" and "frog". Then it loads the two related JSON files and counts the endpoints. If one endpoint is present in both, it will end up higher in the final search result.
You can try it right here (and watch the network tab):
Any downsides?
The downside to this method is that it is quite straightforward: it simply searches for separate words. Word combinations are a lot harder: searching for "pink frog" we may find a page about "pink elephants and dead frogs". But we can match "frog" to "frogs", so that's nice.
I guess it can be improved upon. Maybe by counting words in a file, or adding weight to words in titles or headings. But its current state works and is still simple.
If you want to check out the sources, the build script is located here.