About

How it started ...

A while ago I postulated something on the fediverse:
Searching the web is a really crappy experience these days. Instead of finding cool blogs and forums related to my interests I get a mix of ads, AI slop and unrelated junk. Therefore, we need a new search engine for the small web. Ideally something that can be self-hosted, has some capability for federation/decentralization and is respectful to people choices and resources while crawling. Also, it should return good results, because that's kind of the point.

For $dayjob I've recently spent a lot of time looking at embeddings. Basically extra fancy hash functions, that turn data of a certain type into a long vector (aka a bunch of floating point numbers), such that semantically similar bits of data end up close to each other in the vector space. These functions have existed for text for a while they recently started getting a lot better as one of the more useful side products of LLMs. There are also ready-made python libraries e.g. sentence_transformers that make them really easy to use. There are also vector databases (or database extensions) that can store those vectors and implement nearest neighbour search as an index. Since I'm a fan of boring databases I'd use something like postgres and pgvector. Taking a bunch of text, splitting it into paragraphs or sentences and creating a semantic search index on those shouldn't be too hard. That already gives us "query in and semantically close search hits out" as a system.

Now we need to think about web-crawling. Traditionally that's an annoying and messy process, where the computer tries to process information that's intended to be rendered by browser and parsed by a human. It works, but it's not really designed with intention. Also, many modern day websites need to run a bunch of obnoxious JavaScript, before you can actually get the content. If only there was a standard to provide website content in a more standardized text-centric way ... Oh, wait RSS and atom feeds exist. They are made to be polled regularly and they mostly follow a standardized structure. It's easy to check if new stuff has been added. We can even use the Last-Modified header to check for new content. This is really important, because if we want to build a better search engine, it has to be respectful of other peoples resources. That is to say it should not just DOS small servers into oblivion, like AI crawlers tend to do. RSS and atom feeds are also nice because there are man battle tested libraries for working with them. Blogs have them, discussion boards usually offer them and even a lot of fediverse software has them on a per-account basis. It would also be really easy to use the metadata fields of those feeds to implement an opt-in only system for the search index.

The other interesting thing is: As long as everyone is using the same embedding function, we can just share the database entries, and they will work for everyone. So different instances of the search engine could subscribe to each other and send other updates to their indices quite easily. That could be used to distribute the workload of indexing and crawling stuff in a network. There could even be tags on the entries such that my instance can pull only the analogue electronics stuff from magicsmoke.search and the enduro entries from bikes.index... This concept would probably share many of the (moderation, abuse, drama...) issues of the fediverse, but it also has the potential of having a lot of the advantages the fediverse has for niche communities.

... how it's going

Since nobody else wanted to build it, I had to sit down and explore the concept a little more.
Actually it wasn't that hard:

feedparser to read the feeds
beautifulsoup4 to turn the HTML from the feed entries into plaintext
sent_tokenize from nltk because splitting text into sentences is harder than it sounds
SentenceTransformers with distiluse-base-multilingual-cased-v1 to generate the embeddings
pg_vector and postgress as the backing database and query chunks of texts by cosine distance of their embeddings
sqlalchemy with pgvector-python as the ORM
For the query side of things stsb-roberta-large is used as the cross encoder for reranking

The rest is just the usual: flask with some plugins, alembic, bulma for the css ...

Performance

Let's talk about the elephant in the room. Depending on what happened and where I moved this since writing this paragraph, you might have already noticed that the performance of the search is not-great(TM). It not really bad either. If you hit the search button, it will take between 10s to 60s to get some results back. You can work with that, it just feels wrong after being spoiled by getting your results back instantly for years.

If we add a GPU into the mix, things will look differently immediately. Even my dated RTX2060 will reduce the time to process a single request to less then a second. That unfortunately makes the "self-hosted" part a little harder. While you can get GPU-Servers or VMs with access to GPUs from most hosting providers, they are stupidly expensive. You can probably get an older data center GPU (e.g. a Tesla K80) for just the price of a few months of hosting fees. Hence, the far more feasible approach is probably to host your instance in your homelab, your friends homelab, or the rack in your hackerspace. Get creative. I'd still count that as "self hostabled".