About
How it started ...
A while ago I postulated something on the fediverse:
Searching the web is a really crappy experience these days. Instead of finding cool blogs and forums related to my
interests I get a mix of ads, AI slop and unrelated junk. Therefore, we need a new search engine for the small web.
Ideally something that can be self-hosted, has some capability for federation/decentralization and is respectful to
people choices and resources while crawling. Also, it should return good results, because that's kind of the point.
For
$dayjob I've recently spent a lot of time looking at
embeddings. Basically extra fancy hash
functions, that turn data of a certain type into a long vector (aka a bunch of floating point numbers), such that
semantically similar bits of data end up close to each other in the
vector space. These functions have existed for text for a
while they recently started getting a lot better as one of the more useful side products of LLMs. There are also
ready-made python libraries e.g.
sentence_transformers that make them
really easy to use. There are also vector databases (or database extensions) that can store those vectors and
implement nearest neighbour search as an index. Since I'm a fan of boring databases I'd use something like postgres
and
pgvector. Taking a bunch of text, splitting it into
paragraphs or sentences and creating a semantic search index on those shouldn't be too hard. That already gives us
"query in and semantically close search hits out" as a system.
Now we need to think about web-crawling. Traditionally that's an annoying and messy process, where the computer
tries to process information that's intended to be rendered by browser and parsed by a human. It works, but it's not
really designed with intention. Also, many modern day websites need to run a bunch of obnoxious JavaScript, before
you can actually get the content. If only there was a standard to provide website content in a more standardized
text-centric way ... Oh, wait RSS and atom feeds exist. They are made to be polled regularly and they mostly follow
a standardized structure. It's easy to check if new stuff has been added. We can even use the
Last-Modified header to check for new content. This is really important, because if we want to build a
better search engine, it has to be respectful of other peoples resources. That is to say it should not just
DOS small servers into oblivion, like AI crawlers tend to do. RSS and atom feeds are also nice because there are man battle tested libraries for
working with them. Blogs have them, discussion boards usually offer them and even a lot of fediverse software has
them on a per-account basis. It would also be really easy to use the metadata fields of those feeds to implement an
opt-in only system for the search index.
The other interesting thing is: As long as everyone is using the same embedding function, we can just share the
database entries, and they will work for everyone. So different instances of the search engine could subscribe to
each other and send other updates to their indices quite easily. That could be used to distribute the workload of
indexing and crawling stuff in a network. There could even be tags on the entries such that my instance can pull
only the analogue electronics stuff from magicsmoke.search and the enduro entries from bikes.index... This concept
would probably share many of the (moderation, abuse, drama...) issues of the fediverse, but it also has the
potential of having a lot of the advantages the fediverse has for niche communities.
... how it's going
Since nobody else wanted to build it, I had to sit down and explore the concept a little more.
Actually it wasn't that hard:
- feedparser to read the feeds
-
beautifulsoup4 to turn the HTML from the feed
entries into plaintext
-
sent_tokenize from nltk because splitting text into
sentences is harder than it sounds
-
SentenceTransformers with
distiluse-base-multilingual-cased-v1 to generate the embeddings
-
pg_vector and postgress as the backing database and query
chunks of texts by cosine distance of their embeddings
-
sqlalchemy with
pgvector-python as the ORM
-
For the query side of things
stsb-roberta-large is used as the cross encoder for
reranking
The rest is just the
usual:
flask with some
plugins,
alembic,
bulma for the css ...
Performance
Let's talk about the elephant in the room. Depending on what happened and where I moved this since writing this
paragraph, you might have already noticed that the performance of the search is
not-great(TM). It not really
bad either. If you hit the search button, it will take between 10s to 60s to get some results back. You can work
with that, it just feels wrong after being spoiled by getting your results back instantly for years.
If we add a GPU into the mix, things will look differently immediately. Even my dated RTX2060 will reduce the time
to process a single request to less then a second. That unfortunately makes the "self-hosted" part a little harder.
While you can get GPU-Servers or VMs with access to GPUs from most hosting providers, they are stupidly expensive.
You can probably get an older data center GPU (e.g. a Tesla K80) for just the price of a few months of hosting fees.
Hence, the far more feasible approach is probably to host your instance in your homelab, your friends homelab, or
the rack in your hackerspace. Get creative. I'd still count that as "self hostabled".