After yesterday's work and the resulting performance issues, I did some quick profiling. This revealed that most of the time in building the HNSW search index was in dot product calculations between the 768-element vectors that make up each document embedding. Fortunately, the ndarray crate came to the rescue. With support for BLAS and Apple's Accelerate framework enabled, the time to build the search index went from 45 seconds down to 5, so I'm quite happy with that.
Next up I'm going to try out putting all this into a simple GUI using Tauri.
I've also been enjoying, Rust's new let/else syntax. Take a bit to get used to, but it's really convenient, for example when you want to unwrap an option or return from the function early if it's None.
Got browser history searching done today. Chrome's history file is just a SQLite database so that was pretty easy to pull in.
Lots of little details related to actually fetching the date, which aren't worth talking about too much and took up most of the day. I also revamped the import pipeline a bit to reduce head-of-line blocking when some requests take a while.
With this done, the database contains about 13,000 items, and the HNSW search index takes a while to build. So I'll work on optimizing that tomorrow.
It was a bit tricky getting rust-bert to work on the M1 GPU. The issue, apparently, is that pytorch JIT models not trained for MPS (the macOS GPU framework) can not be directly loaded on MPS. But it does work to load them and then convert them to MPS.
But it turns out there's an easy solution. After loading the VarStore on the CPU device, it's just a matter of var_store.set_device(tch::Device::Mps) and then you're running on the GPU!
In my initial tests with an M1 Pro, this is about 2-3x as fast as running on CPU/AMX. This took the time to scan and index my Logseq database (~1000 documents) down to 6 seconds. Curious if this would have been 3 seconds on an M1 Max, but I didn't spend the extra $400 a couple years ago to find out now. :)
Switching Models
The MiniLmL12V2 model that I started out with is trained more for "sentence similarity" than for searching longer documents, and it shows. I switched the model to msmarco-bert-base-dot-v5, which is supposed to work a lot better for semantic search, and indeed the search results improved immensely. The import process takes a lot longer (40 seconds for ~1000 documents), but that's still not bad. That GPU inference is pulling its weight.
These models aren't automatically supported by rust-bert, but the instructions on how to download and use other models worked great, and this one is similar enough to the existing sentence embedding pipeline that I didn't have to change much.
Search Highlighting
Finally, I implemented search result highlighting, so that you get not only the title of the found document, but a snippet of text from the document relevant to the query. I'm now using two models in the program at once. The primary model used for the search is still a BERT-based model, and handles the full document encoding.
The BERT model is powerful but relatively slow, so for highlighting, I used the MiniLmL6V2 model, which is both much faster and focused on small strings of text.
Then for each matching document, I tokenize it, break the list of tokens into overlapping chunks, and encode each chunk with the model. Finally, I also encode the original query, take a dot product between the two results, and the highest dot product for each document is the best-matching chunk.
I think it could use some tweaking to pay more attention to actual word boundaries in the tokens. But overall I'm quite happy with this as a first effort in a few hours.
Got the import pipeline, model encoding, and embedding search working for Perceive. I ended up using the instant-distance crate to do the nearest neighbor searching, but hnsw_rs looks good as well.
Tomorrow I'll look at generating better embeddings. The model I'm using cuts off the input at 250 tokens so I'm going to add something to cut documents up and do a weighted average of the resulting vectors for each piece. Might play around with some other methods too.
Not a lot of time spent coding this weekend due to Christmas and spending time with the kids, but I got a few things going on #PerceiveProject.
The SQLite database is setup, and I have an import pipeline design that can process input from multiple sources.
The first source is just plain text file scanning via the ignore and globset crates, which will be sufficient for much of my personal needs.
Finally, I was able to implement the first stages of the import pipeline, which figure out if a scanned file has been changed or not from the previous scan.
For Tuesday: writing new/changed entries to the database, generating embeddings, and a basic search functionality.
I've been doing a lot of work on the Ergo dataflow node UI. Ended up doing both the dragging and the connector code totally from scratch to get them just how I liked, which was a fun diversion. The line drawing code takes some pains to not interact with the source and destination boxes. There's more work to do there but I'm happy with it for now.
The demos right now are mostly videos and it's a bit of a hassle to post videos here. Here's one though:
Next up will be to integrate the actual editors and things into that UI.
But for now, it's holiday research project time! I'll be diving into semantic search, trying to build Perceive, a search engine for my local computer and expanding to browser history if I can.
The core will be in Rust, starting with a CLI interface and adding a Tauri app and Svelte GUI if I have the time.
I'm using rust-bert for running the embedding models. There are a couple small tricks to get it running on an M1 Mac, which I'll publish soon.
The persistent data store will be sqlite3. Postgres provides a nice pgvector extension for vector search, but it's more difficult to ship something self-contained that uses Postgres, so I think SQLite is the way to go here.
So far I've built a REPL that does a simple embedding comparison. More to come!
The image in last night's post was too compressed by Pic Store. I updated the webp compressor to use a higher quality setting, and also added a new endpoint that can rerun the conversion process without needing to upload the image again. Nice and simple when you can just trigger it with an easy-to-use background job framework.
Final work on the Ergo dataflow backend is done for now, and I can start on the UI for real. I'm looking at NeoDrag and Perfect Arrows to help with some of that. I'll probably roll the infinite canvas code myself, which I think won't be too hard but we'll see.
Finally, I have some extended time off next week, so I'm considering taking a break from my normal side projects to play around with embeddings and semantic search. Looks like there are a bunch of Rust crates such as rust-bert and hnsw_rs that should let me make good progress quickly.
Finally switched my blog to self-host the fonts. Seems like some improvement on the "flash of unstyled content" but it's still there somewhat with font-display: swap. I'm giving it a try with that left unset, which means the text area is briefly blank on first load, but that feels less jarring than actually swapping the fonts.
Instead of starting on the Ergo dataflow UI, I instead did some extra testing, adding a full integration test against the real server, and checking that it properly works with the triggers and actions systems.
The dataflow tasks are working! I added some convenience functions to the Ergo JavaScript support which allows an expression to return a promise, which will be automatically awaited and unwrapped. Now to start building the UI...
Didn't get much time to hack over the weekend, but I've started on the server side of Ergo's dataflow model. The code to walk the DAG in topological order from any node down to all the connected leaf nodes is done, and so the rest will build upon that.
For the rest of the server side work, I 'm leveraging a lot of the existing task triggers and Javascript execution code. Then I can finally build some more tasks for Ergo than the one I currently have: a one-state state machine that runs youtube-dl for every payload it receives. :)
For that existing task, I have an action set up in Drafts, so I can paste in URLs and it will call the Ergo endpoint once for each line in the document. Works nicely.
Another thing I'm coming around to with Ergo possibility is a block-based design, similar to natto.dev, but where everything can continue running on the server when you close the tab. This is slightly different from a state machine, more of a data flow type of thing, but fits well with the input-based model, and could probably reuse a lot of the UI design between the two.
This also fits well with allowing some blocks to be tables and graphs, and these can be highlighted in a “view” mode and show up in some form in the dashboard as well.
Time to start up Ergo once again! I upgraded to the latest version of SvelteKit and some of the backend dependencies as well. Upgrading Deno's core packages through about 50 versions when you're using internal undocumented stuff was slightly tricky, but fortunately their code is quite readable so I was able to update mine to match.
Next up, I'm going to play with various ways of compiling UIs down to state machines to perform tasks.
I released a new version of Effectum today, with support for altering or cancelling pending jobs. This will be a necessary item for the Email Digest Service project, if I pursue it, since it will need to schedule a digest but then delay it as emails continue to come in.
I finished up the integration between Pic Store and my Logseq exporter today.
I'm caching the image URLs for speed, and it can also look up from Pic Store by hash to avoid duplicate uploads. One nice thing about already having an SQLite database set up for utilities like this is that it becomes easy to add additional tables.
And for good measure, here's an image that I dragged into Logseq, was uploaded by the exporter, and is now being served from the image CDN. This was generated while playing around with Stable Diffusion over the Thanksgiving break, and is being served as a WebP, AVIF, or JPEG depending on your browser support.
Poisson Disc sampling is a method of obtaining a set of points that are roughly evenly spaced, and with all points some minimum distance from each other. Popular algorithms for performing this efficiently are by Bridson and Cem Yuksel, with the latter seeming to give better results according to some sources.
Added conditional format conversions to Pic Store today. This lets you configure it to do things like generate PNGs only if the input was also a PNG, so that you don't do silly things like output a PNG of a photo. Next up will be the Pic Store integration mentioned in the previous post.
Last night I watched Join Order Optimization with (almost) no Statistics by Tom Ebergen. In this video, Tom describes enhancements he made to the join planner in DuckDB, and how it applies both to native tables where you have some extra metadata about the contents of each column, and on external files such as Parquet or CSV, where you don't know much about the actual contents. Good watching if you have any interest in database systems.