As procrastination for writing my book, I wrote a Neovim plugin to generate word counts per section in a document.
Neovim lets you use an extmark on a position to add virtual text. So the idea was to set an extmark on each header line with the word count, but I didn't quite figure out how the extmark actually moves around with the text. It feels somewhat unintuitive, but there are probably some subtleties that I haven't figured out yet.
Instead of dealing with that, instead I set up my plugin as a "decoration provider." This lets you just set an ephemeral extmark which only lasts for the draw cycle, and then next time a line is redrawn you just create it again (or not). This ended up being a bit more code, but much simpler since I now only have to track the word count and where the headers are, and not worry about if an extmark is still on the right line or not.
Overall Lua feels nice, if rather barebones. I'm definitely missing the functional programming idioms afforded by most modern languages. But the Neovim APIs are very easy to use.
Today I encountered an issue using Vite with the Rush monorepo software. Rush keeps its lockfile in an unusual place, and so Vite could not automatically check the lockfile to know when to clear its cache. But it turns out to be pretty easy to code your own logic here.
// vite.config.js
import*asfsfrom'fs';import*aspathfrom'path';import{fileURLToPath}from'url';constdirname=path.dirname(fileURLToPath(import.meta.url));functionlockfileNeedsRebuild(){constlockFilePath=path.resolve(dirname,'../../common/config/rush/pnpm-lock.yaml');constlockFileDatePath=path.resolve(dirname,'.last-lockfile-date');letlockFileSavedDate='';try{lockFileSavedDate=fs.readFileSync(lockFileDatePath).toString();}catch(e){// It's fine if there is no existing file.
}constlockFileDate=fs.statSync(lockFilePath).mtime.valueOf().toString();if(lockFileSavedDate.trim()!==lockFileDate){fs.writeFileSync(lockFileDatePath,lockFileDate);returntrue;}returnfalse;}/** @type {import('vite').UserConfig} */constconfig={optimizeDeps:{force:lockfileNeedsRebuild(),},// ... and other config
};exportdefaultconfig;
I've made great progress on my Maplibre Svelte library, enough so that it's basically ready for use and I'm starting to move on to more advanced features such as new types of layers with custom shaders. Really happy with how this turned out and how easy it is to create great maps with Maplibre.
The Demo Site has a lot more demos now, so check it out!
I also created a small utility called merge-geo, which takes a GeoJSON file and related CSV files with information about the regions in the GeoJSON, and imports the CSV data into the GeoJSON. This comes up a lot when working with US Census data and similar sources, so it can save a lot of time.
The new article on loading geodata is published and I'm happy with how it turned out. Now I've been starting on the Svelte MapLibre library.
Today I got the basic map and simple marker support working. Tomorrow I'm starting on real sources and layers, and along the way I'm planning on some wrappers to make it easier to create fancy styles and shaders too.
If you're interested in following the progress here, you can check out the Github Repo or the Demo Site.
Making some last minute changes to the new post tonight and aiming to publish it on Saturday. After that's out, I'm going to start on a Svelte wrapper for the Maplibre mapping package. This will be a basis of future posts on working with geographic data in the browser.
Finished the initial draft of my 2nd GIS post today. Coming soon: "Loading Geographic Data in a Format You Can Actually Use". Then we can move on to the fun stuff.
Starting on my next geodata blog post. This is another foundational topic — how to actually get geographic data into a format you can use in your application. I'll cover shapefiles, KML, US census data, OpenStreetMap, and more.
TLA+ does worst-case model checking, so it fails if it finds any path to an error. This opens a famous trick: if you want to find the set of steps that solve a problem, write a property saying βthe problem isnβt solvedβ and make that an invariant. Then any behavior that finds the solution also breaks the invariant, and the model checker will dutifully spit out the set of steps in that behavior.
If a three-digit number is divisible by 37, it remains divisible by 37 if you rotate its digits. For example, 148 is divisible by 37, and so are 814 and 481.
Yet more changes to my publishing flow over the past two days. I added the ability to set custom HTML classes and elements in my Logseq page exporter, which lets me add nice custom styling like in the "code and image" pairs in the new Introduction to GeoJSON article.
And speaking of that article, it's live on the site now. This is targeted toward newcomers to the geographic data world, so if you are a beginner, then I hope it's useful, and if not, then expect additional content soon!
I wrote 1,100 words on an intro to GeoJSON last night. The article is coming together and should be up soon once I make the content bit less dry :) This one is just the basics, but will set the stage for doing actual work.
Did a bit more tonight, but I took a break to fix some CSS stuff on this site and update my Logseq exporter to better support exporting longform writing. Now I can write blog posts in Logseq and still use the outline hierarchy to organize sections, but get flattened HTML output when I publish here.
Starting the first article geographic data article tonight. This one will be an overview of GeoJSON, to lay the foundation for more complex topics.
Links
This article on rollback netcode is a nice overview of some different ways of synchronizing state when latency and timing really matters. Mostly only useful for game programming, but well explained.
I'm thinking about doing some more content on working with geographic data. This would include topics such as GeoJSON, working with PostGIS, and writing full apps with SvelteKit and Leaflet. If there is anything that you've found confusing or hard to learn in this area, let please reach out. My email is just daniel at this domain, and I'm also on Twitter and Mastodon.
In other news, I haven't been posting updates on Ergo recently, but the dataflow model is working. Still needs a bit more work to generate type definitions for the editors and for more visualizations, but the core functionality is there.
Small Perceive update today. Given a particular item, you can find other items that are most similar. Since the semantic search is already basically a similarity match, this was just a matter of changing the code to read the embeddings vector for the item from the database, instead of creating one from a typed-in string.
Added support for semantic search on Browser bookmarks. It's very convenient that Chrome's metadata files are all just SQLite databases or JSON files.
I'm thinking that bookmark management is going to become a first-class feature of Perceive, so you can get semantic search not only on bookmarks imported from the browser, but can add bookmarks inside the tool itself and search through them as well. So next up, going to try out a GUI in Tauri.
Did various cleanup on Perceive over the weekend, and cleaned up the HTML parsing which had previously been removing a lot of spaces between words. Since a lot of the data comes from the browsing history and similar online sources, I added a command to allow reprocessing all the data without downloading it again.
This brought about an unexpected issue. I ended up with a data processing pipeline deadlock where all the Rayon thread pool's threads were waiting on blocking channel sends. Then a different stage later in the pipeline which also used Rayon was unable to get any threads to do anything, and so no progress was made.
Attaching with the debugger was very useful here. I had my suspicions, mostly from eliminating pretty much every other potential cause, but looking at the call stacks of all the different threads made it very obvious.
Fortunately the solution was easy. It turns out Rayon lets you create separate thread pools, and so I did exactly that to remove contention. A couple hours of debugging, and only a couple minutes to make the fix.
After yesterday's work and the resulting performance issues, I did some quick profiling. This revealed that most of the time in building the HNSW search index was in dot product calculations between the 768-element vectors that make up each document embedding. Fortunately, the ndarray crate came to the rescue. With support for BLAS and Apple's Accelerate framework enabled, the time to build the search index went from 45 seconds down to 5, so I'm quite happy with that.
Next up I'm going to try out putting all this into a simple GUI using Tauri.
I've also been enjoying, Rust's new let/else syntax. Take a bit to get used to, but it's really convenient, for example when you want to unwrap an option or return from the function early if it's None.
Got browser history searching done today. Chrome's history file is just a SQLite database so that was pretty easy to pull in.
Lots of little details related to actually fetching the date, which aren't worth talking about too much and took up most of the day. I also revamped the import pipeline a bit to reduce head-of-line blocking when some requests take a while.
With this done, the database contains about 13,000 items, and the HNSW search index takes a while to build. So I'll work on optimizing that tomorrow.
It was a bit tricky getting rust-bert to work on the M1 GPU. The issue, apparently, is that pytorch JIT models not trained for MPS (the macOS GPU framework) can not be directly loaded on MPS. But it does work to load them and then convert them to MPS.
But it turns out there's an easy solution. After loading the VarStore on the CPU device, it's just a matter of var_store.set_device(tch::Device::Mps) and then you're running on the GPU!
In my initial tests with an M1 Pro, this is about 2-3x as fast as running on CPU/AMX. This took the time to scan and index my Logseq database (~1000 documents) down to 6 seconds. Curious if this would have been 3 seconds on an M1 Max, but I didn't spend the extra $400 a couple years ago to find out now. :)
Switching Models
The MiniLmL12V2 model that I started out with is trained more for "sentence similarity" than for searching longer documents, and it shows. I switched the model to msmarco-bert-base-dot-v5, which is supposed to work a lot better for semantic search, and indeed the search results improved immensely. The import process takes a lot longer (40 seconds for ~1000 documents), but that's still not bad. That GPU inference is pulling its weight.
These models aren't automatically supported by rust-bert, but the instructions on how to download and use other models worked great, and this one is similar enough to the existing sentence embedding pipeline that I didn't have to change much.
Search Highlighting
Finally, I implemented search result highlighting, so that you get not only the title of the found document, but a snippet of text from the document relevant to the query. I'm now using two models in the program at once. The primary model used for the search is still a BERT-based model, and handles the full document encoding.
The BERT model is powerful but relatively slow, so for highlighting, I used the MiniLmL6V2 model, which is both much faster and focused on small strings of text.
Then for each matching document, I tokenize it, break the list of tokens into overlapping chunks, and encode each chunk with the model. Finally, I also encode the original query, take a dot product between the two results, and the highest dot product for each document is the best-matching chunk.
I think it could use some tweaking to pay more attention to actual word boundaries in the tokens. But overall I'm quite happy with this as a first effort in a few hours.
Got the import pipeline, model encoding, and embedding search working for Perceive. I ended up using the instant-distance crate to do the nearest neighbor searching, but hnsw_rs looks good as well.
Tomorrow I'll look at generating better embeddings. The model I'm using cuts off the input at 250 tokens so I'm going to add something to cut documents up and do a weighted average of the resulting vectors for each piece. Might play around with some other methods too.
Not a lot of time spent coding this weekend due to Christmas and spending time with the kids, but I got a few things going on #PerceiveProject.
The SQLite database is setup, and I have an import pipeline design that can process input from multiple sources.
The first source is just plain text file scanning via the ignore and globset crates, which will be sufficient for much of my personal needs.
Finally, I was able to implement the first stages of the import pipeline, which figure out if a scanned file has been changed or not from the previous scan.
For Tuesday: writing new/changed entries to the database, generating embeddings, and a basic search functionality.
I've been doing a lot of work on the Ergo dataflow node UI. Ended up doing both the dragging and the connector code totally from scratch to get them just how I liked, which was a fun diversion. The line drawing code takes some pains to not interact with the source and destination boxes. There's more work to do there but I'm happy with it for now.
The demos right now are mostly videos and it's a bit of a hassle to post videos here. Here's one though:
Next up will be to integrate the actual editors and things into that UI.
But for now, it's holiday research project time! I'll be diving into semantic search, trying to build Perceive, a search engine for my local computer and expanding to browser history if I can.
The core will be in Rust, starting with a CLI interface and adding a Tauri app and Svelte GUI if I have the time.
I'm using rust-bert for running the embedding models. There are a couple small tricks to get it running on an M1 Mac, which I'll publish soon.
The persistent data store will be sqlite3. Postgres provides a nice pgvector extension for vector search, but it's more difficult to ship something self-contained that uses Postgres, so I think SQLite is the way to go here.
So far I've built a REPL that does a simple embedding comparison. More to come!
The image in last night's post was too compressed by Pic Store. I updated the webp compressor to use a higher quality setting, and also added a new endpoint that can rerun the conversion process without needing to upload the image again. Nice and simple when you can just trigger it with an easy-to-use background job framework.
Final work on the Ergo dataflow backend is done for now, and I can start on the UI for real. I'm looking at NeoDrag and Perfect Arrows to help with some of that. I'll probably roll the infinite canvas code myself, which I think won't be too hard but we'll see.
Finally, I have some extended time off next week, so I'm considering taking a break from my normal side projects to play around with embeddings and semantic search. Looks like there are a bunch of Rust crates such as rust-bert and hnsw_rs that should let me make good progress quickly.
Finally switched my blog to self-host the fonts. Seems like some improvement on the "flash of unstyled content" but it's still there somewhat with font-display: swap. I'm giving it a try with that left unset, which means the text area is briefly blank on first load, but that feels less jarring than actually swapping the fonts.
Instead of starting on the Ergo dataflow UI, I instead did some extra testing, adding a full integration test against the real server, and checking that it properly works with the triggers and actions systems.
The dataflow tasks are working! I added some convenience functions to the Ergo JavaScript support which allows an expression to return a promise, which will be automatically awaited and unwrapped. Now to start building the UI...
Didn't get much time to hack over the weekend, but I've started on the server side of Ergo's dataflow model. The code to walk the DAG in topological order from any node down to all the connected leaf nodes is done, and so the rest will build upon that.
For the rest of the server side work, I 'm leveraging a lot of the existing task triggers and Javascript execution code. Then I can finally build some more tasks for Ergo than the one I currently have: a one-state state machine that runs youtube-dl for every payload it receives. :)
For that existing task, I have an action set up in Drafts, so I can paste in URLs and it will call the Ergo endpoint once for each line in the document. Works nicely.
Another thing I'm coming around to with Ergo possibility is a block-based design, similar to natto.dev, but where everything can continue running on the server when you close the tab. This is slightly different from a state machine, more of a data flow type of thing, but fits well with the input-based model, and could probably reuse a lot of the UI design between the two.
This also fits well with allowing some blocks to be tables and graphs, and these can be highlighted in a βviewβ mode and show up in some form in the dashboard as well.
Time to start up Ergo once again! I upgraded to the latest version of SvelteKit and some of the backend dependencies as well. Upgrading Deno's core packages through about 50 versions when you're using internal undocumented stuff was slightly tricky, but fortunately their code is quite readable so I was able to update mine to match.
Next up, I'm going to play with various ways of compiling UIs down to state machines to perform tasks.
I released a new version of Effectum today, with support for altering or cancelling pending jobs. This will be a necessary item for the Email Digest Service project, if I pursue it, since it will need to schedule a digest but then delay it as emails continue to come in.
I finished up the integration between Pic Store and my Logseq exporter today.
I'm caching the image URLs for speed, and it can also look up from Pic Store by hash to avoid duplicate uploads. One nice thing about already having an SQLite database set up for utilities like this is that it becomes easy to add additional tables.
And for good measure, here's an image that I dragged into Logseq, was uploaded by the exporter, and is now being served from the image CDN. This was generated while playing around with Stable Diffusion over the Thanksgiving break, and is being served as a WebP, AVIF, or JPEG depending on your browser support.
Poisson Disc sampling is a method of obtaining a set of points that are roughly evenly spaced, and with all points some minimum distance from each other. Popular algorithms for performing this efficiently are by Bridson and Cem Yuksel, with the latter seeming to give better results according to some sources.