Perceive

Written 2022-12-21 — Updated 2023-10-11

This project was an experiment from December 2022, in indexing a bunch of personal data and performing semantic search on it via embedding similarity. Everything on the backend was done in Rust just to make it hard. :)
Tauri app
- Loading bar for initial load
- Search field
- real-time search as you type (without highlighting)
- ability to select different sources, source categories
Asymmetric Search
- This is useful when we have an asymmetric query, where the queries are short but the corpuses to be matched are long.
- Probably not worrying about this for now, maybe later
- https://www.sbert.net/examples/applications/retrieve_rerank/README.html
- Use tokenizer: https://docs.rs/rust-bert/latest/rust_bert/pipelines/sentence_embeddings/struct.SentenceEmbeddingsTokenizerConfigResources.html#associatedconstant.ALL_MINI_LM_L12_V2
  - Possible additional config? https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2/blob/main/tokenizer_config.json
  - The conversion process might include all this, not sure.
- Implementing a Cross-Encoder
  - Unlike the normal bi-encoder process, which generates vectors from one document at a time and then relies on cosine similarity and other metrics, the cross-encoder model takes both documents together and produces a similarity score straight from the model.
  - This can give much better results when you have a top list of similar documents, but want to sort that list further for presentation to the user. It's much slower though, which is why you use the vector similar from the bi-encoder to get the set of candidate documents.
  - We do a normal compare on the bi-encoder vectors to get a top N, and the re-sort that results list with the CrossEncoder, which creates an encoding of the large result and the query together, and then gets the top score from each one.
  - https://www.sbert.net/examples/applications/cross-encoder/README.html
  - https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/cross_encoder/CrossEncoder.py
- SBERT provides pre-trained cross encoder models from the MS MARCO dataset
  - https://www.sbert.net/docs/pretrained_cross-encoders.html
  - Data at https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2/tree/main
  - Need to convert the model weights and see what else needs to happen for this to work.
- Other Cross-Encoder Models
  - ColBERT - https://huggingface.co/vespa-engine/col-minilm - Colbert uses a lighter weight context model to speed things up
  - https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
rust-bert on M1
- 1. The latest published version of the rust-bert crate is using tch-rs 0.8, but you need to use at least 0.10 instead. The git version already uses the newest version, so you can just set that in your Cargo.toml: rust-bert = { git = "[github.com/guillaume-be/rust-b](https://github.com/guillaume-be/rust-bert.git)" }
- 2. You need to install pytorch manually. The simplest method is to just install it globally via Homebrew or pip3, though a more robust method would be to install a local copy in a python venv or something, and reference it from there.
- 3. Set the LIBTORCH environment variable to wherever you have libtorch installed. With the homebrew method this is LIBTORCH=/opt/homebrew/opt/pytorch
- 4. Tell the linker where to find it in your Cargo config.
Initial Sprint Reflections
- Still much easier to do ML stuff in Python, though rust-bert was hugely useful in implementing a lot of this
- Setting up libtorch
- Web scraping browser history is kind of a hassle
  - Lots of pages require authentication
  - Github really hates it, even for public pages
  - Content extraction for HTML
    - Readability works well, rust port needs some work
    - In the future would probably run a sidecar that hosts the up-to-date JS version
- Rayon thread pool exhaustion
- Model choice matters a lot
- Running bulk inference without GPU support is still slow for the larger models.
- ndarray is great
- Future work
  - Better article scraping
  - ML model feedback
  - More integrations
  - OpenAI integration to allow running on less powerful systems

Perceive

Initial Sprint Reflections