Smelter

Written 2023-06-27 — Updated 2024-08-27

This is a project to do map-reduce style calculations using a bunch of serverless workers reading from S3.
Task List
- Up Next
- Soon
  - Global queue system and persistent workers for tasks
    - This should speed up the time between tasks a lot and makes things more flexible as new tasks are added dynamically.
    - Start a pool of workers, each of which is qualified to handle one or more stages
    - Each worker communicates with the manager and fetches tasks from it
    - The biggest change this requires is ensuring that the manager and the workers can actually communicate. Can be an issue when it requires punching a hole in a VPC for example, maybe there's a better way to do it.
      - Just running the manager as a task inside the VPC as well would accomplish some of this.
      - Opening a hole in the VPC could work:
        
        Set up VPC ACLs to allow inbound comms on some port
        
        Manager generates a client SSL cert and sends the pubkey along with the worker payload
        
        Worker gets a public IPv6 IP
        
        Once a worker starts,
        
        manager connects to the worker
        
        worker verifies the client cert
        
        uses a 2-way GRPC connection to swap info both ways
        
        This would also help with shipping logs back to the manager.
      - Could also work to use some other queue service like SQS which is easy to talk to from both sides
        
        But this requires supporting a bunch of different APIs on both sides.
    - This also opens things up in some ways for running in Spot mode
      - Add a function that allows a worker to checkpoint its work. The checkpoint function sends a payload to the manager, and then if the worker dies during the execution the manager can put the task back in the queue with checkpoint info, and the worker can resume from where it left off, if possible.
- Later/Maybe
  - Platform adapters could make it easier to build the local and platform-side code
    - Not sure how useful this actually is, but this would mostly be a preconfigured way for a binary to determine if its running in the platform or not and then run your manager code or your worker code when you're doing the single-source-code configuration
    - I think a simple abstraction would be something like:
      - The spawner has an easy way to set a "SMELTER_TASK" variable in the environment (or whatever is the equivalent) of the spawned task.
      - Provide a function that retrieves this value, and then you can match on it and run your own code. If it's not set, then you assume that you are the job manager and run that code instead.
      - Ultimately this isn't too different from doing it yourself, might be a nice convenience though especially when developing the pipeline.
  - Ability for spawner to send an event when a container goes from pending to running
  - Option to run a server within the worker container that can collect logs and stream them back to the manager
  - Option to timeout if we don't see any logs for too long
    - This depends on being able to get logs shipped from the container down to the manager.
  - AWS Lambda adapter
  - Support Fargate spot instances
    - A way for SpawnedTask to indicate that a task failure is from a spot instance shutdown
      - This can be a new error code alongside TaskFailed, call it TaskPreempted
    - Specify which capacity provider to use for Spot vs. Normal launching
    - An option to launch it as a non-spot instance if the spot instance gets preempted too much or if it takes too long to start.
    - Currently Fargate doesn't support spot on ARM
  - Convert a single SQL statement into a multi-stage set of tasks
    - This may be better done by integrating with something like Datafusion which already supports creating query plans
      - Check out https://github.com/datafusion-contrib/ray-sql and https://github.com/datafusion-contrib/datafusion-federation which have some common concerns.
    - Consider also supporting Substrait query plan format
      - DuckDB supports consuming Substrait so I may be able to generate Substrait from SQL, split it into stages, and then run the nodes as needed
  - Clean up inter-level channels
    - There are a bunch of channels for communicating cancel state, task results, etc. between the tasks, stages, job, and manager. See if there's a way to make this cleaner.
- Done
  - Retry inside the spawner when Fargate returns a capacity error
  - Cancel all the jobs when an error occurs, and wait for them to be cancelled
  - Handle rate exceeded errors from Fargate API
    - The Rust AWS SDK doesn't properly handle ThrottlingError and flag it for retry, but with a custom retry layer this can be done.
  - Workers automatically track RAM usage, load factor, time running, etc., and return these with the task results
  - Full test of Fargate jobs
  - Allow a cancel signal to kill all containers
  - StatusSender needs to work with a raw StatusUpdateItem, and the StatusCollector should be able to handle native StatusUpdateItem and have its own separate mechanism for the read operations. (Maybe just use a mutex here to simplify?)
  - Build fargate container
  - Have something that helps generate task definitions for fargate
  - AWS Fargate Spawner
    - How to figure out if a container succeeded or not? — exit code is exposed though there's no other way to pass back data
  - AWS Fargate Worker Framework
  - Example to run workers as local processes
  - Remove the Spawner trait altogether
    - It's no longer necessary and isn't sufficiently expressive for most spawners
Test Datasets
- https://opensky-network.org/datasets/raw/
Initial Features
- Split up queries across a number of workers
  - Split queries into chunks
  - Max number of concurrent workers
- Run workers
- Gather results
- Reduce results using 1 or more reducers
- Retryable workers with some threshold for retrying
  - e.g. autoretry remaining workers once the first 90% have returned
Pluggable Backends
- Execution
  - AWS Lambda
  - Docker/Kubernetes/Nomad
- Storage
  - Various Cloud storage
Applications
- Query over some data and output a DuckDB file with the results for further analysis.
From Honeycomb's O11y Book But serverless functions and cloud object storage aren't 100% reliable. In practice, the latency at the tails of the distribution of invocation time can be significant orders of magnitude higher than the median time. That last 5% to 10% of results may take tens of seconds to return, or may never complete. In the Retriever implementation, we use impatience to return results in a timely manner. Once 90% of the requests to process segments have completed, the remaining 10% are re-requested, without canceling the still-pending requests. The parallel attempts race each other, with whichever returns first being used to populate the query result. Even if it is 10% more expensive to always retry the slowest 10% of subqueries, a different read attempt against the cloud provider's backend will likely perform faster than a "stuck" query blocked on S3, or network I/O that may never finish before it times out. What happens if someone attempts to group results by a high-cardinality field? How can you still return accurate values without running out of memory? The simplest solution is to fan out the reduce step by assigning reduce workers to handle only a proportional subset of the possible groups. For instance, you could follow the pattern Chord does by creating a hash of the group and looking up the hash correspondence in a ring covering the keyspace.

Smelter

Task List

Up Next

Soon

Later/Maybe

Done

Test Datasets

Initial Features

Pluggable Backends

Applications