Extracting Structured Data with LLMs

Written 2023-04-30 — Updated 2023-05-02

Is it useful to finetune a smaller LLM for structured data extraction? Or at that point are you better off just running multiple queries and forming them together into a whole answer?
Code
- https://github.com/1rgs/jsonformer looks like the most promising option so far if you don't need to run against a prepackaged service like OpenAI.
- https://github.com/kyang6/llmparser
- LangChain SelfQueryRetriever
Prompting
- Most examples are using few-shot prompts to achieve this.
- https://twitter.com/goodside/status/1564437905497628674
- Eugene Yan tweet
  - I've found LLMs to reliably return structured data via API by adding the system prompt: "Respond in the format of a JSON array [{key: value}, {key: value}]" Having an "unsure" option also reduces hallucination and indicates uncertainty.

Code