Structured Outputs the easy way with BAML

While working on Treechat’s new treesearch feature (a kind of mini-perplexity that searches the web for you and reads through all the results to answer the question you’re actually asking) I needed to build a small classifier to figure out if the search result is relevant to what the user was asking or not.

Given how cheap gpt4o-mini is, it seemed like a great fit for this type of task, but getting a general LLM to act as a simple classifier requires a bit more wrangling than you might expect at first blush. If you’ve worked with LLMs a bit you’ll know where I’m going with this: LLMs are just as bad at humans at sticking to strict protocols like JSON — they forget to add closing quotes and trailing commas and (virtually) fat-finger basic syntax mistakes almost as often as we do. And don’t even get me started on how fiddly they can be when it comes to getting them to wrap (or not wrap) their output in markdown style triple backticks.

In LLM-land the most common solution to this problem is to use something called “structured output,” which is basically LLM jargon for getting LLMs to output consistent JSON of a certain shape. OpenAI and Anthropic all have various ways to help you get consistent JSON from LLMs via their APIs, usually under names like “tool use” or “function calling”, and in OpenAI’s case, simply “structured output,” all of which tend to use a fairly verbose data definition language called JSON schema which you are then responsible for creating and providing to them when making your API call.

That… seems like overkill for such a basic problem. In this case I just want the LLM to reliably give me back a simple true or false if the search result is relevant to the search query. Until I discovered BAML my go-to solution for this type of thing had been @jxnlco’s Instructor library which patches OpenAI’s client library to support returning data as Pydantic models. That works pretty well: you can make a simple Python class to define the shape of the JSON you want back, and pydantic and instructor take care of all the wrangling back and forth between JSON schema defs and python types.

While this works, it’s a bit too “magical” for my tastes (it’s default behavior is to monkey-patch the official OpenAI Python SDK), and it can sometimes be difficult to understand what it’s doing under the hood — in some modes it will even silently modify your prompts before passing it to the LLM. Not to mention, it’s still using the overly verbose JSON schema language. With instructor at least all the boilerplate is abstracted away, but all that extra token verbosity is still going straight to your OpenAI bill.

BAML however is a little different: it uses a data description DSL that reads like a streamlined version of JSON Schema, but simpler and more intuitive. If you’ve ever used Typescript types or Python type annotations you’ll feel right at home (it’s so simple that Cursor’s autocomplete was able to figure it out immediately). And better yet, it lets you define the classes and functions once, and then use the functions you create this way from multiple programming languages. At the time of writing they support python, js/ts, and ruby natively, and indirectly support golang and other languages through third party codegen (by way of OpenAPI).

This made my little problem a lot easier to solve. Once I had BAML installed and configured in my repo, building my little classifier function looks like this:

function ClassifyIsRelevant(query: string, search_result: string) -> bool {
  client FastOpenAI
  prompt #"
    Act as a classifier. Given a search result and a query, determine if the search result is
    relevant to the query.
    <query>
    {{ query }}
    </query>
    <search_result>
    {{ search_result }}
    </search_result>

    {{ ctx.output_format }}
  "#
}

Given that little stub, BAML automagically generates fully typed client code that you can call from Python like so:

from baml_client import b

is_relevant = b.ClassifyIsRelevant(query, json.dumps(search_result))

Under the hood BAML takes care of making the API call to OpenAI, and parsing the LLMs reply in such a way that it’s guaranteed to return a bool. This means you can just treat it like a standard python function, and you get convenient structured output with less tokens. And, if BAML’s benchmarking is to be believed, better performance (both on speed and accuracy) than even OpenAI’s native structured outputs give you. Huzzah!