Tool calling with open-source LLMs is fragmented because every model family uses a different wire format for encoding function calls. Inference engines like vLLM, SGLang, and TensorRT-LLM each write custom parsers per model, and grammar engines like Outlines and XGrammar independently reverse-engineer the same format knowledge. This creates an M×N duplication problem: N models times M implementations of the same format logic. The proposed solution is a declarative spec that captures each model's wire format — boundary tokens, argument serialization, reasoning token behavior — so both grammar engines and output parsers can consume it without redundant reverse-engineering work every time a new model ships.

4m read timeFrom thetypicalset.com
Post cover image
Table of contents
What “supporting a model” actually meansThe pace of the problemGeneric parsers are swimming against the currentThe missing separation

Sort: