Agentic AIAI Engineering10 min readUpdated

LLM Function Calling: A Production Engineering Guide

By Mudassir Khan — Agentic AI Consultant & AI Systems Architect, Islamabad, Pakistan

Cover illustration for: LLM Function Calling: A Production Engineering Guide

Quick answer

How does LLM function calling work? The model receives your conversation plus a list of tool definitions. Instead of replying with text, it returns a structured JSON object naming a function and its arguments. Your code runs the function, appends the result to the conversation, and sends everything back to the model, which continues reasoning from the new information. The model never executes code directly.

Section 01 · Mechanics

How function calling actually works in an LLM

The loop has four steps: send request with tool definitions, model decides to call a tool, your code executes and returns the result, model continues reasoning. Every vendor implements the same pattern under different names.

Function calling is not a special mode or a separate API endpoint. It is a pattern built on top of the standard completions interface. You send a request that includes both your conversation and a list of tool definitions formatted as JSON Schema objects. The model processes everything together — the conversation, the tool descriptions, the parameter schemas — and then decides how to respond.

When the model decides to call a function, it does not return a text reply. Instead it returns a structured completion that names the function and provides a JSON object with the argument values it has inferred from the conversation. Your application code receives this completion, extracts the function name and arguments, executes the actual function call, and appends the result back to the conversation as a new message. The model then receives the updated conversation — now including the function result — and continues reasoning from there.

This loop can run multiple times in a single user interaction. A single request might trigger three or four function calls before the model has enough information to produce a final response. Each iteration extends the conversation, which means the input token count grows with every round trip. OpenAI refers to this mechanism as function calling. Anthropic refers to it as tool use. Google refers to it as function declarations. The underlying loop is identical across all three providers.

Four-step function calling loop: model receives tools, returns invocation spec, application executes, result returned to model

Section 02 · Schema Design

Designing tool schemas that hold up in production

The most common production failure in function calling is not the model — it is the schema. Vague descriptions, overly wide parameters, and missing enums produce hallucinated arguments at scale.

Every tool definition has three parts: a name, a description, and a parameter schema. The name and description are the primary signal the model uses to decide whether to call this tool at all. A description that says "handles data" is functionally useless. A description that says "retrieves the current account balance for a given customer ID from the billing system" gives the model enough context to make a correct tool selection decision even when similar tools are registered alongside it.

Parameter surfaces should be as narrow as production requirements allow. A common mistake is registering a tool that accepts a raw SQL query string as a parameter because it is flexible for development. In production, that wide surface gives the model room to construct queries that are syntactically valid but semantically wrong. Replace free-form parameters with structured alternatives wherever possible: specific field names instead of query strings, typed numeric ranges instead of open string fields, explicit boolean flags instead of instruction-bearing text fields.

Enumerate values whenever the valid set is finite. If a parameter accepts one of six status codes, define it as an enum in the schema rather than a string. If a parameter represents a priority level with four valid values, enumerate them. Enum constraints are enforced at the schema level and prevent an entire class of argument hallucination where the model invents a plausible-sounding value that does not exist in the actual system. The multi-agent design patterns guide covers how tool schema quality affects orchestration reliability when multiple agents share a tool registry.

Section 03 · Cost

Context window costs explode as your tool count grows

Registering 58 tools costs roughly 55,000 input tokens per request — before any conversation content. Tool routing is the standard mitigation.

Tool definitions are serialized into the prompt on every request. A minimal tool definition — a short name, a two-sentence description, and two or three parameters — consumes roughly 800 to 1,000 input tokens. That cost is paid on every single request, regardless of whether the model uses the tool. At 10 tools, the overhead is approximately 9,500 tokens. At 20 tools, approximately 19,000 tokens. At 58 tools, which is not an unusual count for an enterprise agentic system with integrations across multiple services, the tool list alone consumes roughly 55,000 input tokens before a single word of conversation has been included.

Tool routing is the standard mitigation. Instead of registering the full tool set on every request, the system maintains a registry of all available tools and selects a relevant subset — typically 10 to 15 tools — based on the current conversation context before making the LLM call. The routing layer itself is a lightweight classifier that examines the user intent and the current agent state. Implemented well, tool routing cuts the tool-related token overhead by 70 to 85 percent. For additional context on evaluating this in production, see the post on LLM agent evaluation in production.

Line chart showing token cost rising from roughly 9,500 tokens for 10 tools to 55,000 tokens for 58 tools, with a tool routing threshold marker at 20 tools

Section 04 · Resilience

Building retry and fallback logic that actually works

Production function calling fails in ways tutorials never cover — malformed arguments, wrong tool selection, and validation errors that require explicit structured feedback to recover from.

The first principle is to validate before you execute. When the model returns a function call, run the arguments through your parameter validation logic before calling the actual function. If the arguments are malformed — a required field is missing, a value is outside the acceptable range, a referenced entity does not exist — return a structured error description to the model rather than letting the function fail at runtime. The model can recover from explicit, structured feedback. It cannot recover from a swallowed exception or a generic 500 error with no context.

The second principle is to cap retries per function call, not per conversation turn. Most teams implement retry logic at the conversation level: they catch a failed turn and retry the entire thing. The problem is that function calling loops can have two or three independent retry layers — one in the application code, one in the agent framework, and sometimes one in the LLM client library. An uncapped loop across two retry layers with a limit of five each can generate 25 or more LLM calls before timing out. Cap retries at the function invocation level: three attempts maximum per function call, then surface a specific failure message to the model or escalate to a human fallback.

The third principle is to build an explicit fallback path. Every agent that calls functions should have a defined behavior for the case where a function is unavailable, consistently failing, or returning results the agent cannot use. The fallback does not need to be sophisticated — it can be as simple as responding to the user with a specific error message and a suggested next step. What it cannot be is silence, a false success, or an infinite loop. Define the fallback explicitly and test it with chaos injection before deploying to production.

Section 05 · Concurrency

When to use parallel function calling

Most providers support returning multiple function invocations in one response. The performance gain is real. The correctness risk is easy to underestimate.

Parallel function calling allows the model to return multiple function invocation specifications in a single response. Your runtime executes them concurrently, then appends all results to the conversation before the next model call. For independent read operations — fetching user profile data and account balance simultaneously, or running three separate search queries in parallel — this is unambiguously the right approach. Latency drops by the factor of the parallelism, and correctness is unaffected because the operations have no state dependencies on each other.

For write operations or any pair of functions with state dependencies, parallel execution is dangerous. If a model returns two function calls where the second logically depends on the result of the first, executing them in parallel produces a race condition. The second function runs against stale state, potentially writing incorrect data or producing a result that the model then reasons from incorrectly. This is particularly insidious because the error does not surface as an exception — it surfaces as subtly wrong agent behavior that is hard to trace back to the concurrent execution. Enforce sequential execution for any function that writes state or has explicit ordering dependencies.

Section 06 · Failure Modes

Six production failure modes that tutorials skip

Most tutorials show the success path. These six failure modes are predictable, common, and worth designing for before they hit users.

Six production failure modes: Schema Drift, Parallel Explosion, Context Accumulation, Retry Storm, Argument Hallucination, Tool Deregistration

Schema drift occurs when a function's actual interface changes but the registered tool definition does not — the model continues generating arguments for the old schema, producing failures that look like model errors but are really deployment coordination failures. Parallel explosion happens when a model with broad tool access returns five or six parallel function calls for a request that warranted one, overwhelming downstream services. Context accumulation compounds across multi-turn sessions as function results pile up in the conversation, eventually degrading reasoning quality and inflating costs. Retry storms arise from uncapped retry layers as described above. Argument hallucination produces syntactically valid but semantically wrong arguments that fail silently. Tool deregistration — removing a tool the model has learned to rely on without retraining or adjusting the system prompt — produces persistent wrong tool selection until the prompt is updated. Each of these failure modes is predictable, and each has a specific design response. None of them are edge cases.

Section 07 · FAQ

Frequently asked questions

The questions engineers ask most when building function calling into production systems.

How does function calling work in an LLM?

The model receives a request that includes both the conversation and a list of tool definitions in JSON Schema format. Instead of responding with text, it returns a structured function call specification — a JSON object naming the function and its arguments. Your application executes the function, appends the result to the conversation, and sends the full updated conversation back to the model, which continues reasoning from the new information.

What is the difference between function calling and tool use?

Nothing meaningful. The terms describe the same mechanism and are used interchangeably across vendors. OpenAI documentation uses function calling. Anthropic documentation uses tool use. Both refer to the loop where a model emits a structured invocation specification, the application executes the named function, and the result is returned to the model for continued reasoning.

What is the difference between JSON mode and function calling?

JSON mode forces the model to return syntactically valid JSON regardless of content. Function calling is a higher level abstraction where the model decides whether to invoke a tool, constructs the argument object for that specific tool's schema, and returns control to the application. JSON mode has no tool concept. Function calling includes intent detection, tool selection, and structured argument generation as distinct steps the model performs.

Can multiple functions be called in one LLM response?

Yes. Most providers support parallel function calling, where the model returns multiple function invocation specifications in a single response. Your runtime can execute them concurrently. Use parallel calling freely for independent read operations. Avoid it for operations with state dependencies — parallel execution can introduce race conditions that are difficult to debug after the fact.

How much does function calling cost in tokens?

Tool definitions add tokens to every request. In practice: 10 tools costs roughly 9,500 input tokens, 20 tools roughly 19,000, and 58 tools roughly 55,000 — just for the tool list, before any conversation content. Tool routing, selecting a relevant subset per request, is the standard mitigation for systems with large tool registries.

What happens when an LLM function call fails in production?

The safest pattern is to validate parameters before execution, return a structured error description to the model when validation fails, and cap retries at three per function invocation. The model can recover from explicit, structured failure feedback. Unhandled failures — where the exception is swallowed or the model receives no feedback — typically cause the agent to stall, hallucinate a false success, or enter a retry loop with no exit condition.

If you are building function calling into a production agentic system and need architecture review, schema design support, or help wiring evaluation into your CI pipeline, the agentic AI consulting service covers all of these as part of a structured engagement.

Written by Mudassir Khan

Agentic AI consultant and AI systems architect based in Islamabad, Pakistan. CEO of Cube A Cloud. 38+ agentic AI launches delivered for global founders and CTOs.

View agentic AI consulting serviceSee SentientOps case study

Related service

Agentic AI Consulting

See scope & pricing →

Related case study

SentientOps Control Center

Read case study →

More on this topic

Need an AI systems architect?

Book a 30-minute architecture call. I will sketch the high-level design for your use case and give you an honest view of the trade-offs.

Book a strategy call →