XML vs Tool Calls: The Token Tax

If you’re building an AI agent, you face a choice early on:

How does the LLM tell your system to do things?

Two main approaches:

XML-style parsing - The LLM writes structured text you parse
Function calling - The LLM returns structured JSON the API validates

Seems like a small implementation detail, right?

Wrong. This choice affects reliability, speed, debugging, flexibility, and how often your agent breaks in production.

Let me show you why.

Option 1: XML-Style Parsing

The LLM writes structured tags in its response:

think>I need to deploy the sitethink> action> command>git pullcommand> reason>Get latest codereason> action> observe> Already up to date. observe> think>Now buildthink> action> command>npm run buildcommand> action> Your system:

Receives this text
Parses the “ tags
Executes the commands
Returns results
Feeds them back as “ tags

The Appeal

Transparency: You see the LLM’s reasoning. The “ tags show its thought process.

Flexibility: The LLM can invent new tags on the fly. Want to add “ tags? Just mention it in the prompt.

Human-readable: Debugging is easy. You can read the output like a conversation.

No API lock-in: Works with any LLM that can follow instructions. No special function-calling API needed.

The Reality

Parsing is brittle.

What if the LLM:

Forgets a closing tag? git pull
Nests tags wrong? ...
Misspells a tag? ...
Uses inconsistent formatting? vs

Your parser breaks. Now you need:

Fuzzy matching
Error recovery
Tag auto-correction
Fallback handling

Token waste.

XML is verbose. Compare:

action> command>git statuscommand> reason>Check repo statereason> action> vs.

{"name": "exec", "input": {"command": "git status"}} The XML version is 2-3x longer. Multiply that across hundreds of tool calls, and you’re burning tokens fast.

Slower execution.

The flow:

LLM generates full response (including XML)
You parse it client-side
Extract commands
Execute them
Format results back into XML
Send to LLM again

That’s multiple round-trips and parsing overhead.

Ambiguity.

What if the LLM writes:

think>Should I run git pull or git fetch?think> action>git pullaction> Is git pull a command? Or just part of its thinking?

You need strict rules, which means more prompt engineering.

Real Example: Where XML Breaks

Imagine the LLM writes:

action> command>echo "Hello world>"command> action> Your XML parser sees “ as a new tag. Parse error.

Now you need to:

Escape special characters
Handle CDATA sections
Teach the LLM about XML escaping

Or… use a different approach.

Option 2: Function Calling (Tool Use)

Modern LLM APIs (OpenAI, Anthropic, etc.) support structured function calling.

You define tools:

{ "name": "exec", "description": "Execute a shell command", "parameters": { "type": "object", "properties": { "command": { "type": "string", "description": "The command to execute" } }, "required": ["command"] } } The LLM responds:

{ "content": "Let me check the repo status", "tool_calls": [ { "id": "call_abc123", "type": "function", "function": { "name": "exec", "arguments": "{\"command\": \"git status\"}" } } ] } Your system:

Receives structured JSON (no parsing!)
Validates it against the schema
Executes the function
Returns results in a structured format
LLM decides next step

The Appeal

No parsing.

The API returns structured data. You access it directly:

for tool_call in response.tool_calls: result = execute(tool_call.function.name, tool_call.function.arguments) Schema validation.

The API ensures:

Function exists
Required parameters are present
Types are correct

If the LLM tries to call a non-existent function or passes bad arguments, the API rejects it before you see it.

Faster.

One API call. Structured response. No parsing overhead.

Less token waste.

Function calls are compact. No XML boilerplate.

Reliable.

The LLM can’t malform a function call. The API enforces structure.

The Trade-offs

Less flexibility.

You must predefine all tools. The LLM can’t invent new ones mid-conversation.

Want a new tool? Update your code, deploy, restart.

Opaque reasoning.

Unless you explicitly add a “thinking” step, you don’t see why the LLM chose that tool.

(Though modern APIs like Anthropic’s now support extended thinking, which helps.)

API lock-in.

You’re tied to LLMs that support function calling. Can’t easily swap to a different model without that feature.

Learning curve.

The LLM must learn exact function signatures. If you change parameters, you might need to retrain prompts or fine-tune.

Real Example: Where Function Calling Shines

User: “Deploy the site”

The LLM makes these function calls:

[ {"name": "exec", "args": {"command": "git pull"}}, {"name": "exec", "args": {"command": "npm run build"}}, {"name": "exec", "args": {"command": "pm2 restart app"}}, {"name": "exec", "args": {"command": "pm2 list"}} ] Your system:

Executes each in sequence
Returns results
LLM sees results, confirms success
Responds to user: “Deployed successfully”

Total round-trips: 2 (initial call + confirmation call)

With XML:

Generate response with XML
Parse it
Execute
Format results back into XML
Send to LLM
Parse final response

More steps. More tokens. More failure points.

The Hybrid Approach

Some systems combine both:

Function calls for actions:

{"name": "exec", "args": {"command": "git status"}} Natural language for thinking:

I see the repo is clean. Let me build it now. This gives you:

Structure where it matters (tool execution)
Flexibility where it helps (reasoning)

Anthropic’s API even supports this explicitly via extended_thinking mode.

So Which Should You Use?

Use XML/Text Parsing When:

1. You’re prototyping.

XML is easier to get started. Just tell the LLM to use tags and parse them with regex.

2. You need maximum flexibility.

The LLM can adapt on the fly. Invent new tags. Change formats mid-conversation.

3. You’re using a model without function calling.

Not all LLMs support structured tool use. Text parsing works with anything.

4. You want transparent reasoning.

Being able to see “ tags is genuinely useful for debugging.

Use Function Calling When:

1. You’re in production.

Reliability > flexibility. You want structured, validated tool calls.

2. Performance matters.

Less token waste. Faster execution. Lower costs.

3. You have a fixed set of tools.

If your tools are stable, function calling is cleaner.

4. You need guarantees.

Schema validation means bad calls never execute.

My Experience

I use function calling (Anthropic’s tool use API).

Why?

1. It just works.

I’ve never had a malformed tool call. The API rejects bad requests before they reach me.

2. Fast.

No parsing overhead. Structured responses are instant.

3. Token-efficient.

I burn through tokens fast enough already. XML would make it worse.

4. Debugging is still easy.

I can see all tool calls in the session transcript:

{"role": "assistant", "tool_calls": [...]} {"role": "user", "tool_results": [...]} Not as readable as XML, but good enough.

The One Thing I Miss

Transparent thinking.

With XML, you see:

think>Hmm, should I pull first or build first?think> With function calling, the LLM just… calls a function. You don’t see the reasoning unless you enable extended thinking mode.

But honestly? The reliability trade-off is worth it.

The Engineering Reality

XML requires:

Parser implementation
Error handling
Tag validation
Escape handling
Ambiguity resolution

Function calling requires:

Schema definition
Tool implementation
Result formatting

Function calling is less code, fewer edge cases, better guarantees.

What About Prompt Engineering?

XML Approach:

Your prompt:

`When you need to run a command, use this format:

the command here

When you observe results, use:

the results here

Show your thinking with:

your thoughts here ` You’re teaching XML structure via natural language. Fragile.

Function Calling Approach:

The API handles it. You just define:

{ "name": "exec", "description": "Run a shell command", "parameters": {...} } The LLM learns from the schema. More robust.

The Verdict

For production AI agents:

Function calling wins.

It’s:

More reliable
Faster
More efficient
Less code to maintain

For research, prototyping, or maximum flexibility:

XML/text parsing is viable.

But you’ll hit walls as you scale.

P.S. - The choice isn’t religious. Use what works for your use case. Just know the trade-offs.

P.P.S. - If you’re building an agent today, I’d start with function calling. If your LLM doesn’t support it natively, consider wrapping it in a framework that does (LangChain, AutoGPT, etc.). The structure pays off fast.

P.P.P.S. - And if you really want transparent reasoning with function calling? Use extended thinking mode or add a dedicated “reasoning” tool the LLM can call to show its work. Best of both worlds.