The Part Nobody Explains: How AI Agents Decide What To Do

In the last post, we saw this:
AI can use tools.
It can:
- search the web
- run code
- open files
- call APIs
Cool.
But something still feels… missing.
Because there's one question almost nobody properly explains:
How does the AI decide which tool to use?
The Hidden Step Most People Never See
When you type something like:
"What's the weather in London?"
the AI doesn't just magically "know" what to do.
Under the hood, something more structured is happening.
Instead of replying normally, the model generates a structured tool call — essentially an instruction that says:
"Use this tool with these inputs."
The exact format varies by model and framework, but the idea is always the same: instead of plain text, the model outputs a structured action for the system to execute.
And honestly, this is the moment AI agents started making more sense to me.
Because suddenly it stopped feeling like magic.
At First, I Assumed One Giant Model Did Everything
My original mental model was basically:
"Okay… GPT probably handles everything itself."
Reasoning. Planning. Tool selection. Responses. Memory. Everything.
And to be fair, a lot of modern systems actually do work like that.
Large models from companies like OpenAI, Anthropic, and Google Gemini can often decide which tools to call directly.
But then I discovered something interesting.
Some systems use smaller, specialized models just for tool calling.
And that's where FunctionGemma comes in.
Meet FunctionGemma
FunctionGemma is a specialized version of Google's Gemma model built specifically for function calling.
Not chatting. Not storytelling. Not writing essays.
Its main job is:
Take user intent → convert it into structured tool calls.
That's it.
And honestly, I think that idea is fascinating.
Because instead of trying to make one giant model do everything…
you can split the system into smaller, focused parts.
And Here's the Wild Part
FunctionGemma is tiny compared to modern large language models.
It's built on the Gemma 3 270M parameter model.
Which sounds ridiculously small in today's AI world.
But the important realization is this:
Tool calling is actually a much narrower problem than open-ended conversation.
The model doesn't need to:
- write novels
- explain philosophy
- debate politics
It mainly needs to:
- understand intent
- choose a function
- generate structured outputs correctly
That constrained problem makes smaller specialized models surprisingly effective — especially after fine-tuning.
And the size has another huge advantage: FunctionGemma is specifically designed to run on-device — on laptops, phones, or edge hardware — without needing a server. That's a big deal for privacy and offline use.
This Completely Changed How I See AI Agents
Before this, I imagined agents like:
"One super-intelligent AI doing everything."
But now I think of them more like systems made of layers:
Main model → reasoning and conversation
Tool-calling layer → converts intent into actions
Tools / APIs → actually perform the work
Orchestration → manages the loop, retries, and flow
Almost like:
Brain → Decision Layer → Hands
And suddenly agents feel a lot less mystical. And a lot more understandable.
So What Actually Happens in an Agent?
At a very basic level, the loop is surprisingly simple.
User says:
"Search AI startups"
The system generates a structured tool call, something like:
tool: search_web
query: "AI startups"
Then:
- The backend executes the tool
- Gets the results
- Sends them back to the model
- The model continues
Think → Call tool → Get result → Continue
That's the core loop.
Real Agents Add More Layers
Of course, production agents usually become much more complex.
They may include:
- memory
- retries
- validation
- planning
- permissions
- context management
- error handling
But the important thing is: the core idea is still understandable.
And honestly, that realization made AI feel way more approachable to me.
One Important Thing Most People Miss
FunctionGemma is not really meant to be dropped in "as-is" as a universal agent model.
Google actually positions it as a foundation for fine-tuning.
Meaning:
- You define your own tools
- Train it on your own examples
- Improve its reliability for your specific use case
So instead of:
"one AI that knows everything"
you get:
"a small, specialized model trained for your specific workflows"
That's a very different philosophy. And the benchmark numbers back it up — the base model scores 58% on Mobile Actions tasks. After fine-tuning? 85%.
And Honestly… I Think This Is Where AI Is Heading
The more I learn about agents, the more it feels like modern AI systems are becoming modular.
Instead of one massive model doing everything, we're starting to see:
- routers
- planners
- memory systems
- verification models
- specialized tool-callers
- local edge models
Smaller parts working together.
And weirdly… that makes the whole field feel less intimidating.
What I'm Doing Next
So instead of only reading about agents…
I want to build one myself.
A small one. From scratch. Nothing insane.
Just:
- a few tools
- a simple loop
- tool-calling logic
- structured outputs
- backend execution
And I'll document the whole process here as I learn.
One Line Worth Remembering
AI agents are not just "one giant AI." They're systems made of smaller parts. And sometimes… the part deciding what action to take next can be surprisingly small.



