Rohan Shakya
AI Engineering10 min read

Building AI Agents: A Practical Guide to Agentic Development

A grounded, hands-on look at how AI agents really work: the agent loop, tool calling, planning, memory, evaluation, and the failure modes you'll actually hit.

  • AI Agents
  • LLM
  • Tool Calling
  • Agentic Development
  • Architecture
Building AI Agents: A Practical Guide to Agentic Development

I've shipped a fair number of LLM-backed features over the last couple of years, and the word "agent" has quietly become one of the most overloaded terms in our industry. Every demo is an "agent," every wrapper around a chat completion is an "agent." That vagueness makes it hard to reason about what you're actually building.

So let me strip the hype away. In this post I'll explain what an agent really is, walk through the loop that makes it tick, and share the practical decisions — tools, planning, memory, evaluation, guardrails — that separate something that works from something that quietly burns tokens in an infinite loop. This is the guide I wish I'd had when I started.

What an agent actually is

An agent, at its core, is three things glued together:

  1. An LLM that decides what to do next.
  2. A set of tools the model can invoke to affect the outside world (read a file, call an API, run a query).
  3. A loop that feeds the results of those actions back to the model until a goal is reached.

That's it. The model is the brain, the tools are the hands, and the loop is what turns a single prediction into goal-directed behavior over multiple steps.

The key distinction from a plain LLM call: an agent decides how many steps it takes and which tools it uses. You hand it an objective, not a script.

Contrast that with a workflow, where you — the engineer — hard-code the sequence: "summarize, then classify, then route." The LLM fills in each blank, but the control flow is yours. In an agent, control flow is delegated to the model. That delegation is powerful and also exactly where things get risky.

The agent loop

The mental model I keep coming back to is a four-phase cycle: perceive → decide → act → observe.

  • Perceive — assemble the current context: the goal, prior steps, tool results, relevant memory.
  • Decide — the model produces either a final answer or a request to call one or more tools.
  • Act — your runtime executes the requested tool calls.
  • Observe — the tool outputs are appended to the context, and the loop repeats.

The loop terminates when the model emits a final answer instead of a tool call, or when a stopping condition (max steps, budget, timeout) trips. That termination condition matters more than people expect — I'll come back to it under failure modes.

Tool / function calling

Tools are how the model reaches outside its own context. Modern LLM APIs support structured tool calling: you describe each tool with a name, a natural-language description, and a JSON schema for its parameters. The model responds with a structured request naming the tool and its arguments, you execute it, and you return the result.

The quality of your tool descriptions matters as much as the code behind them. The model can only call a tool well if it understands when and how to use it.

ts
const tools = [
  {
    name: "search_orders",
    description:
      "Look up a customer's orders by email. Returns the 10 most recent " +
      "orders with status and total. Use this before answering any " +
      "question about order history or refunds.",
    input_schema: {
      type: "object",
      properties: {
        email: { type: "string", description: "Customer email address" },
        status: {
          type: "string",
          enum: ["all", "open", "shipped", "cancelled"],
          description: "Filter by status. Defaults to all.",
        },
      },
      required: ["email"],
    },
  },
];

A few hard-won rules:

  • Keep tools narrow and well-named. One tool, one job. search_orders beats a god-tool database_query that takes raw SQL.
  • Make outputs compact. Return what the model needs, not your full ORM object. Every field you return costs context and invites confusion.
  • Validate arguments before executing. The model will occasionally produce malformed or out-of-range inputs. Treat tool args like untrusted user input.

When to use an agent vs a simple workflow

This is the decision I see teams get wrong most often. Agents are flexible but harder to test, slower, more expensive, and less predictable. Reach for a workflow first.

Use a workflow when:

  • The steps are known and stable.
  • You need deterministic, auditable behavior.
  • Latency and cost are tight.

Use an agent when:

  • The number and order of steps genuinely depends on the input.
  • The task requires open-ended exploration (e.g., debugging, research, multi-step data gathering).
  • A fixed pipeline would balloon into an unmaintainable tangle of branches.

My rule of thumb: if you can draw the flowchart, build the flowchart. Only when the flowchart has too many unknown branches does an agent earn its keep.

Planning

For multi-step tasks, it helps for the model to plan before acting. There are a few patterns, in rough order of complexity:

  • Implicit planning — the model reasons step by step inside its normal generation and just calls tools as it goes. Good enough for short tasks.
  • Explicit plan-then-execute — you first ask the model to produce a plan (a list of steps), then execute against it, optionally re-planning when reality diverges.
  • Decomposition — break a big goal into sub-goals, each handled in its own sub-loop.

A practical middle ground I like is asking the model to maintain a lightweight to-do list as part of its working context. It keeps long tasks from drifting and gives you something to log and inspect.

ts
// A simple plan structure the agent updates as it works
type PlanStep = {
  id: number;
  description: string;
  status: "pending" | "in_progress" | "done" | "skipped";
};

Don't over-engineer this. Elaborate planning frameworks often add latency and failure surface without improving outcomes. Start simple and add structure only when you can measure that it helps.

Memory and context

Agents are bounded by the context window. The art is deciding what goes in it on each turn. I think about memory in layers:

  • Working memory — the current loop's messages and tool results. This grows fast on long tasks.
  • Short-term/session memory — a running summary of what's happened so far, used to compact the working memory when it gets large.
  • Long-term memory — facts persisted across sessions, usually in a database or vector store and retrieved on demand.

A common technique is context compaction: when the message history grows past a threshold, summarize the older turns into a compact note and drop the raw messages. This keeps the agent coherent over long runs without blowing the window.

For retrieval, resist the urge to stuff everything in. Retrieve the smallest relevant slice. A focused three-paragraph snippet beats ten loosely-related documents that crowd out the actual task.

Multi-agent orchestration

Once a single agent works, the temptation is to spawn many. Multi-agent systems can help when a task has genuinely separable concerns — for example, an orchestrator that delegates research sub-tasks to parallel worker agents and synthesizes their results.

ts
// Orchestrator delegating to specialized sub-agents
const subtasks = await orchestrator.plan(goal);
const results = await Promise.all(
  subtasks.map((task) => runAgent(workerPrompt, task, tools)),
);
const answer = await orchestrator.synthesize(goal, results);

But be honest about the cost. Every extra agent multiplies token usage and adds a coordination problem. The information one agent gathers has to be passed cleanly to the next, and lossy hand-offs are a real source of bugs. I only go multi-agent when the sub-tasks are independent enough to run in parallel and the single-agent version has clearly hit a wall.

Evaluation

You cannot improve what you don't measure, and agents are notoriously hard to measure because they're non-deterministic and multi-step. A few approaches that have actually worked for me:

  • End-to-end task evals — a fixed set of input scenarios with checkable success criteria. Did the agent reach the right final state? This is the metric that matters most.
  • Trajectory inspection — log every step (decision, tool call, args, result) so you can replay what the agent did when it fails. This is non-negotiable; without traces you're debugging blind.
  • LLM-as-judge — for fuzzy outputs, use a separate model with a rubric to score quality. Useful, but validate the judge against human ratings before you trust it.

Build the eval harness early. The teams that ship reliable agents are the ones who can run fifty scenarios in CI and see a regression before users do.

Guardrails and safety

Because you're handing control flow to a model that can take real actions, guardrails aren't optional.

  • Least privilege — give the agent only the tools it needs. An agent that can read should not also be able to delete unless that's the job.
  • Human-in-the-loop for irreversible actions — sending money, deleting data, emailing customers. Require confirmation.
  • Sandboxing — run code execution and shell access in an isolated environment, never against production directly.
  • Input/output filtering — watch for prompt injection in retrieved content. A document the agent reads can contain instructions; treat external text as data, not commands.
  • Hard limits — max steps, token budget, and wall-clock timeout on every run.

Common failure modes

These are the ones I've personally debugged more than once:

  • Infinite or repetitive loops — the agent keeps calling the same tool, or oscillates between two states, never converging. The fix: a max-step cap, plus detecting repeated identical tool calls and short-circuiting.
  • Hallucinated tool calls — the model invokes a tool that doesn't exist or invents arguments. Validate against your schema and return a clear error the model can recover from.
  • Context drift — on long runs the agent forgets the original goal. Compaction that preserves the goal and re-injecting the objective each turn both help.
  • Premature completion — declaring success before the task is done. Tighten your prompt's definition of "done" and verify final state programmatically when you can.
  • Over-eager tool use — calling tools when it should just answer. Often a description problem: be explicit about when not to use a tool.

A minimal agent loop

Here's a stripped-down sketch in TypeScript-flavored pseudocode that ties it together. Real implementations add streaming, error handling, and tracing, but the shape is exactly this.

ts
async function runAgent(goal: string, tools: Tool[], maxSteps = 12) {
  const messages: Message[] = [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user", content: goal },
  ];

  for (let step = 0; step < maxSteps; step++) {
    // DECIDE: ask the model what to do next
    const response = await llm.chat({ messages, tools });

    // No tool calls means the model is done — return its answer
    if (!response.toolCalls?.length) {
      return response.content;
    }

    messages.push({ role: "assistant", content: response.content });

    // ACT: execute each requested tool call
    for (const call of response.toolCalls) {
      let result: string;
      try {
        const fn = lookupTool(tools, call.name);
        if (!fn) throw new Error(`Unknown tool: ${call.name}`);
        const args = validate(fn.schema, call.args); // guardrail
        result = await fn.execute(args);
      } catch (err) {
        // Return errors to the model so it can recover
        result = `Error: ${(err as Error).message}`;
      }

      // OBSERVE: feed the result back into context
      messages.push({
        role: "tool",
        toolCallId: call.id,
        content: truncate(result, 4000),
      });
    }
  }

  throw new Error("Agent exceeded max steps without finishing");
}

Notice the things that make it robust: a hard step cap, tool validation, errors returned to the model rather than thrown away, and truncated tool output. Those four lines of defense prevent the majority of runaway behavior.

Final thoughts

Agents are not magic, and they're not a default. They're a specific architecture — an LLM, tools, and a loop — that trades predictability for flexibility. The engineering discipline that makes them reliable is unglamorous: narrow tools, compact context, hard limits, real evaluation, and traces you can actually read.

My advice: start with the simplest thing that could work. Build a workflow if a workflow will do. When you do reach for an agent, instrument it heavily from day one, cap everything, and earn each bit of additional autonomy with a passing eval. Do that, and agentic development stops feeling like a gamble and starts feeling like ordinary, if sophisticated, software engineering.