Getting Started with the Claude API: Building with LLMs

I've integrated a few different LLM providers into production apps, and the Anthropic Claude API is consistently the one I find easiest to reason about. The whole thing is built around a single endpoint — the Messages API — and almost everything else (tools, streaming, caching, structured output) is a feature of that one endpoint rather than a separate API to learn.

This post is a practical on-ramp. I'll use the official @anthropic-ai/sdk (TypeScript/JavaScript) throughout, because that's what most of my work is in, but the shapes map directly to the Python SDK if that's your stack. By the end you'll understand the request model, how to hold a conversation, how to stream, how tool use works, and how to keep your app from falling over on errors and rate limits.

What the Messages API is

At its core, the Claude API has one job: you send a list of messages, and it returns the next message. That's it. A message has a role (user or assistant) and content. You build up a conversation by appending to the list and sending it back.

The API is stateless. Claude doesn't remember anything between requests — you hold the conversation history and resend it each time. This feels like extra work at first, but it's actually liberating: you have complete control over what context the model sees on every turn.

Three parameters are required on every request: model (which Claude model to use), max_tokens (the ceiling on the response length), and messages (the conversation so far).

Setting up the SDK

Install it:

bash

npm install @anthropic-ai/sdk

Then create a client. The SDK reads your API key from the ANTHROPIC_API_KEY environment variable by default — don't hardcode it:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
// Equivalent to: new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

Set the key in your environment (a .env file in development, a secrets manager in production):

bash

export ANTHROPIC_API_KEY="sk-ant-..."

Your first request

Here's a complete, minimal call:

const response = await client.messages.create({
  model: 'claude-opus-4-8',
  max_tokens: 1024,
  messages: [
    { role: 'user', content: 'Explain what an API is in one sentence.' },
  ],
});

for (const block of response.content) {
  if (block.type === 'text') {
    console.log(block.text);
  }
}

Two things worth noting. First, content on the response is an array of content blocks, not a plain string. A response can contain multiple blocks (text, tool calls, thinking), so you iterate and narrow by block.type. Second, I'm using a current Claude model string here — pick the model that fits your latency/cost/capability needs; the API surface is identical across them.

The response also carries a usage object (input and output token counts) and a stop_reason telling you why generation ended — end_turn for a natural finish, max_tokens if you hit the cap, tool_use if the model wants to call a tool.

System prompts

The system parameter sets the model's role, behavior, and constraints. It's not a message in the messages array — it's a top-level field. This is where you put instructions that should apply to the whole conversation:

const response = await client.messages.create({
  model: 'claude-opus-4-8',
  max_tokens: 1024,
  system:
    'You are a senior TypeScript engineer. Answer concisely, ' +
    'prefer code examples over prose, and never invent APIs.',
  messages: [{ role: 'user', content: 'How do I debounce a function?' }],
});

A good system prompt does most of the heavy lifting for tone and behavior. Keep it stable across requests — both for consistency and, as we'll see, because a stable system prompt is what makes prompt caching effective.

Multi-turn conversations

Because the API is stateless, a conversation is just an accumulating array. After each response, you append the assistant's reply and the next user message, then send the whole thing again:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const messages: Anthropic.MessageParam[] = [
  { role: 'user', content: 'My favorite language is TypeScript.' },
];

let res = await client.messages.create({
  model: 'claude-opus-4-8',
  max_tokens: 1024,
  messages,
});

// Append the assistant's full content, then the next user turn
messages.push({ role: 'assistant', content: res.content });
messages.push({ role: 'user', content: 'What did I just say my favorite language was?' });

res = await client.messages.create({
  model: 'claude-opus-4-8',
  max_tokens: 1024,
  messages,
});

The rules: the first message must be user, and you push the assistant's content (the array, not just the text) back onto the history so tool calls and other blocks are preserved. Using the SDK's Anthropic.MessageParam type keeps everything type-safe.

Streaming

For anything user-facing — a chat UI, a long response — you want to stream. Streaming sends tokens as they're generated instead of making the user wait for the whole response. It also prevents request timeouts on large outputs.

const stream = client.messages.stream({
  model: 'claude-opus-4-8',
  max_tokens: 4096,
  messages: [{ role: 'user', content: 'Write a short poem about the sea.' }],
});

stream.on('text', (delta) => {
  process.stdout.write(delta); // print each chunk as it arrives
});

const finalMessage = await stream.finalMessage();
console.log('\nTokens used:', finalMessage.usage.output_tokens);

The text event hands you just the incremental string, which is the simplest way to drive a UI. When you need the complete assembled message — to read usage or stop_reason — call stream.finalMessage(). Let the SDK manage the stream lifecycle; don't hand-roll promises around the raw events.

Tool use (function calling)

Tool use is how you let Claude reach beyond its training data — look up live information, query your database, call your own functions. Conceptually the flow is a loop:

You define tools (name, description, input schema) and include them in the request.
Claude responds with a tool_use block instead of (or alongside) text, specifying which tool and what arguments.
Your code executes the tool and sends the result back as a tool_result.
Claude continues, now with the tool's output in context.

The critical thing to understand: Claude never runs your code. It asks you to run something and you feed back the answer. Here's the manual loop for a weather tool:

const tools: Anthropic.Tool[] = [
  {
    name: 'get_weather',
    description: 'Get the current weather for a city.',
    input_schema: {
      type: 'object',
      properties: {
        city: { type: 'string', description: 'City name, e.g. Kathmandu' },
      },
      required: ['city'],
    },
  },
];

const messages: Anthropic.MessageParam[] = [
  { role: 'user', content: "What's the weather in Lalitpur?" },
];

const res = await client.messages.create({
  model: 'claude-opus-4-8',
  max_tokens: 1024,
  tools,
  messages,
});

if (res.stop_reason === 'tool_use') {
  const toolUse = res.content.find((b) => b.type === 'tool_use');

  if (toolUse && toolUse.type === 'tool_use') {
    // Execute YOUR function with the arguments Claude chose
    const result = await getWeather(toolUse.input as { city: string });

    messages.push({ role: 'assistant', content: res.content });
    messages.push({
      role: 'user',
      content: [
        { type: 'tool_result', tool_use_id: toolUse.id, content: JSON.stringify(result) },
      ],
    });

    // Send back — Claude now answers using the tool result
    const final = await client.messages.create({
      model: 'claude-opus-4-8',
      max_tokens: 1024,
      tools,
      messages,
    });
    console.log(final.content);
  }
}

The tool_use_id is how Claude matches your result to the request it made — always pass it through. The TypeScript SDK also has a higher-level tool runner helper that automates this loop, but I like showing the manual version first because it makes the mechanics obvious. Write a clear, prescriptive description for each tool — it's how the model decides when to call it.

Prompt caching

If you're sending the same large context repeatedly — a long system prompt, a big document, a fixed set of few-shot examples — prompt caching can dramatically cut cost and latency. You mark a stable prefix of the prompt, and Anthropic caches it; subsequent requests that share that exact prefix read from cache instead of reprocessing it.

The mental model is a prefix match: caching works on the exact bytes from the start of the prompt up to your cache breakpoint. Any change before the breakpoint invalidates the cache. So you put stable content (system prompt, reference docs) first, and volatile content (the user's current question) last.

const response = await client.messages.create({
  model: 'claude-opus-4-8',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: longReferenceDocument, // large, stable content
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{ role: 'user', content: 'Summarize section 3.' }],
});

console.log(response.usage.cache_read_input_tokens); // served from cache
console.log(response.usage.cache_creation_input_tokens); // written to cache this time

Verify it's working by checking usage.cache_read_input_tokens on repeated requests. If it's stuck at zero, something in your prefix is changing between requests — a timestamp interpolated into the system prompt, a non-deterministically serialized object, a varying tool list. Cache reads are far cheaper than full input tokens, so this is one of the highest-leverage optimizations available.

Handling errors and rate limits

Production code has to handle failure. The SDK throws typed exception classes, so check those with instanceof rather than string-matching error messages:

import Anthropic from '@anthropic-ai/sdk';

try {
  const response = await client.messages.create({
    model: 'claude-opus-4-8',
    max_tokens: 1024,
    messages: [{ role: 'user', content: 'Hello' }],
  });
} catch (error) {
  if (error instanceof Anthropic.RateLimitError) {
    // 429 — back off and retry
    console.error('Rate limited, retry after a delay');
  } else if (error instanceof Anthropic.AuthenticationError) {
    console.error('Check your API key'); // 401
  } else if (error instanceof Anthropic.BadRequestError) {
    console.error('Malformed request:', error.message); // 400
  } else if (error instanceof Anthropic.APIError) {
    console.error(`API error ${error.status}:`, error.message);
  }
}

The key codes to know:

400 — bad request (malformed messages, missing required fields). Not retryable; fix the request.
401 — bad or missing API key.
429 — rate limited. Retryable with backoff. The response carries a retry-after header telling you how long to wait.
500 / 529 — server error or overloaded. Retryable with backoff.

Good news: the SDK automatically retries 429 and 5xx errors with exponential backoff (a couple of times by default). You can tune that with the maxRetries client option. For anything beyond the defaults — a job queue hammering the API, say — implement your own backoff and respect the retry-after header.

Practical tips

A few things I've learned the hard way:

Don't lowball max_tokens. If the response hits the cap, it's truncated mid-thought and you check stop_reason === 'max_tokens' and retry. Give it room.
Stream long responses. It improves perceived latency and avoids timeouts.
Keep the system prompt stable so prompt caching actually hits.
Always parse tool inputs as structured data — never raw-string-match the serialized JSON.
Log usage and request_id. Token counts drive your cost model, and the request ID is what you give Anthropic support to trace an issue.

Final thoughts

The Claude API rewards a simple mental model: it's one stateless endpoint that takes a list of messages and returns the next one. System prompts, multi-turn history, streaming, tool use, and caching are all variations on that one call. Start with a basic request, add streaming for your UI, reach for tools when you need live data or actions, and layer in caching once you have a stable prefix worth caching.

Get the error handling right from day one — typed exceptions, sensible max_tokens, and respect for rate limits — and you'll have an integration that holds up under real traffic instead of one that works in the demo and falls apart in production.

More in AI Engineering