AI Made Friendly HERE

Guide to Understanding, Building, and Optimizing API-Calling Agents

The role of Artificial Intelligence in technology companies is rapidly evolving;  AI use cases have evolved from passive information processing to proactive agents capable of executing tasks. According to a March 2025 survey on global AI adoption conducted by Georgian and NewtonX, 91% of technical executives in growth stage and enterprise companies are reportedly using or planning to use agentic AI.

API-calling agents are a primary example of this shift to agents. API-calling agents leverage Large Language Models (LLMs) to interact with software systems via their Application Programming Interfaces (APIs).

For example, by translating natural language commands into precise API calls, agents can retrieve real-time data, automate routine tasks, or even control other software systems. This capability transforms AI agents into useful intermediaries between human intent and software functionality.

Companies are currently using API-calling agents in various domains including:

  • Consumer Applications: Assistants like Apple’s Siri or Amazon’s Alexa have been designed to simplify daily tasks, such as controlling smart home devices and making reservations.
  • Enterprise Workflows: Enterprises have deployed API agents to automate repetitive tasks like retrieving data from CRMs, generating reports, or consolidating information from internal systems.
  • Data Retrieval and Analysis: Enterprises are using API agents to simplify access to proprietary datasets, subscription-based resources, and public APIs in order to generate insights.

In this article I will use an engineering-centric approach to understanding, building, and optimizing API-calling agents. The material in this article is based in part on the practical research and development conducted by Georgian’s AI Lab. The motivating question for much of the AI Lab’s research in the area of API-calling agents has been: “If an organization has an API, what is the most effective way to build an agent that can interface with that API using natural language?”

I will explain how API-calling agents work and how to successfully architect and engineer these agents  for performance. Finally, I will provide a systematic workflow that engineering teams can use to implement API-calling agents.

I. Key Definitions:

  • API or Application Programming Interface : A set of rules and protocols enabling different software applications to communicate and exchange information.
  • Agent: An AI system designed to perceive its environment, make decisions, and take actions to achieve specific goals.
  • API-Calling Agent: A specialized AI agent that translates natural language instructions into precise API calls.
  • Code Generating Agent: An AI system that assists in software development by writing, modifying, and debugging code. While related, my focus here is primarily on agents that call APIs, though AI can also help build these agents.
  • MCP (Model Context Protocol): A protocol, notably developed by Anthropic, defining how LLMs can connect to and utilize external tools and data sources.

II. Core Task: Translating Natural Language into API Actions

The fundamental function of an API-calling agent is to interpret a user’s natural language request and convert it into one or more precise API calls. This process typically involves:

  1. Intent Recognition: Understanding the user’s goal, even if expressed ambiguously.
  2. Tool Selection: Identifying the appropriate API endpoint(s)—or “tools”—from a set of available options that can fulfill the intent.
  3. Parameter Extraction: Identifying and extracting the necessary parameters for the selected API call(s) from the user’s query.
  4. Execution and Response Generation: Making the API call(s), receiving the response(s), and then synthesizing this information into a coherent answer or performing a subsequent action.

Consider a request like, “Hey Siri, what’s the weather like today?” The agent must identify the need to call a weather API, determine the user’s current location (or allow specification of a location), and then formulate the API call to retrieve the weather information.

For the request “Hey Siri, what’s the weather like today?”, a sample API call might look like:

GET /v1/weather?location=New%20York&units=metric

Initial high-level challenges are inherent in this translation process, including the ambiguity of natural language and the need for the agent to maintain context across multi-step interactions.

For example, the agent must often “remember” previous parts of a conversation or earlier API call results to inform current actions. Context loss is a common failure mode if not explicitly managed.

III. Architecting the Solution: Key Components and Protocols

Building effective API-calling agents requires a structured architectural approach.

1. Defining “Tools” for the Agent

For an LLM to use an API, that API’s capabilities must be described to it in a way it can understand. Each API endpoint or function is often represented as a “tool.” A robust tool definition includes:

  • A clear, natural language description of the tool’s purpose and functionality.
  • A precise specification of its input parameters (name, type, whether it’s required or optional, and a description).
  • A description of the output or data the tool returns.

2. The Role of Model Context Protocol (MCP)

MCP is a critical enabler for more standardized and robust tool use by LLMs. It provides a structured format for defining how models can connect to external tools and data sources.

MCP standardization is beneficial because it allows for easier integration of diverse tools, it promotes reusability of tool definitions across different agents or models. Further, it is a best practice for engineering teams, starting with well-defined API specifications, such as an OpenAPI spec. Tools like Stainless.ai are designed to help convert these OpenAPI specs into MCP configurations, streamlining the process of making APIs “agent-ready.”

3. Agent Frameworks & Implementation Choices

Several frameworks can aid in building the agent itself. These include:

  • Pydantic: While not exclusively an agent framework, Pydantic is useful for defining data structures and ensuring type safety for tool inputs and outputs, which is important for reliability. Many custom agent implementations leverage Pydantic for this structural integrity.
  • LastMile’s mcp_agent: This framework is specifically designed to work with MCPs, offering a more opinionated structure that aligns with practices for building effective agents as described in research from places like Anthropic.
  • Internal Framework: It’s also increasingly common to use AI code-generating agents (using tools like Cursor or Cline) to help write the boilerplate code for the agent, its tools, and the surrounding logic. Georgian’s AI Lab experience working with companies on agentic implementations shows this can be great for creating very minimal, custom frameworks.

IV. Engineering for Reliability and Performance

Ensuring that an agent makes API calls reliably and performs well requires focused engineering effort. Two ways to do this are (1) dataset creation and validation and (2) prompt engineering and optimization.

1. Dataset Creation & Validation

Training (if applicable), testing, and optimizing an agent requires a high-quality dataset. This dataset should consist of representative natural language queries and their corresponding desired API call sequences or outcomes.

  • Manual Creation: Manually curating a dataset ensures high precision and relevance but can be labor-intensive.
  • Synthetic Generation: Generating data programmatically or using LLMs can scale dataset creation, but this approach presents significant challenges. The Georgian AI Lab’s research found that ensuring the correctness and realistic complexity of synthetically generated API calls and queries is very difficult. Often, generated questions were either too trivial or impossibly complex, making it hard to measure nuanced agent performance. Careful validation of synthetic data is absolutely critical.

For critical evaluation, a smaller, high-quality, manually verified dataset often provides more reliable insights than a large, noisy synthetic one.

2. Prompt Engineering & Optimization

The performance of an LLM-based agent is heavily influenced by the prompts used to guide its reasoning and tool selection.

  • Effective prompting involves clearly defining the agent’s task, providing descriptions of available tools and structuring the prompt to encourage accurate parameter extraction.
  • Systematic optimization using frameworks like DSPy can significantly enhance performance. DSPy allows you to define your agent’s components (e.g., modules for thought generation, tool selection, parameter formatting) and then uses a compiler-like approach with few-shot examples from your dataset to find optimized prompts or configurations for these components.

V. A Recommended Path to Effective API Agents

Developing robust API-calling AI agents is an iterative engineering discipline. Based on the findings of Georgian AI Lab’s research, outcomes may be significantly improved using a systematic workflow such as the following:

  1. Start with Clear API Definitions: Begin with well-structured OpenAPI Specifications for the APIs your agent will interact with.
  2. Standardize Tool Access: Convert your OpenAPI specs into MCP Tools like Stainless.ai can facilitate this, creating a standardized way for your agent to understand and use your APIs.
  3. Implement the Agent: Choose an appropriate framework or approach. This might involve using Pydantic for data modeling within a custom agent structure or leveraging a framework like LastMile’s mcp_agent that is built around MCP.
    • Before doing this, consider connecting the MCP to a tool like Claude Desktop or Cline, and manually using this interface to get a feel for how well a generic agent can use it, how many iterations it usually takes to use the MCP correctly and any other details that might save you time during implementation.
  4. Curate a Quality Evaluation Dataset: Manually create or meticulously validate a dataset of queries and expected API interactions. This is critical for reliable testing and optimization.
  5. Optimize Agent Prompts and Logic: Employ frameworks like DSPy to refine your agent’s prompts and internal logic, using your dataset to drive improvements in accuracy and reliability.

VI. An Illustrative Example of the Workflow

Here’s a simplified example illustrating the recommended workflow for building an API-calling agent:

Step 1: Start with Clear API Definitions

Imagine an API for managing a simple To-Do list, defined in OpenAPI:

openapi: 3.0.0

info:

title: To-Do List API

version: 1.0.0

paths:

/tasks:

post:

summary: Add a new task

requestBody:

required: true

content:

application/json:

schema:

type: object

properties:

description:

type: string

responses:

‘201′:

description: Task created successfully

get:

summary: Get all tasks

responses:

‘200′:

description: List of tasks

Step 2: Standardize Tool Access

Convert the OpenAPI spec into Model Context Protocol (MCP) configurations. Using a tool like Stainless.ai, this might yield:

Tool Name Description Input Parameters Output Description
Add Task Adds a new task to the To-Do list. `description` (string, required): The task’s description. Task creation confirmation.
Get Tasks Retrieves all tasks from the To-Do list. None A list of tasks with their descriptions.

Step 3: Implement the Agent

Using Pydantic for data modeling, create functions corresponding to the MCP tools. Then, use an LLM to interpret natural language queries and select the appropriate tool and parameters.

Step 4: Curate a Quality Evaluation Dataset

Create a dataset:

Query Expected API Call Expected Outcome
“Add ‘Buy groceries’ to my list.” `Add Task` with `description` = “Buy groceries” Task creation confirmation
“What’s on my list?” `Get Tasks` List of tasks, including “Buy groceries”

Step 5: Optimize Agent Prompts and Logic

Use DSPy to refine the prompts, focusing on clear instructions, tool selection, and parameter extraction using the curated dataset for evaluation and improvement.

By integrating these building blocks—from structured API definitions and standardized tool protocols to rigorous data practices and systematic optimization—engineering teams can build more capable, reliable, and maintainable API-calling AI agents.

Originally Appeared Here

You May Also Like

About the Author:

Early Bird