The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling - A Survey

Written — Updated
  • The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
  • Introduction

    • The paper defines agents as a system that uses planning, loops, reflection, and other control structures, as well as leveraging the model's reasoning abilities to accomplish a task.
    • This paper focuses mostly on the difference between single-agent vs. multiple-agent architectures.
    • The multiple agent architectures are then subdivided into vertical and horizontal architectures.
    • Each agent has a persona which is basically the system prompt as well as the tools that the agent has access to.
      • In addition to the instructions for the task, the persona may define a specific role such as an expert coder or a manager or a reviewer and so on.
    • Tools of course are external function calls that the model can request, such as editing a document, searching the web, or other actions that the model is not able to do inside its computations.
    • The paper defines a single agent architecture as those powered by a single language model which performs all the tasks on its own with no feedback from other models or agents. There may be feedback from humans though.
    • In a multi-agent setup, each agent typically has a different persona.
    • In a vertical multi-agent architecture, one agent access to leader and other agents report to it. There could be multiple levels of hierarchy as well. But the main distinction is the clear division of labor between the different sub-agents.
    • In a horizontal architecture, the agents are all more or less equal and are part of a single discussion about the task. Communication between agents is shared between all of the agents. And agents can volunteer themselves to complete certain tasks or call tools.
  • Key Considerations

    • Reasoning

      • Reasoning is basically the same thing that we humans do where we think critically about a problem, understand how it fits into the world around us, and make a decision.
      • For a model, reasoning is what allows it to go beyond its training data and learn new tasks or make decisions under new circumstances.
    • Planning

      • Planning is an application of reasoning.
      • And there are five major approaches to it. Task decomposition, multi-plan selection, external modulated planning, reflection and refinement, and memory augmented planning. See understanding the planning of LLM agents.
      • Most agents have a dedicated planning step, which they run before executing any actions. There are many ways to do this. The paper particularly calls out Graph-enhanced Large Language Models in Asynchronous Plan Reasoning AKA "Plan like a graph," and tree of thought as examples which allow the agent to execute multiple steps in parallel.
        • Although my recollection of Tree of Thought was that it was more about trying different permutations of problem solving and not so much about planning.
    • Tool Calling

      • Tool calling goes hand in hand with reasoning and is what really allows the model to make effective and informed decisions.
      • Many agents use some iterative process of planning, reasoning, tool calling, and then breaking up the task into further sub steps with more planning and so on.
      • But some papers point out that single agent architectures often have trouble with these long chains of subtasks.
  • Single Agent Architectures

    • Proper planning and self-correction is paramount here.
    • A big risk with single agent architectures is that because they don't have any external method of automatically correcting themselves, they may get stuck in an infinite loop where they reason the same step over and over again with the same result.
    • ReAct
      • ReAct: Synergizing Reasoning and Acting in Language Models was one of the first single agent methods designed to improve over single-step prompting. In React, which stands for Reason Plus Act, the agent has a cycle of thinking about a task, performing an action based on that thought and observing the output.
      • Aside from improved reliability, one big advantage of this method over previous single-prompt methods was that the sequence of thoughts and actions are all there to see, so it's easier to figure out how the model arrived at its conclusion.
      • But ReAct is susceptible to the infinite loops mentioned above.
    • RAISE
      • RAISE, as described in From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models, stands for Reasoning and Acting through Scratch pad and Examples. There's no 'I' but I guess they thought RAISE sounded better than RASPE or something.
      • It's based on the ReAct method, but adds the scratchpad for short-term storage and a data set of similar previous examples for long-term storage.
        • NOTE How does this work?
      • One interesting problem that the race paper found was that agents would often exceed their defined roles such as a sales agent role which ended up writing Python code. The authors also cited problems with hallucinations and difficulty understanding complex logic.
    • Reflexion
      • Reflexion: Language Agents with Verbal Reinforcement Learning
      • Reflexion is a method in which the agent is asked to reflect on its own performance with certain metrics such as success state and if the current trajectory matches the agent's desired task.
      • NOTE look at the paper to determine more about these
      • Some limitations cited by the authors
        • Reflextion is prone to falling into non-optimal local minima
        • The agent's memory is simply stored in the model's context with a sliding window and so older important items may be forgotten
    • AutoGPT + P
      • AutoGPT+P: Affordance-based Task Planning with Large Language Models
      • AutoGPT+P is a technique specifically designed for use in robotics. It uses computer vision to detect the objects present in a scene. And then can use four tools to try to complete its task.
        • Plan Tool
        • Partial Plan Tool
        • Suggest Alternative Tool
        • Explore Tool
      • The model also works in concert with a traditional planning tool using PDDL or planning domain definition language. This planner helps with translating the model's instructions into things that the robot is actually able to do given its physical limitations.
      • As with many of the above approaches, it does have some problems such as sometimes choosing the wrong tools or getting stuck in loops. And at least as described in the paper, there's no opportunity for human interaction such as the agent asking for clarification or the human interrupting if the robot starts to do something wrong.
    • LATS
      • Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
      • LATS is an algorithm based on Monte Carlo Tree Search. You can read the link for more details, but basically, it's inspired by Monte Carlo simulation in which you do a bunch of random runs to get a better idea of the probability space and the best action to take.
      • But as you can imagine, doing a bunch of random runs down a tree with language models can be very slow and expensive. Also, the paper doesn't tackle particularly complex scenarios.
  • Multi Agent Architectures

    • Common themes with multi-agent architectures
      • Leadership of agent teams
      • Dynamic creation of agent teams between stages
      • Information sharing between team members
    • Embodied LLM Agents Learn to Cooperate in Organized Teams
      • This method uses a hybrid approach that is mostly a horizontal team, but has a leader agent over the rest of the team.
      • They found that teams with a leader finished their tasks about 10% faster and that without a leader the agents spend about half of their time giving orders to each other. Whereas with a single designated leader, the leader spends 60% of its messages giving directions, while the other agents can focus more on actual exchange of useful information.
    • DyLAN
    • AgentVerse
      • AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
      • Agentverse uses a four-stage process
        • Recruitment, which uses a “recruiter” agent to generate personas for a set of agents to work on this iteration, based on the current goal state.
        • Collaborative decision-making between the agents.
          • This can be vertical or horizontal arrangement, depending on the task.
        • Independent action execution by each agent
          • Each agent uses a ReAct loop with up to 10 iterations to get to the desired output
        • Evaluation of how close the current state is to the goal.
      • This process can be repeated until the goal is reached.
      • One important finding here is that agent feedback is not always reliable.
        • Even if an agent’s feedback is not valid, the receiving agent may incorporate it anyway.
    • MetaGPT
      • MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
      • MetaGPT focuses on using structured outputs to communicate between agents instead of plain text in order to reduce unproductive chatter and inefficiencies, such as "how are you? I'm fine".
      • It also implements a message bus which allows agents to publish their information to a common place but only listen to information that is relevant to them.
  • Discussion

    • Single agent patterns tend to work best with a narrowly defined list of tools and well defined processes. They're also easier to implement because there's only one agent and set of tools, and they don't face the limitations of multi-agent systems like poor feedback from other agents or unrelated chatter from other team members. But they are more likely to get stuck in loops and fail to make progress if they find themselves in a situation that does not match their reasoning strengths.
    • Multi-agent architectures work best when feedback from different personas helps to accomplish the task, such as drafting a document and then reviewing or proofreading it. They're also useful for performing parallel execution when there are distinct independent subtasks. Multi-agent architecture is particularly advantageous when no examples of the task have been provided.
    • Feedback can be very helpful, but it's not a panacea. The AgentVerse paper notes a case where an agent gave invalid feedback to another agent, but it was still incorporated. Similarly, human feedback may conflict with the desired behavior of the agent, but because the language models tend to be willing to please, they may incorporate it anyway.
    • Information Sharing

      • Information sharing in a horizontal multi-agent system is very useful, but also has issues. For example, agents can too closely simulate a human when assigned a persona and start asking the other agents small-talk questions such as "how are you?" Agents may also be exposed to information that is irrelevant to their particular task, so systems that allow subscribing or filtering incoming information can be helpful for keeping an agent on task.
      • Vertical architectures tend to not have as many of these issues, but can encounter problems when the managing agent does not send enough information to its team for them to do the job. The paper recommends using prompting techniques to help with this.
    • Careful design of the system prompt for the persona can help to keep an agent on task and reduce the amount of unnecessary chatter between agents.
    • Dynamic team creation where agents are brought in and out of the system can be a big help because it excludes irrelevant agents from adding noise a particular stage of the problem.
  • Limitations

    • Evaluating agents is difficult and there are not very many good standard benchmarks.
    • Many papers introduce their own benchmarks alongside a new agent system, which makes it difficult to compare agent systems beyond those tested in that particular paper.
    • Many agent evals are complex and require manual scoring, which can be tedious, limits the size of the evaluation set, and adds the possibility of evaluator bias. The complexity of agents also leads to a lot more variation in their outputs, so it's more difficult to properly determine if an agent's answer is correct or not.
    • As with language model evaluations, data set contamination is a problem, where the tasks that the agents are trying to work on can be found in their training data.
    • Many standard benchmarks designed for language model testing, such as MMLU and GSM8K, are not applicable to agents because they don't really exercise an agent's ability to reason beyond what you would find in a single call to a language model.
    • Some agent eval systems use simpler answers such as yes or no, which are easier to evaluate, but this limits the real world applicability of the eval, where most tasks require more complex answers. More complex benchmarks that use logic puzzles or video games come closer, but even in those cases it's questionable how much it translates to the real world, where tasks are less well-defined and data is dirtier.
    • The paper mentions WildBench and SWE-bench as a couple of benchmarks that use real-world data, though WildBench doesn't seem to be designed for agent testing.

Thanks for reading! If you have any questions or comments, please send me a note on Twitter.