The Hardest AI Agents to Build Are the Ones That Talk to Humans

Two years ago, most people didn’t even know how to define AI agents.

Dozens of tech articles came out between 2023 and 2024 trying to explain what an agent was, and which type of AI tools qualified as an agent. The term itself was vague. Depending on who you asked, it could mean anything from a chatbot that could trigger actions to a workflow chaining together a few API calls. There were debates about whether an agent needed planning, autonomy, tool use, or some combination of the three. In many ways, the industry was still trying to figure out what the category actually meant.

Today the conversation has moved much further. Agents are no longer something people are trying to define. They’re something many teams are actively building, and companies are now trying to deploy in real workflows.

At first glance, many agent demos look similar. An agent takes a goal, figures out the steps required, and executes them using various tools. It might search the web, call APIs, retrieve documents, or run code. These capabilities are impressive, but they share an important characteristic: the environment is controlled.

When an agent interacts with software systems, the rules are relatively predictable. APIs return structured responses. Inputs and outputs have defined formats. If something fails, the system can detect the error and retry the step. The agent is essentially operating inside a structured sandbox where the boundaries are known.

But much of the work people actually do every day does not happen inside structured systems. It happens through communication.

Meetings are coordinated through email. Decisions are negotiated in Slack threads. Details are clarified through text messages. Work unfolds through back-and-forth conversations that stretch across hours, days, or even weeks. When an agent enters this world, it is no longer interacting with predictable software interfaces. It is interacting with humans.

That change introduces an entirely different level of complexity.

Human communication is inherently messy. People respond late, forget details, change their minds, and introduce new information halfway through a conversation. Messages are often ambiguous or incomplete. Someone might reply to a thread while another person starts a separate conversation about the same task. Participants may be added after the discussion has already begun. Despite all of this, the work still needs to move forward.

For an agent to function effectively in this environment, it has to do far more than execute a predefined workflow. It must maintain context across long conversations, interpret intent from imperfect messages, track constraints and preferences, and update its understanding of the task as new information arrives. In many cases, it also needs to coordinate between multiple people while communicating clearly and politely.

At that point the system starts to resemble something closer to a human assistant than traditional automation software.

Another challenge is that communication environments are fundamentally open-ended. When an agent interacts with an API, it typically knows the possible inputs and outputs in advance. The system can anticipate what types of responses it might receive.

In contrast, when an agent participates in email or chat, there is no fixed script. Someone might suddenly introduce a new constraint. Another participant might suggest a completely different plan. A meeting that seemed settled might need to be rescheduled. A participant might respond with partial information or ask a clarifying question that shifts the direction of the conversation.

The agent has to interpret that message, update the state of the task, and continue the interaction in a way that feels coherent to everyone involved.

This creates a hybrid problem that combines reasoning, state management, and human communication. Solving it reliably is significantly harder than simply chaining together a series of tool calls.

One of the hardest aspects of this type of work is that it unfolds asynchronously. Conversations rarely happen in a single interaction. Instead they stretch across long periods of time, with pauses, interruptions, and changes along the way.

Someone might not respond for two days and then suddenly reply with a scheduling change. Another participant might start a new thread referencing the same task. A proposed time might later conflict with another meeting, forcing the conversation to restart. A message might arrive in Slack even though the conversation originally started in email.

For an agent, each of these situations requires reconnecting the message to the correct underlying task. The system has to remember what it was trying to accomplish, understand the current state of that effort, and continue the interaction from the appropriate point.

Most agent systems today implicitly assume short-lived interactions: a request arrives, the agent performs a series of steps, and the task is completed. Communication-driven work rarely behaves that way. It is persistent, evolving, and often fragmented across multiple conversations.

Handling this reliably requires a much deeper concept of memory and task state. The agent needs to track what it is trying to accomplish, what information has already been gathered, which constraints have been introduced, and which participants are involved. When a new message arrives—whether minutes or days later—the system has to pick up the task exactly where it left off.

Reliability also becomes far more important once humans are involved. When an agent operates purely behind the scenes, occasional mistakes are often tolerable. If an internal automation fails, the system can retry the operation.

But when an agent sends a message to a person, that message becomes part of a social interaction.

A confusing or poorly phrased response can create friction, erode trust, or make the agent appear incompetent. Users lose confidence quickly if the system behaves unpredictably or fails to follow the context of a conversation. As a result, agents that operate in communication environments have to meet a much higher bar for clarity, consistency, and contextual awareness.

Despite these challenges, agents that can operate inside real conversations may ultimately represent one of the most valuable categories of AI software. A significant portion of organizational work consists of coordination: scheduling meetings, aligning on plans, following up on decisions, and managing the logistics of collaboration between people.

These problems are not primarily computational. They are coordination problems, and coordination happens through communication.

If agents are going to become true digital workers, they will need to participate directly in these environments. They will need to maintain context across conversations, coordinate between people, and help groups move work forward over time.

Building systems like this is far more difficult than building agents that operate inside structured software tools. But that difficulty is also a signal. The areas that are hardest to automate often correspond to the areas where the most value can be created.

This is part of what makes building communication agents so interesting. At Skej, we spend much of our time thinking about exactly these problems—how an agent can participate naturally in email and chat conversations, keep track of long-running tasks, and pick things back up days later when someone finally replies.

As the technology continues to evolve, the most impactful agents will not simply run tools in the background. They will communicate, coordinate, and adapt as situations change.

In many ways, the future of agents may look less like traditional automation and more like something familiar: a capable assistant helping people navigate the complexity of getting things done together.

Two years ago, most people didn’t even know how to define AI agents.