When Skills Aren't Enough

May 28, 2026·
Dong Liang
Dong Liang
· 13 min read

When Skills Aren’t Enough: testing dynamic workflow

Skills have been a game changer for AI applications. But it is far from solving all our problems. Write a long skill that contains multiple steps and watch it run a few times, I can guarantee you somewhere a quiet failure will surface. The first run got lucky: it follows your steps exactly. By the third, it has begun to drift: skipping a step here, reordering two there, quietly dropping the one that felt like busywork. You wrote the steps down and the machine read them, and yet somewhere between the reading and the doing, the steps stopped binding.

I have a skill like this. I call it rick, and it is a reader-impersonation checklist. The idea is simple: instead of asking an AI whether a piece of writing is any good, I have it read the text in particular roles, then sort what those roles said, and finally decide which complaints point at a real problem and which are merely one reader wishing for a different book.

The roles are the heart of it. Rick calls them casts, and one reads like this:

You are a thirty-four-year-old project manager at a mid-sized healthcare company. You’ve been using ChatGPT for about two years, drafting emails, summarizing meetings, explaining to your kids why the sky is blue. You’ve never thought systematically about writing. You write because your job requires it, not because you love it. You bought this book because a podcast host you trust mentioned it, and because something about AI and language has been bothering you for months and you can’t quite name what.

I then hand that person the chapter along with a short list of questions. Where did you start to trust the writer? Where did you stop? Which passages did the real work? Where did you skim, and why? What would have made you put the book down? The cast answers in character, and the answers are useful precisely because they come from inside a head rather than off a rubric.

So the checklist works. And yet my skill.md opens with the following:

A workflow, not a set of guidelines. Three phases. Named inputs and outputs at each. Do them in order. Do not skip phases. Do not reorder.

And it ends, two hundred lines later, in a section called “Failure modes the workflow exists to prevent”:

  • Pattern-matching mode. Scanning casts for known problem-shapes and aggregating matches without checking the page.
  • Severity inflation. Copyedits described in the language of structural findings.
  • Throwing away praise. Forwarding only negatives.
  • Mixing aggregation with verification. Reaching for the chapter while still sorting cast feedback.

This is not documentation. It is a person who has run the thing often enough to know exactly how it breaks, writing the breakages down and asking the machine, please, not to commit them. “Do not skip phases” is the sort of sentence you write only when you have no way to actually stop the skipping. The whole vocabulary of the file — non-negotiable, do them in order, do not break character — is the sound of a process that can do nothing but beg.

That is the pain, and it runs deeper than one floor. At the top: a checklist is a set of instructions, and instructions can be ignored. Beneath it sit three more. Long runs pile up in the machine’s memory until its judgment goes soft; a run that dies halfway dies all the way back to the beginning; and you can never quite promise a friend that the process they run is the same one you built.

Introducing Dynamic workflows: a feature Anthropic released today (I noticed it was in preview under Max plan for a few days and was referred to as ultrawork; now it is simply workflow).

The one-line version is that they move the process from something the AI follows to something a script executes. What follows is the detailed breakdown.

What separates a workflow from a skill

The two differ on a single question: who holds the plan?

A skill is a set of instructions the AI follows. It reads them, decides each next step as it goes, and deposits every result back into its own memory. A workflow is a script that a runtime executes, and the difference that matters is that the loop, the branching, and the half-finished results all live in the script rather than in the AI’s head.

When rick runs, the AI is the boss: it spawns the casts, reads their full replies back into its own memory, sorts them there, and decides there which findings deserve a second look.

A workflow moves the boss into code. That single move, from the AI holding the plan to the script holding it, is the root of all five gains below.

1. Determinism: the logic lives in code, not in compliance

Consider the gate at the top of rick’s second phase:

Stay in the cast space. Do not open the chapter file in this phase.

The instruction is there for a reason. The second phase is where the AI sorts what the readers said, and it has to do that sorting before it tests any of it against the page; if it peeks at the chapter too early, it stops listening to the readers and starts defending its own prose. So the file says it twice, in two different ways: not yet.

As things stand, that rule is a matter of honor. It holds only as far as the AI chooses to honor it, and choosing to honor a rule is precisely what a language model does worst once the conversation has grown long and the chapter file is sitting one tool call away.

A workflow turns honor into architecture. The sorting stage becomes an agent that is simply never handed the path to the chapter, so it cannot open the page for the plain reason that it has no page to open. The gate stops being a sentence someone has to remember and becomes a door that was never built. The phase order follows the same logic: “do them in order, do not reorder” is no longer a plea but the literal shape of the script, which has no will to weaken.

A sign on the grass asks. A fence decides.

The same shift buys a second kind of reliability, one scale up. A skill is a markdown file the model reads afresh every time, and reads a little differently each time, so the process I run on Tuesday is never quite the one I run on Friday, nor the one a collaborator runs from the same file. A workflow is the orchestration itself, fixed: the same steps in the same order, run after run and person after person, because the sequence no longer passes through a mind that might read it otherwise. The first reliability is that a rule cannot be broken inside a run; the second, that the run is the same run every time. Both refuse to depend on the model being intelligent or obedient, and that is why they hold.

2. Resumability: a dead run picks up where it fell

Rick fans out to a handful of agents, then fans out again to chase down each finding, and a process with that many moving parts may break somewhere in the middle: a crash, a closed laptop, a network hiccup three phases deep.

When a skill is driving those agents, an interruption throws away the entire turn. If rick dies after the casts have finished reading but before the sorting begins, the readings are simply gone; the AI spawns all the readers again, and they read the whole chapter from the top. The work was real and the tokens were spent, but nothing survives.

A workflow’s runtime, by contrast, remembers each agent’s result the moment it lands. Stop the run and start it again, and the agents that already finished return what they found while only the unfinished work runs live. For three casts that is a small mercy. But point a dozen readers at an entire manuscript and let the verification fan out behind them, and it becomes the difference between losing a morning and losing a minute. The longer the run, the more it saves, which is the exact inverse of the skill, where the longer the run, the more you stand to lose.

3. Context performance: a clean head for the hard part

This is the irony worth sitting with, because it shows the split between worker and boss at its sharpest.

A big challenge for AI application is the fixed context window. Context rot will invariably arrive: the slow softening of judgment as a conversation fills up, with the halfway mark as the point at which you should start to worry. Rick exists to guard judgment. Run as a skill, it quietly rots its own.

Trace how it happens. The cast readings are sealed, so the chapter is never read into the AI’s memory three times over; so far, so good. But each cast returns its full reply to the boss, the sorting happens in the boss’s memory, and every later verdict happens there too. By the time the AI reaches its most delicate calls — real problem or mere wish, five-word fix or structural tear-out — its memory is carrying three full reader monologues, a sorted map, and a growing stack of findings. The hardest thinking happens in the most crowded room.

A workflow keeps those intermediate results in the script’s own variables () rather than in the AI’s memory. The readings live in a variable, the sorted map lives in a variable, and a fresh agent reads them only when the time comes, so that only the finished report ever reaches your conversation. The plan stops fouling the very judgment it exists to protect. The official description is blunt about it: the AI’s memory holds only the final answer. For a process whose entire purpose is to keep a single judgment sharp, performing that judgment in a clean room rather than a crowded one is not a refinement. It is the whole game.

4. Verification: skepticism you can wire in

This is the gain I care about most, because it changes the work itself and not merely the machinery around it.

Rick guards against invented problems with two moves, and both are written as instructions. The first is the removal test, for when a reader’s complaint is only vaguely aimed:

cut each candidate sentence one at a time and identify whose removal most improves the passage. That cut names the offender.

The second is the phantom verdict, the last fate a finding can meet:

Phantom finding — if the page does not actually support the finding (the sentence the cast describes isn’t there, or isn’t doing what they said), drop it.

Both are skeptical second readings, and both are exactly the slow, contrary checking that a tired mind skips first. As a skill, rick can only ask the AI to play its own skeptic, which is a little like asking a writer to catch their own typos an hour after setting down the last sentence.

A workflow can make the skeptic a separate person altogether. One agent locates a problem and names its size and its fix; a second agent, who never saw the first one’s verdict, reads the quoted passage cold and answers a single question: is this really on the page, and is it doing what the finding claims? The finding survives only if it survives that stranger. The phantom dies not because the AI happened to remember to look, but because looking has been wired into the line.

None of this is hypothetical. Claude Code’s own /deep-research already works this way with facts: it fans across sources, votes on each claim, and returns the report with the claims that lost the vote cut out. The shape generalizes readily, whether as independent agents grading one another’s findings before any of it reaches you, or as a plan drafted from several angles and weighed against itself. The careful reader that rick keeps hoping the AI will be becomes an agent who is careful whether or not anyone is in the mood. You cannot write that into a paragraph of instruction; you can only build it.

5. Monitorability: a run you can watch

Two smaller gains share a single root.

While rick runs as a skill, your conversation waits on it: the AI is busy playing casts, and you sit there watching it work. A workflow runs in the background instead. The runtime executes the script off to one side while your session stays free, so that you can keep working as the agents grind through the phases, and you are no longer babysitting the plan.

And because the plan is now code, you get a record rather than a transcript. Type /workflows and a panel opens with every phase laid out: its agent count, its token total, its elapsed time. You can drill into any phase, and then into any single agent, to read what it was asked and what it found. Set that beside the skill version, where you reconstruct the run by scrolling back through a wall of chat and straining to recall which reader said what. The run becomes something you can read as a run.

Two further knobs are worth knowing. The script can route different stages to different models, sending the cast readings to a cheaper one and reserving the strongest for the hard judgment, which a skill cannot do cleanly. And the runtime keeps walls around the whole thing: sixteen agents at once, a thousand to a run, so that a fan-out can never quietly turn into a runaway.

What about token counts?

I gave the workflow a single test run, and it cost about 1.09 million tokens across thirty-four agents in a little over six minutes. That is a lot for what the job actually was: read one chapter, gather three readers’ reactions, sort them, check them against the page. When I looked at where the tokens went, almost all of it sat in one phase. Judging the findings and then verifying them came to eighty-two percent of the run. The cause was crude. I had given each finding its own agent and each check its own agent, so thirteen findings became twenty-six separate agents, and every one of them opened the chapter, read its three thousand words from scratch, handled a single complaint, and quit. In one run the chapter was read twenty-nine times.

I caught that quickly, and the fix was simple. But the part worth saying is that it was not bad luck. When you let Claude build its own workflow, this is the kind of mistake it reaches for. Faced with a list, it fans out, one agent per item, because that is the obvious shape, and the obvious shape reads as efficient when it is the opposite. The cost of an agent is mostly the cost of starting one at all, so multiplying agents multiplies the overhead whether or not each one has much to do.

The recovery, though, is real, and it is in your hands. The run leaves logs, agent by agent, with the token counts attached. You can read them, see exactly what the system did, and go back into the script to change it. In my case the two dozen single-finding agents collapse into two: one that judges every finding in a single pass, one that checks them, with the structure that made the result trustworthy left intact. The token count falls by most of what it was. This is not a special case. Much of the waste a generated workflow produces is sitting in plain sight in the logs, and much of it comes back out.

The dynamic workflow tool is genuinely useful. But this doesn’t mean you can just let it do its work and be happy about the cost. The way it spends agents, when it writes itself, is not yet economical. It has no real sense of cost efficiency (maybe intentionally left out). I also find its choice of model suspicious. It used Haiku for things that I think is beyond its abilities. Luckily all those decisions are something you supply by hand, after the fact, by reading the logs.


Further explorations on anthropic official example /deep-research upcoming