Deep dive

How the work
actually works.

For the curious. Most visitors skip this. The rest of you, here's what actually happens.

We show up with a notebook and zero deck. The first three days, we shadow whoever's actually running the work. Standups. Tickets. The thing on the screen at 4 PM on Wednesday.

We're not looking for what's "AI-shaped." We're looking for the hour that drains your week. Sometimes it's obvious. Sometimes it's a handoff so embedded the team doesn't read it as friction anymore.

By Friday we have a ranked list. We pick one workflow to start with.

Every audit ends with a ranked list. Every list has the same shape: the workflow at the top is the one nobody flagged when we asked.

Three shapes we keep finding. Different teams, different stacks, same friction.

Shopify ops, team of nine. CX runs through Gorgias. Every morning the support lead spends ninety minutes before standup categorizing overnight tickets: refund requests, shipping issues, product questions, wholesale inquiries. The team can't start until she finishes.

When you ask what she does all day, she says "respond to customers." She doesn't read the categorization as work. It's the prerequisite for the work, so it doesn't count.

The agent reads the overnight queue, tags each ticket, drafts a first-pass reply in the brand voice. Routes wholesale and refund requests to specific humans. Flags the weird ones for review. Standup moves from 9:30 to 9:00.

B2B sales, three-person team. Maybe four signed contracts a week. After every signature, somebody sits in front of the executed PDF and copies terms into Salesforce: start date, contract value, billing frequency, special clauses. Twenty minutes per contract, plus the inevitable typo found two months later.

The team calls this "admin." Nobody pushes back because it's admin.

The agent watches the DocuSign-completed inbox, parses the contract, populates the Salesforce fields, posts a summary to the deal Slack channel. Flags anomalies for human review before write. The win isn't the twenty minutes. It's that Salesforce stops being three days behind reality.

Editorial team of twelve. Pitches land in a Slack channel called #pitches all week. Every Monday morning the managing editor spends two hours reading back through the channel, deduplicating, sorting by priority, rebuilding the editorial calendar in Notion before the 11 AM meeting.

She's done it long enough she's stopped noticing it's a job.

The agent watches #pitches, extracts each pitch into a structured row, deduplicates against the existing calendar, posts a draft Monday calendar to the editor's DMs by Sunday night. She reviews, tweaks, and ships. The two hours get rerouted to writing.

Three different teams, three different stacks, same shape. The right first workflow is:

Repetitive, daily or weekly.
Bounded. The agent isn't being asked to invent strategy.
Recoverable. A wrong draft is a draft. A wrong send is a problem.
Already happening in tools you use. No new SaaS to buy.
Owned by someone who can articulate what a good outcome looks like, even if they've stopped noticing.

That last one is what week one is really about. Not finding work to automate. Finding the person who knows what "right" looks like, and who'll spot it the moment the agent gets it wrong.

Day one of week two, we open a PR. Code lands in your repo from the start. No staging shadow system, no demo environment that diverges from your actual stack.

"Prototype" doesn't mean "polished demo." It means a working agent inside your tools that the team can use Monday morning. It will have rough edges. By Wednesday afternoon the team has told us what's wrong. By Friday we've fixed half of it.

What ships in this phase:

Prompts in your repo, version-controlled.
Tool definitions for everything the agent can touch.
Observability hookups, every agent action logged.
A Slack channel for fast feedback.

This is where most agent projects die. The model is fine. The infrastructure around the model is what makes or breaks it.

Six things we ship in this phase:

Eval suite. Tests, but for prompts. Runs in CI on every PR. Catches model drift when Anthropic ships a new Sonnet overnight.
Cost caps. Hard per-day spend limits at the API gateway. If something goes weird, the bill stops.
Rate limiting. Per user, per workflow, per origin. Abuse stops at the door.
Failure recovery. Agents call APIs. APIs fail. Agents retry, then escalate, then page. No silent corruption.
Audit trail. Every decision the agent made, queryable. "Why did the agent send this email at 2 PM?" gets a real answer.
Cost-of-failure analysis. What happens when the agent makes a wrong call? Is it reversible? Auto-flagged for human review? Documented before we ship.

By end of week five, the system runs unattended.

We train one operator on your team. Not "watch us click through this dashboard" training. Real, hands-on:

Read an eval result and decide what to do.
Modify a prompt and run the eval suite.
Add a new tool to the agent's allowed list.
Roll back a config.
Read the audit log when something looks off.

By Friday, that person is the answer to "who runs this?"

What you walk away with:

All code in your repos.
All prompts in version control.
An automation with the actual things that go wrong, not generic docs.
Eval suite running in your CI.
Observability dashboard using what you already have (Datadog, Langfuse, Cloudflare logs).
Three months of Slack-channel availability for follow-up questions.

Three things most consultancies skip:

Evals are tests, not vibes. A prompt change without an eval pass is a regression, not progress. CI runs them on every PR.
Costs are bounded by code, not goodwill. Hard caps at the API gateway. The agent can't run up a $10k bill because a malformed input created a loop.
Every agent action is auditable. Compliance and ops both want this. The audit log isn't a feature, it's the substrate.

If you've done this before, none of this is news. If you haven't, it's the difference between a demo and a system you can actually run.

Everything happens in your tenant:

Workspace agents authenticate via your Google admin. Data flows through Google APIs your team already authorizes.
Shopify agents run through your Shopify Partner app, scoped to your store.
Code lives in your GitHub. We push directly to your repos.
Prompts live next to the code, version-controlled.
Observability writes to your existing stack.

We sign DPAs and BAAs if you need them. We don't extract data to a separate system. Every API call our agents make is auditable in your own logs.

We don't have a stack. The stack you already use is the stack.

Models: Claude by default. GPT and Gemini when the workflow benefits.
Orchestration: depends on the workflow. Sometimes plain SDK calls, sometimes Cloudflare Workers, sometimes the Agents SDK.
Evals: bring-your-own, or we set up vitest-based eval scripts.
Observability: Langfuse, LangSmith, Datadog, or structured logs through your existing pipeline.
Code work: Claude Code in your repo. The same loop you'd run yourself, but the operator is paid.

If your stack has a constraint we haven't seen (regulatory, custom platform, on-prem), we'll tell you up front.

How the workactually works.

{title}

How the work
actually works.