# llms-full.txt — LLM Workbench (protocol v1.0.0)

Canonical site: https://www.llmworkbench.io
Source: https://github.com/llmworkbench/llm-workbench
License: Proprietary

---

## App README

# `@llm-workbench/web`

The hosted reference deployment for [LLM Workbench](../../README.md). It is a
Next.js 16 (App Router) application that proves the
runtime works end-to-end against real infrastructure: Supabase for run
persistence, Clerk for auth, and Vercel AI Gateway for model calls via the AI
SDK v5.

This app is intentionally a **reference**, not a finished product. Expect to
fork it, swap providers, and harden the trade-offs flagged with `// SECURITY:`
comments in the source.

## Stack

- **Next.js 16** (App Router; Cache Components intentionally off for Clerk compatibility — see `next.config.mjs`)
- **Tailwind CSS v4** (CSS-first `@theme` config) + **shadcn/ui** primitives
- **Clerk** (`@clerk/nextjs`) for authentication and tenancy
- **Supabase** (`@supabase/supabase-js`) for the `runs` table
- **AI SDK v5** (`ai`) routed through **Vercel AI Gateway**
- **`@llm-workbench/runtime`**, **`@llm-workbench/ui`**, **`@llm-workbench/mcp`** — workspace packages (`mcp` powers `/api/mcp`)

## Routes

### Product UI

| Path | What it is |
| --- | --- |
| `/` | Marketing landing page with a “Try the playground” CTA. |
| `/sign-in`, `/sign-up` | Clerk hosted flows. |
| `/playground` | Live job-search workflow demo backed by AI Gateway (**auth required**). |
| `/runs` | Saved runs for the current Clerk org/user. |
| `/runs/[runId]` | Run detail: trace timeline, artifact viewer, gate panel. |
| `/runs/demo` | **Public** read-only demo run (no sign-in). |
| `/blog` | Blog index (Markdown sources under `content/blog/`). |
| `/blog/[slug]` | Individual article (static paths from `.md` front matter). |
| `/docs/protocol` | Protocol overview (**public**). |

### HTTP APIs

| Path | What it is |
| --- | --- |
| `GET /api/health` | Liveness check (**public**). |
| `GET /api/runs?limit=N` | List runs for the caller’s tenant (`HttpRunRepository.list` shape). |
| `GET/PUT/DELETE /api/runs/[runId]` | Single-run CRUD using the workbench wire format. |
| `POST /api/llm` | AI Gateway streaming proxy (demo). |
| `POST /api/mcp` | MCP JSON-RPC (`tools/list` public; mutating tools require auth — see handler). |
| `GET /api/openapi.json` | OpenAPI 3.1 for the run REST surface (**public**). |

### Discovery & feeds (machine-readable)

These are intentional entry points for crawlers, assistants, and integrations:

| Path | What it is |
| --- | --- |
| `/llms.txt` | Short LLM-oriented site summary + important links. |
| `/llms-full.txt` | Long-form narrative for model context. |
| `/agents.md` | Agent-oriented capability summary. |
| `/robots.txt`, `/sitemap.xml` | Crawling hints + URL list. |
| `/.well-known/security.txt` | RFC 9116 security contact (GitHub private advisories). |
| `/.well-known/mcp.json` | MCP server descriptor. |
| `/feed.xml` | RSS 2.0 for blog posts. |

### Routing & security notes

- **Clerk + CSP** live in [`middleware.ts`](middleware.ts) (Next.js middleware convention). Public routes include `/`, `/blog`, `/feed.xml`, `/docs/*`, discovery URLs above, `/runs/demo`, and `/api/openapi.json`; gated surfaces (`/playground`, `/runs`, `/api/runs`, …) require a session. API routes return **401 JSON** when unauthenticated — they never redirect to HTML sign-in.

## Prerequisites

- Node.js **22+** (matches monorepo `engines` and CI)
- A Clerk application (publishable + secret key)
- A Supabase project (URL + service-role key)
- Vercel AI Gateway access (`AI_GATEWAY_API_KEY`, or OIDC if deployed on Vercel)

## Local setup

```bash
# 1. Install all workspace dependencies (run from the repo root)
npm install

# 2. Copy the env template and fill in real values
cp apps/web/.env.example apps/web/.env.local

# 3. Apply the database migration (see apps/web/supabase/README.md)
cd apps/web && supabase db push && cd -

# 4. Run the dev server
npm run dev:web
```

The app will be available at <http://localhost:3000>.

### Lighthouse (optional)

After `npm install` in the repo root, from `apps/web`:

```bash
npm run lighthouse:smoke
```

Builds production, serves `next start` briefly, audits `/`, and writes scores under `reports/` (gitignored). Requires Chromium for headless Chrome.

## Deploying

This app is designed to deploy on Vercel with zero modification.

[![Deploy with Vercel](https://vercel.com/button)](https://vercel.com/new/clone?repository-url=https://github.com/roymcfarland/llm-workbench&project-name=llm-workbench-reference&repository-name=llm-workbench-reference&env=NEXT_PUBLIC_SUPABASE_URL,SUPABASE_SERVICE_ROLE_KEY,NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY,CLERK_SECRET_KEY,AI_GATEWAY_API_KEY,NEXT_PUBLIC_SITE_ORIGIN)

When deployed on Vercel, prefer OIDC-based auth for AI Gateway (`vercel env
pull` will inject `VERCEL_OIDC_TOKEN`) so you do not have to rotate
`AI_GATEWAY_API_KEY` manually.

### Deploy in 10 minutes

The reference deployment is intentionally cheap (~$0/month at design-partner
volume on the Supabase / Clerk / Vercel free tiers). End-to-end:

1. **Fork or clone** this repository to your own GitHub account.
2. **Supabase.** Create a project at <https://supabase.com>. In the SQL
   editor, run the contents of
   [`apps/web/supabase/migrations/0001_init.sql`](./supabase/migrations/0001_init.sql)
   (or `supabase db push` if you have the CLI linked). Copy the project URL
   and the `service_role` key from Project Settings → API.
3. **Clerk.** Create an application at <https://clerk.com>. Under Paths,
   set the sign-in path to `/sign-in` and the sign-up path to `/sign-up`.
   Copy the publishable key and secret key.
4. **Vercel AI Gateway.** Enable AI Gateway on the Vercel project (Settings
   → AI Gateway). When the project is deployed on Vercel, OIDC injects the
   gateway token automatically — you do **not** need `AI_GATEWAY_API_KEY`
   in production. Set it locally in `.env.local` only.
5. **Vercel.** Import the repo. Set **Root
   Directory** to `apps/web` so Vercel picks up
   [`vercel.json`](./vercel.json) (install + build run from the monorepo root
   via `cd ../..`). Paste env vars from [`.env.example`](./.env.example) into
   Project Settings → Environment Variables; set `NEXT_PUBLIC_SITE_ORIGIN` to
   your production URL. Optional: `SENTRY_DSN`, `NEXT_PUBLIC_SENTRY_DSN`,
  `UPSTASH_REDIS_REST_URL`, `UPSTASH_REDIS_REST_TOKEN`, `CSP_EXTRA_CONNECT_SRC`.
6. **Deploy.** Push to `main` (or click Deploy). The first build typically
   takes a few minutes; CI on GitHub runs `build`, `test`, web `typecheck`,
   `lint`, and `build:web` (see `.github/workflows/ci.yml`).

> Cost expectation at design-partner volume: ~$0/month. Supabase free tier
> covers 500 MB Postgres + 2 GB egress; Clerk free tier covers 10 000 MAUs;
> Vercel hobby covers 100 GB of bandwidth. AI Gateway charges through to
> the underlying provider; budget for that separately.

## Security trade-offs (read these)

Every shortcut is annotated inline with `// SECURITY:` comments. The big ones:

- **Service-role Supabase key.** The server uses the service role and gates
  access by tenant in `lib/auth/tenant.ts`. RLS is enabled as defense in
  depth but is not the primary boundary. See
  [`supabase/migrations/0001_init.sql`](./supabase/migrations/0001_init.sql)
  for the alternative.
- **Tenant key fallback.** When a Clerk user has no active org, we use
  `user:<userId>` as the tenant. That matches the personal-workspace model
  most demos want; multi-tenant production apps should require an org.
- **Body limit.** API routes cap PUT bodies at 25 MB. Large run states
  should be summarized or chunked before persistence.

---

## Project README

# LLM Workbench

**A proprietary control plane for LLM-powered products.**

LLM Workbench gives AI applications a production-grade human interface for
the messy parts that matter: workflow state, artifacts, rules, human review
gates, trace history, model I/O, cost telemetry, import/export, and replay.

It is not another chat UI. It is the layer you bolt onto an LLM pipeline when
you want non-technical users to inspect, edit, approve, branch, audit, and
learn from the work your system is doing.

The runtime is headless, model-agnostic, and environment-agnostic. It does not
call OpenAI, Anthropic, local models, or any other provider directly. Your host
application owns prompts, tools, models, and policy. LLM Workbench records what
happened and gives humans a clean control surface over it.

> **License in one line:** Proprietary. All rights reserved. Use, modification,
> and operation are limited to Authorized Users (Roy McFarland personally and
> entities controlled by Roy McFarland, including Brightline Ltd) except by
> separate written agreement. See [License](#license).

## Status

`v0.2.0` (2026-04-27): the runtime adds Trace 2.0 (hierarchical spans, OTel
GenAI mapper), hierarchical supervision (`runChildrenOf`, `cancelRunCascade`),
and an externalizable `ArtifactStore`; `@llm-workbench/ai-sdk` wraps Vercel
AI SDK v5 with automatic trace events; the UI ships scoped `lwb-` CSS,
accessible `@dnd-kit` reorder, virtualized trace, and a `WorkflowGraph`;
and a hosted reference deployment lands at [`apps/web`](apps/web).
See [CHANGELOG.md](CHANGELOG.md) for the full list.

**Project spec:** [PROJECT.md](PROJECT.md) is the authoritative source of
truth for purpose, scope, non-goals, and the rules that automated reviewers
enforce on every PR.

## See It Live

- **Interactive demo (no signup):** https://www.llmworkbench.io/runs/demo — a
  read-only LLM Workbench run rendered exactly as an authenticated run is.
- **Overview & docs:** https://www.llmworkbench.io · https://www.llmworkbench.io/docs/protocol

## For Reviewers

If you're reviewing this repo, a useful 15-minute path is:

1. Open the live demo first: https://www.llmworkbench.io/runs/demo.
2. Skim [PROJECT.md](PROJECT.md), then the [Architecture](#architecture)
   section below.
3. Read one representative source file:
   [`packages/runtime/src/runtime/session.ts`](packages/runtime/src/runtime/session.ts).
4. Read one representative test suite:
   [`packages/runtime/src/runtime/workbench.test.ts`](packages/runtime/src/runtime/workbench.test.ts).

## How This Repo Is Built

Most changes are shipped as deliberately small slices. The maintainer
acts as architect/advisor: designing scope, grounding the prompt in repository
recon, catching spec errors, reviewing the implementation, and deciding whether
to merge. A coding agent then implements the scoped PR, and a separate verifier
agent independently checks it against [PROJECT.md](PROJECT.md) with a
structured APPROVE/REJECT verdict.

The process artifacts at repo root are there on purpose. [PROJECT.md](PROJECT.md)
is the contract both agents are held to. [CLOSEOUT.md](CLOSEOUT.md) is the
latest slice's build record. [VERIFIER-AUDIT-PR8.md](VERIFIER-AUDIT-PR8.md)
and [VERIFIER-AUDIT-PR10.md](VERIFIER-AUDIT-PR10.md) are independent
verification transcripts from specific PRs.

## Why It Exists

LLM apps fail in boring, expensive ways:

- Outputs change and nobody knows why.
- Prompts, rules, artifacts, and human edits drift apart.
- Non-technical reviewers get a black box instead of useful controls.
- Teams cannot replay what happened after a bad run.
- Model spend is logged somewhere, but not where product decisions happen.
- "Add AI" becomes a pile of custom debugging panels and brittle JSON editors.

LLM Workbench turns that chaos into an inspectable run graph.

## What You Get

- **Model-agnostic runtime.** The host decides which provider, model, prompt
  strategy, and tool registry to use. The runtime records model I/O and tool
  calls through explicit APIs.
- **Workflow-shaped execution.** Workflows are DAGs with step-level gate
  policies: `AUTO`, `PAUSE_BEFORE`, `PAUSE_AFTER`, and `CHECKPOINT`.
- **Human review gates.** Pause before or after important steps, collect
  approvals, rejections, edits, and notes, then resume with traceable intent.
- **Schema-validated artifacts and rules.** Bring JSON Schemas, validate data
  through Ajv, patch artifacts safely, and export redacted user bundles.
- **Tamper-evident run bundles.** Exports are SHA-256 signed over canonical
  JSON. Imports verify integrity by default.
- **Telemetry-ready traces.** Track provider, model, usage, duration, cost,
  user, tenant, account, and plan metadata without locking into a vendor.
- **Cost and usage summaries.** `summarizeModelTelemetry` turns raw trace
  events into a typed ledger grouped by provider, model, step, user, tenant,
  and plan.
- **Pluggable persistence.** Use memory, IndexedDB, or HTTP behind one
  `RunRepository` interface. The HTTP adapter supports auth headers, timeouts,
  retries, and abort signals.
- **Composable UI.** Use `WorkbenchShell` as a ready-made React control panel,
  or build your own UI against the headless runtime.

## Architecture

```
host app
  owns models, prompts, tools, business logic
  calls runtime APIs as work happens

@llm-workbench/runtime
  records workflow state, artifacts, rules, gates, traces, bundles, telemetry
  runs in browser, Node, or edge-style runtimes

@llm-workbench/ui
  React shell for artifact editing, rules, trace history, gates, import/export

@llm-workbench/adapters-react
  subscription hooks for live runtime state
```

## Repository Layout

```
packages/
  runtime/              @llm-workbench/runtime
  ui/                   @llm-workbench/ui
  adapters-react/       @llm-workbench/adapters-react
  ai-sdk/               @llm-workbench/ai-sdk
  mcp/                  @llm-workbench/mcp (MCP server + HTTP adapter)
examples/
  job-search-demo/    Vite demo app exercising the full surface
  run-repo-server/    Reference REST store for HttpRunRepository
apps/
  web/                Hosted reference deployment (Next.js + Supabase + AI Gateway + Clerk)
```

| Package | What it gives you |
| --- | --- |
| `@llm-workbench/runtime` | Protocol types, `WorkbenchRuntime`, `WorkbenchSession`, `SchemaRegistry`, persistence adapters, bundle import/export, telemetry summaries, and structured `WorkbenchError`. |
| `@llm-workbench/ui` | `WorkbenchShell`, a themeable React interface for artifacts, rules, traces, gates, and bundles. |
| `@llm-workbench/adapters-react` | `useWorkbenchRunRevision` for subscribing React components to live run state. |
| `@llm-workbench/ai-sdk` | Vercel AI SDK v5 wrappers (`tracedGenerateText`, `tracedStreamText`, `tracedGenerateObject`, `tracedStreamObject`, `traceTools`) that emit correlated `model_io`, `tool_call`, and gateway-cost trace events automatically. |
| `@llm-workbench/mcp` | Model Context Protocol server factory plus HTTP handler (`createWorkbenchMcpHttpHandler`) for exposing the runtime over MCP — see [`packages/mcp/README.md`](packages/mcp/README.md). |

## Quick Start

```bash
npm install
npm test
npm run build
npm run demo               # Vite demo app at http://localhost:5173
npm run demo:http-server   # Reference REST store for HttpRunRepository
```

Node.js **22+** is required (`engines` in root `package.json`). CI runs on **Node 22 and 24** (`.github/workflows/ci.yml`).

## 60-Second Integration

```ts
import {
  WorkbenchRuntime,
  SchemaRegistry,
  registerDemoSchemas,
  summarizeModelTelemetry,
} from "@llm-workbench/runtime";

const registry = new SchemaRegistry();
registerDemoSchemas(registry);

const runtime = new WorkbenchRuntime();
const { runId } = runtime.startRun({
  workflow: {
    id: "my-pipeline",
    version: 1,
    steps: [
      { id: "parse", gatePolicy: "PAUSE_BEFORE" },
      { id: "score", gatePolicy: "AUTO" },
    ],
    edges: [{ id: "e1", from: "parse", to: "score" }],
  },
  subject: {
    userId: "user_123",
    tenantId: "team_456",
    planId: "pro",
  },
});

const session = runtime.session(runId);

session.resolveGate({
  stepId: "parse",
  gate: "PAUSE_BEFORE",
  decision: "approved",
});

session.beginStep("parse");

session.writeArtifact({
  artifactKey: "compiledProfile",
  typeId: "compiledProfile",
  data: {
    headline: "TypeScript engineer",
    skills: ["typescript", "react", "systems"],
    summary: "Strong full-stack builder with AI workflow experience.",
  },
});

session.logModelIO({
  stepId: "parse",
  direction: "response",
  provider: "openai",
  model: "gpt-example",
  usage: { inputTokens: 120, outputTokens: 40 },
  cost: { amount: 0.0012, currency: "USD" },
  durationMs: 900,
});

session.completeStep("parse");

const telemetry = summarizeModelTelemetry(session.snapshot());
console.log(telemetry.totals, telemetry.byProviderModel);
```

Drop the shell anywhere in your app:

```tsx
<WorkbenchShell runtime={runtime} runId={runId} registry={registry} />
```

## Runtime Principles

- The runtime never hides state behind provider-specific abstractions.
- Structured outputs should be schema-validated before they become product
  state.
- Human edits and approvals are first-class trace events, not side notes.
- Exported runs should be useful for debugging, audits, demos, and learning.
- Model telemetry should be close enough to the workflow that cost and quality
  can be managed together.
- The public protocol should be boring, explicit, and durable.

## License

LLM Workbench is **proprietary**. All rights reserved.

Use, modification, deployment, and operation are limited to Authorized Users
(Roy McFarland personally and any entity controlled by Roy McFarland,
including Brightline Ltd) except by separate written agreement. The full
text of this grant is in [`LICENSE`](LICENSE), and an identical copy lives in
each `packages/*/LICENSE` directory.

For licensing inquiries, contact Roy McFarland.

## Contributing

Outside contributions are not currently accepted. Issue reports are welcome
through [GitHub Issues](https://github.com/roymcfarland/llm-workbench/issues),
but pull requests from third parties will be closed without merge unless a
separate written agreement is in place.

## Security

Please report security issues through the process in [SECURITY.md](SECURITY.md).

---

# Protocol overview

LLM Workbench v1.0.0 is a runtime-and-wire-format pair for
recording, gating, and replaying LLM-powered work. It is deliberately boring
on the surface — JSON in, JSON out, structured trace events — so it survives
upgrades, model swaps, framework migrations, and audits.

## Host boundaries

The runtime **never** chooses models, executes prompts, or registers tools on
your behalf. Your application owns orchestration policy; LLM Workbench exposes
explicit APIs (`WorkbenchSession`) so **recording is intentional**. Every
meaningful action turns into typed facts (`TraceEvent`) rather than inferred
guesses from stderr or vendor dashboards.

That separation matters when you swap providers or refactor prompts: semantic
gates and artifact schemas travel with the run; vendor IDs remain annotations on
`model_io`, not primary keys for truth.

## Run bundles

A run bundle is the canonical **export** of a single run — the interchange format
for email attachments, audit ZIPs, cold storage, or compliance tooling. It is a
JSON document with these top-level fields:

- `protocolVersion`: literal protocol identifier (currently `1.0.0`).
- `run`: the `RunInstance` (id, workflow snapshot, status, timestamps, optional subject and metadata).
- `trace`: ordered array of `TraceEvent`s, each carrying a stable id and ISO timestamp.
- `artifacts`: every version of every artifact ever written (artifacts are append-only in bundle land).
- `ruleSets`: every version of every rule set referenced by the run.
- `engine` (optional): an internal snapshot of step status, gate state, and idempotency keys for byte-faithful rehydration. When absent, the runtime can re-derive these from the trace.
- `integrity.sha256`: hex SHA-256 over **canonical JSON** of `{run, trace, artifacts, ruleSets, engine?}`. The bundle is tamper-evident — verify on import.

Bundles are content-addressed in spirit: canonical serialization sorts object keys
lexicographically, drops undefined properties consistently, and rejects cyclic or
non-JSON-safe values before hashing — so two semantically equal exports bit-match.

## Live persistence versus export bundles

Day-to-day HTTP persistence (`PUT /api/runs/{runId}` in this reference app)
stores a **`RunStoreState`** snapshot: maps for artifacts, gate state,
idempotency keys, and step status — optimized for merging and optimistic
concurrency. A **`RunBundle`**, by contrast, is the denormalized archive you
hand to another system or verify offline. The runtime translates between them:
export flattens maps to ordered arrays for signing; import rehydrates into
session state. When you read the OpenAPI spec, you are looking at **store** wire
format, not necessarily a fully materialized bundle with `integrity` — use
`export_bundle` via MCP (or equivalent) when you need a signed artifact.

## Trace events

Every observable runtime fact is a typed trace event. The discriminated union
covers:

- `step_started` / `step_completed`: lifecycle of a workflow step.
- `artifact_written` / `artifact_patch`: creating or evolving structured outputs (with idempotency keys to dedupe writes).
- `model_io`: a model call's request, response, or stream chunk, with optional usage, cost, duration, summary, and a redacted payload.
- `tool_call`: a tool invocation with arguments and result.
- `human_gate_requested` / `human_gate_resolved`: pause-and-decide events with explicit decisions (`approved`, `rejected`, `edited`) and optional notes.
- `rule_changed`: snapshot of a new rule-set revision.
- `policy_changed`: a step's gate policy was overridden mid-run.
- `error`: a structured error with optional code and `fatal` flag.
- `run_forked`: the run was branched off another (`parentRunId` or `parentRunIds`).
- `annotation`: free-text human note with optional tags.
- `run_status_changed`: terminal transitions (`completed`, `failed`, `cancelled`).
- `span_started` / `span_ended`: hierarchical spans modeled after OpenTelemetry's GenAI semantic conventions; convertible to OTel spans.

Every event has `id`, `runId`, `ts`, and an optional `stepId` and
`correlationId`. The `TraceEventSchema` Zod parser is the authoritative
contract — any host that emits structured trace events should validate against
it before persistence.

### Ordering, replay, and correlation

The trace is an **append-only** narrative. UIs and replay tooling usually expect
events in time order; the schema does not enforce monotonic timestamps (clocks
skew happens), but exporters should preserve append order as the source of truth
for human review. Use `correlationId` to stitch a **single logical operation**
split across multiple events — for example every `stream_chunk` for one
completion should share one id so downstream analysis can collapse a stream back
into one row without heuristics.

### Spans and external observability

`span_started` / `span_ended` mirror GenAI semantic layers: you can convert
the trace to vendor spans with `traceEventsToOtelSpans` (see `@llm-workbench/runtime`)
and ship them to whichever OTLP backend you already operate. That path is complementary,
not redundant: OTLP answers “where was latency?” while the bundle still answers “which
artifact version did a reviewer approve before it reached a customer?”.

## Gates

Every workflow step carries a gate policy:

- `AUTO`: no gate; runs whenever predecessors are ready.
- `PAUSE_BEFORE`: hold until a reviewer approves **before** the step executes.
- `PAUSE_AFTER`: hold after the step completes until a reviewer approves the result.
- `CHECKPOINT`: arbitrary named checkpoints **inside** a long step — each checkpoint is independently approved.

Gate state is part of the run state and persists across reloads. Resuming a
paused run is a `resolveGate` call followed by the next eligible `beginStep`.
Rejections (`human_gate_resolved` with `decision: "rejected"`) are explicit trace facts — they **stop forward progress** until product logic forks or retries; there is no silent skip.

## Artifacts and schemas

Artifacts are versioned, JSON-shaped values keyed by `artifactKey`. They
carry a `typeId` that maps to a schema in the `SchemaRegistry`. Writes are
either full replacements (`writeArtifact`) or RFC 6902 JSON Patches
(`patchArtifact`), and both flavours produce structured trace events so
diffs survive replay.

Schemas live in the host: `registerDemoSchemas` ships a useful set of
examples; in production you bring your own Ajv-validated JSON Schemas. The
runtime refuses to write artifacts that do not validate against the registered
schema for their `typeId`.

### Idempotency

Heavy steps may retry — network flakes, duplicate webhook deliveries — so artifact
writes carry **idempotency keys** where the host needs deduplication. Replays with
the same key collapse to one version bump, which keeps forensic traces readable
without duplicate `artifact_written` noise.

## Telemetry

`summarizeModelTelemetry(state)` reduces the trace into a typed ledger keyed
by provider, model, step, user, tenant, and plan. It surfaces input/output
token totals, cached and reasoning tokens, and per-currency cost rollups. The
ledger is a derived view — the underlying `model_io` events are the system of
record.

For **billing truth**, reconcile against your gateway or cloud invoice APIs; the trace
ledger is the fast, run-scoped approximation that makes product decisions legible next
to workflow structure.

## Forks and lineage

`run_forked` plus `RunInstance` parent linkage (`parentRunId` or plural
`parentRunIds`) lets you express forks (human branches an investigation) or
supervisor/worker graphs without losing ancestry. Consumers should prefer helpers that
normalize plural vs singular parents — older bundles may only carry `parentRunId`.

## Migrations

Bundle migration is a single `migrateRunBundle` step keyed off
`protocolVersion`. Bumping the version forces an explicit migration path —
older bundles are accepted, transformed, and re-signed before they enter the
runtime. The runtime refuses to import a bundle whose declared protocol
version it does not understand (unless migration hooks extend the importer).

## Sample minimal RunBundle

```json
{
  "protocolVersion": "1.0.0",
  "run": {
    "id": "run_demo_001",
    "workflowId": "jobSearchWorkflow",
    "workflowVersion": 1,
    "workflowSnapshot": {
      "id": "jobSearchWorkflow",
      "version": 1,
      "steps": [
        { "id": "parse",   "gatePolicy": "PAUSE_BEFORE" },
        { "id": "score",   "gatePolicy": "AUTO" }
      ],
      "edges": [{ "id": "e1", "from": "parse", "to": "score" }]
    },
    "startedAt": "2026-04-01T12:00:00.000Z",
    "endedAt":   "2026-04-01T12:00:42.000Z",
    "status": "completed"
  },
  "trace": [
    { "id": "evt_1", "type": "human_gate_resolved", "runId": "run_demo_001",
      "ts": "2026-04-01T12:00:01.000Z",
      "stepId": "parse", "gate": "PAUSE_BEFORE", "decision": "approved" },
    { "id": "evt_2", "type": "step_started", "runId": "run_demo_001",
      "ts": "2026-04-01T12:00:02.000Z", "stepId": "parse" },
    { "id": "evt_3", "type": "model_io", "runId": "run_demo_001",
      "ts": "2026-04-01T12:00:03.000Z", "stepId": "parse",
      "direction": "response", "provider": "anthropic",
      "model": "claude-haiku-4-5",
      "usage": { "inputTokens": 110, "outputTokens": 40, "totalTokens": 150 },
      "cost":  { "amount": 0.003, "currency": "USD" },
      "durationMs": 220 },
    { "id": "evt_4", "type": "artifact_written", "runId": "run_demo_001",
      "ts": "2026-04-01T12:00:04.000Z",
      "artifact": {
        "artifactKey": "compiledProfile", "typeId": "compiledProfile",
        "version": 1, "createdAt": "2026-04-01T12:00:04.000Z",
        "data": { "headline": "Senior TypeScript engineer",
                  "skills": ["typescript","react","systems"],
                  "summary": "Strong full-stack builder." }
      } }
  ],
  "artifacts": [],
  "ruleSets": [],
  "integrity": { "sha256": "<hex>" }
}
```

## Driving the workbench programmatically

Two complementary surfaces are intended for agents and integrations:

- **REST.** `GET /api/runs` lists runs for the caller's tenant; `GET /api/runs/{runId}` returns serialized **live state** (`RunStoreState` shape); `PUT /api/runs/{runId}` persists the next revision (same wire shape); `DELETE /api/runs/{runId}` removes it. Responses mirror what `HttpRunRepository` reads and writes — **not** automatically the signed bundle envelope unless your exporter wraps it. See `/api/openapi.json` for schemas.
- **MCP.** `/api/mcp` is a Streamable HTTP MCP endpoint. The `@llm-workbench/mcp` core registers `list_runs`, `get_run`, `verify_run_integrity`, and `validate_run_bundle`; this reference app adds `start_run`, `resolve_gate`, `write_artifact`, and `export_bundle` (tamper-evident **RunBundle** JSON with engine snapshot — use this when automation needs hashes, not only row state). Discovery lives at `/.well-known/mcp.json`.

Both surfaces share Clerk-based auth and the tenant-scoping rules described in
`/agents.md`.

---

# Driving the workbench programmatically

Two surfaces expose the runtime over the network:

## REST (`/api/runs` and `/api/runs/{runId}`)

- `GET /api/runs?limit=N` returns `SavedRunMeta[]` for the caller's tenant.
- `GET /api/runs/{runId}` returns the serialized `RunStoreState` (the same wire format `HttpRunRepository` produces).
- `PUT /api/runs/{runId}` persists a serialized state. Body limit is 25 MB. Validates structural invariants and rejects `state.run.id !== runId`.
- `DELETE /api/runs/{runId}` removes a run.
- All responses carry `Link: </api/openapi.json>; rel="describedby"`.
- Auth is Clerk-based: the request must carry a session cookie (or a Clerk bearer token in production deployments). Tenants are derived as `orgId ?? "user:" + userId`.

The full schema lives at `/api/openapi.json` (OpenAPI 3.1).

## MCP (`/api/mcp`)

A Streamable HTTP MCP endpoint registers:

- Core (`@llm-workbench/mcp`): `list_runs`, `get_run`, `verify_run_integrity`, `validate_run_bundle`.
- Reference app additions: `start_run`, `resolve_gate`, `write_artifact`, `export_bundle` (full-profile tamper-evident bundle).

Discovery via `/.well-known/mcp.json`. Resources expose `runs://{runId}` bundle URIs — see `packages/mcp/README.md`.

HTML crawlers and link previews do not carry Clerk sessions. Middleware explicitly allows OG/Twitter metadata image routes (`/opengraph-image`, `/twitter-image`) and marketing paths; tenant APIs and MCP stay behind auth (`robots.txt` `Disallow` on private APIs for crawl-budget hygiene).

## Error model

Errors are JSON: `{ "error": "<human message>", "code": "<optional canonical code>" }`. Status codes follow standard REST conventions: `400` invalid body, `401` missing session, `404` unknown run, `413` body too large, `500` for unexpected failures.

## Rate limits

No rate limits are enforced in this reference deployment. Production
deployments must add a per-tenant limiter at the API or MCP layer before
exposing this surface to untrusted clients.

---

# Trace event reference

Every observable runtime fact is one of these typed events. The discriminated
union is the authoritative contract — host code should validate against
`TraceEventSchema` before persistence. All events carry `id`, `runId`,
`ts`, optional `stepId`, optional `correlationId`.

- `step_started` — a workflow step has started executing.
- `step_completed` — a step finished; `ok: boolean` plus optional `error`.
- `artifact_written` — a new artifact version was written (full replacement).
- `artifact_patch` — an artifact was advanced via RFC 6902 JSON Patch ops.
- `model_io` — a model call (`request`, `response`, or `stream_chunk`) with optional provider, model, usage, cost, durationMs, summary, and a redacted payload.
- `tool_call` — a tool was invoked, with arguments and result.
- `human_gate_requested` — runtime is paused waiting for a human decision (PAUSE_BEFORE | PAUSE_AFTER | CHECKPOINT).
- `human_gate_resolved` — a human delivered a decision (`approved`, `rejected`, `edited`) plus an optional note.
- `rule_changed` — a rule set was updated; the snapshot is embedded.
- `policy_changed` — a step's gate policy was overridden mid-run.
- `error` — a structured error with optional code and `fatal` flag.
- `run_forked` — the run was branched off another run (`parentRunId`).
- `annotation` — free-text human annotation with optional tags.
- `run_status_changed` — terminal transitions: `completed`, `failed`, `cancelled`.
- `span_started` / `span_ended` — hierarchical spans modeled after OpenTelemetry GenAI semconv; convertible to OTel via `traceEventsToOtelSpans`.