# llms-full.txt — LLM Workbench (protocol v1.0.0) Canonical site: https://www.llmworkbench.io Source: https://github.com/llmworkbench/llm-workbench License: Proprietary --- ## App README # `@llm-workbench/web` The hosted reference deployment for [LLM Workbench](../../README.md). It is a Next.js 16 (App Router) application that proves the runtime works end-to-end against real infrastructure: Supabase for run persistence, Clerk for auth, and Vercel AI Gateway for model calls via the AI SDK v5. This app is intentionally a **reference**, not a finished product. Expect to fork it, swap providers, and harden the trade-offs flagged with `// SECURITY:` comments in the source. ## Stack - **Next.js 16** (App Router; Cache Components intentionally off for Clerk compatibility — see `next.config.mjs`) - **Tailwind CSS v4** (CSS-first `@theme` config) + **shadcn/ui** primitives - **Clerk** (`@clerk/nextjs`) for authentication and tenancy - **Supabase** (`@supabase/supabase-js`) for the `runs` table - **AI SDK v5** (`ai`) routed through **Vercel AI Gateway** - **`@llm-workbench/runtime`**, **`@llm-workbench/ui`**, **`@llm-workbench/mcp`** — workspace packages (`mcp` powers `/api/mcp`) ## Routes ### Product UI | Path | What it is | | --- | --- | | `/` | Marketing landing page with a “Try the playground” CTA. | | `/sign-in`, `/sign-up` | Clerk hosted flows. | | `/playground` | Live job-search workflow demo backed by AI Gateway (**auth required**). | | `/runs` | Saved runs for the current Clerk org/user. | | `/runs/[runId]` | Run detail: trace timeline, artifact viewer, gate panel. | | `/runs/demo` | **Public** read-only demo run (no sign-in). | | `/blog` | Blog index (Markdown sources under `content/blog/`). | | `/blog/[slug]` | Individual article (static paths from `.md` front matter). | | `/docs/protocol` | Protocol overview (**public**). | ### HTTP APIs | Path | What it is | | --- | --- | | `GET /api/health` | Liveness check (**public**). | | `GET /api/runs?limit=N` | List runs for the caller’s tenant (`HttpRunRepository.list` shape). | | `GET/PUT/DELETE /api/runs/[runId]` | Single-run CRUD using the workbench wire format. | | `POST /api/llm` | AI Gateway streaming proxy (demo). | | `POST /api/mcp` | MCP JSON-RPC (`tools/list` public; mutating tools require auth — see handler). | | `GET /api/openapi.json` | OpenAPI 3.1 for the run REST surface (**public**). | ### Discovery & feeds (machine-readable) These are intentional entry points for crawlers, assistants, and integrations: | Path | What it is | | --- | --- | | `/llms.txt` | Short LLM-oriented site summary + important links. | | `/llms-full.txt` | Long-form narrative for model context. | | `/agents.md` | Agent-oriented capability summary. | | `/robots.txt`, `/sitemap.xml` | Crawling hints + URL list. | | `/.well-known/security.txt` | RFC 9116 security contact (GitHub private advisories). | | `/.well-known/mcp.json` | MCP server descriptor. | | `/feed.xml` | RSS 2.0 for blog posts. | ### Routing & security notes - **Clerk + CSP** live in [`middleware.ts`](middleware.ts) (Next.js middleware convention). Public routes include `/`, `/blog`, `/feed.xml`, `/docs/*`, discovery URLs above, `/runs/demo`, and `/api/openapi.json`; gated surfaces (`/playground`, `/runs`, `/api/runs`, …) require a session. API routes return **401 JSON** when unauthenticated — they never redirect to HTML sign-in. ## Prerequisites - Node.js **22+** (matches monorepo `engines` and CI) - A Clerk application (publishable + secret key) - A Supabase project (URL + service-role key) - Vercel AI Gateway access (`AI_GATEWAY_API_KEY`, or OIDC if deployed on Vercel) ## Local setup ```bash # 1. Install all workspace dependencies (run from the repo root) npm install # 2. Copy the env template and fill in real values cp apps/web/.env.example apps/web/.env.local # 3. Apply the database migration (see apps/web/supabase/README.md) cd apps/web && supabase db push && cd - # 4. Run the dev server npm run dev:web ``` The app will be available at . ### Lighthouse (optional) After `npm install` in the repo root, from `apps/web`: ```bash npm run lighthouse:smoke ``` Builds production, serves `next start` briefly, audits `/`, and writes scores under `reports/` (gitignored). Requires Chromium for headless Chrome. ## Deploying This app is designed to deploy on Vercel with zero modification. [![Deploy with Vercel](https://vercel.com/button)](https://vercel.com/new/clone?repository-url=https://github.com/roymcfarland/llm-workbench&project-name=llm-workbench-reference&repository-name=llm-workbench-reference&env=NEXT_PUBLIC_SUPABASE_URL,SUPABASE_SERVICE_ROLE_KEY,NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY,CLERK_SECRET_KEY,AI_GATEWAY_API_KEY,NEXT_PUBLIC_SITE_ORIGIN) When deployed on Vercel, prefer OIDC-based auth for AI Gateway (`vercel env pull` will inject `VERCEL_OIDC_TOKEN`) so you do not have to rotate `AI_GATEWAY_API_KEY` manually. ### Deploy in 10 minutes The reference deployment is intentionally cheap (~$0/month at design-partner volume on the Supabase / Clerk / Vercel free tiers). End-to-end: 1. **Fork or clone** this repository to your own GitHub account. 2. **Supabase.** Create a project at . In the SQL editor, run the contents of [`apps/web/supabase/migrations/0001_init.sql`](./supabase/migrations/0001_init.sql) (or `supabase db push` if you have the CLI linked). Copy the project URL and the `service_role` key from Project Settings → API. 3. **Clerk.** Create an application at . Under Paths, set the sign-in path to `/sign-in` and the sign-up path to `/sign-up`. Copy the publishable key and secret key. 4. **Vercel AI Gateway.** Enable AI Gateway on the Vercel project (Settings → AI Gateway). When the project is deployed on Vercel, OIDC injects the gateway token automatically — you do **not** need `AI_GATEWAY_API_KEY` in production. Set it locally in `.env.local` only. 5. **Vercel.** Import the repo. Set **Root Directory** to `apps/web` so Vercel picks up [`vercel.json`](./vercel.json) (install + build run from the monorepo root via `cd ../..`). Paste env vars from [`.env.example`](./.env.example) into Project Settings → Environment Variables; set `NEXT_PUBLIC_SITE_ORIGIN` to your production URL. Optional: `SENTRY_DSN`, `NEXT_PUBLIC_SENTRY_DSN`, `UPSTASH_REDIS_REST_URL`, `UPSTASH_REDIS_REST_TOKEN`, `CSP_EXTRA_CONNECT_SRC`. 6. **Deploy.** Push to `main` (or click Deploy). The first build typically takes a few minutes; CI on GitHub runs `build`, `test`, web `typecheck`, `lint`, and `build:web` (see `.github/workflows/ci.yml`). > Cost expectation at design-partner volume: ~$0/month. Supabase free tier > covers 500 MB Postgres + 2 GB egress; Clerk free tier covers 10 000 MAUs; > Vercel hobby covers 100 GB of bandwidth. AI Gateway charges through to > the underlying provider; budget for that separately. ## Security trade-offs (read these) Every shortcut is annotated inline with `// SECURITY:` comments. The big ones: - **Service-role Supabase key.** The server uses the service role and gates access by tenant in `lib/auth/tenant.ts`. RLS is enabled as defense in depth but is not the primary boundary. See [`supabase/migrations/0001_init.sql`](./supabase/migrations/0001_init.sql) for the alternative. - **Tenant key fallback.** When a Clerk user has no active org, we use `user:` as the tenant. That matches the personal-workspace model most demos want; multi-tenant production apps should require an org. - **Body limit.** API routes cap PUT bodies at 25 MB. Large run states should be summarized or chunked before persistence. --- ## Project README # LLM Workbench **A proprietary control plane for LLM-powered products.** LLM Workbench gives AI applications a production-grade human interface for the messy parts that matter: workflow state, artifacts, rules, human review gates, trace history, model I/O, cost telemetry, import/export, and replay. It is not another chat UI. It is the layer you bolt onto an LLM pipeline when you want non-technical users to inspect, edit, approve, branch, audit, and learn from the work your system is doing. The runtime is headless, model-agnostic, and environment-agnostic. It does not call OpenAI, Anthropic, local models, or any other provider directly. Your host application owns prompts, tools, models, and policy. LLM Workbench records what happened and gives humans a clean control surface over it. > **License in one line:** Proprietary. All rights reserved. Use, modification, > and operation are limited to Authorized Users (Roy McFarland personally and > entities controlled by Roy McFarland, including Brightline Ltd) except by > separate written agreement. See [License](#license). ## Status `v0.2.0` (2026-04-27): the runtime adds Trace 2.0 (hierarchical spans, OTel GenAI mapper), hierarchical supervision (`runChildrenOf`, `cancelRunCascade`), and an externalizable `ArtifactStore`; `@llm-workbench/ai-sdk` wraps Vercel AI SDK v5 with automatic trace events; the UI ships scoped `lwb-` CSS, accessible `@dnd-kit` reorder, virtualized trace, and a `WorkflowGraph`; and a hosted reference deployment lands at [`apps/web`](apps/web). See [CHANGELOG.md](CHANGELOG.md) for the full list. **Project spec:** [PROJECT.md](PROJECT.md) is the authoritative source of truth for purpose, scope, non-goals, and the rules that automated reviewers enforce on every PR. ## See It Live - **Interactive demo (no signup):** https://www.llmworkbench.io/runs/demo — a read-only LLM Workbench run rendered exactly as an authenticated run is. - **Overview & docs:** https://www.llmworkbench.io · https://www.llmworkbench.io/docs/protocol ## For Reviewers If you're reviewing this repo, a useful 15-minute path is: 1. Open the live demo first: https://www.llmworkbench.io/runs/demo. 2. Skim [PROJECT.md](PROJECT.md), then the [Architecture](#architecture) section below. 3. Read one representative source file: [`packages/runtime/src/runtime/session.ts`](packages/runtime/src/runtime/session.ts). 4. Read one representative test suite: [`packages/runtime/src/runtime/workbench.test.ts`](packages/runtime/src/runtime/workbench.test.ts). ## How This Repo Is Built Most changes are shipped as deliberately small slices. The maintainer acts as architect/advisor: designing scope, grounding the prompt in repository recon, catching spec errors, reviewing the implementation, and deciding whether to merge. A coding agent then implements the scoped PR, and a separate verifier agent independently checks it against [PROJECT.md](PROJECT.md) with a structured APPROVE/REJECT verdict. The process artifacts at repo root are there on purpose. [PROJECT.md](PROJECT.md) is the contract both agents are held to. [CLOSEOUT.md](CLOSEOUT.md) is the latest slice's build record. [VERIFIER-AUDIT-PR8.md](VERIFIER-AUDIT-PR8.md) and [VERIFIER-AUDIT-PR10.md](VERIFIER-AUDIT-PR10.md) are independent verification transcripts from specific PRs. ## Why It Exists LLM apps fail in boring, expensive ways: - Outputs change and nobody knows why. - Prompts, rules, artifacts, and human edits drift apart. - Non-technical reviewers get a black box instead of useful controls. - Teams cannot replay what happened after a bad run. - Model spend is logged somewhere, but not where product decisions happen. - "Add AI" becomes a pile of custom debugging panels and brittle JSON editors. LLM Workbench turns that chaos into an inspectable run graph. ## What You Get - **Model-agnostic runtime.** The host decides which provider, model, prompt strategy, and tool registry to use. The runtime records model I/O and tool calls through explicit APIs. - **Workflow-shaped execution.** Workflows are DAGs with step-level gate policies: `AUTO`, `PAUSE_BEFORE`, `PAUSE_AFTER`, and `CHECKPOINT`. - **Human review gates.** Pause before or after important steps, collect approvals, rejections, edits, and notes, then resume with traceable intent. - **Schema-validated artifacts and rules.** Bring JSON Schemas, validate data through Ajv, patch artifacts safely, and export redacted user bundles. - **Tamper-evident run bundles.** Exports are SHA-256 signed over canonical JSON. Imports verify integrity by default. - **Telemetry-ready traces.** Track provider, model, usage, duration, cost, user, tenant, account, and plan metadata without locking into a vendor. - **Cost and usage summaries.** `summarizeModelTelemetry` turns raw trace events into a typed ledger grouped by provider, model, step, user, tenant, and plan. - **Pluggable persistence.** Use memory, IndexedDB, or HTTP behind one `RunRepository` interface. The HTTP adapter supports auth headers, timeouts, retries, and abort signals. - **Composable UI.** Use `WorkbenchShell` as a ready-made React control panel, or build your own UI against the headless runtime. ## Architecture ``` host app owns models, prompts, tools, business logic calls runtime APIs as work happens @llm-workbench/runtime records workflow state, artifacts, rules, gates, traces, bundles, telemetry runs in browser, Node, or edge-style runtimes @llm-workbench/ui React shell for artifact editing, rules, trace history, gates, import/export @llm-workbench/adapters-react subscription hooks for live runtime state ``` ## Repository Layout ``` packages/ runtime/ @llm-workbench/runtime ui/ @llm-workbench/ui adapters-react/ @llm-workbench/adapters-react ai-sdk/ @llm-workbench/ai-sdk mcp/ @llm-workbench/mcp (MCP server + HTTP adapter) examples/ job-search-demo/ Vite demo app exercising the full surface run-repo-server/ Reference REST store for HttpRunRepository apps/ web/ Hosted reference deployment (Next.js + Supabase + AI Gateway + Clerk) ``` | Package | What it gives you | | --- | --- | | `@llm-workbench/runtime` | Protocol types, `WorkbenchRuntime`, `WorkbenchSession`, `SchemaRegistry`, persistence adapters, bundle import/export, telemetry summaries, and structured `WorkbenchError`. | | `@llm-workbench/ui` | `WorkbenchShell`, a themeable React interface for artifacts, rules, traces, gates, and bundles. | | `@llm-workbench/adapters-react` | `useWorkbenchRunRevision` for subscribing React components to live run state. | | `@llm-workbench/ai-sdk` | Vercel AI SDK v5 wrappers (`tracedGenerateText`, `tracedStreamText`, `tracedGenerateObject`, `tracedStreamObject`, `traceTools`) that emit correlated `model_io`, `tool_call`, and gateway-cost trace events automatically. | | `@llm-workbench/mcp` | Model Context Protocol server factory plus HTTP handler (`createWorkbenchMcpHttpHandler`) for exposing the runtime over MCP — see [`packages/mcp/README.md`](packages/mcp/README.md). | ## Quick Start ```bash npm install npm test npm run build npm run demo # Vite demo app at http://localhost:5173 npm run demo:http-server # Reference REST store for HttpRunRepository ``` Node.js **22+** is required (`engines` in root `package.json`). CI runs on **Node 22 and 24** (`.github/workflows/ci.yml`). ## 60-Second Integration ```ts import { WorkbenchRuntime, SchemaRegistry, registerDemoSchemas, summarizeModelTelemetry, } from "@llm-workbench/runtime"; const registry = new SchemaRegistry(); registerDemoSchemas(registry); const runtime = new WorkbenchRuntime(); const { runId } = runtime.startRun({ workflow: { id: "my-pipeline", version: 1, steps: [ { id: "parse", gatePolicy: "PAUSE_BEFORE" }, { id: "score", gatePolicy: "AUTO" }, ], edges: [{ id: "e1", from: "parse", to: "score" }], }, subject: { userId: "user_123", tenantId: "team_456", planId: "pro", }, }); const session = runtime.session(runId); session.resolveGate({ stepId: "parse", gate: "PAUSE_BEFORE", decision: "approved", }); session.beginStep("parse"); session.writeArtifact({ artifactKey: "compiledProfile", typeId: "compiledProfile", data: { headline: "TypeScript engineer", skills: ["typescript", "react", "systems"], summary: "Strong full-stack builder with AI workflow experience.", }, }); session.logModelIO({ stepId: "parse", direction: "response", provider: "openai", model: "gpt-example", usage: { inputTokens: 120, outputTokens: 40 }, cost: { amount: 0.0012, currency: "USD" }, durationMs: 900, }); session.completeStep("parse"); const telemetry = summarizeModelTelemetry(session.snapshot()); console.log(telemetry.totals, telemetry.byProviderModel); ``` Drop the shell anywhere in your app: ```tsx ``` ## Runtime Principles - The runtime never hides state behind provider-specific abstractions. - Structured outputs should be schema-validated before they become product state. - Human edits and approvals are first-class trace events, not side notes. - Exported runs should be useful for debugging, audits, demos, and learning. - Model telemetry should be close enough to the workflow that cost and quality can be managed together. - The public protocol should be boring, explicit, and durable. ## License LLM Workbench is **proprietary**. All rights reserved. Use, modification, deployment, and operation are limited to Authorized Users (Roy McFarland personally and any entity controlled by Roy McFarland, including Brightline Ltd) except by separate written agreement. The full text of this grant is in [`LICENSE`](LICENSE), and an identical copy lives in each `packages/*/LICENSE` directory. For licensing inquiries, contact Roy McFarland. ## Contributing Outside contributions are not currently accepted. Issue reports are welcome through [GitHub Issues](https://github.com/roymcfarland/llm-workbench/issues), but pull requests from third parties will be closed without merge unless a separate written agreement is in place. ## Security Please report security issues through the process in [SECURITY.md](SECURITY.md). --- # Protocol overview LLM Workbench v1.0.0 is a runtime-and-wire-format pair for recording, gating, and replaying LLM-powered work. It is deliberately boring on the surface — JSON in, JSON out, structured trace events — so it survives upgrades, model swaps, framework migrations, and audits. ## Host boundaries The runtime **never** chooses models, executes prompts, or registers tools on your behalf. Your application owns orchestration policy; LLM Workbench exposes explicit APIs (`WorkbenchSession`) so **recording is intentional**. Every meaningful action turns into typed facts (`TraceEvent`) rather than inferred guesses from stderr or vendor dashboards. That separation matters when you swap providers or refactor prompts: semantic gates and artifact schemas travel with the run; vendor IDs remain annotations on `model_io`, not primary keys for truth. ## Run bundles A run bundle is the canonical **export** of a single run — the interchange format for email attachments, audit ZIPs, cold storage, or compliance tooling. It is a JSON document with these top-level fields: - `protocolVersion`: literal protocol identifier (currently `1.0.0`). - `run`: the `RunInstance` (id, workflow snapshot, status, timestamps, optional subject and metadata). - `trace`: ordered array of `TraceEvent`s, each carrying a stable id and ISO timestamp. - `artifacts`: every version of every artifact ever written (artifacts are append-only in bundle land). - `ruleSets`: every version of every rule set referenced by the run. - `engine` (optional): an internal snapshot of step status, gate state, and idempotency keys for byte-faithful rehydration. When absent, the runtime can re-derive these from the trace. - `integrity.sha256`: hex SHA-256 over **canonical JSON** of `{run, trace, artifacts, ruleSets, engine?}`. The bundle is tamper-evident — verify on import. Bundles are content-addressed in spirit: canonical serialization sorts object keys lexicographically, drops undefined properties consistently, and rejects cyclic or non-JSON-safe values before hashing — so two semantically equal exports bit-match. ## Live persistence versus export bundles Day-to-day HTTP persistence (`PUT /api/runs/{runId}` in this reference app) stores a **`RunStoreState`** snapshot: maps for artifacts, gate state, idempotency keys, and step status — optimized for merging and optimistic concurrency. A **`RunBundle`**, by contrast, is the denormalized archive you hand to another system or verify offline. The runtime translates between them: export flattens maps to ordered arrays for signing; import rehydrates into session state. When you read the OpenAPI spec, you are looking at **store** wire format, not necessarily a fully materialized bundle with `integrity` — use `export_bundle` via MCP (or equivalent) when you need a signed artifact. ## Trace events Every observable runtime fact is a typed trace event. The discriminated union covers: - `step_started` / `step_completed`: lifecycle of a workflow step. - `artifact_written` / `artifact_patch`: creating or evolving structured outputs (with idempotency keys to dedupe writes). - `model_io`: a model call's request, response, or stream chunk, with optional usage, cost, duration, summary, and a redacted payload. - `tool_call`: a tool invocation with arguments and result. - `human_gate_requested` / `human_gate_resolved`: pause-and-decide events with explicit decisions (`approved`, `rejected`, `edited`) and optional notes. - `rule_changed`: snapshot of a new rule-set revision. - `policy_changed`: a step's gate policy was overridden mid-run. - `error`: a structured error with optional code and `fatal` flag. - `run_forked`: the run was branched off another (`parentRunId` or `parentRunIds`). - `annotation`: free-text human note with optional tags. - `run_status_changed`: terminal transitions (`completed`, `failed`, `cancelled`). - `span_started` / `span_ended`: hierarchical spans modeled after OpenTelemetry's GenAI semantic conventions; convertible to OTel spans. Every event has `id`, `runId`, `ts`, and an optional `stepId` and `correlationId`. The `TraceEventSchema` Zod parser is the authoritative contract — any host that emits structured trace events should validate against it before persistence. ### Ordering, replay, and correlation The trace is an **append-only** narrative. UIs and replay tooling usually expect events in time order; the schema does not enforce monotonic timestamps (clocks skew happens), but exporters should preserve append order as the source of truth for human review. Use `correlationId` to stitch a **single logical operation** split across multiple events — for example every `stream_chunk` for one completion should share one id so downstream analysis can collapse a stream back into one row without heuristics. ### Spans and external observability `span_started` / `span_ended` mirror GenAI semantic layers: you can convert the trace to vendor spans with `traceEventsToOtelSpans` (see `@llm-workbench/runtime`) and ship them to whichever OTLP backend you already operate. That path is complementary, not redundant: OTLP answers “where was latency?” while the bundle still answers “which artifact version did a reviewer approve before it reached a customer?”. ## Gates Every workflow step carries a gate policy: - `AUTO`: no gate; runs whenever predecessors are ready. - `PAUSE_BEFORE`: hold until a reviewer approves **before** the step executes. - `PAUSE_AFTER`: hold after the step completes until a reviewer approves the result. - `CHECKPOINT`: arbitrary named checkpoints **inside** a long step — each checkpoint is independently approved. Gate state is part of the run state and persists across reloads. Resuming a paused run is a `resolveGate` call followed by the next eligible `beginStep`. Rejections (`human_gate_resolved` with `decision: "rejected"`) are explicit trace facts — they **stop forward progress** until product logic forks or retries; there is no silent skip. ## Artifacts and schemas Artifacts are versioned, JSON-shaped values keyed by `artifactKey`. They carry a `typeId` that maps to a schema in the `SchemaRegistry`. Writes are either full replacements (`writeArtifact`) or RFC 6902 JSON Patches (`patchArtifact`), and both flavours produce structured trace events so diffs survive replay. Schemas live in the host: `registerDemoSchemas` ships a useful set of examples; in production you bring your own Ajv-validated JSON Schemas. The runtime refuses to write artifacts that do not validate against the registered schema for their `typeId`. ### Idempotency Heavy steps may retry — network flakes, duplicate webhook deliveries — so artifact writes carry **idempotency keys** where the host needs deduplication. Replays with the same key collapse to one version bump, which keeps forensic traces readable without duplicate `artifact_written` noise. ## Telemetry `summarizeModelTelemetry(state)` reduces the trace into a typed ledger keyed by provider, model, step, user, tenant, and plan. It surfaces input/output token totals, cached and reasoning tokens, and per-currency cost rollups. The ledger is a derived view — the underlying `model_io` events are the system of record. For **billing truth**, reconcile against your gateway or cloud invoice APIs; the trace ledger is the fast, run-scoped approximation that makes product decisions legible next to workflow structure. ## Forks and lineage `run_forked` plus `RunInstance` parent linkage (`parentRunId` or plural `parentRunIds`) lets you express forks (human branches an investigation) or supervisor/worker graphs without losing ancestry. Consumers should prefer helpers that normalize plural vs singular parents — older bundles may only carry `parentRunId`. ## Migrations Bundle migration is a single `migrateRunBundle` step keyed off `protocolVersion`. Bumping the version forces an explicit migration path — older bundles are accepted, transformed, and re-signed before they enter the runtime. The runtime refuses to import a bundle whose declared protocol version it does not understand (unless migration hooks extend the importer). ## Sample minimal RunBundle ```json { "protocolVersion": "1.0.0", "run": { "id": "run_demo_001", "workflowId": "jobSearchWorkflow", "workflowVersion": 1, "workflowSnapshot": { "id": "jobSearchWorkflow", "version": 1, "steps": [ { "id": "parse", "gatePolicy": "PAUSE_BEFORE" }, { "id": "score", "gatePolicy": "AUTO" } ], "edges": [{ "id": "e1", "from": "parse", "to": "score" }] }, "startedAt": "2026-04-01T12:00:00.000Z", "endedAt": "2026-04-01T12:00:42.000Z", "status": "completed" }, "trace": [ { "id": "evt_1", "type": "human_gate_resolved", "runId": "run_demo_001", "ts": "2026-04-01T12:00:01.000Z", "stepId": "parse", "gate": "PAUSE_BEFORE", "decision": "approved" }, { "id": "evt_2", "type": "step_started", "runId": "run_demo_001", "ts": "2026-04-01T12:00:02.000Z", "stepId": "parse" }, { "id": "evt_3", "type": "model_io", "runId": "run_demo_001", "ts": "2026-04-01T12:00:03.000Z", "stepId": "parse", "direction": "response", "provider": "anthropic", "model": "claude-haiku-4-5", "usage": { "inputTokens": 110, "outputTokens": 40, "totalTokens": 150 }, "cost": { "amount": 0.003, "currency": "USD" }, "durationMs": 220 }, { "id": "evt_4", "type": "artifact_written", "runId": "run_demo_001", "ts": "2026-04-01T12:00:04.000Z", "artifact": { "artifactKey": "compiledProfile", "typeId": "compiledProfile", "version": 1, "createdAt": "2026-04-01T12:00:04.000Z", "data": { "headline": "Senior TypeScript engineer", "skills": ["typescript","react","systems"], "summary": "Strong full-stack builder." } } } ], "artifacts": [], "ruleSets": [], "integrity": { "sha256": "" } } ``` ## Driving the workbench programmatically Two complementary surfaces are intended for agents and integrations: - **REST.** `GET /api/runs` lists runs for the caller's tenant; `GET /api/runs/{runId}` returns serialized **live state** (`RunStoreState` shape); `PUT /api/runs/{runId}` persists the next revision (same wire shape); `DELETE /api/runs/{runId}` removes it. Responses mirror what `HttpRunRepository` reads and writes — **not** automatically the signed bundle envelope unless your exporter wraps it. See `/api/openapi.json` for schemas. - **MCP.** `/api/mcp` is a Streamable HTTP MCP endpoint. The `@llm-workbench/mcp` core registers `list_runs`, `get_run`, `verify_run_integrity`, and `validate_run_bundle`; this reference app adds `start_run`, `resolve_gate`, `write_artifact`, and `export_bundle` (tamper-evident **RunBundle** JSON with engine snapshot — use this when automation needs hashes, not only row state). Discovery lives at `/.well-known/mcp.json`. Both surfaces share Clerk-based auth and the tenant-scoping rules described in `/agents.md`. --- # Driving the workbench programmatically Two surfaces expose the runtime over the network: ## REST (`/api/runs` and `/api/runs/{runId}`) - `GET /api/runs?limit=N` returns `SavedRunMeta[]` for the caller's tenant. - `GET /api/runs/{runId}` returns the serialized `RunStoreState` (the same wire format `HttpRunRepository` produces). - `PUT /api/runs/{runId}` persists a serialized state. Body limit is 25 MB. Validates structural invariants and rejects `state.run.id !== runId`. - `DELETE /api/runs/{runId}` removes a run. - All responses carry `Link: ; rel="describedby"`. - Auth is Clerk-based: the request must carry a session cookie (or a Clerk bearer token in production deployments). Tenants are derived as `orgId ?? "user:" + userId`. The full schema lives at `/api/openapi.json` (OpenAPI 3.1). ## MCP (`/api/mcp`) A Streamable HTTP MCP endpoint registers: - Core (`@llm-workbench/mcp`): `list_runs`, `get_run`, `verify_run_integrity`, `validate_run_bundle`. - Reference app additions: `start_run`, `resolve_gate`, `write_artifact`, `export_bundle` (full-profile tamper-evident bundle). Discovery via `/.well-known/mcp.json`. Resources expose `runs://{runId}` bundle URIs — see `packages/mcp/README.md`. HTML crawlers and link previews do not carry Clerk sessions. Middleware explicitly allows OG/Twitter metadata image routes (`/opengraph-image`, `/twitter-image`) and marketing paths; tenant APIs and MCP stay behind auth (`robots.txt` `Disallow` on private APIs for crawl-budget hygiene). ## Error model Errors are JSON: `{ "error": "", "code": "" }`. Status codes follow standard REST conventions: `400` invalid body, `401` missing session, `404` unknown run, `413` body too large, `500` for unexpected failures. ## Rate limits No rate limits are enforced in this reference deployment. Production deployments must add a per-tenant limiter at the API or MCP layer before exposing this surface to untrusted clients. --- # Trace event reference Every observable runtime fact is one of these typed events. The discriminated union is the authoritative contract — host code should validate against `TraceEventSchema` before persistence. All events carry `id`, `runId`, `ts`, optional `stepId`, optional `correlationId`. - `step_started` — a workflow step has started executing. - `step_completed` — a step finished; `ok: boolean` plus optional `error`. - `artifact_written` — a new artifact version was written (full replacement). - `artifact_patch` — an artifact was advanced via RFC 6902 JSON Patch ops. - `model_io` — a model call (`request`, `response`, or `stream_chunk`) with optional provider, model, usage, cost, durationMs, summary, and a redacted payload. - `tool_call` — a tool was invoked, with arguments and result. - `human_gate_requested` — runtime is paused waiting for a human decision (PAUSE_BEFORE | PAUSE_AFTER | CHECKPOINT). - `human_gate_resolved` — a human delivered a decision (`approved`, `rejected`, `edited`) plus an optional note. - `rule_changed` — a rule set was updated; the snapshot is embedded. - `policy_changed` — a step's gate policy was overridden mid-run. - `error` — a structured error with optional code and `fatal` flag. - `run_forked` — the run was branched off another run (`parentRunId`). - `annotation` — free-text human annotation with optional tags. - `run_status_changed` — terminal transitions: `completed`, `failed`, `cancelled`. - `span_started` / `span_ended` — hierarchical spans modeled after OpenTelemetry GenAI semconv; convertible to OTel via `traceEventsToOtelSpans`.