linkworld eval — scenario regression tests

linkworld eval is a scenario test runner for your app. You declare inputs (inbound messages, schedule firings, tool calls) and expectations (which platform tools the handler should call, what they should be called with, what the handler should return); the runner loads main.py in-process, drives each scenario through the SDK’s TestClient, and reports pass/fail.

Like pytest — but for agents. Drop into CI with --junit.

Quickstart

$ linkworld eval                                     # default: ./linkworld.eval.yaml + ./main.py
$ linkworld eval scenarios/regressions.yaml          # custom path
$ linkworld eval --filter "schedule"                 # only scenarios whose name contains "schedule"
$ linkworld eval --junit reports/eval.xml            # write JUnit for CI
$ linkworld eval --json reports/eval.json --quiet    # write JSON, suppress per-scenario table

Exit codes: 0 = all passed, 1 = at least one failure, 2 = config problem (bad YAML, no main.py, etc).

File format

scenarios:
  - name: <human-readable>
    given:
      tools:
        # Mock responses keyed by platform tool name. Same shape as
        # MockTools.set_response — either a dict (returned verbatim)
        # or any value the runner passes back to the handler's
        # ctx.tools.call(...) await.
        crm_lookup:
          ok: true
          result: { customer: "Acme Inc" }
        email_send:
          ok: true
      secrets:
        OPENAI_KEY: "test-key"
    when:
      type: inbound | schedule | tool | install | uninstall | user_added
      # type-specific fields below
    then:
      tool_calls:           # ordered subset match
        - name: crm_lookup
          args_match: { id: "42" }
      tool_calls_count: 2   # exact total
      tool_calls_include:   # unordered "must contain" match
        - name: email_send
      handler_returns:      # only meaningful for type=tool
        ok: true
      handler_raises:       # expect the handler to raise
        type: ToolCallError
        message_contains: "scope_denied"

All fields under then are optional; an empty then: {} means “the scenario just needs to not throw.”

When-types

inbound

when:
  type: inbound
  message_text: "Help with order #42"
  channel: email                 # default: web
  modality: text                 # default: text
  channel_thread_id: "thread-1"  # optional
  attachments: []                # optional list of {name, mime, content_b64}
  tenant_id: t-eval              # optional
  user_id: u-eval                # optional

schedule

when:
  type: schedule
  schedule_name: daily   # name registered via @app.on_schedule("daily")

tool

when:
  type: tool
  tool_name: classify
  args:
    text: "Help with order #42"

The handler’s return value is captured and asserted via then.handler_returns.

install / uninstall / user_added

when: { type: install }
when: { type: uninstall }
when: { type: user_added, user_id: u-new }

Assertions

`then.tool_calls` — ordered match

Each entry asserts on the i-th call the handler made. Extra args in the actual call are ignored (subset match). The lengths must match unless tool_calls_count is also set, in which case that wins.

then:
  tool_calls:
    - name: crm_lookup
      args_match: { id: "42" }   # actual call must have id == "42"
    - name: email_send
      args_match: { to: "[email protected]" }

`then.tool_calls_count` — exact count

then:
  tool_calls_count: 2

`then.tool_calls_include` — unordered match

For each entry, at least one actual call must match (by name + args subset). Useful when the handler may make extra book-keeping calls in any order.

then:
  tool_calls_include:
    - name: report_pdf_generate

`then.handler_returns` — for tool scenarios

then:
  handler_returns:
    label: "support"     # subset-matched against the dict the tool returned

`then.handler_raises` — expected error

then:
  handler_raises:
    type: ToolCallError
    message_contains: "scope_denied"

When handler_raises is set, the runner expects the handler to throw and reports a failure if it doesn’t. Without it, any exception is a scenario failure (with the error captured in the report).

CI integration

- run: pip install linkworld
- run: linkworld eval --junit eval-report.xml
- if: always()
  uses: actions/upload-artifact@v4
  with:
    name: eval-report
    path: eval-report.xml

The JUnit XML follows the standard schema, so GitHub’s test summary, GitLab’s test reports, Buildkite Test Analytics etc. all render it.

How it works

The runner imports your main.py in-process (so the same Python process holds the App instance and all its handlers). For each scenario, a fresh TestClient with MockTools and MockSecrets is built, the relevant handler is invoked, and the captured tool calls are asserted against the spec.

No real LLM calls, no platform connection, no Docker — pure-Python execution against the in-memory app. That makes scenarios fast (~ms per scenario) and deterministic.

For chats that need a real LLM (e.g. asserting on agent reasoning), use the live playground instead — see linkworld dev once we ship the chat surface.