Skip to content

linkworld eval — scenario regression tests

linkworld eval is a scenario test runner for your app. You declare inputs (inbound messages, schedule firings, tool calls) and expectations (which platform tools the handler should call, what they should be called with, what the handler should return); the runner loads main.py in-process, drives each scenario through the SDK’s TestClient, and reports pass/fail.

Like pytest — but for agents. Drop into CI with --junit.

Terminal window
$ linkworld eval # default: ./linkworld.eval.yaml + ./main.py
$ linkworld eval scenarios/regressions.yaml # custom path
$ linkworld eval --filter "schedule" # only scenarios whose name contains "schedule"
$ linkworld eval --junit reports/eval.xml # write JUnit for CI
$ linkworld eval --json reports/eval.json --quiet # write JSON, suppress per-scenario table

Exit codes: 0 = all passed, 1 = at least one failure, 2 = config problem (bad YAML, no main.py, etc).

linkworld.eval.yaml
scenarios:
- name: <human-readable>
given:
tools:
# Mock responses keyed by platform tool name. Same shape as
# MockTools.set_response — either a dict (returned verbatim)
# or any value the runner passes back to the handler's
# ctx.tools.call(...) await.
crm_lookup:
ok: true
result: { customer: "Acme Inc" }
email_send:
ok: true
secrets:
OPENAI_KEY: "test-key"
when:
type: inbound | schedule | tool | install | uninstall | user_added
# type-specific fields below
then:
tool_calls: # ordered subset match
- name: crm_lookup
args_match: { id: "42" }
tool_calls_count: 2 # exact total
tool_calls_include: # unordered "must contain" match
- name: email_send
handler_returns: # only meaningful for type=tool
ok: true
handler_raises: # expect the handler to raise
type: ToolCallError
message_contains: "scope_denied"

All fields under then are optional; an empty then: {} means “the scenario just needs to not throw.”

when:
type: inbound
message_text: "Help with order #42"
channel: email # default: web
modality: text # default: text
channel_thread_id: "thread-1" # optional
attachments: [] # optional list of {name, mime, content_b64}
tenant_id: t-eval # optional
user_id: u-eval # optional
when:
type: schedule
schedule_name: daily # name registered via @app.on_schedule("daily")
when:
type: tool
tool_name: classify
args:
text: "Help with order #42"

The handler’s return value is captured and asserted via then.handler_returns.

when: { type: install }
when: { type: uninstall }
when: { type: user_added, user_id: u-new }

Each entry asserts on the i-th call the handler made. Extra args in the actual call are ignored (subset match). The lengths must match unless tool_calls_count is also set, in which case that wins.

then:
tool_calls:
- name: crm_lookup
args_match: { id: "42" } # actual call must have id == "42"
- name: email_send
args_match: { to: "[email protected]" }
then:
tool_calls_count: 2

then.tool_calls_include — unordered match

Section titled “then.tool_calls_include — unordered match”

For each entry, at least one actual call must match (by name + args subset). Useful when the handler may make extra book-keeping calls in any order.

then:
tool_calls_include:
- name: report_pdf_generate

then.handler_returns — for tool scenarios

Section titled “then.handler_returns — for tool scenarios”
then:
handler_returns:
label: "support" # subset-matched against the dict the tool returned
then:
handler_raises:
type: ToolCallError
message_contains: "scope_denied"

When handler_raises is set, the runner expects the handler to throw and reports a failure if it doesn’t. Without it, any exception is a scenario failure (with the error captured in the report).

.github/workflows/test.yml
- run: pip install linkworld
- run: linkworld eval --junit eval-report.xml
- if: always()
uses: actions/upload-artifact@v4
with:
name: eval-report
path: eval-report.xml

The JUnit XML follows the standard schema, so GitHub’s test summary, GitLab’s test reports, Buildkite Test Analytics etc. all render it.

The runner imports your main.py in-process (so the same Python process holds the App instance and all its handlers). For each scenario, a fresh TestClient with MockTools and MockSecrets is built, the relevant handler is invoked, and the captured tool calls are asserted against the spec.

No real LLM calls, no platform connection, no Docker — pure-Python execution against the in-memory app. That makes scenarios fast (~ms per scenario) and deterministic.

For chats that need a real LLM (e.g. asserting on agent reasoning), use the live playground instead — see linkworld dev once we ship the chat surface.