linkworld eval — scenario regression tests
linkworld eval is a scenario test runner for your app. You declare
inputs (inbound messages, schedule firings, tool calls) and
expectations (which platform tools the handler should call, what they
should be called with, what the handler should return); the runner
loads main.py in-process, drives each scenario through the SDK’s
TestClient, and reports pass/fail.
Like pytest — but for agents. Drop into CI with --junit.
Quickstart
Section titled “Quickstart”$ linkworld eval # default: ./linkworld.eval.yaml + ./main.py$ linkworld eval scenarios/regressions.yaml # custom path$ linkworld eval --filter "schedule" # only scenarios whose name contains "schedule"$ linkworld eval --junit reports/eval.xml # write JUnit for CI$ linkworld eval --json reports/eval.json --quiet # write JSON, suppress per-scenario tableExit codes: 0 = all passed, 1 = at least one failure, 2 = config
problem (bad YAML, no main.py, etc).
File format
Section titled “File format”scenarios: - name: <human-readable> given: tools: # Mock responses keyed by platform tool name. Same shape as # MockTools.set_response — either a dict (returned verbatim) # or any value the runner passes back to the handler's # ctx.tools.call(...) await. crm_lookup: ok: true result: { customer: "Acme Inc" } email_send: ok: true secrets: OPENAI_KEY: "test-key" when: type: inbound | schedule | tool | install | uninstall | user_added # type-specific fields below then: tool_calls: # ordered subset match - name: crm_lookup args_match: { id: "42" } tool_calls_count: 2 # exact total tool_calls_include: # unordered "must contain" match - name: email_send handler_returns: # only meaningful for type=tool ok: true handler_raises: # expect the handler to raise type: ToolCallError message_contains: "scope_denied"All fields under then are optional; an empty then: {} means
“the scenario just needs to not throw.”
When-types
Section titled “When-types”inbound
Section titled “inbound”when: type: inbound message_text: "Help with order #42" channel: email # default: web modality: text # default: text channel_thread_id: "thread-1" # optional attachments: [] # optional list of {name, mime, content_b64} tenant_id: t-eval # optional user_id: u-eval # optionalschedule
Section titled “schedule”when: type: schedule schedule_name: daily # name registered via @app.on_schedule("daily")when: type: tool tool_name: classify args: text: "Help with order #42"The handler’s return value is captured and asserted via
then.handler_returns.
install / uninstall / user_added
Section titled “install / uninstall / user_added”when: { type: install }when: { type: uninstall }when: { type: user_added, user_id: u-new }Assertions
Section titled “Assertions”then.tool_calls — ordered match
Section titled “then.tool_calls — ordered match”Each entry asserts on the i-th call the handler made. Extra args in
the actual call are ignored (subset match). The lengths must match
unless tool_calls_count is also set, in which case that wins.
then: tool_calls: - name: crm_lookup args_match: { id: "42" } # actual call must have id == "42" - name: email_sendthen.tool_calls_count — exact count
Section titled “then.tool_calls_count — exact count”then: tool_calls_count: 2then.tool_calls_include — unordered match
Section titled “then.tool_calls_include — unordered match”For each entry, at least one actual call must match (by name + args subset). Useful when the handler may make extra book-keeping calls in any order.
then: tool_calls_include: - name: report_pdf_generatethen.handler_returns — for tool scenarios
Section titled “then.handler_returns — for tool scenarios”then: handler_returns: label: "support" # subset-matched against the dict the tool returnedthen.handler_raises — expected error
Section titled “then.handler_raises — expected error”then: handler_raises: type: ToolCallError message_contains: "scope_denied"When handler_raises is set, the runner expects the handler to throw
and reports a failure if it doesn’t. Without it, any exception is a
scenario failure (with the error captured in the report).
CI integration
Section titled “CI integration”- run: pip install linkworld- run: linkworld eval --junit eval-report.xml- if: always() uses: actions/upload-artifact@v4 with: name: eval-report path: eval-report.xmlThe JUnit XML follows the standard schema, so GitHub’s test summary, GitLab’s test reports, Buildkite Test Analytics etc. all render it.
How it works
Section titled “How it works”The runner imports your main.py in-process (so the same Python
process holds the App instance and all its handlers). For each
scenario, a fresh TestClient with MockTools and MockSecrets is
built, the relevant handler is invoked, and the captured tool calls
are asserted against the spec.
No real LLM calls, no platform connection, no Docker — pure-Python execution against the in-memory app. That makes scenarios fast (~ms per scenario) and deterministic.
For chats that need a real LLM (e.g. asserting on agent reasoning),
use the live playground instead — see linkworld dev once we ship
the chat surface.