Benchmarks

How jobs run in production—and how we score them

Enterprise General Intelligence

A benchmark is not a model leaderboard. It is a record of completed runs: the agent executed a standardized job against real systems, left an auditable trail, and passed the spec.

We run standardized jobs against live tenant systems, score each execution pass or fail, and aggregate results across businesses in unlike industries. The benchmark you review is that execution history—not a synthetic capability index.

Sample production run

Prospect, outreach & opportunity engagement

REV-PIPE-042 · Revenue · B2B industrial distributor · anonymized

PASS

14m 32s

Systems touchedSalesforce · Outreach · Slack

00:00Job spec REV-PIPE-042 loaded · tenant CRM bound

00:4247 stale opportunities identified · re-score started

04:18Outbound sequences drafted · policy check passed

09:0512 opportunities engaged · next steps logged in CRM

14:32Job closed · pipeline fields match spec · audit trail written

Goal persistencePass

Tool executionPass

Multi-step completionPass

Constraint adherencePass

Full methodology & run criteria →

How each run executes

Every scored benchmark follows the same execution path—from job spec to systems updated to pass/fail logged.

01
Load job spec
JTBD for the function: what “done” means, which systems may be touched, policy limits.
02
Connect tenant stack
Live CRM, ERP, WMS, or finance APIs for that business—not an isolated test harness.
03
Execute end-to-end
Multi-step run with tool calls and codegen; state held until the job completes or fails.
04
Update systems of record
Production objects written—pipeline, journals, POs, tickets—per the job definition.
05
Score the run
Pass or fail vs. the job spec, plus execution checks below. Every run is logged.
06
Publish benchmark
Passes aggregated across unlike tenants before the agent deploys to you.

[JTBD]

Job spec loaded

▼

[EXECUTE]

Agent runs end-to-end

▼

[TOOLS]

APIs · DBs · SaaS

▼

[SYSTEMS]

Records updated

▼

[SCORE]

Pass / fail logged

Production runs scored

Real end-to-end workflow runs at real businesses. Names omitted; you see the job, what the agent executed, and the pass criteria.

RevenueProspect, outreach & opportunity engagement

Agent executes: Prospects targets, runs outbound outreach, and engages each opportunity in the CRM—activity logged, next steps set, pipeline advanced.
Tenants validated: B2B industrial distributor · tech & SaaS · regional healthcare services
Done when: Prospects sourced, outreach executed, opportunities engaged with documented touchpoints and next steps in CRM.
Pass criteria: Pass = full prospect-to-engagement cycle completes on each tenant's CRM and outreach stack.

FinanceMonth-end reconciliation assist

Agent executes: Runs the full close workflow: pulls sub-ledgers, matches exceptions, drafts journals inside policy, routes approvals, writes audit entries.
Tenants validated: Multi-entity CPG operator · logistics tech platform · financial services back-office
Done when: Exceptions surfaced, proposed journals within policy, full audit trail per run.
Pass criteria: Pass = same job spec, different charts of accounts and approval workflows.

PurchasingSignal-driven purchasing decisions

Agent executes: Tracks demand, inventory, vendor, and spend signals across systems; synthesizes inputs and executes or routes purchase decisions within policy.
Tenants validated: Industrial distribution · CPG & beverage · retail & commerce
Done when: Signals monitored, decision rationale documented, PO or requisition created or escalated per approval rules.
Pass criteria: Pass = full signal-to-decision cycle on each tenant's procurement and ERP stack.

Scored on every run

Before a run counts toward the published benchmark, it must clear these checks. No partial credit for a clever single step.

→Job closed to defined expectation (yes / no)
→Correct tools and parameters for that tenant's stack
→Full sequence executed—not a single demo step
→Audit log captures each action and system change
→Policies and constraints held for that business

Execution dimensions

Each dimension describes what the agent must do in production—not a lab capability label.

Goal persistence

Finishes the job through handoffs and long runs

Tool execution

Calls the right APIs with correct parameters

Multi-step completion

Runs the full job chain, not one isolated action

State & context

Retains job state across steps and interruptions

Error recovery

Recovers and continues without abandoning the job

Outcome quality

Deliverables match the job spec

Constraint adherence

Stays inside tenant policy and approval rules

Earn the benchmark across tenants

One completed run is a data point. The benchmark publishes when the same job passes at multiple businesses with different stacks—e.g. prospect-and-outreach runs at a SaaS vendor, a healthcare operator, and an industrial distributor, each with its own CRM, sequences, and approval chain.

That cross-tenant execution record is what you review before deploy—not a generic model score from a fixture.

Benchmarked across unlike industries

CybersecurityIndustrial distributionRetail & commerceeCommerce & DTCCPG & beverageFashion & jewelryLogistics techSoftware & SaaSFinancial servicesEducation & EdTechHealthcare & life sciencesPharma, biotech & CROsData centers & AI infraUtilities & energyIndustrial warehousingFood & beverageCreative & marketing

Same job spec. Different tenants. Scored before deploy.

What you get before Day 1

→ Job spec for your function

→ Pass/fail history on that job at unlike tenants

→ Execution logs and dimension scores from validating runs

→ Deploy in days: connect your stack, run under your policies

Explore functions →Research / Doctrine →