managed-agentsbenchmarkclaudecodexmemory

Benchmarking Claude Agent SDK and Codex SDK Memory for Managed Agents

Sandbox0 Team·April 14, 2026

When you run managed agents in production, SDK choice is not only an API ergonomics decision.

It changes the shape of the runtime you have to host.

A normal API client adds a little memory inside your application process. An agent SDK may start a local agent runtime, hold session state, run tools, keep working directories alive, and spawn helper processes. That fixed local footprint becomes important when you multiply it by warm pools, concurrent sessions, and per-tenant isolation boundaries.

That is why we added a reproducible memory benchmark for Sandbox0 Managed Agents.

The goal is not to declare one SDK universally better. The goal is simpler: make the baseline process cost visible before you design a managed agent backend around it.

What We Wanted to Measure#

There are two different questions people often mix together:

How much memory does the SDK import add to my application process?
How much memory does the local agent runtime use once a session can actually run?

Those are different numbers.

For Claude Agent SDK, the important runtime path is the bundled Claude CLI process that the SDK uses underneath. Anthropic's current Agent SDK docs describe the SDK as Claude Code as a library, and the hosting guide calls out that each SDK instance needs runtime resources and spawns the bundled Claude Code CLI.

For Codex SDK, the TypeScript SDK controls local Codex agents, while the Python path is currently experimental and controls the local Codex app-server over JSON-RPC. OpenAI documents those paths in the Codex SDK and Codex app-server docs.

So the benchmark measures both import-only cost and local child-process cost.

The Benchmark Design#

For more stable results, run more samples and write the JSON report to a named file:

bash
npm run bench -- \
    --samples 10 \
    --warmup-ms 1500 \
    --stability-probes 8 \
    --output results/linux-arm64.json

The script measures:

Node.js baseline RSS
import-only RSS for @anthropic-ai/claude-agent-sdk
import-only RSS for @openai/codex-sdk
idle RSS for Codex exec --experimental-json
idle RSS for Codex app-server
idle RSS for the bundled Claude CLI process

It deliberately does not send prompts or call model APIs. That keeps token usage, network latency, and task-specific tool output out of the baseline.

For child processes, the benchmark uses process-tree RSS as the primary metric. That matters because agent SDKs may spawn helper processes depending on platform and mode. After a warmup delay, the script samples a short stability window and records the highest RSS observed in that window, which avoids treating startup race artifacts as real idle memory.

The script also runs child processes with temporary HOME-like directories. Local user configuration, credentials, and session history should not become part of the measurement.

One Local Run#

Here is one run from a macOS arm64 laptop using Node v23.11.0, @anthropic-ai/[email protected], @openai/[email protected], and @openai/[email protected].

The command:

bash
npm run bench -- \
    --samples 3 \
    --warmup-ms 1200 \
    --stability-probes 5 \
    --output /tmp/agent-sdk-memory-benchmark.json

The result:

Benchmark	Avg MB	Min MB	Max MB	Primary metric
Node.js baseline	29.1	28.7	29.3	process RSS
Codex SDK import	34.8	34.5	35.0	process RSS
Claude Agent SDK import	57.4	57.2	57.6	process RSS
Codex `exec` idle	21.1	20.9	21.3	process-tree RSS
Codex `app-server` idle	45.6	43.3	47.0	process-tree RSS
Claude bundled CLI idle	215.8	212.9	221.5	process-tree RSS

Do not copy these numbers directly into a production capacity model. They are local macOS cold-start idle numbers. The value is in the shape of the comparison and in the ability to run the same benchmark on your target Linux image.

What the Numbers Say#

The import-only numbers are useful, but they are not the main story.

Codex SDK import stayed close to the Node baseline in this run. Claude Agent SDK import added more memory inside the Node process. That matters for application servers that import the SDK before any agent is active, but it is still smaller than the runtime-process difference.

The larger difference appears once the local agent runtime is present.

In this idle benchmark, Codex exec was around 21MB and Codex app-server was around 46MB. The bundled Claude CLI process was around 216MB. That is a large difference before any repository scan, MCP server, shell command, package install, or long session history is added.

This does not mean a Claude-backed managed agent is too expensive to run. It means you should not budget it like a thin HTTP client.

Anthropic's own hosting guide recommends allocating 1GiB RAM per SDK instance. For serious Claude Agent SDK workloads, that recommendation is a reasonable starting point. The idle CLI process is only the floor. Real sessions can add memory through tool output, subprocesses, MCP servers, local files, and long-lived context management.

Codex looks lighter in this idle process benchmark, especially for the exec path. The app-server path is still compact, but it is a persistent server interface, so the right comparison depends on whether your architecture wants one process per turn, one process per loaded thread, or a longer-running local server.

Why This Matters for Managed Agents#

Managed agents turn SDK process cost into infrastructure cost.

If every active session gets a runtime boundary, then a 200MB idle process is not a footnote. It affects:

how many warm sessions fit per node
how aggressively idle sessions should be paused
whether one tenant should share a runtime pool or get stronger isolation
how large the default sandbox memory request should be
how fast the platform can scale under bursty traffic

This is exactly the kind of tradeoff Sandbox0 is built to make explicit.

The agent session should be durable. The runtime should be replaceable. Persistent storage, event history, credentials, and policy should live outside the agent process when possible. Once that boundary exists, you can choose a runtime strategy based on the actual workload:

keep a Claude runtime warm when user latency matters and memory budget allows it
use a Codex exec style path for short automation tasks where startup and steady-state memory both matter
use an app-server style interface when you need richer streaming, approvals, history, and loaded-thread control
pause or recycle runtimes while preserving session truth and workspace state outside the process

The benchmark does not make those decisions for you. It gives you the baseline data needed to make them honestly.

How to Extend the Benchmark#

The current script is intentionally narrow. It measures the local idle floor.

The next useful profiles are workload-specific:

small prompt, no tools
repository scan with grep or rg
file read and edit workload
MCP-enabled session with one local MCP server
long session resume with existing history
concurrent sessions inside one container

Those profiles should run on the same target image used by the managed agent runtime, ideally under cgroup memory accounting. RSS from ps is useful for local development, but production scheduling should care about container memory, peak memory, and OOM behavior.

The Practical Takeaway#

For managed agents, memory benchmarking should happen before the runtime abstraction hardens.

Claude Agent SDK and Codex SDK have different local process shapes. The Claude path gives you a full Claude Code-style runtime through the Agent SDK, and that comes with a higher baseline memory floor. The Codex SDK path is lighter in this idle benchmark, especially for one-shot exec, while app-server gives a more persistent control surface at a still modest idle footprint.

The right answer depends on product requirements, not just MB numbers. But capacity planning needs real measurements.

Run the benchmark on your target infrastructure, capture the JSON report, and treat the result as the starting point for runtime placement, warm-pool sizing, and session lifecycle policy.