Benchmarking Claude Agent SDK and Codex SDK Memory for Managed Agents
When you run managed agents in production, SDK choice is not only an API ergonomics decision.
It changes the shape of the runtime you have to host.
A normal API client adds a little memory inside your application process. An agent SDK may start a local agent runtime, hold session state, run tools, keep working directories alive, and spawn helper processes. That fixed local footprint becomes important when you multiply it by warm pools, concurrent sessions, and per-tenant isolation boundaries.
That is why we added a reproducible memory benchmark for Sandbox0 Managed Agents.
The goal is not to declare one SDK universally better. The goal is simpler: make the baseline process cost visible before you design a managed agent backend around it.
What We Wanted to Measure#
There are two different questions people often mix together:
- How much memory does the SDK import add to my application process?
- How much memory does the local agent runtime use once a session can actually run?
Those are different numbers.
For Claude Agent SDK, the important runtime path is the bundled Claude CLI process that the SDK uses underneath. Anthropic's current Agent SDK docs describe the SDK as Claude Code as a library, and the hosting guide calls out that each SDK instance needs runtime resources and spawns the bundled Claude Code CLI.
For Codex SDK, the TypeScript SDK controls local Codex agents, while the Python path is currently experimental and controls the local Codex app-server over JSON-RPC. OpenAI documents those paths in the Codex SDK and Codex app-server docs.
So the benchmark measures both import-only cost and local child-process cost.
The Benchmark Design#
For more stable results, run more samples and write the JSON report to a named file:
bashnpm run bench -- \ --samples 10 \ --warmup-ms 1500 \ --stability-probes 8 \ --output results/linux-arm64.json
The script measures:
- Node.js baseline RSS
- import-only RSS for
@anthropic-ai/claude-agent-sdk - import-only RSS for
@openai/codex-sdk - idle RSS for Codex
exec --experimental-json - idle RSS for Codex
app-server - idle RSS for the bundled Claude CLI process
It deliberately does not send prompts or call model APIs. That keeps token usage, network latency, and task-specific tool output out of the baseline.
For child processes, the benchmark uses process-tree RSS as the primary metric. That matters because agent SDKs may spawn helper processes depending on platform and mode. After a warmup delay, the script samples a short stability window and records the highest RSS observed in that window, which avoids treating startup race artifacts as real idle memory.
The script also runs child processes with temporary HOME-like directories. Local user configuration, credentials, and session history should not become part of the measurement.
One Local Run#
Here is one run from a macOS arm64 laptop using Node v23.11.0, @anthropic-ai/[email protected], @openai/[email protected], and @openai/[email protected].
The command:
bashnpm run bench -- \ --samples 3 \ --warmup-ms 1200 \ --stability-probes 5 \ --output /tmp/agent-sdk-memory-benchmark.json
The result:
| Benchmark | Avg MB | Min MB | Max MB | Primary metric |
|---|---|---|---|---|
| Node.js baseline | 29.1 | 28.7 | 29.3 | process RSS |
| Codex SDK import | 34.8 | 34.5 | 35.0 | process RSS |
| Claude Agent SDK import | 57.4 | 57.2 | 57.6 | process RSS |
Codex exec idle | 21.1 | 20.9 | 21.3 | process-tree RSS |
Codex app-server idle | 45.6 | 43.3 | 47.0 | process-tree RSS |
| Claude bundled CLI idle | 215.8 | 212.9 | 221.5 | process-tree RSS |
Do not copy these numbers directly into a production capacity model. They are local macOS cold-start idle numbers. The value is in the shape of the comparison and in the ability to run the same benchmark on your target Linux image.
What the Numbers Say#
The import-only numbers are useful, but they are not the main story.
Codex SDK import stayed close to the Node baseline in this run. Claude Agent SDK import added more memory inside the Node process. That matters for application servers that import the SDK before any agent is active, but it is still smaller than the runtime-process difference.
The larger difference appears once the local agent runtime is present.
In this idle benchmark, Codex exec was around 21MB and Codex app-server was around 46MB. The bundled Claude CLI process was around 216MB. That is a large difference before any repository scan, MCP server, shell command, package install, or long session history is added.
This does not mean a Claude-backed managed agent is too expensive to run. It means you should not budget it like a thin HTTP client.
Anthropic's own hosting guide recommends allocating 1GiB RAM per SDK instance. For serious Claude Agent SDK workloads, that recommendation is a reasonable starting point. The idle CLI process is only the floor. Real sessions can add memory through tool output, subprocesses, MCP servers, local files, and long-lived context management.
Codex looks lighter in this idle process benchmark, especially for the exec path. The app-server path is still compact, but it is a persistent server interface, so the right comparison depends on whether your architecture wants one process per turn, one process per loaded thread, or a longer-running local server.
Why This Matters for Managed Agents#
Managed agents turn SDK process cost into infrastructure cost.
If every active session gets a runtime boundary, then a 200MB idle process is not a footnote. It affects:
- how many warm sessions fit per node
- how aggressively idle sessions should be paused
- whether one tenant should share a runtime pool or get stronger isolation
- how large the default sandbox memory request should be
- how fast the platform can scale under bursty traffic
This is exactly the kind of tradeoff Sandbox0 is built to make explicit.
The agent session should be durable. The runtime should be replaceable. Persistent storage, event history, credentials, and policy should live outside the agent process when possible. Once that boundary exists, you can choose a runtime strategy based on the actual workload:
- keep a Claude runtime warm when user latency matters and memory budget allows it
- use a Codex
execstyle path for short automation tasks where startup and steady-state memory both matter - use an app-server style interface when you need richer streaming, approvals, history, and loaded-thread control
- pause or recycle runtimes while preserving session truth and workspace state outside the process
The benchmark does not make those decisions for you. It gives you the baseline data needed to make them honestly.
How to Extend the Benchmark#
The current script is intentionally narrow. It measures the local idle floor.
The next useful profiles are workload-specific:
- small prompt, no tools
- repository scan with
greporrg - file read and edit workload
- MCP-enabled session with one local MCP server
- long session resume with existing history
- concurrent sessions inside one container
Those profiles should run on the same target image used by the managed agent runtime, ideally under cgroup memory accounting. RSS from ps is useful for local development, but production scheduling should care about container memory, peak memory, and OOM behavior.
The Practical Takeaway#
For managed agents, memory benchmarking should happen before the runtime abstraction hardens.
Claude Agent SDK and Codex SDK have different local process shapes. The Claude path gives you a full Claude Code-style runtime through the Agent SDK, and that comes with a higher baseline memory floor. The Codex SDK path is lighter in this idle benchmark, especially for one-shot exec, while app-server gives a more persistent control surface at a still modest idle footprint.
The right answer depends on product requirements, not just MB numbers. But capacity planning needs real measurements.
Run the benchmark on your target infrastructure, capture the JSON report, and treat the result as the starting point for runtime placement, warm-pool sizing, and session lifecycle policy.