All Blog Posts
Article Image

AI Coding Agents Build Better Apps with Infrastructure Skills: Claude and Codex Benchmark on Hyperliquid

3rd March 2026 12min read

You ask an AI coding agent to build a Hyperliquid trading dashboard. It searches the web, experiments with API endpoints, writes 500 lines of code, and delivers a working app on the public API with default book depth. A second agent, given a 126-line skill file describing Dwellir's Hyperliquid infrastructure, skips the research phase entirely, connects to optimized endpoints with 50-level order book depth, and finishes 37% faster at 41% lower cost. A third agent, given the same skill plus instructions for the Dwellir CLI, finishes 70% faster at 57% lower cost.

That gap between "figure it out" and "already knows" is what infrastructure skills solve for AI agents. We ran two rounds of benchmarks to measure exactly how much focused tooling changes the outcome, and found that the gains compound when you layer skills with agent-friendly CLIs.

Baseline agent (left) vs skill-equipped agent (right). The baseline uses the public WebSocket with 20-level book depth; the skill variant connects to Dwellir's dedicated Orderbook WebSocket with 50-level depth.
Baseline agent (left) vs skill-equipped agent (right). The baseline uses the public WebSocket with 20-level book depth; the skill variant connects to Dwellir's dedicated Orderbook WebSocket with 50-level depth.

What Are Agent Skills?

AI coding agents like Claude Code and OpenAI Codex can build full applications from natural language prompts. They write code, install dependencies, debug errors, and iterate until the app works. But when these agents encounter specialized infrastructure (blockchain RPCs, WebSocket protocols, proprietary API formats), they rely on training data that may be outdated or incomplete.

An agent skill is a structured reference document (formatted as markdown) that gives the agent the same knowledge a senior developer would have after reading the provider's documentation. For Hyperliquid, Dwellir's skill covers endpoint URLs, authentication patterns, WebSocket subscription formats, available book depth levels, and the read/write architecture split between Dwellir's data infrastructure and Hyperliquid's native exchange API.

The skill file is compact: 126 lines of markdown with links to 6 detailed reference documents covering the Info API, Orderbook WebSocket, gRPC Gateway, HyperEVM JSON-RPC, native API, and historical data access.

Benchmark Design

We built a sandboxed benchmark harness that eliminates information leakage between variants. Each agent runs in an isolated /tmp/ directory containing only a package.json scaffold - no other source files, no neighboring investigation scripts, no .env files to discover.

The Task

All agents received the same complex prompt: build a real-time Multi-Market Dashboard for Hyperliquid. The requirements were deliberately ambitious:

  • Market discovery: Query perpDexs, spotMetaAndAssetCtxs, and metaAndAssetCtxs to find available markets
  • 4-market WebSocket streaming: Subscribe to l2Book updates for BTC-PERP, ETH-PERP, the highest-volume HIP-3 perp, and the highest-volume spot pair
  • Per-market analytics: Volume-weighted mid price, spread in basis points, USD liquidity within 50bps of mid, order book imbalance ratio
  • Cross-market divergence detection: Track the BTC/ETH spread ratio and alert when it deviates beyond 2 standard deviations from its 5-minute rolling mean
  • 30-second dashboard tables: Formatted output with price change and volume estimates
  • Continuous operation with WebSocket reconnection handling

This is not a toy task. It requires understanding Hyperliquid's dual-layer architecture, the distinction between perpetual and spot markets, HIP-3 DEX metadata, WebSocket subscription formats with depth parameters, and statistical calculations for divergence detection.

Round 1: Skill vs No Skill

The first benchmark compared two leading agents, Claude Opus 4.6 and OpenAI Codex (gpt-5.3-codex), with and without the Dwellir Hyperliquid skill.

VariantAgentModelSkill?What the agent receives
claude-no-skillClaude CodeOpus 4.6NoTask prompt only. No API key, no provider mentioned.
claude-skillClaude CodeOpus 4.6YesTask prompt + Dwellir skill file + reference docs + API key
codex-no-skillOpenAI Codexgpt-5.3-codexNoTask prompt only. No API key, no provider mentioned.
codex-skillOpenAI Codexgpt-5.3-codexYesTask prompt + Dwellir skill file + reference docs + API key

The no-skill variants received zero mentions of Dwellir. No API key, no endpoint URLs, no hints about which provider to use. They could search the web and use any public API they found. This creates a clean baseline measuring what each agent knows from training data alone.

The skill variants received the Dwellir Hyperliquid skill file injected into the prompt, the references/ directory copied into their sandbox for on-disk reading, and a Dwellir API key.

Claude Opus 4.6

MetricNo SkillWith SkillDifference
Wall time391s245s37% faster
Output tokens25,9709,40364% fewer
Cost$1.49$0.8841% cheaper
Code size551 lines446 lines19% smaller
Uses DwellirNoYes-
Book depth20 levels per side (default)50 levels per side2.5x deeper

The skill variant finished in 245 seconds and spent $0.88. The no-skill variant took 391 seconds and cost $1.49 for a less capable result.

The token difference is striking: 64% fewer output tokens with the skill. The agent did not need to generate exploratory code, debug API format issues, or write verbose workarounds. It knew the endpoint URLs, the subscription format, and the available depth parameters from the skill file.

OpenAI Codex (gpt-5.3-codex)

MetricNo SkillWith SkillDifference
Output tokens15,91414,7687% fewer
Est. cost~$0.84~$0.68~19% cheaper
Tool calls (shell commands)2424-
Web searches50100% fewer
Reasoning steps331845% fewer
Code size491 lines520 lines6% larger
Uses DwellirNoYes-
Book depth20 levels per side (public cap)50 levels per side2.5x deeper

Codex shows a different pattern. Wall time was similar across both variants, but the quality of work changed significantly: zero web searches needed (versus 5 for the no-skill variant), and 45% fewer reasoning steps. Estimated cost dropped 19% from ~$0.84 to ~$0.68.

The no-skill Codex variant ran 5 web searches against Hyperliquid's documentation to figure out the API format before writing code. The skill variant read the reference docs from disk and went straight to implementation.

Codex costs are estimated from JSONL token counts using published API pricing ($1.75/1M input, $0.4375/1M cached input, $14/1M output).

What the Agents Built

Both no-skill agents fell back to Hyperliquid's public WebSocket at api.hyperliquid.xyz/ws. Neither discovered Dwellir endpoints on their own, confirming the sandbox isolation worked. The skill agents connected to Dwellir's dedicated Orderbook WebSocket, which is optimized for order book delivery with edge servers in Singapore and Tokyo.

VariantWebSocket EndpointBook Depth
claude-no-skillwss://api.hyperliquid.xyz/ws (public)20 levels per side (default)
claude-skillwss://api-hyperliquid-mainnet-orderbook.n.dwellir.com50 levels per side
codex-no-skillwss://api.hyperliquid.xyz/ws (public)20 levels per side (capped by public endpoint)
codex-skillwss://api-hyperliquid-mainnet-orderbook.n.dwellir.com50 levels per side

The skill variants produced more focused code. Claude's skill variant was the most concise at 446 lines with robust reconnection handling (10 reconnection-related code sections versus 4 in the no-skill variant). The no-skill variants wrote more defensive code and included extra error handling around API format assumptions they were uncertain about.

Baseline: public WebSocket, 20 levels per side.
Baseline: public WebSocket, 20 levels per side.
Skill-v2: Dwellir Orderbook WebSocket, 50 levels per side.
Skill-v2: Dwellir Orderbook WebSocket, 50 levels per side.

Round 2: Layering Skills with the Dwellir CLI

After shipping the Dwellir Agent Toolkit, including a CLI built for agent consumption, we ran a second benchmark to test whether combining skills with an agent-friendly CLI produces compounding gains. The updated Hyperliquid skill now includes a section teaching agents to use dwellir endpoints search, dwellir docs search, and dwellir keys list for programmatic infrastructure discovery.

This round tested both Claude Opus 4.6 and OpenAI Codex (gpt-5.3-codex) across four configurations:

VariantWhat the agent receives
baselineTask prompt only. No skill, no CLI, no API key.
cli-onlyTask prompt + Dwellir API key + instructions to use the dwellir CLI for endpoint discovery. No skill file.
skill-v1Task prompt + original skill (without CLI section) + reference docs + API key
skill-v2Task prompt + updated skill (with CLI section) + reference docs + API key

Claude Opus 4.6

All four Claude variants produced working applications that ran continuously.

VariantWall TimeOutput TokensCostCode Size
baseline776s44,555$2.19527 lines
cli-only400s19,321$1.22508 lines
skill-v1255s11,558$1.18408 lines
skill-v2236s8,208$0.94393 lines

The progression tells a clear story. Each layer of tooling produced measurable improvement:

Baseline to cli-only: 48% faster, 44% cheaper. The CLI alone gave the agent a way to discover endpoints and read documentation without web searches, cutting wall time by 376 seconds.

Baseline to skill-v1: 67% faster, 46% cheaper. The original skill (without CLI instructions) eliminated the research phase more effectively than the CLI alone, because the agent received endpoint URLs and subscription formats directly in the prompt.

Baseline to skill-v2: 70% faster, 57% cheaper. The updated skill with CLI instructions produced the best results across every metric. The agent used the skill for immediate infrastructure knowledge and the CLI for dynamic discovery, resulting in 82% fewer output tokens than the baseline.

skill-v1 to skill-v2: 7% faster, 20% cheaper. Adding CLI instructions to an already-effective skill produced incremental improvement on top of the large gains from the skill itself.

All four Claude variants running simultaneously. Top-left: baseline, top-right: cli-only, bottom-left: skill-v1, bottom-right: skill-v2.
All four Claude variants running simultaneously. Top-left: baseline, top-right: cli-only, bottom-left: skill-v1, bottom-right: skill-v2.

OpenAI Codex (gpt-5.3-codex)

All four Codex variants also passed, producing working dashboards. Codex does not expose wall time in its output, but token counts are available from the JSONL output. Costs below are estimated using published API pricing ($1.75/1M input, $0.4375/1M cached input, $14/1M output).

VariantTool CallsOutput TokensEst. CostCode Size
baseline3017,223~$0.97555 lines
cli-only4315,106~$0.72487 lines
skill-v12915,219~$0.74548 lines
skill-v22819,749~$0.84498 lines

Codex shows a different pattern from Claude. Cost differences are smaller (all variants between $0.72 and $0.97), and the cheapest configuration was cli-only rather than skill-v2. Codex processes tasks in fewer turns with larger tool call batches, so it benefits less from the progressive disclosure pattern of skills. The most consistent signal across both agents: the baseline was the most expensive variant every time. Skills and CLI access reliably reduced the cost of building the same application.

Training Data vs Structured Knowledge

The two agents showed opposite training-data biases. Claude's baseline used only public Hyperliquid endpoints and never mentioned Dwellir. Codex's baseline included Dwellir endpoint URLs in its generated code, but without an API key it fell back to the public WebSocket. Both baselines produced working applications regardless. The skill-equipped variants of both agents connected to Dwellir's dedicated Orderbook WebSocket with correct authentication.

Even when a model already "knows" about a provider from training data, structured skill files plus CLI access still cut Claude's cost by 57% and time by 70%. The agent spends zero tokens on research and discovery when the answers are in the skill.

Why Skills and CLIs Compound

The performance gap comes down to three factors, each amplified when skills and CLIs work together:

Eliminated research phase. The no-skill variants spent significant time and tokens searching the web, reading documentation, and experimenting with API calls before writing code. Skill variants went from prompt to implementation immediately because the reference docs provided endpoint URLs, authentication methods, subscription formats, and depth parameters. The CLI variant discovered the same information through structured commands, but the skill-v2 variant had both paths available and used whichever was faster for each subtask.

Fewer wrong turns. Without the skill, Claude generated exploratory code to test API responses, then rewrote sections when the format differed from expectations. With the skill, each function was written correctly on the first attempt because the reference docs include request/response examples. The CLI added a verification layer: agents could confirm endpoint availability before writing code against them.

Progressive disclosure over context flooding. Agent skills use a three-tier loading system. Metadata loads at startup (~600 tokens), core instructions load on-demand (~2,000-5,000 tokens), and detailed references load only when needed. This avoids the context pollution problem where dumping everything into the prompt degrades model performance. Research on context engineering shows skills achieve equivalent extensibility with 96% less context overhead compared to static tool loading. The CLI extends this pattern by letting agents query for specific information at runtime rather than loading it all upfront.

Try It Yourself

Dwellir's Hyperliquid infrastructure skill is open source and works with Claude Code, OpenAI Codex, Cursor, and 40+ other agents that support the Agent Skills standard. It covers all five Dwellir Hyperliquid services: HyperEVM JSON-RPC, Info API proxy, gRPC Gateway, Orderbook WebSocket, and dedicated nodes.

Install the skill in your project:

BASH
npx skills add dwellir-public/hyperliquid-skills

Install the Dwellir CLI:

BASH
curl -fsSL https://raw.githubusercontent.com/dwellir-public/cli/main/scripts/install.sh | sh

The benchmark harness is open source. You can reproduce these results or run your own skill comparisons.

Get started with Dwellir's Hyperliquid infrastructure:

read another blog post

© Copyright 2026 Dwellir AB