Claude Code 4.7 Front-End Demo Review: AgentHub AI Dashboard (vs GPT-5.5 Head-to-Head)

Same 1,500-word prompt handed to Claude Code 4.7 (Opus 4.7 / Sonnet 4.6) and Codex/GPT-5.5. Claude passed 9/9 delivery checks, shipped a 599KB main chunk (170KB gzip), and built 2,388 modules in 2.42s. A side-by-side, numbers-driven Claude vs GPT-5.5 review on a real AgentHub AgentOps dashboard — covering delivery checklist, custom-control density, mobile edges, state coupling depth, code structure, and bundle size. Also useful when comparing Claude Code with Cursor or the Codex CLI for SaaS dashboard prototyping.

Claude Code 4.7Anthropic Claude CodeClaude vs GPT-5.5Claude vs CursorClaude vs Codex CLIClaude Sonnet 4.6Claude Opus 4.71M ContextAI Coding Agent Comparisonagentic codingAgentOps DashboardAI Generated React UIAI Generated Tailwind DashboardVite React TypeScriptSaaS Prototype ReviewLLM Coding Benchmark

This is the second half of a controlled experiment. The same ~1500-word prompt that I gave to Codex/GPT-5.5 in the previous AgentHub review, I now hand to Claude Code 4.7. Same product brief, same boundaries, same delivery requirements. The question: under identical input, how does the latest agentic-coding generation move the bar?

What Search Intent This Review Covers

How good is Claude Code 4.7 at front-end work?Start from an empty directory and let it deliver a Vite + React + TypeScript AgentOps dashboard end-to-end, hitting 9/9 acceptance checks. Then evaluate interactions, mobile behavior, code structure, and build output.
Claude vs GPT-5.5 — which is better for SaaS demos?Same 1,500-word prompt, same delivery checklist, both agents executed once. Side-by-side numbers on 9 acceptance checks, bundle size, and custom-control density.
What metrics matter when comparing AI coding agents?Use a delivery checklist, mobile-edge handling, depth of state coupling, custom-control density, and bundle size to quantify the real gap — not just screenshot quality.
How does Claude Code 4.7 compare with Cursor or the Codex CLI?Working backwards from this AgentHub run — code structure, bundle size, accessibility defaults, engineering defaults — sketches where Claude Code sits relative to Cursor IDE and the Codex CLI for SaaS prototyping.

1. Experiment Setup

To make the two runs comparable, I did not change a single character of the prompt. Same product context, same page sections, same interaction list, same design constraints, same code-quality bar, same delivery requirements. The execution environment this time is the Claude Code 4.7 CLI agent, starting from an empty directory, creating its own Vite + React + TypeScript project from scratch.

2. What It Delivered

Starting from an empty directory, Claude Code 4.7 generated a Vite + React + TypeScript + Tailwind project, pulled in only lucide-react and recharts, and skipped shadcn entirely. The result was 9 feature components organized across 8 folders, plus a 15.3KB structured mock-data file.

Desktop screenshot: a 6-agent left roster, 5 KPI cards on top, stacked area + stacked bar charts in the middle, risks panel on the right, kanban and timeline below. Restrained dark palette with a teal accent — none of the "single blue-purple gradient" the prompt explicitly forbade.
9Delivery checks
9Fully passed
599KBMain JS chunk
2.42sProduction build
Top navigationAgentHub branding, custom project dropdown (3 projects with repo and branch), Today / 7 Days / 30 Days switcher, search button, theme toggle, active-agent counter, CL avatar.
Agent list6 agents (one extra Refactor Surgeon). Each carries a personal name (Atlas / Pixel / Forge / Sentry / Lens / Mason), role, status, model attribution (opus-4.7 / sonnet-4.6 / haiku-4.5), current task, tokens, cost, and progress bar.
Core metricsAll 5 metrics implemented. Each card has a trend chip (up / down / warn / flat), delta label, and a contextual hint such as "Mason idle since 38m".
Project kanbanBacklog / In Progress / Review / Done columns. Cards include T-2041-style IDs, p0/p1/p2 priority pills, assignee avatar initial, file count, checks (passed/failed/pending icon counters), and relative time.
ChartsStacked area "Agent token usage" with a Tokens/Cost toggle; stacked bar "Task throughput" by completed/reviewed/failed. Both have grid, custom tooltip, and legend, fully aligned with the card design system.
Risks and timeline6 risks with multi-select severity pills (high/medium/low), hover-revealed dismiss button. Timeline includes a connecting vertical line and per-status icons.
Detail drawerTwo drawer modes for agent vs task: status pill, current-task progress, three-stat row, recent activity, open files, active risks, and a Pause / Reassign / View runs / Restart action row. ESC closes; opening locks body scroll.
Responsive behaviorSidebar + main on desktop; mobile flips to a top agent strip + single-column stack, kanban becomes 1 column, 2 at md, 4 at xl. At 393px width I saw no horizontal overflow or clipping.
Build verificationnpm run build finished in 2.42s — 2,388 modules, main JS 599.24KB (gzip 170.40KB), CSS 26.49KB. Vite warned that the chunk exceeds 500KB, same caveat as the GPT-5.5 run.

3. What It Got Right

First: it closed the gaps GPT-5.5 left behind

The two loudest complaints from the GPT-5.5 run were a native <select> for status changes and right-edge clipping in the mobile agent strip. Under the same prompt, Claude Code 4.7 spontaneously built a custom dropdown menu for status moves (with a "Move to" header, CircleDot icons, current state marked by Check), and added HTML5 drag-and-drop on top — the target column highlights on drag-over, the source card goes semi-transparent during dragging. The mobile agent rail was implemented as overflow-x-auto no-scrollbar inside a strict overflow-x-hidden main container, so the 393px viewport produced zero horizontal overflow. Neither of these were new requirements in the prompt — Claude decided on its own that those spots needed polish.

Second: mock data reads like a real project

Six agents carry personal names (Atlas, Pixel, Forge, Sentry, Lens, Mason) with distinct Claude model labels (opus-4.7 / sonnet-4.6 / haiku-4.5), echoing how a real AgentOps platform routes complex vs simple work to different model tiers. Task IDs run T-2041 to T-2049 plus three done-state tasks, file paths reach concrete strings like src/checkout/PaymentForm.tsx, risk numbers ("Coverage fell from 82.1% to 77.9% after PaymentForm rewrite — 3 new branches lack tests") line up with diff stats in the activity timeline ("Updated PaymentForm.tsx (+412 / −287)"). The internal consistency is a step beyond GPT-5.5's already-good run.

Third: the theming system is engineered, not hardcoded

Instead of a binary light / dark swap, every color is exposed as an RGB triplet (--bg-base: 247 248 250), so Tailwind can do bg-status-running/10 with proper alpha derivation. Status colors (running / idle / blocked / reviewing / done) live in the same variable layer — re-skinning the entire app is two :root blocks. The prompt didn't ask for this, but Claude went there anyway. Combined with a useTheme hook that reads prefers-color-scheme, persists to localStorage, and toggles a class="dark" on the html element, theme switching is flicker-free.

Fourth: interaction polish above the demo bar

Drawer opens lock document.body.style.overflow, close on ESC and on backdrop click, ride animate-slideIn in; project dropdown listens to outside mousedown; risk dismiss appears only on hover so the main view stays calm; running-status dots get a pulseRing animation; KPI numbers turn on font-variant-numeric: tabular-nums for stable column alignment. Each item alone is small. Together they make the page feel like the engineer who built it actually ships product.

4. Where It Still Falls Short

Main chunk is slightly larger than the GPT-5.5 run

Bundle size came in at 599.24KB raw (170.40KB gzip), about 38KB larger than GPT-5.5's 561KB. The reason is mechanical: 6 agents + multi-project dropdown + dual-mode chart + drag-and-drop logic all land in a single entry. Vite explicitly warned Some chunks are larger than 500 kB. The fix is obvious — tree-shake Recharts, lazy-load the Drawer — but Claude didn't run that pass on its own. Like GPT-5.5, it stopped at "demo-acceptable."

Drag-and-drop is HTML5-native, no touch fallback

Mouse drag on desktop is smooth, but HTML5 native draggable doesn't really work in iOS Safari. Claude did add a per-card "..." menu with a "Move to →" submenu so touch users aren't blocked, but on mobile the drag interaction is effectively decorative. A production version would need @dnd-kit/core or similar for pointer events. The prompt only asked for "drag or status switch — at minimum the status must change," so technically nothing was missed; but the gap to production is real.

No focus trap inside the drawer

Keyboard navigation works — buttons, roles, and aria attributes are present, and .focus-ring is defined as a utility class — but Tab can leak out of an open drawer back into the underlying page. Doesn't show up in casual demos, but any SaaS product that goes through accessibility review will get flagged here. It's the kind of "easy-to-add but skipped" item that separates a demo from a shipped feature.

Pixel 5 viewport (393px): the top agent strip scrolls horizontally without right-edge clipping, KPI cards collapse to 2 columns, the kanban switches to vertical stacking, the chart legend wraps gracefully. The whole page is consumable by scrolling top-to-bottom — no horizontal scroll required. This is one of the clearest deltas vs the GPT-5.5 run.

5. Same Follow-up Interaction Test, This Run

In the GPT-5.5 review I ran a combined operation: move a backlog task to In Progress, switch to 7 Days, then filter to high risk only. I ran the same combo here. The kanban accepted both drag and the "..." menu — during drag the source card went semi-transparent and the target column showed an accent border. The 7-day switch recomputed the Active Agents scale, Estimated Cost, and token-usage chart together. Severity pills allow multi-select and full deselect; if you turn all three off, the filter resets back to "all on" so the list never collapses to empty.

Reading the source, those state slices live at the App level via useState, with useMemo deriving series / throughput / metrics and filteredRisks. Like GPT-5.5, no Redux or Zustand. Unlike GPT-5.5, the line between source data and derived data is sharper here: kanban tasks are mutated through setTasks, risk dismissals through setRisks, severity filtering only touches the UI layer. That clarity is exactly what makes the difference between a demo that can plug into a real backend and one that has to be rewritten for it.

The boundary, again, is honest: refresh resets everything, the Pause / Reassign buttons in the drawer carry no handlers, and the activity timeline does not gain a new entry when I drag a task. "Internal state coupling" was implemented; "operation logging" and "persistence" were left for the next phase.

6. Same Prompt, Two Deliveries — Side by Side

Putting the two runs side by side, the pattern is visible:

  • GPT-5.5: 8 passes, 1 partial (mobile clipping). Status changes via native select. 5 agents, 1 project. Main chunk 561KB.
  • Claude Code 4.7: 9 passes. Status changes via custom menu + drag. 6 agents with model attribution. 3 switchable projects. Main chunk 599KB (+38KB).

Neither has a fatal flaw. Both reach demo-grade — clickable, screenshot-able, ready for product discussions. GPT-5.5 already meets the "build a respectable workspace" bar; Claude Code 4.7 takes it further on "after building, polish a few production details" — most visibly by closing the GPT-5.5 gaps (native select, mobile clipping) and by making the theme system, state layering, and a11y attributes more engineered.

The cost is bundle size and complexity. If the goal is "runnable enough to demo in a meeting," both agents qualify. If the goal is "ship to a design and PM review tomorrow where someone will pick on details," the Claude Code 4.7 output keeps the discussion on product logic instead of "why does this select look wrong?"

My more subjective day-to-day read is this: I still trust Codex more for backend work, code search, bug isolation, and following a messy call chain through an existing repository. It behaves more like a backend engineer willing to trace the problem all the way down. But for front-end pages, where visual judgment, layout, interaction details, and product feel have to land together, Claude Code tends to be more sensitive. This comparison shows the same pattern: Claude Code 4.7 was more proactive on custom controls, mobile edges, visual hierarchy, and making the page feel like a real product. I would not present that as a benchmark result. It is a usage preference: for backend debugging I would call Codex first; for front-end prototypes and page polish I would let Claude Code take the first pass.

One variable that should not be ignored: prompt density. The 1,500-word boundary set itself decides ~60% of model judgment. Hand either model "build an AgentOps console" as a one-liner and the gap probably widens — weaker models improvise where the prompt is silent, stronger models fill in sensible defaults. The denser the prompt, the more the differences collapse into engineering taste; the looser the prompt, the more they spill back into product judgment. This experiment ran with a dense prompt, so the gap shows up mostly at the engineering layer rather than at the product-shape layer.

Search Questions

How big is the gap between Claude Code 4.7 and Codex/GPT-5.5 on front-end work?

In this 1,500-word prompted test the gap is not huge but it has direction: Claude Code 4.7 passed all 9 delivery checks; GPT-5.5 passed 8 with 1 partial (mobile clipping). Claude proactively built a custom dropdown + drag-and-drop status changer, closing the native-select issue from the GPT-5.5 run. The price is a 38KB heavier main chunk (599KB vs 561KB) because 6 agents, multi-project switching, and a dual-mode chart all land in one entry.

Is Claude Code 4.7 a good fit for SaaS dashboard prototypes?

Based on this AgentHub run, yes. It bootstrapped Vite + React + TypeScript + Tailwind from an empty directory, organized 9 components across 8 feature folders, and produced a 15.3KB internally-consistent mock dataset. Production build was 2.42s with 2,388 modules and a 170KB gzip main chunk. Production-readiness still needs Recharts on-demand imports, lazy Drawer, focus trap, and a touch-friendly drag library — but for product reviews this is more than enough.

Why is the Claude Code 4.7 main chunk larger than the GPT-5.5 one?

Because Claude shipped more functionality. Six agents (it added a Refactor Surgeon), three switchable projects, a dual-mode token/cost chart, and HTML5 drag logic all land in the same entry, taking the main chunk from 561KB to 599KB. It is not less efficient code, it is a richer feature surface under the same prompt. Manual splitChunks or lazy imports would close this gap.

How does Claude Code 4.7 handle theming and color systems?

It exposes all colors as RGB triplets (e.g. --bg-base: 247 248 250) rather than hex strings, so Tailwind utilities like bg-status-running/10 can derive alpha properly. Status colors (running / idle / blocked / reviewing / done) all live in the CSS variable layer. Paired with a useTheme hook that reads prefers-color-scheme, persists to localStorage, and toggles class="dark", switching is flicker-free. The prompt did not ask for this — it was a self-initiated engineering choice.

What is different about the Claude Code 4.7 mock data?

Six agents carry personal names (Atlas, Pixel, Forge, Sentry, Lens, Mason) tagged with different Claude models (opus-4.7 / sonnet-4.6 / haiku-4.5), echoing how a real AgentOps platform would route by task complexity. Tasks, risks, and activity events line up — coverage drop percentages in the risk panel match diff stats in the activity timeline for the same PaymentForm.tsx change. The internal consistency is sharper than the GPT-5.5 run.

What is the catch with AI-coded kanban drag-and-drop?

Claude Code 4.7 used HTML5 native draggable. Desktop mouse drag is smooth, but iOS Safari barely supports it. It did add a "..." menu fallback for status changes so touch users are not blocked, but on mobile drag is effectively decorative. A production version needs @dnd-kit/core or similar with pointer-event support.

How much does prompt detail affect AI coding agent results?

A lot. This experiment used a 1,500-word prompt covering product, page, interaction, design, code-quality, and delivery — that prompt alone makes ~60% of the modeling decisions. The denser the prompt, the more model differences collapse into engineering taste and code style. Hand either model a one-liner like "build an AgentOps console" and the Claude vs GPT-5.5 gap would widen significantly.

How does Claude Code 4.7 handle accessibility?

Basic aria attributes, role labels, and a focus-ring utility are all present, and keyboard navigation reaches every interactive element. However, the open Drawer has no focus trap — Tab leaks out to the underlying page, which any serious a11y audit will flag. It is the kind of "easy-to-add but skipped" item that the prompt did not explicitly require.

How does Claude Code 4.7 compare with Cursor and the Codex CLI?

Cursor is an IDE — its strength is inline completion plus repo-as-context. Codex CLI and Claude Code 4.7 are both command-line agents — their strength is taking a task, editing multiple files, running checks, and explaining results. In this test, Claude Code 4.7 and Codex/GPT-5.5 both delivered the AgentHub dashboard end-to-end under the same prompt; the differences landed at engineering defaults (custom controls, theme system, mobile edges). If you already write in Cursor and need "run a full task autonomously," reach for Claude Code or Codex CLI; if you want "I write, AI assists inline," Cursor is still more comfortable.

Anthropic Claude vs OpenAI GPT for coding — which is better?

No silver-bullet answer. In this prompted comparison, Claude Code 4.7 (Opus 4.7 + Sonnet 4.6 under the hood) was more proactive on custom-control density, mobile edges, and accessibility defaults. Codex/GPT-5.5 produced a slightly smaller bundle and equally sound structural decisions. By 2026 the gap has collapsed to "engineering taste" and "default productization depth" rather than "can it ship at all." Pick by whose default style is closer to your team — not by which model is "smarter."

Can AI-generated front-end code ship to production directly?

It can demo, discuss, and screenshot — not ship. This Claude Code 4.7 delivery has a 599KB main chunk that Vite flags, no focus trap inside the drawer, HTML5 drag that breaks on iOS Safari, no persistence, and action buttons without real handlers. Those are the engineering finishing layers between a prototype and production. My take: AI coding agents take front-end prototypes from 0 to 0.7. The remaining 0.7 to 1.0 (bundle splits, a11y, touch fallback, state persistence, error boundaries, telemetry) still needs engineers.

Claude Sonnet 4.6 vs Opus 4.7 — which to use for coding?

In this AgentHub mock, Claude tagged Atlas (Planner) and Lens (Code Reviewer) as opus-4.7, the executor agents (Pixel, Forge, Mason) as sonnet-4.6, and the test runner (Sentry) as haiku-4.5 — mirroring the cost-routing pattern the community already uses: architecture judgment and PR review on Opus, routine work on Sonnet, mechanical work on Haiku. For day-to-day Claude Code, default to Sonnet 4.6 (best price/perf) and switch to Opus 4.7 for hard reasoning or multi-file coordination. The gap is in "judgment depth" and "cross-context reasoning," not single-file edits.

What is the Claude Code 1M context window actually useful for?

Claude Code 4.7 supports a 1M-token context window. For a 10-file front-end demo it is overkill — no measurable benefit. The real use cases are large repo refactors, state-machine rewrites spanning 10+ files, and feeding an entire SDK doc plus codebase together for a compliance pass. This AgentHub run used under 30K tokens to deliver 9 components and a 15.3KB mock dataset; the 1M window was unused. The practical value is that engineers no longer have to pre-plan for context overflow — the agent can decide what to read.