DeepSeek V4 And The Cost Of Remembering: 1M Tokens, Selective Attention, And A China-Chip Pivot

Long context isn’t a feature. It’s a bill. DeepSeek’s V4 is about making that bill smaller.

DeepSeek’s V4 preview is being pitched as open, cheap, and capable — with a one‑million‑token context window and architectural tweaks aimed at making ‘remembering’ less computationally ruinous. The interesting part isn’t the bragging rights. It’s the cost curve, and the hardware politics underneath it.

What DeepSeek actually shipped (and what it claims)

MIT Technology Review reports that DeepSeek released a preview of V4, with two variants (V4‑Pro and V4‑Flash), both offering reasoning modes and a stated 1 million token context window. Engadget highlights the same headline: the pitch is “cost‑effective” long‑context, and continued “open” availability (downloadable and modifiable, per the company).

DeepSeek also claims aggressive pricing — MIT Technology Review cites $1.74/$3.48 per million input/output tokens for V4‑Pro, and roughly $0.14/$0.28 for V4‑Flash — positioning the model as a cost weapon as much as a capability statement.

The non-obvious thing: long context becomes strategic when attention stops being quadratic pain

Everyone can slap “1M tokens” on a slide deck now. The differentiator is whether you can do it without turning inference into a bonfire. MIT Technology Review describes V4’s core move as selective attention: compress older context, keep nearby text “full fidelity,” and focus compute on the bits most likely to matter.

DeepSeek says that, at a 1M‑token context, V4‑Pro uses 27% of the compute and 10% of the memory of its prior model (V3.2), with even larger reductions claimed for V4‑Flash. If those numbers hold in real deployments (big “if”), this is the difference between long context being a premium gimmick and long context being default infrastructure.

The other non-obvious thing: ‘open’ is colliding with ‘chips’

V4 is also framed as DeepSeek’s first model optimized for domestic Chinese chips (notably Huawei’s Ascend), and MIT Technology Review calls out Huawei’s statement that Ascend 950‑series “supernode” products will support V4. The geopolitical read is obvious: export controls pushed China toward a parallel stack. The operational read is the interesting one: inference is where domestic chips may be “good enough” sooner than training, which means a lab can shift its cost base and pricing strategy even if training still leans on Nvidia elsewhere.

So you get a neat two‑layer strategy: lower the cost of memory (architecturally), then lower the cost of compute (supply chain). If you’re trying to build an ecosystem around your model, that’s how you turn “open weights” into “open distribution.”

The Singularity Soup Take

The next model war isn’t “who’s smartest,” it’s “who can afford to remember.” DeepSeek is trying to turn long-context from a luxury into a baseline — and it’s doing it in the most 2026 way possible: architecture tweaks plus a chip-alignment story that doubles as industrial policy.

What to Watch

  • Real-world long-context reliability. Does the compression approach preserve the right details, or does it quietly eat the evidence you needed?
  • Developer uptake at the tooling layer. Watch whether “optimized for agent frameworks” turns into real integrations and tutorials — not just benchmark charts.
  • Ascend at scale. If Huawei supernodes ship in volume (and work), pricing pressure on closed models intensifies fast.