RTX Spark and how every Nvidia GPU outside the datacenter is cut down except for one

Nvidia just announced the RTX Spark today. With it being presented as an AI-computer it prompted me to write a key Nvidia product-design choice that is relatively unknown in the space: Across Nvidia Blackwell GPUs, Tensor Cores are deliberately cut down in silicon in every product line you can buy outside a datacenter, except for a single relatively unknown product. Let’s work through the math.

Nvidia has doubled datacenter Tensor Core FLOP performance per clock cycle every generation from Ampere to Hopper to Blackwell. Datacenter Blackwell now does 2048 FLOP/clk per Tensor Core (TC) at Dense BF16/FP16 (see SemiAnalysis’ NVIDIA Tensor Core Evolution). To get to the Sparse FP4 number Nvidia often quotes:

\[2048 \text{ (BF16)} \times 2 \text{ (BF16} \to \text{FP8)} \times 2 \text{ (FP8} \to \text{FP4)} \times 2 \text{ (sparsity)} = 16{,}384 \text{ FLOP/clk per TC}\]

All Datacenter Blackwell GPUs have 4 TCs per SM, that’s 65,536 FLOP/clk per SM for Sparse FP4 performance. However, for consumer Blackwell (the RTX 50 series) Nvidia instead ships a physically narrower Tensor Core running at exactly 1/8 of this rate: 8,192 FLOP/clk per SM. That width is fixed at design time in silicon.

The peak Tensor Core compute performance for a GPU can be calculated with this formula:

\[\text{SMs} \times \text{FLOP/clk} \times \text{clock} \times \text{width ratio} = \text{PFLOPS}\]

Jetson Thor (robotics, GB10B):

\[20 \text{ SMs} \times 65{,}536 \times 1570 \text{ MHz} \times \tfrac{1}{1} \approx 2 \text{ PFLOPS}\]

Width ratio = 1. Full datacenter-width cores.

B200 (datacenter, dual-die — 80 SMs per die with 74 enabled, so 148 SMs total):

\[148 \text{ SMs} \times 65{,}536 \times {\sim}1860 \text{ MHz (boost)} \times \tfrac{1}{1} \approx 18 \text{ PFLOPS}\]

Width ratio = 1. (~1.86 GHz boost clock is what the official 18 PFLOPS implies, sitting right next to the H100 SXM’s 1.83 GHz.)

RTX 5090 (consumer Blackwell):

\[170 \text{ SMs} \times 65{,}536 \times 2407 \text{ MHz (boost clock)} \times \tfrac{1}{8} \approx 3.35 \text{ PFLOPS}\]

Width ratio = 1/8. One-eighth of Thor’s per-clock speed — exactly Nvidia’s advertised 3,352 “AI TOPS.”

Which brings us to today’s RTX Spark:

Jetson Thor: 20 SMs (2560 CUDA cores), 2 PFLOPS sparse FP4
RTX Spark: 48 SMs (6144 CUDA cores), 1 PFLOPS sparse FP4

2.4× the SMs, half the AI performance. How?

Run the width-ratio test backwards. Assume the consumer 1/8 ratio, solve for clock:

\[48 \times 65{,}536 \times \text{clock} \times \tfrac{1}{8} \approx 1 \text{ PFLOPS} \implies \text{clock} \approx 2.54 \text{ GHz}\]

Which lands right in desktop RTX 50 boost territory (the 5080 boosts at 2.62 GHz), and RTX Spark has exactly the same SM count / CUDA core count as the RTX 5070.

So RTX Spark is effectively a 5070 boosting at desktop-class clocks, but still with the same 1/8-width Tensor Cores as other RTX-branded GPUs.

Jetson Thor, despite having far fewer SMs, runs full datacenter-width Tensor Cores and remains the only Nvidia product outside the datacenter that does. That’s how the Jetson Thor can have higher AI performance than the RTX Spark with way fewer CUDA cores.

Errata: An earlier version derived a different ratio for the B200 from incorrect spec-database figures, the real chip has 148 SMs at an implied ~1.86 GHz boost, and the RTX Spark implied clock corrected from 2.62 to 2.54 GHz.