RTX Spark and how every Nvidia GPU product line is throttled except for one

Nvidia just released the RTX Spark today. With it being presented as an AI-computer it prompted me to write a key Nvidia product-design choice that is relatively unknown in the space: Across Nvidia Blackwell GPUs, Tensor Cores are deliberately throttled in every product line except for a single relatively unknown product. Let’s work through the math.

Nvidia has doubled Tensor Core FLOP performance per clock cycle every generation from Ampere to Hopper to Blackwell. Blackwell now does 2048 FLOP/clk per Tensor Core (TC) at Dense BF16/FP16 (see SemiAnalysis’ NVIDIA Tensor Core Evolution). To get to the Sparse FP4 number Nvidia often quotes:

\[2048 \text{ (BF16)} \times 2 \text{ (BF16} \to \text{FP8)} \times 2 \text{ (FP8} \to \text{FP4)} \times 2 \text{ (sparsity)} = 16{,}384 \text{ FLOP/clk per TC}\]

All Blackwell GPUs have 4 TCs per SM, that’s 65,536 FLOP/clk per SM for Sparse FP4 performance.

The peak Tensor Core compute performance for a GPU can be calculated with this formula:

\[\text{SMs} \times \text{FLOP/clk} \times \text{clock} \times \text{throttle ratio} = \text{PFLOPS}\]

Jetson Thor (robotics, GB10B):

\[20 \text{ SMs} \times 65{,}536 \times 1570 \text{ MHz} \times \tfrac{1}{1} \approx 2 \text{ PFLOPS}\]

Throttle ratio = 1. Fully unlocked.

B200 (datacenter, dual-die):

\[2 \times 132 \text{ SMs} \times 65{,}536 \times 1665 \text{ MHz} \times \tfrac{5}{8} \approx 18 \text{ PFLOPS}\]

Throttle ratio = 5/8.

RTX 5090 (consumer Blackwell):

\[170 \text{ SMs} \times 65{,}536 \times 2407 \text{ MHz} \times \tfrac{1}{8} \approx 3.35 \text{ PFLOPS}\]

Throttle ratio = 1/8. One-eighth of Thor’s per-clock speed.

Which brings us to today’s RTX Spark:

  • Jetson Thor: 20 SMs (2560 CUDA cores), 2 PFLOPS sparse FP4
  • RTX Spark: 48 SMs (6144 CUDA cores), 1 PFLOPS sparse FP4

2.4× the SMs, half the AI performance. How?

Run the throttle-ratio test backwards. Assume the consumer 1/8 ratio, solve for clock:

\[48 \times 65{,}536 \times \text{clock} \times \tfrac{1}{8} \approx 1 \text{ PFLOPS} \implies \text{clock} \approx 2.62 \text{ GHz}\]

Which exactly matches the RTX 5080’s boost clock (2.62 GHz), and RTX Spark has exactly the same SM count / CUDA core count as the RTX 5070.

So RTX Spark is effectively a 5070 with a 5080-class boost clock, but still with the same 1/8 Tensor Core throttle as other RTX-branded GPUs.

Jetson Thor, despite having far fewer SMs, run fully unlocked Tensor Cores and remain the only Nvidia products that do. That’s how the Jetson Thor can have higher AI performance than the RTX Spark with way fewer CUDA cores.