Roofline — Milk-V Jupiter

Measured Peak Compute (cpufp vfmacc.vv)

Precision	1 core	8 cores
FP16	63.9	511.5 GFLOPS
FP32	32.0	255.8 GFLOPS
FP64	16.0	127.9 GFLOPS
INT8 IME	—	2.05 TOPS (4c only)

Ridge Points (FP32, 8 cores)

Bandwidth	Ridge OI	Min N for compute-bound
Theoretical (10.66 GB/s)	24.0 FLOP/B	N ≈ 144
Realistic (~7 GB/s)	36.5 FLOP/B	N ≈ 220

Square NxN FP32 GEMM: OI = N/6 FLOP/byte.
Orange markers show where each N falls on the roofline.

Microarchitecture Notes

X60 core: In-order, dual-issue. VLEN=256, DLEN=128 (128-bit execution datapath). Two vector units can execute in parallel to the scalar FP unit. Microarchitecture closely resembles XuanTie C908 but with wider VLEN.

FP32 per-core throughput: ~32 GFLOPS → ~20 FLOPS/cycle @ 1.6 GHz. This implies ~10 FP32 FMAs/cycle sustained: 2 vector units × (128b / 32b) = 8 FMAs from vector + likely 1-2 scalar FMA contributing.

IME (Integer Matrix Extension): SpacemiT custom vendor extension for INT8 matrix multiply. Only available on Cluster 0 (cores 0–3). Delivers ~2 TOPS across 4 cores — this is the "2.0 TOPS AI computing power" in SpacemiT's marketing.

Memory: Single 32-bit LPDDR4X bus @ 2666 MT/s = 10.66 GB/s theoretical peak. Realistic sustained bandwidth with these in-order cores is likely 6–8 GB/s. This makes the ridge point quite high (~24–37 FLOP/byte for FP32 8-core), meaning GEMM needs N≥144–220 to become compute-bound.