A hardware implementation of a transformer, built as a learning tool and optimized past 50k tokens/s.
May 1, 2026
Talos V2 asks a simple hardware question: what happens if a small Transformer is not treated as software at all? Instead of running a model through a runtime, Talos V2 turns the inference path into explicit RTL: embeddings, attention, normalization, the MLP, the language-model head, and token sampling.
The model is microGPT, trained on Karpathy's names dataset to generate names one character at a time. It is intentionally small enough to study end to end, but still complex enough to contain the pieces that make modern generative models interesting: matrix-vector projections, self-attention, softmax-like behavior, residual paths, normalization, and stochastic sampling.
The point of the project is not just the final speed number. The point is that the design is readable. Each block exists because the model math forced it to exist, and each optimization is tied to a real hardware constraint: area, memory bandwidth, routing pressure, timing closure, or unnecessary cycles in the finite-state machine.
Most explanations of Transformers stop at equations. Most FPGA explanations stop at adders, multipliers, and timing reports. Talos V2 is meant to connect those worlds. It shows how a familiar neural-network layer turns into memories, counters, state transitions, accumulators, lookup tables, and multicycle arithmetic engines.
That makes the project useful even if you are not trying to build this exact accelerator. You can use it to learn how to translate model operations into hardware questions. Where do weights live? How wide should the accumulator be? Which values must be stored, and which can be streamed? What work can happen in parallel without making the design impossible to route? Which parts of the model are mathematically clean but physically expensive?
The most important lesson was that throughput did not improve by blindly adding parallelism. The winning changes were the ones that respected the FPGA. We attempted wider datapaths, more aggressive parallel compute, and deeper shortcuts, but the design only got faster when those changes still fit, closed timing, and reduced real cycles in the measured generation path.
Floating point is convenient in Python, but expensive in FPGA fabric. Talos V2 uses Q4.12 fixed-point math throughout the standalone RTL core. Activations and weights are 16-bit signed values, with enough fractional precision for the small character model and enough structure for predictable hardware.
\[\underbrace{\,\overbrace{\text{sign + integer}}^{\text{4 bits}}\;|\;\overbrace{\text{fraction}}^{\text{12 bits}}\,}_{16\ \text{bits total}}\]The weights are exported into ROM-friendly hex files and loaded with $readmemh. That matters because runtime weight movement is not free. If the model is fixed, the fastest path is to put the weights where the datapath can read them deterministically.
A Transformer is mostly matrix-vector multiplication wearing different names. Query, key, value, output projection, MLP expansion, MLP projection, and the final language-model head all reduce to the same core operation: multiply a vector by rows of weights and accumulate the result.
The first instinct was to make that hardware wide, but the useful question was how much parallelism could still fit and close timing on the Cyclone V. The practical design uses a 16-lane streamed systolic matrix-vector tile: sixteen output rows in parallel, one vector column streamed per cycle.
That tile became the center of the design. Instead of creating a separate datapath for every layer, Talos V2 time-multiplexes one known-good tile across the Transformer. This is the central area-throughput tradeoff: spend a few more cycles reusing hardware, but keep the circuit small and fast enough to actually run.
The same tile computes the Q, K, and V projections, then the attention output projection. In software, attention is usually described as one compact expression:
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V \]In RTL, that expression becomes a schedule. First generate Q, K, and V. Then scan dot products. Track the maximum. Convert scores into approximate weights. Sum them. Divide. Accumulate weighted values. Move the result into the residual path. The math is the same, but the design problem is completely different.
Softmax is where the hardware cost becomes obvious. A direct implementation wants exponentials, a reduction sum, and division. Talos V2 handles that with a small lookup-table based exponential approximation and a multicycle saturated divider. The divider is not general purpose because it does not need to be. Its input range is bounded by the model, so the engine can be narrower and faster than a naive divider.
The design also uses two divider engines for the attention value path. That lets two channels progress at the same time and cuts a serial bottleneck without exploding the rest of the datapath. This is the kind of optimization that mattered most: targeted parallelism at a proven bottleneck, not parallelism everywhere.
The optimization process was not a straight line. Some ideas made the RTL prettier but slower. Some ideas made simulation faster but failed to fit. The final throughput came from repeatedly asking a more grounded question: does this reduce the number of useful cycles, shorten the critical path, or remove data movement from the token-generation path?
The current pure RTL path generates more than 50,000 tokens per second on the DE1-SoC, with a measured checkpoint around 53,000 tokens/s. That path executes the Transformer block and performs token sampling in hardware. The host is not choosing tokens for the design.
That number is important, but the way the design reached it is more important. The speedup came from a collection of disciplined hardware decisions: keep weights local, reuse the tile, stream data instead of buffering everything, fold bookkeeping into existing passes, bound the math engines, reduce serial attention work, and only raise the clock after timing closure supported it.
Talos V2 is therefore both an accelerator and a map. It shows how a Transformer can be lowered into RTL, where the performance traps are, and why the best optimization is often not the biggest datapath. The best optimization is the one that survives synthesis, fits the board, closes timing, and removes real work from the next-token loop.
Talos V2 is open source because accelerator design is easier to learn when the full stack is visible. There is still a lot of room to take this design further: tighter scheduling, better memory reuse, wider or better-balanced tiles, improved sampling, cleaner host integration, and more aggressive timing-driven RTL cleanup. If you want to push the throughput higher, improve the tooling, or extend the writeup, open an issue, send a PR, or explore the Talos V2 repo.
Big shoutout to Gawtham for giving us the FPGA that made this project possible.