Talos V2

A hardware implementation of a transformer, built as a learning tool and optimized past 50k tokens/s.

May 1, 2026

The Project

Talos V2 asks a simple hardware question: what happens if a small Transformer is not treated as software at all? Instead of running a model through a runtime, Talos V2 turns the inference path into explicit RTL: embeddings, attention, normalization, the MLP, the language-model head, and token sampling.

The model is microGPT, trained on Karpathy's names dataset to generate names one character at a time. It is intentionally small enough to study end to end, but still complex enough to contain the pieces that make modern generative models interesting: matrix-vector projections, self-attention, softmax-like behavior, residual paths, normalization, and stochastic sampling.

The point of the project is not just the final speed number. The point is that the design is readable. Each block exists because the model math forced it to exist, and each optimization is tied to a real hardware constraint: area, memory bandwidth, routing pressure, timing closure, or unnecessary cycles in the finite-state machine.

Why This Is A Learning Tool

Most explanations of Transformers stop at equations. Most FPGA explanations stop at adders, multipliers, and timing reports. Talos V2 is meant to connect those worlds. It shows how a familiar neural-network layer turns into memories, counters, state transitions, accumulators, lookup tables, and multicycle arithmetic engines.

That makes the project useful even if you are not trying to build this exact accelerator. You can use it to learn how to translate model operations into hardware questions. Where do weights live? How wide should the accumulator be? Which values must be stored, and which can be streamed? What work can happen in parallel without making the design impossible to route? Which parts of the model are mathematically clean but physically expensive?

The most important lesson was that throughput did not improve by blindly adding parallelism. The winning changes were the ones that respected the FPGA. We attempted wider datapaths, more aggressive parallel compute, and deeper shortcuts, but the design only got faster when those changes still fit, closed timing, and reduced real cycles in the measured generation path.

Transformer flow mapped into the RTL inference schedule

The moving token drives the RTL schedule upward, ending in an explicit generated-token stage.

Core RTL Architecture

Q4.12 Fixed-Point Backbone

Floating point is convenient in Python, but expensive in FPGA fabric. Talos V2 uses Q4.12 fixed-point math throughout the standalone RTL core. Activations and weights are 16-bit signed values, with enough fractional precision for the small character model and enough structure for predictable hardware.

\[\underbrace{\,\overbrace{\text{sign + integer}}^{\text{4 bits}}\;|\;\overbrace{\text{fraction}}^{\text{12 bits}}\,}_{16\ \text{bits total}}\]

The weights are exported into ROM-friendly hex files and loaded with $readmemh. That matters because runtime weight movement is not free. If the model is fixed, the fastest path is to put the weights where the datapath can read them deterministically.

One Reusable MatVec Tile

A Transformer is mostly matrix-vector multiplication wearing different names. Query, key, value, output projection, MLP expansion, MLP projection, and the final language-model head all reduce to the same core operation: multiply a vector by rows of weights and accumulate the result.

The first instinct was to make that hardware wide, but the useful question was how much parallelism could still fit and close timing on the Cyclone V. The practical design uses a 16-lane streamed systolic matrix-vector tile: sixteen output rows in parallel, one vector column streamed per cycle.

That tile became the center of the design. Instead of creating a separate datapath for every layer, Talos V2 time-multiplexes one known-good tile across the Transformer. This is the central area-throughput tradeoff: spend a few more cycles reusing hardware, but keep the circuit small and fast enough to actually run.

Example 4 lane streamed systolic matvec

Example 4-lane slice of the streamed MatVec tile: four MAC lanes compute four output rows while the input vector streams through.

Attention In Hardware

The same tile computes the Q, K, and V projections, then the attention output projection. In software, attention is usually described as one compact expression:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V \]

In RTL, that expression becomes a schedule. First generate Q, K, and V. Then scan dot products. Track the maximum. Convert scores into approximate weights. Sum them. Divide. Accumulate weighted values. Move the result into the residual path. The math is the same, but the design problem is completely different.

Attention dataflow: stream Q/K/V into bounded engines

Attention as scheduled RTL: project Q/K/V, normalize scores, mix V, and project back.

Softmax is where the hardware cost becomes obvious. A direct implementation wants exponentials, a reduction sum, and division. Talos V2 handles that with a small lookup-table based exponential approximation and a multicycle saturated divider. The divider is not general purpose because it does not need to be. Its input range is bounded by the model, so the engine can be narrower and faster than a naive divider.

The design also uses two divider engines for the attention value path. That lets two channels progress at the same time and cuts a serial bottleneck without exploding the rest of the datapath. This is the kind of optimization that mattered most: targeted parallelism at a proven bottleneck, not parallelism everywhere.

Softmax engine: bounded stages instead of one wide combinational block

Softmax is split into max tracking, lookup-table exponentials, accumulation, and multicycle division.

Optimization Iterations

The optimization process was not a straight line. Some ideas made the RTL prettier but slower. Some ideas made simulation faster but failed to fit. The final throughput came from repeatedly asking a more grounded question: does this reduce the number of useful cycles, shorten the critical path, or remove data movement from the token-generation path?

MatVec tiling instead of full parallelism: The 16-lane tile was the practical point between speed and area. It preserved substantial parallel multiply-accumulate work while avoiding the routing and timing cost of fully parallel datapaths.
Operation folding: Separate FSM states are easy to reason about, but they can waste cycles. Talos V2 folds attention max tracking into the attention dot-product pass and folds LM-head token scanning into the projection pass, so the core does useful bookkeeping while the main computation is already happening.
Reusable multicycle math engines: RMS normalization and division are expensive if implemented as giant combinational blocks. Talos V2 isolates them into bounded iterative engines, improving timing closure and making their latency explicit in the schedule.
Hardware sampling: The generated token is selected in RTL, not by shipping logits back to a host program. The categorical sampler computes weights, accumulates them, scales a random cutoff, and selects the next token inside the FPGA path.
Better randomness boundaries: The sampler uses a wider scaled cutoff and xorshift-based mixing to avoid bias toward low token IDs. This is a good example of correctness and throughput meeting in the same block: the sampler has to be fast, but it also has to preserve the behavior of generation.
A faster fabric clock: After shortening critical paths, the RTL core runs from a custom 56.25 MHz PLL instead of only relying on the base 50 MHz board clock. The clock increase only matters because the design still closes timing with positive slack.

The 50k+ tokens/second Result

The current pure RTL path generates more than 50,000 tokens per second on the DE1-SoC, with a measured checkpoint around 53,000 tokens/s. That path executes the Transformer block and performs token sampling in hardware. The host is not choosing tokens for the design.

That number is important, but the way the design reached it is more important. The speedup came from a collection of disciplined hardware decisions: keep weights local, reuse the tile, stream data instead of buffering everything, fold bookkeeping into existing passes, bound the math engines, reduce serial attention work, and only raise the clock after timing closure supported it.

Talos V2 is therefore both an accelerator and a map. It shows how a Transformer can be lowered into RTL, where the performance traps are, and why the best optimization is often not the biggest datapath. The best optimization is the one that survives synthesis, fits the board, closes timing, and removes real work from the next-token loop.

Appendix

References

Attention Is All You Need for the core Transformer formulation that the RTL schedule is derived from.
Karpathy's microgpt guide for the single-file reference implementation and walkthrough of a compact autoregressive GPT.
DE1-SoC documentation for the Cyclone V platform constraints that shaped the tile width, clocking, and memory layout.
Cornell DE1-SoC memory notes for practical ROM and on-chip memory usage patterns on this board class.

Contribute

Talos V2 is open source because accelerator design is easier to learn when the full stack is visible. There is still a lot of room to take this design further: tighter scheduling, better memory reuse, wider or better-balanced tiles, improved sampling, cleaner host integration, and more aggressive timing-driven RTL cleanup. If you want to push the throughput higher, improve the tooling, or extend the writeup, open an issue, send a PR, or explore the Talos V2 repo.

Contact Us

Luthira Abeykoon

Krish Chhajer

Big shoutout to Gawtham for giving us the FPGA that made this project possible.