Cryptography

inspired by NaCl: https://nacl.cr.yp.to/

Why I Started with AES

AES is a perfect teaching hardware cryptography because it embodies nearly every challenge I'll face when implementing crypto in hardware. It requires finite field arithmetic, careful state management, complex control flow, and demands both high throughput and constant-time execution. If I could implement AES correctly in hardware, most of the patterns needed for other cryptographic primitives would be easier. More pragmatically, AES is everywhere. Trading systems, Secure communications, data storage—all rely on AES. By starting here, I could immediately validate our design choices against real-world requirements.

I wanted to build more than just a simple AES implementation, thus decided to build HCL. The features I've talked in this wiki are not implemented as of July 2025 and the pseudocode is to give you an idea on what HCL is all about.

You can find the code here: github.com/2SpaceMasterRace/HCL

Understanding AES Through a Hardware Lens

Before diving into Hardcaml code, let's understand why AES is particularly well-suited for hardware implementation. The algorithm operates on a 4×4 matrix of bytes, applying the same operations repeatedly. This regular structure maps beautifully to hardware's spatial parallelism.

In software, you might implement AES with loops and conditional branches. In hardware, we think differently. Every operation happens simultaneously across all 16 bytes of the state. When we perform SubBytes, we're not iterating through bytes—we're instantiating 16 S-box circuits that all compute in parallel. This fundamental shift in thinking—from temporal to spatial computation—underlies everything in HCL.

The four AES operations each serve a specific cryptographic purpose, and each presents unique implementation challenges:

  • SubBytes provides non-linearity through S-box substitution. In hardware, we face a classic tradeoff: lookup tables (fast but resource-heavy) versus computed S-boxes (compact but slower). The choice depends on your target FPGA and performance requirements.

  • ShiftRows is pure wire routing—no logic gates needed. This showcases hardware's ability to perform certain operations at zero cost. What requires memory operations and index arithmetic in software becomes simple signal reordering in hardware.

  • MixColumns involves Galois field multiplication, introducing the challenge of implementing finite field arithmetic efficiently. The operation mixes bytes within columns, providing diffusion.

  • AddRoundKey is a simple XOR operation, but it requires careful key schedule implementation to ensure all round keys are available when needed.

Architecture Decision: Iterative vs. Pipelined

One of HCL's core design principles is that no single implementation can serve all use cases. For AES, I explored three architectures, each optimized for different scenarios.

The iterative architecture implements one round of AES and reuses it 10 times. Think of it as the hardware equivalent of a for loop. This approach minimizes area—you only pay for one round's worth of logic—but requires 11 clock cycles to encrypt a block. Here's how we structure it:

The beauty of this approach lies in its simplicity and flexibility. Need to add fault detection? Insert checking logic in the Round state. Want to support different key sizes? Parameterize the round count. The state machine abstraction makes these modifications straightforward.

The pipelined architecture takes the opposite approach: it unrolls all rounds into dedicated hardware. Each round becomes a pipeline stage, with registers between stages. This trades area for throughput—you can start a new encryption every cycle, achieving peak throughput of one block per cycle after the initial latency.

The pipelined approach showcases hardware's true strength: massive parallelism. With 11 blocks in flight simultaneously, we achieve throughput impossible in software. The cost? we've instantiated 160 S-boxes (16 per round × 10 rounds) compared to just 16 in the iterative design.

The S-Box

The S-box implementation decision ripples through the entire design. It's worth examining in detail because it exemplifies the nuanced tradeoffs in hardware design.

The naive approach uses a lookup table:

Simple, but this synthesizes to a massive multiplexer tree. On modern FPGAs with abundant lookup tables (LUTs), this might be optimal. But what if we're area-constrained? The composite field approach computes the S-box using mathematical operations:

This approach uses about 1/5 the area but requires more logic levels, potentially limiting clock frequency. The choice depends on your specific requirements—there's no universally correct answer.

In the future, I aim HCL to provide both implementations and make the choice explicit:

Galois Field Arithmetic

MixColumns requires multiplication in GF(2^8), and implementing this efficiently is crucial for performance. The key insight is that multiplication by fixed values (2 and 3 in MixColumns) can be optimized into simple circuits:

For MixColumns, we only multiply by constants 1, 2, and 3. This means we can hardcode optimized circuits:

Key Schedule

The key schedule often gets less attention than the main datapath, but it's equally critical. Poor key schedule implementation can bottleneck the entire system. HCL aims to offer three key schedule strategies:

The choice of key schedule strategy depends on your architecture. Pipelined designs need all keys immediately, making precomputation mandatory. Iterative designs can trade latency for area by computing keys on demand.

Constant-Time Execution

One of hardware's advantages is natural constant-time execution. Every operation takes exactly the same number of clock cycles, regardless of data values. But this property must be preserved carefully:

HCL could in theory enforce constant-time operations through type design:

Testing

Testing cryptographic hardware requires more than just checking test vectors. HCL's testing philosophy encompasses multiple layers:

First, we validate against known test vectors:

We also can verify constant-time execution:

We can use property-based testing to explore corner cases:

Integration Patterns

A cryptographic primitive is only useful if it can be easily integrated into larger systems. HCL aims to provide several integration patterns: For streaming applications, buffered interfaces are in the works:

For integration with existing designs, HCL can also provide wrappers similar to Google Tink's API:

Performance Optimization

After having a working implementation, optimization begins. HCL aims to provide tools to analyze and improve performance:

Real optimization often requires algorithmic changes. For example, bit-slicing can dramatically improve throughput for parallel operations:

Modes of Operation

While AES provides the foundation, practical applications require modes of operation. HCL's design makes composition natural:

Looking Forward: Building a Complete Library

AES is just the beginning. The patterns I've established—parameterizable architectures, explicit tradeoffs, comprehensive testing, and composable designs—extend to all cryptographic primitives.

Consider how these patterns apply to other algorithms:

  • SHA-256 shares AES's iterative structure but processes data unidirectionally. The round function is simpler, but the padding logic requires careful handling of message boundaries.

  • ChaCha20 is naturally parallelizable with no table lookups, making it ideal for high-security applications where side-channel resistance is paramount.

  • Elliptic curve operations introduce new challenges ; Thankfully Jane Street already has a implementation at Hardcaml ZPrize.

Each primitive added to HCL strengthens the foundation. The goal isn't just to implement algorithms but to create a toolkit that makes secure, efficient hardware cryptography accessible to everyone at Jane Street and beyond.

Building HCL taught me several key lessons about hardware cryptography:

  • First, there's no substitute for deep understanding. You can't implement what you don't understand, and in cryptography, partial understanding is dangerous. That's why I document not just what our code does, but why it does it.

  • Second, flexibility and performance are not opposites. By making architectural choices explicit and providing multiple implementations, I want users to be able to choose the right tradeoff for their application.

  • Third, verification cannot be skipped even when building a MVP. Every optimization and every line of code must be validated against test vectors, verified for constant-time execution, and tested for edge cases. It is so easy to think some code works only to realize all your test cases fail and you spend another 6 hours validating every piece of logic you wrote. But this did lead me to write tests for the mentioned logic which helped me to quickly debug things.

Finally, the best abstraction is one that makes the right thing easy and the wrong thing obvious. HCL's type system, module structure, and API design should work together to guide users toward secure implementations.

Last updated