Cryptography

inspired by NaCl: https://nacl.cr.yp.to/

Why I Started with AES

AES is a perfect teaching hardware cryptography because it embodies nearly every challenge I'll face when implementing crypto in hardware. It requires finite field arithmetic, careful state management, complex control flow, and demands both high throughput and constant-time execution. If I could implement AES correctly in hardware, most of the patterns needed for other cryptographic primitives would be easier. More pragmatically, AES is everywhere. Trading systems, Secure communications, data storage—all rely on AES. By starting here, I could immediately validate our design choices against real-world requirements.

I wanted to build more than just a simple AES implementation, thus decided to build HCL. The features I've talked in this wiki are not implemented as of July 2025 and the pseudocode is to give you an idea on what HCL is all about.

You can find the code here: github.com/2SpaceMasterRace/HCL

Understanding AES Through a Hardware Lens

Before diving into Hardcaml code, let's understand why AES is particularly well-suited for hardware implementation. The algorithm operates on a 4×4 matrix of bytes, applying the same operations repeatedly. This regular structure maps beautifully to hardware's spatial parallelism.

In software, you might implement AES with loops and conditional branches. In hardware, we think differently. Every operation happens simultaneously across all 16 bytes of the state. When we perform SubBytes, we're not iterating through bytes—we're instantiating 16 S-box circuits that all compute in parallel. This fundamental shift in thinking—from temporal to spatial computation—underlies everything in HCL.

The four AES operations each serve a specific cryptographic purpose, and each presents unique implementation challenges:

SubBytes provides non-linearity through S-box substitution. In hardware, we face a classic tradeoff: lookup tables (fast but resource-heavy) versus computed S-boxes (compact but slower). The choice depends on your target FPGA and performance requirements.
ShiftRows is pure wire routing—no logic gates needed. This showcases hardware's ability to perform certain operations at zero cost. What requires memory operations and index arithmetic in software becomes simple signal reordering in hardware.
MixColumns involves Galois field multiplication, introducing the challenge of implementing finite field arithmetic efficiently. The operation mixes bytes within columns, providing diffusion.
AddRoundKey is a simple XOR operation, but it requires careful key schedule implementation to ensure all round keys are available when needed.

Architecture Decision: Iterative vs. Pipelined

One of HCL's core design principles is that no single implementation can serve all use cases. For AES, I explored three architectures, each optimized for different scenarios.

The iterative architecture implements one round of AES and reuses it 10 times. Think of it as the hardware equivalent of a for loop. This approach minimizes area—you only pay for one round's worth of logic—but requires 11 clock cycles to encrypt a block. Here's how we structure it:

(* The iterative approach models AES as a state machine *)
type aes_state = 
  | Idle           (* Waiting for new encryption request *)
  | LoadKey        (* Initial key whitening *)
  | Round of int   (* Executing round n *)
  | Finalize       (* Final round without MixColumns *)
  | Done           (* Result ready *)

let create_iterative scope (i : _ I.t) =
  let open Always in
  let reg_spec = Reg_spec.create ~clock:i.clock ~clear:i.clear () in
  
  (* State machine for control *)
  let state = State_machine.create (module struct
    type t = aes_state [@@deriving sexp, compare, enumerate]
  end) reg_spec ~enable:vdd in
  
  (* Data path registers *)
  let round_counter = Variable.reg reg_spec ~width:4 in
  let state_reg = Variable.reg reg_spec ~width:128 in
  
  (* Key expansion - critical design choice here *)
  (* Do we expand all keys upfront or on-the-fly? *)
  let round_keys = 
    if config.minimize_latency then
      expand_all_keys_parallel i.key  (* All keys ready immediately *)
    else
      expand_keys_on_demand i.key round_counter.value  (* Save area *)
  in
  
  (* The round function - this is where the magic happens *)
  let round_output = 
    let after_subbytes = sub_bytes state_reg.value in
    let after_shiftrows = shift_rows after_subbytes in
    let after_mixcolumns = 
      (* MixColumns is skipped in the final round *)
      mux2 (round_counter.value ==:. 10) 
        after_shiftrows 
        (mix_columns after_shiftrows) in
    add_round_key after_mixcolumns (select_round_key round_keys round_counter.value)
  in
  
  (* State machine behavior *)
  compile [
    state.switch [
      Idle, [
        when_ i.start [
          state_reg <-- i.plaintext;
          round_counter <-- zero 4;
          state.set_next LoadKey;
        ]
      ];
      
      LoadKey, [
        (* Initial AddRoundKey *)
        state_reg <-- add_round_key state_reg.value (select_round_key round_keys 0);
        round_counter <-- one 4;
        state.set_next (Round 1);
      ];
      

    ] @ List.concat_map (List.range 1 11) ~f:(fun r ->
      [Round r, [
        state_reg <-- round_output;
        round_counter <-- round_counter.value +:. 1;
        state.set_next (if r = 10 then Done else Round (r + 1));
      ]]
    ) @ [
      Done, [
        when_ (~:(i.start)) [
          state.set_next Idle;
        ]
      ];
    ]
  ];
  
  { O.valid = (state.current = Done);
    ciphertext = state_reg.value;
    ready = (state.current = Idle) }

The beauty of this approach lies in its simplicity and flexibility. Need to add fault detection? Insert checking logic in the Round state. Want to support different key sizes? Parameterize the round count. The state machine abstraction makes these modifications straightforward.

The pipelined architecture takes the opposite approach: it unrolls all rounds into dedicated hardware. Each round becomes a pipeline stage, with registers between stages. This trades area for throughput—you can start a new encryption every cycle, achieving peak throughput of one block per cycle after the initial latency.

(* Pipelined architecture - maximum throughput *)
let create_pipelined scope (i : _ I.t) =
  let open Signal in
  let reg_spec = Reg_spec.create ~clock:i.clock ~clear:i.clear () in
  
  (* Generate all round keys upfront *)
  let round_keys = expand_all_keys_parallel i.key in
  
  (* Pipeline: each round is a combinational function followed by registers *)
  let stage_outputs = Array.create ~len:11 (zero 128) in
  
  (* Initial round key addition *)
  stage_outputs.(0) <- 
    reg reg_spec ~enable:i.valid 
      (i.plaintext ^: select_round_key round_keys 0);
  
  (* Main rounds 1-9 *)
  for round = 1 to 9 do
    let round_function input =
      input
      |> sub_bytes
      |> shift_rows
      |> mix_columns
      |> add_round_key (select_round_key round_keys round)
    in
    stage_outputs.(round) <- 
      reg reg_spec ~enable:vdd (round_function stage_outputs.(round - 1))
  done;
  
  (* Final round (no MixColumns) *)
  stage_outputs.(10) <- 
    reg reg_spec ~enable:vdd
      (stage_outputs.(9)
       |> sub_bytes
       |> shift_rows
       |> add_round_key (select_round_key round_keys 10));
  
  (* Valid signal propagation through pipeline *)
  let valid_pipe = 
    Array.create ~len:11 (zero 1) in
  valid_pipe.(0) <- reg reg_spec ~enable:vdd i.valid;
  for i = 1 to 10 do
    valid_pipe.(i) <- reg reg_spec ~enable:vdd valid_pipe.(i-1)
  done;
  
  { O.valid = valid_pipe.(10);
    ciphertext = stage_outputs.(10);
    ready = vdd }  (* Always ready to accept new data *)

The pipelined approach showcases hardware's true strength: massive parallelism. With 11 blocks in flight simultaneously, we achieve throughput impossible in software. The cost? we've instantiated 160 S-boxes (16 per round × 10 rounds) compared to just 16 in the iterative design.

The S-Box

The S-box implementation decision ripples through the entire design. It's worth examining in detail because it exemplifies the nuanced tradeoffs in hardware design.

The naive approach uses a lookup table:

let sbox_lut = [|
  0x63; 0x7c; 0x77; 0x7b; 0xf2; 0x6b; 0x6f; 0xc5;
  (* ... 248 more values ... *)
|]

let sub_byte_lut byte =
  (* This becomes a 256-to-1 multiplexer in hardware *)
  mux byte (Array.to_list (Array.map ~f:(of_int ~width:8) sbox_lut))

Simple, but this synthesizes to a massive multiplexer tree. On modern FPGAs with abundant lookup tables (LUTs), this might be optimal. But what if we're area-constrained? The composite field approach computes the S-box using mathematical operations:

(* Composite field S-box - trading computation for area *)
let sub_byte_composite byte =
  (* Map from GF(2^8) to GF((2^4)^2) *)
  let map_to_composite x =
    let open Signal in
    (* This seemingly magical matrix comes from irreducible polynomial selection *)
    let bits = List.init 8 ~f:(fun i -> bit x i) in
    concat [
      bits.(7) ^: bits.(5);
      bits.(7) ^: bits.(6) ^: bits.(4) ^: bits.(3) ^: bits.(2) ^: bits.(1);
      bits.(7) ^: bits.(5) ^: bits.(3) ^: bits.(2);
      bits.(7) ^: bits.(5) ^: bits.(3) ^: bits.(2) ^: bits.(1);
      bits.(7) ^: bits.(6) ^: bits.(2) ^: bits.(1);
      bits.(7) ^: bits.(4) ^: bits.(3) ^: bits.(2) ^: bits.(1);
      bits.(6) ^: bits.(4) ^: bits.(1);
      bits.(6) ^: bits.(1) ^: bits.(0);
    ]
  in
  
  (* Compute multiplicative inverse in GF((2^4)^2) *)
  (* This is where we save area - 4-bit operations instead of 8-bit *)
  let inv_in_composite = composite_field_inverse (map_to_composite byte) in
  
  (* Map back and apply affine transformation *)
  let from_composite = inverse_isomorphism inv_in_composite in
  affine_transform from_composite

This approach uses about 1/5 the area but requires more logic levels, potentially limiting clock frequency. The choice depends on your specific requirements—there's no universally correct answer.

In the future, I aim HCL to provide both implementations and make the choice explicit:

module type SboxConfig = sig
  val implementation : [`Lut | `Composite | `CanrightComposite]
  val pipeline_depth : int  (* For deep pipelining if needed *)
end

module MakeSbox (Config : SboxConfig) : sig
  val sub_byte : Signal.t -> Signal.t
  val latency : int
end

Galois Field Arithmetic

MixColumns requires multiplication in GF(2^8), and implementing this efficiently is crucial for performance. The key insight is that multiplication by fixed values (2 and 3 in MixColumns) can be optimized into simple circuits:

(* Multiplication by 2 in GF(2^8) - the xtime operation *)
let gf_mul2 x =
  let open Signal in
  (* Left shift *)
  let shifted = sll x 1 in
  (* Conditional XOR with irreducible polynomial if MSB was set *)
  mux2 (msb x) 
    (shifted ^: of_hex ~width:8 0x1b)
    shifted

(* Multiplication by 3 = multiplication by 2 + original value *)
let gf_mul3 x =
  gf_mul2 x ^: x

let gf_mul a b =
  let open Signal in
  (* Russian peasant multiplication in GF(2^8) *)
  let products = Array.create ~len:8 (zero 8) in
  products.(0) <- mux2 (lsb b) a (zero 8);
  
  for i = 1 to 7 do
    let a_doubled = gf_mul2 products.(i-1) in
    products.(i) <- mux2 (bit b i) 
      (products.(i-1) ^: a_doubled)
      products.(i-1)
  done;
  
  products.(7)

For MixColumns, we only multiply by constants 1, 2, and 3. This means we can hardcode optimized circuits:

let mix_single_column column =
  let open Signal in
  let bytes = Array.init 4 ~f:(fun i -> 
    select column ((i+1)*8 - 1) (i*8)) in
  
  (* This looks like magic but follows from the MixColumns matrix *)
  concat [
    gf_mul2 bytes.(0) ^: gf_mul3 bytes.(1) ^: bytes.(2) ^: bytes.(3);
    bytes.(0) ^: gf_mul2 bytes.(1) ^: gf_mul3 bytes.(2) ^: bytes.(3);
    bytes.(0) ^: bytes.(1) ^: gf_mul2 bytes.(2) ^: gf_mul3 bytes.(3);
    gf_mul3 bytes.(0) ^: bytes.(1) ^: bytes.(2) ^: gf_mul2 bytes.(3);
  ]

Key Schedule

The key schedule often gets less attention than the main datapath, but it's equally critical. Poor key schedule implementation can bottleneck the entire system. HCL aims to offer three key schedule strategies:

module type KeyScheduleStrategy = sig
  type t
  val create : Signal.t -> t
  val get_round_key : t -> round:Signal.t -> Signal.t
  val area_estimate : int
  val latency : int
end

(* Strategy 1: Precompute all keys *)
module PrecomputedKeys : KeyScheduleStrategy = struct
  type t = Signal.t array
  
  let create key =
    let keys = Array.create ~len:11 (zero 128) in
    keys.(0) <- key;
    
    for i = 1 to 10 do
      (* Key expansion logic using previous key *)
      keys.(i) <- expand_key keys.(i-1) i
    done;
    keys
    
  let get_round_key keys ~round =
    (* Simple mux - all keys ready immediately *)
    mux round (Array.to_list keys)
    
  let area_estimate = 1408  (* 11 keys × 128 bits *)
  let latency = 0  (* All keys ready immediately *)
end

(* Strategy 2: On-demand generation *)
module OnDemandKeys : KeyScheduleStrategy = struct
  type t = Signal.t  (* Just store the original key *)
  
  let create key = key
  
  let get_round_key key ~round =
    (* Iteratively compute the requested round key *)
    (* This saves area but adds latency *)
    let rec compute_key_up_to r current_key =
      if r = 0 then current_key
      else 
        let next_key = expand_key current_key r in
        compute_key_up_to (r - 1) next_key
    in
    compute_key_up_to round key
    
  let area_estimate = 128  (* Only store original key *)
  let latency = 10  (* Worst case: computing round 10 key *)
end

(* Strategy 3: Hybrid - store some intermediate keys *)
module HybridKeys : KeyScheduleStrategy = struct
  (* Store keys 0, 5, 10 - compute others as needed *)
  (* Balances area and latency *)
end

The choice of key schedule strategy depends on your architecture. Pipelined designs need all keys immediately, making precomputation mandatory. Iterative designs can trade latency for area by computing keys on demand.

Constant-Time Execution

One of hardware's advantages is natural constant-time execution. Every operation takes exactly the same number of clock cycles, regardless of data values. But this property must be preserved carefully:

(* DANGEROUS: Data-dependent timing *)
let bad_early_termination data =
  let rec first_zero_byte i =
    if i >= 16 then None
    else if select data (i*8+7) (i*8) = zero 8 then Some i
    else first_zero_byte (i + 1)
  in
  (* This creates different circuits based on data! *)
  
(* SAFE: Always constant time *)
let good_constant_time data =
  let is_zero = Array.init 16 ~f:(fun i ->
    let byte = select data (i*8+7) (i*8) in
    reduce ~f:(&:) (List.init 8 ~f:(fun b -> ~:(bit byte b)))
  ) in
  (* Always examines all bytes, combines results securely *)
  Array.fold is_zero ~init:(zero 4) ~f:(fun acc (idx, is_z) ->
    mux2 is_z (of_int ~width:4 idx) acc
  )

HCL could in theory enforce constant-time operations through type design:

(* Secret data is tagged at the type level *)
module Secret : sig
  type 'a t
  val create : 'a -> 'a t
  val unwrap : 'a t -> 'a
  
  (* Operations that preserve constant-time *)
  val map : 'a t -> f:('a -> 'b) -> 'b t
  val map2 : 'a t -> 'b t -> f:('a -> 'b -> 'c) -> 'c t
  
  (* Explicitly unsafe operations require justification *)
  val unsafe_select : bool t -> 'a t -> 'a t -> 'a t
    [@@alert unsafe "Selection on secret data breaks constant-time"]
end

Testing

Testing cryptographic hardware requires more than just checking test vectors. HCL's testing philosophy encompasses multiple layers:

First, we validate against known test vectors:

let test_nist_vectors () =
  let test_vectors = [
    (* From NIST SP 800-38A *)
    { plaintext = 0x3243f6a8885a308d313198a2e0370734;
      key = 0x2b7e151628aed2a6abf7158809cf4f3c;
      ciphertext = 0x3925841d02dc09fbdc118597196a0b32 };
  ] in
  
  List.iter test_vectors ~f:(fun { plaintext; key; ciphertext } ->
    let module Sim = Cyclesim.With_interface(I)(O) in
    let sim = Sim.create (create_aes ~architecture:`Iterative) in
    let inputs = Sim.inputs sim in
    let outputs = Sim.outputs sim in
    
    (* Drive inputs *)
    inputs.plaintext := Bits.of_int ~width:128 plaintext;
    inputs.key := Bits.of_int ~width:128 key;
    inputs.start := Bits.vdd;
    Cyclesim.cycle sim;
    inputs.start := Bits.gnd;
    
    (* Wait for completion *)
    while Bits.to_bool !(outputs.valid) |> not do
      Cyclesim.cycle sim
    done;
    
    (* Check result *)
    assert (Bits.to_int !(outputs.ciphertext) = ciphertext)
  )

We also can verify constant-time execution:

let verify_constant_time () =
  (* Run encryption with different data but same structure *)
  let trace1 = simulate_and_trace ~data:all_zeros ~key:test_key in
  let trace2 = simulate_and_trace ~data:all_ones ~key:test_key in
  let trace3 = simulate_and_trace ~data:random_data ~key:test_key in
  
  (* Verify all traces have identical control flow *)
  assert (trace1.cycle_count = trace2.cycle_count);
  assert (trace2.cycle_count = trace3.cycle_count);
  assert (trace1.state_sequence = trace2.state_sequence);
  assert (trace2.state_sequence = trace3.state_sequence)

We can use property-based testing to explore corner cases:

let property_tests = [
  QCheck.Test.make ~count:10000
    ~name:"Encryption inverts decryption"
    QCheck.(pair (bytes 16) (bytes 16))
    (fun (plaintext, key) ->
      let ciphertext = encrypt plaintext key in
      let recovered = decrypt ciphertext key in
      plaintext = recovered);
      
  QCheck.Test.make ~count:10000
    ~name:"Different keys produce different ciphertexts"
    QCheck.(triple (bytes 16) (bytes 16) (bytes 16))
    (fun (plaintext, key1, key2) ->
      if key1 = key2 then true
      else encrypt plaintext key1 <> encrypt plaintext key2);
]

Integration Patterns

A cryptographic primitive is only useful if it can be easily integrated into larger systems. HCL aims to provide several integration patterns: For streaming applications, buffered interfaces are in the works:

module StreamingAes = struct
  type 'a t = {
    (* Standard AES signals *)
    clock : 'a;
    clear : 'a;
    (* Streaming interface *)
    s_valid : 'a;
    s_data : 'a; [@bits 8]  (* Byte stream *)
    s_ready : 'a; [@rtlname "s_ready"]
    (* Configuration *)
    key : 'a; [@bits 128]
    mode : 'a; [@bits 2]  (* ECB, CBC, CTR, GCM *)
  } [@@deriving sexp_of, hardcaml]
  
  let create scope (i : _ I.t) =
    (* Internal buffering to accumulate 128-bit blocks *)
    let buffer = Fifo.create ~capacity:16 ~width:8 in
    let block_ready = buffer.count >=:. 16 in
    
    (* State machine to coordinate streaming *)
    (* ... implementation ... *)
end

For integration with existing designs, HCL can also provide wrappers similar to Google Tink's API:

module AesAxi4Stream = struct
  (* Wrap AES in industry-standard AXI4-Stream interface *)
  include Axi4stream.Make(struct
    let data_width = 128
    let id_width = 0
    let dest_width = 0
    let user_width = 0
  end)
  
  let create scope (i : _ I.t) =
    (* Bridge between AXI4-Stream and our AES core *)
    let aes = Aes.create scope aes_inputs in
    (* Protocol conversion logic *)
end

Performance Optimization

After having a working implementation, optimization begins. HCL aims to provide tools to analyze and improve performance:

module Performance = struct
  type metrics = {
    max_frequency : float;
    latency_cycles : int;
    throughput_mbps : float;
    area_luts : int;
    area_ffs : int;
    area_brams : int;
  }
  
  let analyze design =
    (* Static analysis of the design *)
    let critical_path = find_critical_path design in
    let resource_usage = estimate_resources design in
    
    (* Provide optimization suggestions *)
    let suggestions = 
      if critical_path.delay > target_period then
        ["Consider pipelining at: " ^ identify_bottleneck critical_path]
      else if resource_usage.luts > available_luts * 0.8 then
        ["High LUT usage. Consider composite S-boxes or resource sharing"]
      else
        []
    in
    (metrics, suggestions)
end

Real optimization often requires algorithmic changes. For example, bit-slicing can dramatically improve throughput for parallel operations:

(* Bit-sliced AES processes 32 blocks in parallel *)
module BitslicedAes = struct
  (* Instead of 16 8-bit bytes, we have 8 16-bit slices *)
  (* Each slice contains bit i from all 16 bytes *)
  
  let bitslice_sbox slices =
    (* S-box expressed as Boolean operations *)
    let open Signal in
    let x0, x1, x2, x3, x4, x5, x6, x7 = slices in
    
    (* Compute multiplicative inverse using tower field arithmetic *)
    (* 113 gates, depth 16 - optimal known implementation *)
    (* ... Boolean algebra implementation ... *)
end

Modes of Operation

While AES provides the foundation, practical applications require modes of operation. HCL's design makes composition natural:

module MakeMode (Cipher : BlockCipher) = struct
  module Ecb = struct
    (* Simplest mode - direct application *)
    let create = Cipher.create
  end
  
  module Cbc = struct
    (* Chaining provides better security *)
    let create scope (i : _ I.t) =
      let cipher = Cipher.create scope cipher_inputs in
      let iv_reg = Variable.reg reg_spec ~width:Cipher.block_bits in
      
      compile [
        if_ i.start [
          iv_reg <-- i.iv;
        ] @@ elif_ cipher.valid [
          iv_reg <-- cipher.ciphertext;
        ]
      ];
      
      { cipher with 
        plaintext = i.plaintext ^: iv_reg.value }
  end
  
  module Ctr = struct
    (* Counter mode turns block cipher into stream cipher *)
    let create scope (i : _ I.t) =
      let counter = Counter.create ~width:128 in
      let cipher = Cipher.create scope {
        plaintext = counter.value;
        key = i.key;
        start = i.valid;
      } in
      
      { valid = cipher.valid;
        ciphertext = cipher.ciphertext ^: i.plaintext }
  end
  
  module Gcm = struct
    (* Authenticated encryption - the gold standard *)
    (* Combines CTR mode with GHASH authentication *)
    (* Implementation requires careful attention to standards *)
  end
end

Looking Forward: Building a Complete Library

AES is just the beginning. The patterns I've established—parameterizable architectures, explicit tradeoffs, comprehensive testing, and composable designs—extend to all cryptographic primitives.

Consider how these patterns apply to other algorithms:

SHA-256 shares AES's iterative structure but processes data unidirectionally. The round function is simpler, but the padding logic requires careful handling of message boundaries.
ChaCha20 is naturally parallelizable with no table lookups, making it ideal for high-security applications where side-channel resistance is paramount.
Elliptic curve operations introduce new challenges ; Thankfully Jane Street already has a implementation at Hardcaml ZPrize.

Each primitive added to HCL strengthens the foundation. The goal isn't just to implement algorithms but to create a toolkit that makes secure, efficient hardware cryptography accessible to everyone at Jane Street and beyond.

Building HCL taught me several key lessons about hardware cryptography:

First, there's no substitute for deep understanding. You can't implement what you don't understand, and in cryptography, partial understanding is dangerous. That's why I document not just what our code does, but why it does it.
Second, flexibility and performance are not opposites. By making architectural choices explicit and providing multiple implementations, I want users to be able to choose the right tradeoff for their application.
Third, verification cannot be skipped even when building a MVP. Every optimization and every line of code must be validated against test vectors, verified for constant-time execution, and tested for edge cases. It is so easy to think some code works only to realize all your test cases fail and you spend another 6 hours validating every piece of logic you wrote. But this did lead me to write tests for the mentioned logic which helped me to quickly debug things.

Finally, the best abstraction is one that makes the right thing easy and the wrong thing obvious. HCL's type system, module structure, and API design should work together to guide users toward secure implementations.

PreviousBuilding a softcore NextSoftcore CPU Reference

Last updated 4 months ago