Softcore CPU Reference

everything you need to know about building a softcore processor

Setting Up Your FPGA Development Environment

Before diving into CPU design, let me walk you through setting up a proper FPGA development environment on Ubuntu. I'll focus on the Arty A7 board since it's popular and affordable, but the process is similar for other boards.

Installing Vivado on Ubuntu

First, you'll need Xilinx Vivado. Here's how I got it working on my Ubuntu system:

# Download Vivado from Xilinx website (you'll need to create an account)
# I'm using version 2024.2, but check for the latest

# Extract the installer
tar -xvf Xilinx_Unified_2024.2_*.tar.gz
cd Xilinx_Unified_2024.2_*/

# Run the installer
sudo ./xsetup

# During installation:
# - Choose "Vivado" (not Vitis)
# - Select "Vivado ML Standard" 
# - Make sure to include support for Artix-7 devices
# - Install to /tools/Xilinx/Vivado/2024.2 (or your preferred location)

After installation, you need to source the settings file every time you want to use Vivado:

source /tools/Xilinx/Vivado/2024.2/settings64.sh

# I add this to my .bashrc with an alias:
echo "alias vivado-init='source /tools/Xilinx/Vivado/2024.2/settings64.sh'" >> ~/.bashrc

Installing Open Source FPGA Tools

While Vivado is powerful, I also recommend installing the open-source toolchain. It's faster for small designs and great for learning:

# Install prerequisites
sudo apt-get update
sudo apt-get install build-essential clang bison flex \
    libreadline-dev gawk tcl-dev libffi-dev git \
    graphviz xdot pkg-config python3 libboost-system-dev \
    libboost-python-dev libboost-filesystem-dev zlib1g-dev

# Install IceStorm tools (for Lattice FPGAs)
git clone https://github.com/YosysHQ/icestorm.git
cd icestorm
make -j$(nproc)
sudo make install

# Install Yosys
git clone https://github.com/YosysHQ/yosys.git
cd yosys
make -j$(nproc)
sudo make install

# Install nextpnr
git clone https://github.com/YosysHQ/nextpnr.git
cd nextpnr
cmake -DARCH=ice40 -DCMAKE_INSTALL_PREFIX=/usr/local .
make -j$(nproc)
sudo make install

Connecting Your Arty Board

The Arty board uses an FTDI chip for USB communication. You'll need to set up proper permissions:

# Add yourself to the dialout group
sudo usermod -a -G dialout $USER

# Create udev rules for the FTDI chip
sudo bash -c 'cat > /etc/udev/rules.d/99-ftdi.rules << EOF
# FTDI FT2232 for Arty
SUBSYSTEM=="usb", ATTR{idVendor}=="0403", ATTR{idProduct}=="6010", MODE="0666"
SUBSYSTEM=="tty", ATTRS{idVendor}=="0403", ATTRS{idProduct}=="6010", MODE="0666"
EOF'

# Reload udev rules
sudo udevadm control --reload-rules
sudo udevadm trigger

# Log out and back in for group changes to take effect

Installing OpenOCD for Debugging

OpenOCD is essential for loading programs onto softcore CPUs:

# Install dependencies
sudo apt-get install libusb-1.0-0-dev libftdi1-dev

# Clone and build OpenOCD
git clone https://github.com/openocd-org/openocd.git
cd openocd
./bootstrap
./configure --enable-ftdi
make -j$(nproc)
sudo make install

From Blinky to CPU

Part 1: The Heartbeat

My first design was embarrassingly simple:

module SimpleCPU (
    input clk,
    input reset,
    output [31:0] debug_pc
);
    reg [31:0] PC;
    
    always @(posedge clk) begin
        if (reset) begin
            PC <= 0;
        end else begin
            PC <= PC + 4;
        end
    end
    
    assign debug_pc = PC;
endmodule

No ALU, no decode logic, nothing. Just a counter. But this counter would become the heartbeat of my CPU.

Part 2: Understanding RISC-V

I chose RISC-V because it's clean and unencumbered by decades of legacy. The base instruction set (RV32I) has just 47 instructions. That's it. This minimalism meant I could build a useful CPU in days, not months. The beauty of RISC-V is its regularity. Instruction fields are always in the same place:

// This consistency saved me hours of debugging
wire [6:0] opcode = instruction[6:0];
wire [4:0] rd     = instruction[11:7];
wire [4:0] rs1    = instruction[19:15];
wire [4:0] rs2    = instruction[24:20];

Part 3: The State Machine

Every CPU is fundamentally a state machine. I learned this the hard way when my first "all-in-one-cycle" design turned into a combinatorial nightmare. Breaking it down into states made everything click:

localparam FETCH    = 0;
localparam DECODE   = 1;
localparam EXECUTE  = 2;
localparam WRITEBACK = 3;

always @(posedge clk) begin
    case (state)
        FETCH: begin
            instruction <= memory[PC[31:2]];
            state <= DECODE;
        end
        
        DECODE: begin
            rs1_value <= registers[rs1];
            rs2_value <= registers[rs2];
            state <= EXECUTE;
        end
        
        EXECUTE: begin
            // This is where the magic happens
            alu_result <= rs1_value + rs2_value; // Simplified!
            state <= WRITEBACK;
        end
        
        WRITEBACK: begin
            if (rd != 0) registers[rd] <= alu_result;
            PC <= PC + 4;
            state <= FETCH;
        end
    endcase
end

Four cycles per instruction. Not fast, but correct. And correct is the foundation you build performance on.

WebAssembly on FPGA Frontier

After getting comfortable with RISC-V, I started exploring WebAssembly softcores. The idea is compelling: instead of compiling to native code, why not run WASM directly on hardware?

Why WASM on FPGA?

WASM has several advantages for FPGA implementation:

No garbage collection in WASM 1.0 - This dramatically simplifies the hardware design
Simple memory model - Just linear memory, no complex MMU required
Stack machine architecture - Different from register machines, potentially more compact

I found several existing projects attempting this:

WasMachine: A sequential 6-step WASM implementation
wasm-fpga-engine: Executes a subset of WASM instructions

The Compilation Pipeline Challenge

The biggest challenge I faced was understanding the compilation pipeline differences between traditional ISAs and WASM. With RISC-V, the flow is straightforward:

C Code → RISC-V Assembly → Machine Code → FPGA Memory

With WASM, it's more complex:

C/Rust/Go → WASM Bytecode → ??? → FPGA Implementation

That middle step is where things get interesting. Unlike traditional CPUs that execute native instructions, a WASM softcore needs to interpret bytecode or compile it to microcode on the fly.

My WASM Softcore Design Approach

I decided to start with WASM 1.0 only, avoiding the complexity of newer features. Here's my basic architecture:

module WASMCore (
    input clk,
    input reset,
    // Memory interface
    output [31:0] mem_addr,
    input [31:0] mem_rdata,
    output mem_wen,
    output [31:0] mem_wdata
);
    // WASM uses a stack machine
    reg [31:0] stack [0:255];
    reg [7:0] sp;
    
    // Linear memory is separate from stack
    // (Interfaced through mem_* signals)
    
    // Current instruction
    reg [7:0] opcode;
    
    // Simplified decode for basic opcodes
    always @(*) begin
        case (opcode)
            8'h6a: begin // i32.add
                // Pop two values, push sum
            end
            8'h20: begin // local.get
                // Push local variable to stack
            end
            // ... more opcodes
        endcase
    end
endmodule

Bootloader Experiments

Getting code onto the WASM softcore required a custom bootloader. I experimented with several approaches:

JTAG Loading: Using OpenOCD to write directly to FPGA memory
UART Bootloader: Slower but more universal
SPI Flash: For permanent storage

Here's a simple UART bootloader I implemented:

module UARTBootloader (
    input clk,
    input uart_rx,
    output reg [31:0] mem_addr,
    output reg [31:0] mem_data,
    output reg mem_write
);
    // Receive bytes from UART
    // Assemble into 32-bit words
    // Write to memory
    
    always @(posedge clk) begin
        if (uart_byte_ready) begin
            case (byte_count)
                0: mem_data[7:0] <= uart_byte;
                1: mem_data[15:8] <= uart_byte;
                2: mem_data[23:16] <= uart_byte;
                3: begin
                    mem_data[31:24] <= uart_byte;
                    mem_write <= 1;
                    mem_addr <= mem_addr + 4;
                end
            endcase
            byte_count <= byte_count + 1;
        end
    end
endmodule

Hello World on WASM

WASM to Memory Compiler: Converting WASM bytecode to memory initialization
Basic I/O: Memory-mapped UART for output
Minimal WASM Runtime: Supporting just enough opcodes for string output

The specifications checklist I developed:

Memory-mapped I/O at 0x10000000
Support for i32 operations (add, sub, load, store)
Function call support (call, return)
Linear memory with at least 64KB
Stack depth of at least 256 entries

Performance Optimization Journey

Once I had working designs, optimization became the focus. Here's what made the biggest differences:

For the RISC-V Core:

Pipelining: 4x theoretical speedup, 2.5x actual after accounting for hazards
Branch prediction: Even 2-bit prediction gave 85% accuracy
Forwarding paths: Eliminated most pipeline stalls

For the WASM Core:

Stack caching: Top 8 stack entries in registers
Opcode fusion: Common sequences executed as single operations
Memory prefetching: Predictable access patterns in WASM helped here

Why Roll Your Own CPU?

When you build your own softcore, you're not just learning architecture; you're learning how to think about computation itself. Every decision you make - from instruction encoding to pipeline depth - teaches you something about the trade-offs that real architects face every day. The beauty of FPGAs is they let you experiment without the $100M price tag of a tape-out. That's the kind of iteration cycle that leads to real understanding.

The Simplest MVP

Let's start with the absolute minimum viable CPU. Not because it's useful, but because complexity is the enemy of understanding. You want something you can hold in your head all at once.

module SimpleCPU (
    input clk,
    input reset,
    output [31:0] debug_pc,
    output [31:0] debug_instruction
);
    reg [31:0] PC;
    reg [31:0] instruction;
    reg [31:0] registers [0:31];
    
    // This is your Harvard architecture right here - separate instruction
    // and data paths. We'll fix this later, but for now, simple wins.
    reg [31:0] instruction_memory [0:255];
    
    always @(posedge clk) begin
        if (reset) begin
            PC <= 0;
        end else begin
            instruction <= instruction_memory[PC[31:2]];
            PC <= PC + 4;
        end
    end
endmodule

See what we did there? No ALU, no decode logic, nothing. Just a counter that reads instructions. This is your heartbeat. Everything else is just organs attached to this pulse.

Understanding RISC-V

RISC-V isn't perfect, but it's good enough, and good enough is often better than perfect. The instruction set is clean, the encoding is regular, and most importantly, it's not encumbered by 40 years of backwards compatibility cruft. Here's the thing about instruction sets: they're a contract between software and hardware. Break that contract, and you're on your own. Respect it, and you get to leverage millions of hours of compiler development. The RV32I base instruction set has exactly 47 instructions. That's it. Everything else is optional. This minimalism is a feature, not a bug. It means you can build a useful CPU in a weekend, not a year.

// Instruction decoder - the rosetta stone between software and hardware
wire [6:0] opcode = instruction[6:0];
wire [4:0] rd     = instruction[11:7];
wire [4:0] rs1    = instruction[19:15];
wire [4:0] rs2    = instruction[24:20];
wire [2:0] funct3 = instruction[14:12];
wire [6:0] funct7 = instruction[31:25];

// The magic of RISC-V: these fields are ALWAYS in the same place
// No variable-length decode nightmares, no modal bits changing the
// meaning of other bits. Just simple, boring, beautiful regularity.

State Machines

Every CPU is fundamentally a state machine. Fetch, decode, execute, writeback - it's a dance as old as von Neumann. The question isn't whether you need these states; it's how you choreograph them.

localparam FETCH    = 0;
localparam DECODE   = 1;
localparam EXECUTE  = 2;
localparam WRITEBACK = 3;

reg [1:0] state;

always @(posedge clk) begin
    if (reset) begin
        state <= FETCH;
        PC <= 0;
    end else begin
        case (state)
            FETCH: begin
                instruction <= memory[PC[31:2]];
                state <= DECODE;
            end
            
            DECODE: begin
                // This is where you pay the price for generality
                // Every instruction needs to be categorized
                rs1_value <= registers[rs1];
                rs2_value <= registers[rs2];
                state <= EXECUTE;
            end
            
            EXECUTE: begin
                // The actual work happens here
                case (opcode)
                    7'b0110011: begin // R-type
                        case (funct3)
                            3'b000: alu_result <= (funct7[5]) ? 
                                    rs1_value - rs2_value :  // SUB
                                    rs1_value + rs2_value;   // ADD
                            // ... more operations
                        endcase
                    end
                endcase
                state <= WRITEBACK;
            end
            
            WRITEBACK: begin
                if (rd != 0) begin  // x0 is always zero in RISC-V
                    registers[rd] <= alu_result;
                end
                PC <= next_pc;
                state <= FETCH;
            end
        endcase
    end
end

Four cycles per instruction. It's not fast, but it's correct, and correct is the foundation you build performance on.

The ALU

The ALU is where the rubber meets the road. It's tempting to build a massive combinatorial blob that does everything, but that's a mistake. Start simple, measure, then optimize.

module ALU (
    input [31:0] a,
    input [31:0] b,
    input [3:0] op,
    output reg [31:0] result
);
    // Here's a secret: subtraction is just addition with extra steps
    wire [32:0] sum = {1'b0, a} + {1'b0, op[3] ? ~b : b} + op[3];
    
    always @(*) begin
        case (op[2:0])
            3'b000: result = sum[31:0];                    // ADD/SUB
            3'b001: result = a << b[4:0];                  // SLL
            3'b010: result = {31'b0, sum[32]};             // SLT (signed)
            3'b011: result = {31'b0, a < b};               // SLTU
            3'b100: result = a ^ b;                        // XOR
            3'b101: result = op[3] ? 
                            ($signed(a) >>> b[4:0]) :      // SRA
                            (a >> b[4:0]);                  // SRL
            3'b110: result = a | b;                        // OR
            3'b111: result = a & b;                        // AND
        endcase
    end
endmodule

Notice how we share the adder for both addition and comparison? That's not being clever - that's recognizing that silicon area costs money, even in an FPGA.

The Memory

Memory is where most CPUs go to die. You can have the world's best pipeline, but if you're waiting on memory, you're just warming the room. This is why caches exist, but let's not get ahead of ourselves.

module Memory (
    input clk,
    input [31:0] address,
    input [31:0] write_data,
    input [3:0] write_mask,  // Byte-level write enables
    input read_enable,
    output reg [31:0] read_data
);
    // In real life, this would be SRAM or DRAM
    // In an FPGA, it's Block RAM (BRAM)
    reg [31:0] mem [0:1023];
    
    wire [29:0] word_addr = address[31:2];
    
    always @(posedge clk) begin
        if (read_enable) begin
            read_data <= mem[word_addr];
        end
        
        // Byte-level writes are crucial for RISC-V
        // This is why memory systems are complex
        if (write_mask[0]) mem[word_addr][7:0]   <= write_data[7:0];
        if (write_mask[1]) mem[word_addr][15:8]  <= write_data[15:8];
        if (write_mask[2]) mem[word_addr][23:16] <= write_data[23:16];
        if (write_mask[3]) mem[word_addr][31:24] <= write_data[31:24];
    end
endmodule

Optimization

Here's where it gets interesting. That 4-cycle state machine? It's killing your performance. Time to pipeline. But first, let me tell you a secret: premature optimization is the root of all evil, except in CPU design, where leaving performance on the table is a sin. The trick is knowing when you're being premature.

// A simple 3-stage pipeline: Fetch, Execute, Writeback
// Why 3 and not 5? Because memory access and execute can overlap
// in our simple design. Don't add stages you don't need.

always @(posedge clk) begin
    // Stage 1: Fetch
    if_id_instruction <= instruction_memory[PC[31:2]];
    if_id_pc <= PC;
    PC <= PC + 4;  // Assume no branches for now
    
    // Stage 2: Decode/Execute
    id_ex_rd <= if_id_instruction[11:7];
    id_ex_result <= alu_result;  // Combinatorial from decoded instruction
    
    // Stage 3: Writeback
    if (id_ex_rd != 0) begin
        registers[id_ex_rd] <= id_ex_result;
    end
end

But wait! What about data hazards? What if instruction N+1 needs the result from instruction N? Welcome to the fun part of CPU design.

Hazards and Forwarding

Hazards are why CPU design is hard. It's not the arithmetic or the control logic - it's the corner cases when instructions depend on each other.

// Forwarding logic - the duct tape of CPU design
wire forward_from_ex = (id_ex_rd != 0) && 
                       (id_ex_rd == rs1_current);
wire forward_from_wb = (ex_wb_rd != 0) && 
                       (ex_wb_rd == rs1_current) && 
                       !forward_from_ex;

wire [31:0] forwarded_rs1 = forward_from_ex ? id_ex_result :
                            forward_from_wb ? ex_wb_result :
                            registers[rs1_current];

This is inelegant. It's also necessary. Every cycle you stall is performance lost forever.

Branches

Branches are where the von Neumann model shows its age. You're fetching instructions sequentially, but programs aren't sequential. They jump around like a hyperactive squirrel.

// Simple branch predictor - always predict not taken
// This is wrong 50% of the time for loops, but it's simple
always @(posedge clk) begin
    if (branch_taken && (predicted_pc != branch_target)) begin
        // Flush the pipeline - those fetched instructions are garbage
        if_id_valid <= 1'b0;
        PC <= branch_target;
    end
end

// Better: a branch history table
reg [1:0] branch_history [0:255];  // 2-bit saturating counters
wire [7:0] bht_index = PC[9:2];    // Use PC bits as index

// 00 = strongly not taken, 01 = weakly not taken
// 10 = weakly taken, 11 = strongly taken
wire predict_taken = branch_history[bht_index][1];

Two-bit prediction gets you to about 85% accuracy. Want better? Add more history. But remember: every bit of state is area, and area is money.

Tools and Testing

Building a CPU without proper testing is like flying blind. You need to know it works before you synthesize it.

# Start with icarus verilog - it's free and good enough
iverilog -o cpu_tb cpu.v cpu_tb.v
vvp cpu_tb

# But for performance, you want Verilator
verilator --cc cpu.v --exe cpu_main.cpp
make -C obj_dir -f Vcpu.mk
./obj_dir/Vcpu

Write test programs. Start simple:

# Test 1: Can you add?
addi x1, x0, 5
addi x2, x0, 3
add  x3, x1, x2  # x3 should be 8

# Test 2: Can you branch?
loop:
    addi x1, x1, -1
    bnez x1, loop

Memory Mapped I/O

A CPU that can't communicate is just a space heater. Memory-mapped I/O is the simplest way to connect to peripherals.

// Address space layout - this is architecture
// 0x00000000 - 0x0000FFFF : RAM (64KB)
// 0x10000000 - 0x1000000F : UART
// 0x10000010 - 0x1000001F : GPIO

always @(*) begin
    if (address[28]) begin  // I/O space
        case (address[7:4])
            4'h0: read_data = uart_data;
            4'h1: read_data = {27'b0, gpio_in};
            default: read_data = 32'h0;
        endcase
    end else begin
        read_data = ram_data;
    end
end

From Simulation to Silicon

Simulation is one thing. Running on real hardware is another. The FPGA tools will humble you.

// What works in simulation might not synthesize
always @(posedge clk) begin
    case (state)
        FETCH: begin
            // This creates a combinatorial loop in synthesis
            // if next_state depends on current_state
            state <= next_state;  
        end
    endcase
end

// Better: separate combinatorial and sequential logic
always @(*) begin
    case (state)
        FETCH: next_state = DECODE;
        DECODE: next_state = EXECUTE;
        // ...
    endcase
end

always @(posedge clk) begin
    state <= next_state;
end

Performance Analysis

You can't optimize what you can't measure. Add performance counters:

reg [31:0] cycle_count;
reg [31:0] instruction_count;
reg [31:0] branch_mispredict_count;

always @(posedge clk) begin
    cycle_count <= cycle_count + 1;
    if (instruction_retired) instruction_count <= instruction_count + 1;
    if (branch_mispredicted) branch_mispredict_count <= branch_mispredict_count + 1;
end

// IPC = instruction_count / cycle_count
// Branch prediction accuracy = 1 - (branch_mispredict_count / branch_count)

Advanced Topics

Once you have a working CPU, you can start adding complexity. But remember: every feature has a cost.

Caches

Caches are just fast memory with an attitude problem. They think they know better than you what data you'll need next. Sometimes they're right.

Out-of-Order Execution

This is where CPUs get really complex. You're essentially building a dependency graph of instructions and executing them as soon as their inputs are ready. It's beautiful when it works and a nightmare to debug.

Multiple Issue

Why execute one instruction per cycle when you can do two? Or four? This is where you need to understand your workload. Not all code has enough parallelism to feed a wide machine.

Toolchain

A CPU without a compiler is like a car without roads. You need to understand how software will use your hardware.

The GNU toolchain is your friend here. Building GCC for a new architecture is... non-trivial. But RISC-V already has great compiler support. Use it.

# Building code for your CPU
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -c program.c
riscv64-unknown-elf-ld -T link.ld program.o -o program.elf
riscv64-unknown-elf-objcopy -O binary program.elf program.bin