Softcore CPU Reference
everything you need to know about building a softcore processor
Setting Up Your FPGA Development Environment
Before diving into CPU design, let me walk you through setting up a proper FPGA development environment on Ubuntu. I'll focus on the Arty A7 board since it's popular and affordable, but the process is similar for other boards.
Installing Vivado on Ubuntu
First, you'll need Xilinx Vivado. Here's how I got it working on my Ubuntu system:
# Download Vivado from Xilinx website (you'll need to create an account)
# I'm using version 2024.2, but check for the latest
# Extract the installer
tar -xvf Xilinx_Unified_2024.2_*.tar.gz
cd Xilinx_Unified_2024.2_*/
# Run the installer
sudo ./xsetup
# During installation:
# - Choose "Vivado" (not Vitis)
# - Select "Vivado ML Standard"
# - Make sure to include support for Artix-7 devices
# - Install to /tools/Xilinx/Vivado/2024.2 (or your preferred location)
After installation, you need to source the settings file every time you want to use Vivado:
source /tools/Xilinx/Vivado/2024.2/settings64.sh
# I add this to my .bashrc with an alias:
echo "alias vivado-init='source /tools/Xilinx/Vivado/2024.2/settings64.sh'" >> ~/.bashrc
Installing Open Source FPGA Tools
While Vivado is powerful, I also recommend installing the open-source toolchain. It's faster for small designs and great for learning:
# Install prerequisites
sudo apt-get update
sudo apt-get install build-essential clang bison flex \
libreadline-dev gawk tcl-dev libffi-dev git \
graphviz xdot pkg-config python3 libboost-system-dev \
libboost-python-dev libboost-filesystem-dev zlib1g-dev
# Install IceStorm tools (for Lattice FPGAs)
git clone https://github.com/YosysHQ/icestorm.git
cd icestorm
make -j$(nproc)
sudo make install
# Install Yosys
git clone https://github.com/YosysHQ/yosys.git
cd yosys
make -j$(nproc)
sudo make install
# Install nextpnr
git clone https://github.com/YosysHQ/nextpnr.git
cd nextpnr
cmake -DARCH=ice40 -DCMAKE_INSTALL_PREFIX=/usr/local .
make -j$(nproc)
sudo make install
Connecting Your Arty Board
The Arty board uses an FTDI chip for USB communication. You'll need to set up proper permissions:
# Add yourself to the dialout group
sudo usermod -a -G dialout $USER
# Create udev rules for the FTDI chip
sudo bash -c 'cat > /etc/udev/rules.d/99-ftdi.rules << EOF
# FTDI FT2232 for Arty
SUBSYSTEM=="usb", ATTR{idVendor}=="0403", ATTR{idProduct}=="6010", MODE="0666"
SUBSYSTEM=="tty", ATTRS{idVendor}=="0403", ATTRS{idProduct}=="6010", MODE="0666"
EOF'
# Reload udev rules
sudo udevadm control --reload-rules
sudo udevadm trigger
# Log out and back in for group changes to take effect
Installing OpenOCD for Debugging
OpenOCD is essential for loading programs onto softcore CPUs:
# Install dependencies
sudo apt-get install libusb-1.0-0-dev libftdi1-dev
# Clone and build OpenOCD
git clone https://github.com/openocd-org/openocd.git
cd openocd
./bootstrap
./configure --enable-ftdi
make -j$(nproc)
sudo make install
From Blinky to CPU
Part 1: The Heartbeat
My first design was embarrassingly simple:
module SimpleCPU (
input clk,
input reset,
output [31:0] debug_pc
);
reg [31:0] PC;
always @(posedge clk) begin
if (reset) begin
PC <= 0;
end else begin
PC <= PC + 4;
end
end
assign debug_pc = PC;
endmodule
No ALU, no decode logic, nothing. Just a counter. But this counter would become the heartbeat of my CPU.
Part 2: Understanding RISC-V
I chose RISC-V because it's clean and unencumbered by decades of legacy. The base instruction set (RV32I) has just 47 instructions. That's it. This minimalism meant I could build a useful CPU in days, not months. The beauty of RISC-V is its regularity. Instruction fields are always in the same place:
// This consistency saved me hours of debugging
wire [6:0] opcode = instruction[6:0];
wire [4:0] rd = instruction[11:7];
wire [4:0] rs1 = instruction[19:15];
wire [4:0] rs2 = instruction[24:20];
Part 3: The State Machine
Every CPU is fundamentally a state machine. I learned this the hard way when my first "all-in-one-cycle" design turned into a combinatorial nightmare. Breaking it down into states made everything click:
localparam FETCH = 0;
localparam DECODE = 1;
localparam EXECUTE = 2;
localparam WRITEBACK = 3;
always @(posedge clk) begin
case (state)
FETCH: begin
instruction <= memory[PC[31:2]];
state <= DECODE;
end
DECODE: begin
rs1_value <= registers[rs1];
rs2_value <= registers[rs2];
state <= EXECUTE;
end
EXECUTE: begin
// This is where the magic happens
alu_result <= rs1_value + rs2_value; // Simplified!
state <= WRITEBACK;
end
WRITEBACK: begin
if (rd != 0) registers[rd] <= alu_result;
PC <= PC + 4;
state <= FETCH;
end
endcase
end
Four cycles per instruction. Not fast, but correct. And correct is the foundation you build performance on.
WebAssembly on FPGA Frontier
After getting comfortable with RISC-V, I started exploring WebAssembly softcores. The idea is compelling: instead of compiling to native code, why not run WASM directly on hardware?
Why WASM on FPGA?
WASM has several advantages for FPGA implementation:
No garbage collection in WASM 1.0 - This dramatically simplifies the hardware design
Simple memory model - Just linear memory, no complex MMU required
Stack machine architecture - Different from register machines, potentially more compact
I found several existing projects attempting this:
WasMachine: A sequential 6-step WASM implementation
wasm-fpga-engine: Executes a subset of WASM instructions
The Compilation Pipeline Challenge
The biggest challenge I faced was understanding the compilation pipeline differences between traditional ISAs and WASM. With RISC-V, the flow is straightforward:
C Code → RISC-V Assembly → Machine Code → FPGA Memory
With WASM, it's more complex:
C/Rust/Go → WASM Bytecode → ??? → FPGA Implementation
That middle step is where things get interesting. Unlike traditional CPUs that execute native instructions, a WASM softcore needs to interpret bytecode or compile it to microcode on the fly.
My WASM Softcore Design Approach
I decided to start with WASM 1.0 only, avoiding the complexity of newer features. Here's my basic architecture:
module WASMCore (
input clk,
input reset,
// Memory interface
output [31:0] mem_addr,
input [31:0] mem_rdata,
output mem_wen,
output [31:0] mem_wdata
);
// WASM uses a stack machine
reg [31:0] stack [0:255];
reg [7:0] sp;
// Linear memory is separate from stack
// (Interfaced through mem_* signals)
// Current instruction
reg [7:0] opcode;
// Simplified decode for basic opcodes
always @(*) begin
case (opcode)
8'h6a: begin // i32.add
// Pop two values, push sum
end
8'h20: begin // local.get
// Push local variable to stack
end
// ... more opcodes
endcase
end
endmodule
Bootloader Experiments
Getting code onto the WASM softcore required a custom bootloader. I experimented with several approaches:
JTAG Loading: Using OpenOCD to write directly to FPGA memory
UART Bootloader: Slower but more universal
SPI Flash: For permanent storage
Here's a simple UART bootloader I implemented:
module UARTBootloader (
input clk,
input uart_rx,
output reg [31:0] mem_addr,
output reg [31:0] mem_data,
output reg mem_write
);
// Receive bytes from UART
// Assemble into 32-bit words
// Write to memory
always @(posedge clk) begin
if (uart_byte_ready) begin
case (byte_count)
0: mem_data[7:0] <= uart_byte;
1: mem_data[15:8] <= uart_byte;
2: mem_data[23:16] <= uart_byte;
3: begin
mem_data[31:24] <= uart_byte;
mem_write <= 1;
mem_addr <= mem_addr + 4;
end
endcase
byte_count <= byte_count + 1;
end
end
endmodule
Hello World on WASM
WASM to Memory Compiler: Converting WASM bytecode to memory initialization
Basic I/O: Memory-mapped UART for output
Minimal WASM Runtime: Supporting just enough opcodes for string output
The specifications checklist I developed:
Performance Optimization Journey
Once I had working designs, optimization became the focus. Here's what made the biggest differences:
For the RISC-V Core:
Pipelining: 4x theoretical speedup, 2.5x actual after accounting for hazards
Branch prediction: Even 2-bit prediction gave 85% accuracy
Forwarding paths: Eliminated most pipeline stalls
For the WASM Core:
Stack caching: Top 8 stack entries in registers
Opcode fusion: Common sequences executed as single operations
Memory prefetching: Predictable access patterns in WASM helped here
Why Roll Your Own CPU?
When you build your own softcore, you're not just learning architecture; you're learning how to think about computation itself. Every decision you make - from instruction encoding to pipeline depth - teaches you something about the trade-offs that real architects face every day. The beauty of FPGAs is they let you experiment without the $100M price tag of a tape-out. That's the kind of iteration cycle that leads to real understanding.
The Simplest MVP
Let's start with the absolute minimum viable CPU. Not because it's useful, but because complexity is the enemy of understanding. You want something you can hold in your head all at once.
module SimpleCPU (
input clk,
input reset,
output [31:0] debug_pc,
output [31:0] debug_instruction
);
reg [31:0] PC;
reg [31:0] instruction;
reg [31:0] registers [0:31];
// This is your Harvard architecture right here - separate instruction
// and data paths. We'll fix this later, but for now, simple wins.
reg [31:0] instruction_memory [0:255];
always @(posedge clk) begin
if (reset) begin
PC <= 0;
end else begin
instruction <= instruction_memory[PC[31:2]];
PC <= PC + 4;
end
end
endmodule
See what we did there? No ALU, no decode logic, nothing. Just a counter that reads instructions. This is your heartbeat. Everything else is just organs attached to this pulse.
Understanding RISC-V
RISC-V isn't perfect, but it's good enough, and good enough is often better than perfect. The instruction set is clean, the encoding is regular, and most importantly, it's not encumbered by 40 years of backwards compatibility cruft. Here's the thing about instruction sets: they're a contract between software and hardware. Break that contract, and you're on your own. Respect it, and you get to leverage millions of hours of compiler development. The RV32I base instruction set has exactly 47 instructions. That's it. Everything else is optional. This minimalism is a feature, not a bug. It means you can build a useful CPU in a weekend, not a year.
// Instruction decoder - the rosetta stone between software and hardware
wire [6:0] opcode = instruction[6:0];
wire [4:0] rd = instruction[11:7];
wire [4:0] rs1 = instruction[19:15];
wire [4:0] rs2 = instruction[24:20];
wire [2:0] funct3 = instruction[14:12];
wire [6:0] funct7 = instruction[31:25];
// The magic of RISC-V: these fields are ALWAYS in the same place
// No variable-length decode nightmares, no modal bits changing the
// meaning of other bits. Just simple, boring, beautiful regularity.
State Machines
Every CPU is fundamentally a state machine. Fetch, decode, execute, writeback - it's a dance as old as von Neumann. The question isn't whether you need these states; it's how you choreograph them.
localparam FETCH = 0;
localparam DECODE = 1;
localparam EXECUTE = 2;
localparam WRITEBACK = 3;
reg [1:0] state;
always @(posedge clk) begin
if (reset) begin
state <= FETCH;
PC <= 0;
end else begin
case (state)
FETCH: begin
instruction <= memory[PC[31:2]];
state <= DECODE;
end
DECODE: begin
// This is where you pay the price for generality
// Every instruction needs to be categorized
rs1_value <= registers[rs1];
rs2_value <= registers[rs2];
state <= EXECUTE;
end
EXECUTE: begin
// The actual work happens here
case (opcode)
7'b0110011: begin // R-type
case (funct3)
3'b000: alu_result <= (funct7[5]) ?
rs1_value - rs2_value : // SUB
rs1_value + rs2_value; // ADD
// ... more operations
endcase
end
endcase
state <= WRITEBACK;
end
WRITEBACK: begin
if (rd != 0) begin // x0 is always zero in RISC-V
registers[rd] <= alu_result;
end
PC <= next_pc;
state <= FETCH;
end
endcase
end
end
Four cycles per instruction. It's not fast, but it's correct, and correct is the foundation you build performance on.
The ALU
The ALU is where the rubber meets the road. It's tempting to build a massive combinatorial blob that does everything, but that's a mistake. Start simple, measure, then optimize.
module ALU (
input [31:0] a,
input [31:0] b,
input [3:0] op,
output reg [31:0] result
);
// Here's a secret: subtraction is just addition with extra steps
wire [32:0] sum = {1'b0, a} + {1'b0, op[3] ? ~b : b} + op[3];
always @(*) begin
case (op[2:0])
3'b000: result = sum[31:0]; // ADD/SUB
3'b001: result = a << b[4:0]; // SLL
3'b010: result = {31'b0, sum[32]}; // SLT (signed)
3'b011: result = {31'b0, a < b}; // SLTU
3'b100: result = a ^ b; // XOR
3'b101: result = op[3] ?
($signed(a) >>> b[4:0]) : // SRA
(a >> b[4:0]); // SRL
3'b110: result = a | b; // OR
3'b111: result = a & b; // AND
endcase
end
endmodule
Notice how we share the adder for both addition and comparison? That's not being clever - that's recognizing that silicon area costs money, even in an FPGA.
The Memory
Memory is where most CPUs go to die. You can have the world's best pipeline, but if you're waiting on memory, you're just warming the room. This is why caches exist, but let's not get ahead of ourselves.
module Memory (
input clk,
input [31:0] address,
input [31:0] write_data,
input [3:0] write_mask, // Byte-level write enables
input read_enable,
output reg [31:0] read_data
);
// In real life, this would be SRAM or DRAM
// In an FPGA, it's Block RAM (BRAM)
reg [31:0] mem [0:1023];
wire [29:0] word_addr = address[31:2];
always @(posedge clk) begin
if (read_enable) begin
read_data <= mem[word_addr];
end
// Byte-level writes are crucial for RISC-V
// This is why memory systems are complex
if (write_mask[0]) mem[word_addr][7:0] <= write_data[7:0];
if (write_mask[1]) mem[word_addr][15:8] <= write_data[15:8];
if (write_mask[2]) mem[word_addr][23:16] <= write_data[23:16];
if (write_mask[3]) mem[word_addr][31:24] <= write_data[31:24];
end
endmodule
Optimization
Here's where it gets interesting. That 4-cycle state machine? It's killing your performance. Time to pipeline. But first, let me tell you a secret: premature optimization is the root of all evil, except in CPU design, where leaving performance on the table is a sin. The trick is knowing when you're being premature.
// A simple 3-stage pipeline: Fetch, Execute, Writeback
// Why 3 and not 5? Because memory access and execute can overlap
// in our simple design. Don't add stages you don't need.
always @(posedge clk) begin
// Stage 1: Fetch
if_id_instruction <= instruction_memory[PC[31:2]];
if_id_pc <= PC;
PC <= PC + 4; // Assume no branches for now
// Stage 2: Decode/Execute
id_ex_rd <= if_id_instruction[11:7];
id_ex_result <= alu_result; // Combinatorial from decoded instruction
// Stage 3: Writeback
if (id_ex_rd != 0) begin
registers[id_ex_rd] <= id_ex_result;
end
end
But wait! What about data hazards? What if instruction N+1 needs the result from instruction N? Welcome to the fun part of CPU design.
Hazards and Forwarding
Hazards are why CPU design is hard. It's not the arithmetic or the control logic - it's the corner cases when instructions depend on each other.
// Forwarding logic - the duct tape of CPU design
wire forward_from_ex = (id_ex_rd != 0) &&
(id_ex_rd == rs1_current);
wire forward_from_wb = (ex_wb_rd != 0) &&
(ex_wb_rd == rs1_current) &&
!forward_from_ex;
wire [31:0] forwarded_rs1 = forward_from_ex ? id_ex_result :
forward_from_wb ? ex_wb_result :
registers[rs1_current];
This is inelegant. It's also necessary. Every cycle you stall is performance lost forever.
Branches
Branches are where the von Neumann model shows its age. You're fetching instructions sequentially, but programs aren't sequential. They jump around like a hyperactive squirrel.
// Simple branch predictor - always predict not taken
// This is wrong 50% of the time for loops, but it's simple
always @(posedge clk) begin
if (branch_taken && (predicted_pc != branch_target)) begin
// Flush the pipeline - those fetched instructions are garbage
if_id_valid <= 1'b0;
PC <= branch_target;
end
end
// Better: a branch history table
reg [1:0] branch_history [0:255]; // 2-bit saturating counters
wire [7:0] bht_index = PC[9:2]; // Use PC bits as index
// 00 = strongly not taken, 01 = weakly not taken
// 10 = weakly taken, 11 = strongly taken
wire predict_taken = branch_history[bht_index][1];
Two-bit prediction gets you to about 85% accuracy. Want better? Add more history. But remember: every bit of state is area, and area is money.
Tools and Testing
Building a CPU without proper testing is like flying blind. You need to know it works before you synthesize it.
# Start with icarus verilog - it's free and good enough
iverilog -o cpu_tb cpu.v cpu_tb.v
vvp cpu_tb
# But for performance, you want Verilator
verilator --cc cpu.v --exe cpu_main.cpp
make -C obj_dir -f Vcpu.mk
./obj_dir/Vcpu
Write test programs. Start simple:
# Test 1: Can you add?
addi x1, x0, 5
addi x2, x0, 3
add x3, x1, x2 # x3 should be 8
# Test 2: Can you branch?
loop:
addi x1, x1, -1
bnez x1, loop
Memory Mapped I/O
A CPU that can't communicate is just a space heater. Memory-mapped I/O is the simplest way to connect to peripherals.
// Address space layout - this is architecture
// 0x00000000 - 0x0000FFFF : RAM (64KB)
// 0x10000000 - 0x1000000F : UART
// 0x10000010 - 0x1000001F : GPIO
always @(*) begin
if (address[28]) begin // I/O space
case (address[7:4])
4'h0: read_data = uart_data;
4'h1: read_data = {27'b0, gpio_in};
default: read_data = 32'h0;
endcase
end else begin
read_data = ram_data;
end
end
From Simulation to Silicon
Simulation is one thing. Running on real hardware is another. The FPGA tools will humble you.
// What works in simulation might not synthesize
always @(posedge clk) begin
case (state)
FETCH: begin
// This creates a combinatorial loop in synthesis
// if next_state depends on current_state
state <= next_state;
end
endcase
end
// Better: separate combinatorial and sequential logic
always @(*) begin
case (state)
FETCH: next_state = DECODE;
DECODE: next_state = EXECUTE;
// ...
endcase
end
always @(posedge clk) begin
state <= next_state;
end
Performance Analysis
You can't optimize what you can't measure. Add performance counters:
reg [31:0] cycle_count;
reg [31:0] instruction_count;
reg [31:0] branch_mispredict_count;
always @(posedge clk) begin
cycle_count <= cycle_count + 1;
if (instruction_retired) instruction_count <= instruction_count + 1;
if (branch_mispredicted) branch_mispredict_count <= branch_mispredict_count + 1;
end
// IPC = instruction_count / cycle_count
// Branch prediction accuracy = 1 - (branch_mispredict_count / branch_count)
Advanced Topics
Once you have a working CPU, you can start adding complexity. But remember: every feature has a cost.
Caches
Caches are just fast memory with an attitude problem. They think they know better than you what data you'll need next. Sometimes they're right.
Out-of-Order Execution
This is where CPUs get really complex. You're essentially building a dependency graph of instructions and executing them as soon as their inputs are ready. It's beautiful when it works and a nightmare to debug.
Multiple Issue
Why execute one instruction per cycle when you can do two? Or four? This is where you need to understand your workload. Not all code has enough parallelism to feed a wide machine.
Toolchain
A CPU without a compiler is like a car without roads. You need to understand how software will use your hardware.
The GNU toolchain is your friend here. Building GCC for a new architecture is... non-trivial. But RISC-V already has great compiler support. Use it.
# Building code for your CPU
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -c program.c
riscv64-unknown-elf-ld -T link.ld program.o -o program.elf
riscv64-unknown-elf-objcopy -O binary program.elf program.bin
References and Further Reading
Official Documentation and Standards
RISC-V Implementations and Cores
Learning Resources and Tutorials
Tools and Development
Advanced Topics and Research
FPGA and Hardware Resources
Memory and Architecture
FPGA Toolchain Resources
WebAssembly on FPGA
Related Technologies
Last updated