Softcore CPU Reference
everything you need to know about building a softcore processor
Setting Up Your FPGA Development Environment
Before diving into CPU design, let me walk you through setting up a proper FPGA development environment on Ubuntu. I'll focus on the Arty A7 board since it's popular and affordable, but the process is similar for other boards.
Installing Vivado on Ubuntu
First, you'll need Xilinx Vivado. Here's how I got it working on my Ubuntu system:
# Download Vivado from Xilinx website (you'll need to create an account)
# I'm using version 2024.2, but check for the latest
# Extract the installer
tar -xvf Xilinx_Unified_2024.2_*.tar.gz
cd Xilinx_Unified_2024.2_*/
# Run the installer
sudo ./xsetup
# During installation:
# - Choose "Vivado" (not Vitis)
# - Select "Vivado ML Standard"
# - Make sure to include support for Artix-7 devices
# - Install to /tools/Xilinx/Vivado/2024.2 (or your preferred location)After installation, you need to source the settings file every time you want to use Vivado:
Installing Open Source FPGA Tools
While Vivado is powerful, I also recommend installing the open-source toolchain. It's faster for small designs and great for learning:
Connecting Your Arty Board
The Arty board uses an FTDI chip for USB communication. You'll need to set up proper permissions:
Installing OpenOCD for Debugging
OpenOCD is essential for loading programs onto softcore CPUs:
From Blinky to CPU
Part 1: The Heartbeat
My first design was embarrassingly simple:
No ALU, no decode logic, nothing. Just a counter. But this counter would become the heartbeat of my CPU.
Part 2: Understanding RISC-V
I chose RISC-V because it's clean and unencumbered by decades of legacy. The base instruction set (RV32I) has just 47 instructions. That's it. This minimalism meant I could build a useful CPU in days, not months. The beauty of RISC-V is its regularity. Instruction fields are always in the same place:
Part 3: The State Machine
Every CPU is fundamentally a state machine. I learned this the hard way when my first "all-in-one-cycle" design turned into a combinatorial nightmare. Breaking it down into states made everything click:
Four cycles per instruction. Not fast, but correct. And correct is the foundation you build performance on.
WebAssembly on FPGA Frontier
After getting comfortable with RISC-V, I started exploring WebAssembly softcores. The idea is compelling: instead of compiling to native code, why not run WASM directly on hardware?
Why WASM on FPGA?
WASM has several advantages for FPGA implementation:
No garbage collection in WASM 1.0 - This dramatically simplifies the hardware design
Simple memory model - Just linear memory, no complex MMU required
Stack machine architecture - Different from register machines, potentially more compact
I found several existing projects attempting this:
WasMachine: A sequential 6-step WASM implementation
wasm-fpga-engine: Executes a subset of WASM instructions
The Compilation Pipeline Challenge
The biggest challenge I faced was understanding the compilation pipeline differences between traditional ISAs and WASM. With RISC-V, the flow is straightforward:
With WASM, it's more complex:
That middle step is where things get interesting. Unlike traditional CPUs that execute native instructions, a WASM softcore needs to interpret bytecode or compile it to microcode on the fly.
My WASM Softcore Design Approach
I decided to start with WASM 1.0 only, avoiding the complexity of newer features. Here's my basic architecture:
Bootloader Experiments
Getting code onto the WASM softcore required a custom bootloader. I experimented with several approaches:
JTAG Loading: Using OpenOCD to write directly to FPGA memory
UART Bootloader: Slower but more universal
SPI Flash: For permanent storage
Here's a simple UART bootloader I implemented:
Hello World on WASM
WASM to Memory Compiler: Converting WASM bytecode to memory initialization
Basic I/O: Memory-mapped UART for output
Minimal WASM Runtime: Supporting just enough opcodes for string output
The specifications checklist I developed:
Performance Optimization Journey
Once I had working designs, optimization became the focus. Here's what made the biggest differences:
For the RISC-V Core:
Pipelining: 4x theoretical speedup, 2.5x actual after accounting for hazards
Branch prediction: Even 2-bit prediction gave 85% accuracy
Forwarding paths: Eliminated most pipeline stalls
For the WASM Core:
Stack caching: Top 8 stack entries in registers
Opcode fusion: Common sequences executed as single operations
Memory prefetching: Predictable access patterns in WASM helped here
Why Roll Your Own CPU?
When you build your own softcore, you're not just learning architecture; you're learning how to think about computation itself. Every decision you make - from instruction encoding to pipeline depth - teaches you something about the trade-offs that real architects face every day. The beauty of FPGAs is they let you experiment without the $100M price tag of a tape-out. That's the kind of iteration cycle that leads to real understanding.
The Simplest MVP
Let's start with the absolute minimum viable CPU. Not because it's useful, but because complexity is the enemy of understanding. You want something you can hold in your head all at once.
See what we did there? No ALU, no decode logic, nothing. Just a counter that reads instructions. This is your heartbeat. Everything else is just organs attached to this pulse.
Understanding RISC-V
RISC-V isn't perfect, but it's good enough, and good enough is often better than perfect. The instruction set is clean, the encoding is regular, and most importantly, it's not encumbered by 40 years of backwards compatibility cruft. Here's the thing about instruction sets: they're a contract between software and hardware. Break that contract, and you're on your own. Respect it, and you get to leverage millions of hours of compiler development. The RV32I base instruction set has exactly 47 instructions. That's it. Everything else is optional. This minimalism is a feature, not a bug. It means you can build a useful CPU in a weekend, not a year.
State Machines
Every CPU is fundamentally a state machine. Fetch, decode, execute, writeback - it's a dance as old as von Neumann. The question isn't whether you need these states; it's how you choreograph them.
Four cycles per instruction. It's not fast, but it's correct, and correct is the foundation you build performance on.
The ALU
The ALU is where the rubber meets the road. It's tempting to build a massive combinatorial blob that does everything, but that's a mistake. Start simple, measure, then optimize.
Notice how we share the adder for both addition and comparison? That's not being clever - that's recognizing that silicon area costs money, even in an FPGA.
The Memory
Memory is where most CPUs go to die. You can have the world's best pipeline, but if you're waiting on memory, you're just warming the room. This is why caches exist, but let's not get ahead of ourselves.
Optimization
Here's where it gets interesting. That 4-cycle state machine? It's killing your performance. Time to pipeline. But first, let me tell you a secret: premature optimization is the root of all evil, except in CPU design, where leaving performance on the table is a sin. The trick is knowing when you're being premature.
But wait! What about data hazards? What if instruction N+1 needs the result from instruction N? Welcome to the fun part of CPU design.
Hazards and Forwarding
Hazards are why CPU design is hard. It's not the arithmetic or the control logic - it's the corner cases when instructions depend on each other.
This is inelegant. It's also necessary. Every cycle you stall is performance lost forever.
Branches
Branches are where the von Neumann model shows its age. You're fetching instructions sequentially, but programs aren't sequential. They jump around like a hyperactive squirrel.
Two-bit prediction gets you to about 85% accuracy. Want better? Add more history. But remember: every bit of state is area, and area is money.
Tools and Testing
Building a CPU without proper testing is like flying blind. You need to know it works before you synthesize it.
Write test programs. Start simple:
Memory Mapped I/O
A CPU that can't communicate is just a space heater. Memory-mapped I/O is the simplest way to connect to peripherals.
From Simulation to Silicon
Simulation is one thing. Running on real hardware is another. The FPGA tools will humble you.
Performance Analysis
You can't optimize what you can't measure. Add performance counters:
Advanced Topics
Once you have a working CPU, you can start adding complexity. But remember: every feature has a cost.
Caches
Caches are just fast memory with an attitude problem. They think they know better than you what data you'll need next. Sometimes they're right.
Out-of-Order Execution
This is where CPUs get really complex. You're essentially building a dependency graph of instructions and executing them as soon as their inputs are ready. It's beautiful when it works and a nightmare to debug.
Multiple Issue
Why execute one instruction per cycle when you can do two? Or four? This is where you need to understand your workload. Not all code has enough parallelism to feed a wide machine.
Toolchain
A CPU without a compiler is like a car without roads. You need to understand how software will use your hardware.
The GNU toolchain is your friend here. Building GCC for a new architecture is... non-trivial. But RISC-V already has great compiler support. Use it.
References and Further Reading
Official Documentation and Standards
RISC-V Implementations and Cores
Learning Resources and Tutorials
Tools and Development
Advanced Topics and Research
FPGA and Hardware Resources
Memory and Architecture
FPGA Toolchain Resources
WebAssembly on FPGA
Related Technologies
Last updated