Back in November the Tiny Tapeout project opened its doors for the second time. The project is a way to get a (simple) chip design manufactured on actual silicon for a (somewhat) affordable price.
Since I wanted to try my hand at Verilog for some time, naturally I jumped on this opportunity...
At the last possible moment.
TL;DR: I implemented a 5-bit Galois LFSR and a 16-bit CPU in Verilog. The code is here, and a 3D rendered GDS is here.
- Shift into high gear
- The Plan
- We'll fit
- Attempt #1
- Attempt #2 - Reduce RAM
- Attempt #3 - Disable multiplication
- Attempt #4 - Reduce ROM
- Attempt #5 - Absolutely disable multiplication
- Attempt #6 - Streamline PROM
- Attempt #7 - Simplify 7-segment somewhat
- Attempt #8 - Disable multiplication (again)
- Attempt #9 - Optimize LFSR assembly, reduce ROM size
- Attempt #10 - Implement XOR instruction
- Attempt #11 - Further optimize LFSR assembly
- Attempt #12 - Downclock
- Attempt #13 - Remove RAM cell
- Attempt #14 - Get rid of loop in LFSR assembly
- Attempt #15 - Get rid of RAM entirely
- Attempt #16 - Cut ROM in half
- Attempt #17 - Further reduce memory
- Attempt #18 - Cut ROM in half
- Appendix: Hack ISA
Shift into high gear
It is evening, November 13th. The submission deadline is November 14th, 19:00. What can I do in two evenings with virtually zero experience in Verilog?
A counter that increments the digit on the 7-segment display every second? I did that once on an FPGA, so that's a no. I want something new.
How about an LFSR? It's still basically a counter, but the sequence it generates is pseudo-random. Here's a nice video that explains the basic operation (in fact, that was the inspiration).
The plan is to have a configurable LFSR: the user should be able to set both the initial state and the taps. How wide (in bits) can we make it?
We'll need two registers: one to store the taps, another to store the LFSR's current state. Since on power-up the contents of the registers are undefined, we'll also need two reset signals to initialize them: one reset will initialize the taps register from the user input, the other will initialize the initial state. Another signal is reserved for the clock. So we're left with 5 inputs to work with.
And here it is1:
The gist is this:
- On every clock cycle:
- Check whether either of the reset signals is asserted.
- If so, initialize the appropriate register.
- If not, check the tick counter.
- If we've not yet counted enough ticks for a full second, advance the counter.
- Otherwise, reset the counter and advance the LFSR.
- Check whether either of the reset signals is asserted.
For example, if we have an 8-bit LFSR with a clock frequency of 4 Hz, we might see something like this:
For a 5-bit LFSR there are enough choices of taps for a maximal-length sequence, so we should be able to have a pretty light display 🚦.
The whole thing was submitted on November 14th, at exactly 19:00:2
no more time
And then, ~1 hour after the submission deadline, this arrives in the mail:
I just heard from Efabless that the November 14th Tapeout deadline has been pushed back to the end of the month.
But okay, the design is done, it works (in simulation), no need to touch it.
I'll just leave it as it is.
Hack, as defined in the Nand2Tetris course, is a very simple 16-bit Harvard architecture machine:
- CPU capable of basic arithmetic operations:
- Bitwise AND
- Bitwise OR
- Bitwise NOT
- ROM with 32K 16-bit words for storing the program.
- RAM with 16K 16-bit words.
- 512x256 black and white (1 bit) display.
And that's about it. There isn't even a shift operation! Additionally:
- The RAM and ROM can only be accessed by individual words. It's impossible to access a single byte, for example.
- There is no instruction for reading from ROM, so it can't be used for storing constants.
- All arithmetic operations are signed. There are no unsigned operations.
(See the appendix for more info.)
But even though the CPU is extremely limited, people managed to do really impressive things with it. I mean, here's a raytracer: https://blog.alexqua.ch/posts/from-nand-to-raytracer/.
Yes, technically the author runs the code in the VM emulator, not directly on the Hack CPU. The assembled code might not even fit in the ROM.
But, the code still manages to run with the platform's limited RAM and arithmetic capabilities.
It is a very impressive achievement.
So check out the course, it's pretty good!
There are various implementations of Hack on an FPGA, but wouldn't it be nice to have it on actual silicon? If we're cursing sand to do our bidding, might as well go all the way 😈.
We already have a working LFSR in hardware, so naturally the next step is to make it work in software. The goal is to build a Hack implementation that can run an LFSR and output its values through the PCB's output pins. Using the 7-segment display is a bonus. And, the original hardware LFSR should still be present, just in case the CPU doesn't work.
- A Hack CPU
- It mustn't be very large. Just enough to make the LFSR program work.
- Again, only big enough to hold the LFSR code.
- A UART for uploading the program to the ROM
- Glue for interfacing the Hack CPU to the output pins
- A way to switch between running the CPU and the hardware LFSR.
For the UART, I mostly adapted the code from whitequark's implementation. The CPU itself was more-or-less directly translated from my code for the Nand2Tetris course. Actually, it was the "expanded" version of the CPU, with support for multiplication and bitwise shifting.
And here's the address map (in 16-bit words) I landed on:
|Low address||High address||Description|
RAM_WORDS is a compile-time parameter which must be a power of 2.
Then, the address space between the RAM and the "I/O space" will just wrap back around
to point to RAM. Reading from address
0x4000 would sample the input pins of the PCB.
0x4001 would drive the PCB's output pins.
Finally, I used
io_in to determine whether the LFSR or the CPU
should be active. Recall that these pins are used to reset
the LFSR initial state and taps, respectively. Asserting both of them at the same time
isn't really valid, so we can use that state to indicate that the CPU should be active
And the LFSR program? Here it is:
The program is an 8-bit LFSR, and outputs its state once a second (approximately)
directly to the PCB's
io_out pins. There's no 7-segment logic here.
After assembling, the whole thing comes down to 42 instructions.
You can see all the code from this stage here.
Okay then. The design is done, the tests are passing, it's time to run synthesis.
[INFO GPL-0019] Util(%): 390.08 ... [ERROR GPL-0301] Utilization exceeds 100%.
That's, umm... Less than ideal. We're basically using 4 times the area we have for the design.
Attempt #2 - Reduce RAM
Okay, let's try reducing RAM. After all, flip-flops take the most area (or so I'm told). We only need two RAM slots for our LFSR program, so let's do that.
[INFO GPL-0019] Util(%): 341.80 ... [ERROR GPL-0301] Utilization exceeds 100%.
That's something, but still not nearly enough.
Attempt #3 - Disable multiplication
We don't actually need multiplication for the LFSR, so that goes next:
[INFO GPL-0019] Util(%): 308.01 ... [ERROR GPL-0301] Utilization exceeds 100%.
Okay, I didn't actually expect that big of a reduction. Anyway, onwards.
Attempt #4 - Reduce ROM
We don't need that much ROM:
[INFO GPL-0019] Util(%): 294.81 ... [ERROR GPL-0301] Utilization exceeds 100%.
Attempt #5 - Absolutely disable multiplication
Multiplication currently returns 0, so let's get rid of that logic entirely:
[INFO GPL-0019] Util(%): 305.38 ... [ERROR GPL-0301] Utilization exceeds 100%.
The area increased?! Revert that and try again.
Attempt #6 - Streamline PROM
Up till now, the UART accessed the ROM as a normal single-port RAM. But, if we let it poke directly at its internals we can save a bunch of logic and flops.
[INFO GPL-0019] Util(%): 204.53 ... [ERROR GPL-0301] Utilization exceeds 100%.
90% reduction! But still not enough.
Attempt #7 - Simplify 7-segment somewhat
The Verilog code for driving the 7-segment display that is used by the hardware LFSR has some redundancy that can be removed.
[INFO GPL-0019] Util(%): 204.00 ... [ERROR GPL-0301] Utilization exceeds 100%.
Attempt #8 - Disable multiplication (again)
Okay, I have another idea for removing multiplication...
[INFO GPL-0019] Util(%): 204.56 ... [ERROR GPL-0301] Utilization exceeds 100%.
But it's not that bad. Let's leave it as-is.
Attempt #9 - Optimize LFSR assembly, reduce ROM size
With the recent RAM optimizations we can save a couple of instructions in the LFSR program:
This also means we can further reduce the ROM size:
And, since the instruction pointer is only 15 bits wide, we can save one flip-flop by cutting that bit from the PC register.
[INFO GPL-0019] Util(%): 191.77 ... [ERROR GPL-0301] Utilization exceeds 100%.
Inching ever closer.
Attempt #10 - Implement XOR instruction
Time for more drastic measures. Instead of XORing by hand, let's add a new instruction to the CPU. It will take the place of the multiplication instruction we removed. This means we can further reduce the size of the LFSR program and the ROM.
The whole thing is now only 27 instructions long.
[INFO GPL-0019] Util(%): 149.37 ... [ERROR GPL-0301] Utilization exceeds 100%.
Attempt #11 - Further optimize LFSR assembly
We can shave off 3 more instructions using some clever bit twiddling.
In each cycle of an LFSR, the XOR with the taps is performed only if the LSB of
the current state is 1. The current program implements this using a branch, but we
can remove it using this neat trick:
lfsr ^= (-lsb) & taps3.
Now at 24 instructions.
[INFO GPL-0019] Util(%): 140.78 ... [ERROR GPL-0301] Utilization exceeds 100%.
Attempt #12 - Downclock
By reducing the clock speed from 6250 Hz to 625 Hz we can save a couple of flops. Specifically, both the hardware LFSR and the UART have registers for counting clock cycles. If the clock is slower, we need less bits for those counters.
[INFO GPL-0019] Util(%): 140.41 ... [ERROR GPL-0301] Utilization exceeds 100%.
Attempt #13 - Remove RAM cell
We don't actually need 2 RAM cells at this point. We just need one for storing the loop count. Let's get rid of the other one.
[INFO GPL-0019] Util(%): 136.29 ... [ERROR GPL-0301] Utilization exceeds 100%.
Attempt #14 - Get rid of loop in LFSR assembly
Let's do away with the constraint that the LFSR should update once a second. Removing the busy loop reduces the instruction count (and thus ROM size) to 16.
[INFO GPL-0019] Util(%): 103.36 ... [ERROR GPL-0301] Utilization exceeds 100%.
Attempt #15 - Get rid of RAM entirely
Since we removed the busy loop, we don't actually need any RAM. Let's cut it out of the design and update the memory map:
|Low address||High address||Description|
[INFO GPL-0019] Util(%): 99.59 ... [ERROR GPL-0302] Use a higher -density or re-floorplan with a larger core area. Given target density: 0.55 Suggested target density: 1.00
It fits! Except it doesn't, because apparently the density requirement is 55%. So we need to cut more stuff out.
Attempt #16 - Cut ROM in half
Right now the program has some initialization code that sets the LFSR's initial state. This is a waste of ROM, since that code is run only once. Instead, let's use a two-phase approach:
- Upload an initialization program.
- Run the CPU.
- Stop the CPU.
- Upload the actual LFSR program.
- Run the CPU again.
Our main program now becomes this, at only 8 instructions:
The main trick is that if the ROM size is also 8 words, we don't need an explicit jump anymore. Instead, when execution falls off the bottom, the program counter will simply wrap around.
[INFO GPL-0019] Util(%): 67.72 ... [ERROR GPL-0302] Use a higher -density or re-floorplan with a larger core area. Given target density: 0.55 Suggested target density: 0.68
Just a little further...
Attempt #17 - Further reduce memory
There's no reason for us to be able to read from
io_in, right? It doesn't help
the LFSR code at all. Let's make it so that all memory accesses go directly
io_out. This'll save us one instruction:
But won't actually save on ROM, since we still need the wrap-around behaviour, which is only possible if the size is a power of 2.4 Hopefully we'll save some muxes, though...
[INFO GPL-0019] Util(%): 67.28 ... [ERROR GPL-0302] Use a higher -density or re-floorplan with a larger core area. Given target density: 0.55 Suggested target density: 0.68
Attempt #18 - Cut ROM in half
I don't think there's a way to get around this. We have to cut the ROM again. The LFSR program won't fit anymore, but at least we'll have something...
[INFO GPL-0019] Util(%): 50.71
So we have a Hack CPU with 4 words of ROM and 1 word of RAM, the bottom 8 bits of which are used to drive the output I/O pins. It is just enough for this:
Or a counter loop:
And now we wait for the silicon to arrive.5
See you next year, folks 🎄.
Appendix: Hack ISA
The Hack CPU has only two 16-bit general-purpose registers —
and only two instructions — the A-Instruction and the C-Instruction.
- Binary encoding:
0vvv vvvv vvvv vvvv
- Description: Loads a 15-bit value into the
jump(or both) may be omitted.
- Binary encoding:
111a cccc ccdd djjj
cbits encode the
dbits encode the
jbits encode the
abit modifies the interpretation of some
- Description: Performs a computation, stores its result in one or more locations, and jumps.
AD=-1 stores the value
-1 in the
D registers. In this case,
comp part is
-1, and the
dest part is
The C-Instruction also recognizes the pseudo-register
M, which is interpreted
as the RAM location referenced by the
A register. So, for instance, the following
1 at RAM location
jump specification supports both conditional and unconditional jumps.
For conditional jumps, the condition is the result of the computation performed
by the current instruction. The jump target is taken from the
Will jump to address
5 in ROM if the value of the
D register is strictly greater
On the other hand,
0;JMP performs an unconditional jump. Since a
comp must always
be supplied, we just use
Refer to the "Machine Language" chapter of the Nand2Tetris course for a full reference.
This is the final submitted version. It's cleaned-up and doesn't have an off-by-one bug that was present in the original version. ↩︎
The two other commits there were made within the next 15 minutes, and are purely cosmetic. They design would've still worked even if they didn't get in. ↩︎
Wait, if the ROM has 8 words, but the program is only 7 words, won't we execute one garbage instruction before wrapping around to the start? Not if we zero-initialize the ROM. Zero translates to the
@0instruction, which is harmless here. ↩︎
Actually, the final utilization is 52.09%, because I needed to fix a bug 😅. But that would ruin the flow of the narrative. ↩︎