Biko's House of Horrors

Tiny Tapeout 02

2D rendering of the final silicon layout

Back in November the Tiny Tapeout project opened its doors for the second time. The project is a way to get a (simple) chip design manufactured on actual silicon for a (somewhat) affordable price.

Since I wanted to try my hand at Verilog for some time, naturally I jumped on this opportunity...

At the last possible moment.

TL;DR: I implemented a 5-bit Galois LFSR and a 16-bit CPU in Verilog. The code is here, and a 3D rendered GDS is here.

Shift into high gear

It is evening, November 13th. The submission deadline is November 14th, 19:00. What can I do in two evenings with virtually zero experience in Verilog?

A counter that increments the digit on the 7-segment display every second? I did that once on an FPGA, so that's a no. I want something new.

How about an LFSR? It's still basically a counter, but the sequence it generates is pseudo-random. Here's a nice video that explains the basic operation (in fact, that was the inspiration).

The plan is to have a configurable LFSR: the user should be able to set both the initial state and the taps. How wide (in bits) can we make it?

We'll need two registers: one to store the taps, another to store the LFSR's current state. Since on power-up the contents of the registers are undefined, we'll also need two reset signals to initialize them: one reset will initialize the taps register from the user input, the other will initialize the initial state. Another signal is reserved for the clock. So we're left with 5 inputs to work with.

And here it is1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
module lfsr #(
    parameter BITS = 8,
    parameter TICKS = 1
) (
    input clk,

    input reset_lfsr_i,
    input [BITS-1:0] initial_state_i,

    input reset_taps_i,
    input [BITS-1:0] taps_i,

    output [BITS-1:0] state_o
);
    reg [BITS-1:0] taps;
    reg [BITS-1:0] lfsr;

    localparam TICK_BITS =
        ($clog2(TICKS) == 0) ? 1 : $clog2(TICKS);
    reg [TICK_BITS-1:0] tick_count;

    assign state_o = lfsr;

    always @(posedge clk) begin
        if (reset_taps_i) begin
            taps <= taps_i;
        end else if (reset_lfsr_i) begin
            tick_count <= 0;
            lfsr <= initial_state_i;
        end else begin
            if (tick_count == TICKS - 1) begin
                tick_count <= 0;

                // Advance the LFSR
                if (lfsr[0]) begin
                    lfsr <= (lfsr >> 1) ^ taps;
                end else begin
                    lfsr <= (lfsr >> 1);
                end
            end else begin
                tick_count <= tick_count + 1;
            end
        end
    end
endmodule

The gist is this:

  1. On every clock cycle:
    1. Check whether either of the reset signals is asserted.
      1. If so, initialize the appropriate register.
      2. If not, check the tick counter.
        1. If we've not yet counted enough ticks for a full second, advance the counter.
        2. Otherwise, reset the counter and advance the LFSR.

For example, if we have an 8-bit LFSR with a clock frequency of 4 Hz, we might see something like this:

For a 5-bit LFSR there are enough choices of taps for a maximal-length sequence, so we should be able to have a pretty light display 🚦.

The whole thing was submitted on November 14th, at exactly 19:00:2

Screenshot of GitHub commits list. The highlighted commit was made at the exact time of the deadline.

There's no more time

And then, ~1 hour after the submission deadline, this arrives in the mail:

I just heard from Efabless that the November 14th Tapeout deadline has been pushed back to the end of the month.

Oof.

But okay, the design is done, it works (in simulation), no need to touch it.

I'll just leave it as it is.

It's fine.

Hack

Hack, as defined in the Nand2Tetris course, is a very simple 16-bit Harvard architecture machine:

  1. CPU capable of basic arithmetic operations:
    1. Addition
    2. Subtraction
    3. Bitwise AND
    4. Bitwise OR
    5. Bitwise NOT
  2. ROM with 32K 16-bit words for storing the program.
  3. RAM with 16K 16-bit words.
  4. 512x256 black and white (1 bit) display.
  5. Keyboard.

And that's about it. There isn't even a shift operation! Additionally:

  1. The RAM and ROM can only be accessed by individual words. It's impossible to access a single byte, for example.
  2. There is no instruction for reading from ROM, so it can't be used for storing constants.
  3. All arithmetic operations are signed. There are no unsigned operations.

(See the appendix for more info.)

But even though the CPU is extremely limited, people managed to do really impressive things with it. I mean, here's a raytracer: https://blog.alexqua.ch/posts/from-nand-to-raytracer/.

Nitpicker's corner

Yes, technically the author runs the code in the VM emulator, not directly on the Hack CPU. The assembled code might not even fit in the ROM.

But, the code still manages to run with the platform's limited RAM and arithmetic capabilities.

It is a very impressive achievement.

So check out the course, it's pretty good!

There are various implementations of Hack on an FPGA, but wouldn't it be nice to have it on actual silicon? If we're cursing sand to do our bidding, might as well go all the way 😈.

The Plan

We already have a working LFSR in hardware, so naturally the next step is to make it work in software. The goal is to build a Hack implementation that can run an LFSR and output its values through the PCB's output pins. Using the 7-segment display is a bonus. And, the original hardware LFSR should still be present, just in case the CPU doesn't work.

We'll need:

  1. A Hack CPU
  2. RAM
    • It mustn't be very large. Just enough to make the LFSR program work.
  3. ROM
    • Again, only big enough to hold the LFSR code.
  4. A UART for uploading the program to the ROM
  5. Glue for interfacing the Hack CPU to the output pins
  6. A way to switch between running the CPU and the hardware LFSR.

For the UART, I mostly adapted the code from whitequark's implementation. The CPU itself was more-or-less directly translated from my code for the Nand2Tetris course. Actually, it was the "expanded" version of the CPU, with support for multiplication and bitwise shifting.

And here's the address map (in 16-bit words) I landed on:

Low address High address Description
0 RAM_WORDS-1 RAM
RAM_WORDS 0x3FFF A20 😎
0x4000 0x4000 io_in (high 8 bits are always 0 on read)
0x4001 0x4001 io_out (high 8 bits are ignored on write, 0 on read)

Where RAM_WORDS is a compile-time parameter which must be a power of 2. Then, the address space between the RAM and the "I/O space" will just wrap back around to point to RAM. Reading from address 0x4000 would sample the input pins of the PCB. Writing to 0x4001 would drive the PCB's output pins.

Finally, I used io_in[1] and io_in[2] to determine whether the LFSR or the CPU should be active. Recall that these pins are used to reset the LFSR initial state and taps, respectively. Asserting both of them at the same time isn't really valid, so we can use that state to indicate that the CPU should be active instead.

And the LFSR program? Here it is:

lfsr.asm
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// IO_OUT   - holds current LFSR state
// R0       - holds loop count
// R1       - temp storage

(lblStart)
    @1          // Initial state
    D=A
    @16385      // IO_OUT
    M=D

(lblLoop)
    // Each instruction takes a cycle, and there are
    // 10 instructions in the busy loop, so we need 625
    // loops to wait a second, with a clock speed
    // of 6250 Hz.
    @R0
    D=M
    @624
    D=D-A
    @lblNext
    D;JEQ

    @R0
    M=M+1
    @lblLoop
    0;JMP

(lblNext)
    @R0
    M=0

    @16385  // IO_OUT
    D=M
    M=M>>
    @1
    D=D&A
    @lblXor
    D;JGT

    @lblLoop
    0;JMP

(lblXor)
    // R1 = ~(IO_OUT & Taps)
    @16385  // IO_OUT
    D=M
    @142    // Taps - 0x8E
    D=D&A
    D=!D
    @R1
    M=D

    // IO_OUT = (IO_OUT | Taps) & R1
    @16385  // IO_OUT
    D=M
    @142    // Taps - 0x8E
    D=D|A
    @R1
    D=D&M
    @16385  // IO_OUT
    M=D

    @lblLoop
    0;JMP

The program is an 8-bit LFSR, and outputs its state once a second (approximately) directly to the PCB's io_out pins. There's no 7-segment logic here. After assembling, the whole thing comes down to 42 instructions.

You can see all the code from this stage here.

Okay then. The design is done, the tests are passing, it's time to run synthesis.

We'll fit

Attempt #1

Code, GDS run

[INFO GPL-0019] Util(%): 390.08
...
[ERROR GPL-0301] Utilization exceeds 100%.

That's, umm... Less than ideal. We're basically using 4 times the area we have for the design.

Attempt #2 - Reduce RAM

Commit, GDS run

Okay, let's try reducing RAM. After all, flip-flops take the most area (or so I'm told). We only need two RAM slots for our LFSR program, so let's do that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
diff --git a/src/mbikovitsky_top.v b/src/mbikovitsky_top.v
index d7da2ac1..382e8863 100644
--- a/src/mbikovitsky_top.v
+++ b/src/mbikovitsky_top.v
@@ -2,7 +2,7 @@ module mbikovitsky_top #(
     parameter CLOCK_HZ = 6250,
     parameter BAUD = 781,
     parameter ROM_WORDS = 64, // Must be a power of 2
-    parameter RAM_WORDS = 16  // Must be a power of 2
+    parameter RAM_WORDS = 2  // Must be a power of 2
 ) (
     input [7:0] io_in,
     output [7:0] io_out

And...

[INFO GPL-0019] Util(%): 341.80
...
[ERROR GPL-0301] Utilization exceeds 100%.

That's something, but still not nearly enough.

Attempt #3 - Disable multiplication

Commit, GDS run

We don't actually need multiplication for the LFSR, so that goes next:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
diff --git a/src/extend_alu.v b/src/extend_alu.v
index a3a0c69d..507a7ab8 100644
--- a/src/extend_alu.v
+++ b/src/extend_alu.v
@@ -24,11 +24,12 @@ module ExtendALU (
     // Otherwise, the lower bit takes precedence, 0 meaning the multiplication
     // result should be returned. If it is not 0, we return the shift
     // result.
+    // UPDATE: Multiplication disabled
     reg signed [15:0] result;
     always @(*) begin
         case (instruction[8:7])
             2'b00:
-                result = x * y;
+                result = 0;
             2'b01:
                 case (instruction[5:4])
                     2'b00:
@@ -41,7 +42,7 @@ module ExtendALU (
                         result = x <<< 1;
                 endcase
             2'b10:
-                result = x * y;
+                result = 0;
             2'b11:
                 result = simple_alu_result;
         endcase

And...

[INFO GPL-0019] Util(%): 308.01
...
[ERROR GPL-0301] Utilization exceeds 100%.

Okay, I didn't actually expect that big of a reduction. Anyway, onwards.

Attempt #4 - Reduce ROM

Commit, GDS run

We don't need that much ROM:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
diff --git a/src/mbikovitsky_top.v b/src/mbikovitsky_top.v
index 382e8863..d7f330ae 100644
--- a/src/mbikovitsky_top.v
+++ b/src/mbikovitsky_top.v
@@ -1,7 +1,7 @@
 module mbikovitsky_top #(
     parameter CLOCK_HZ = 6250,
     parameter BAUD = 781,
-    parameter ROM_WORDS = 64, // Must be a power of 2
+    parameter ROM_WORDS = 42,
     parameter RAM_WORDS = 2  // Must be a power of 2
 ) (
     input [7:0] io_in,

And...

[INFO GPL-0019] Util(%): 294.81
...
[ERROR GPL-0301] Utilization exceeds 100%.

Next!

Attempt #5 - Absolutely disable multiplication

Commit, GDS run

Multiplication currently returns 0, so let's get rid of that logic entirely:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
diff --git a/src/extend_alu.v b/src/extend_alu.v
index 507a7ab8..f4ff31b2 100644
--- a/src/extend_alu.v
+++ b/src/extend_alu.v
@@ -28,8 +28,6 @@ module ExtendALU (
     reg signed [15:0] result;
     always @(*) begin
         case (instruction[8:7])
-            2'b00:
-                result = 0;
             2'b01:
                 case (instruction[5:4])
                     2'b00:
@@ -41,8 +39,6 @@ module ExtendALU (
                     2'b11:
                         result = x <<< 1;
                 endcase
-            2'b10:
-                result = 0;
             2'b11:
                 result = simple_alu_result;
         endcase

And...

[INFO GPL-0019] Util(%): 305.38
...
[ERROR GPL-0301] Utilization exceeds 100%.

The area increased?! Revert that and try again.

Attempt #6 - Streamline PROM

Commit, GDS run

Up till now, the UART accessed the ROM as a normal single-port RAM. But, if we let it poke directly at its internals we can save a bunch of logic and flops.

[INFO GPL-0019] Util(%): 204.53
...
[ERROR GPL-0301] Utilization exceeds 100%.

90% reduction! But still not enough.

Attempt #7 - Simplify 7-segment somewhat

Commit, GDS run

The Verilog code for driving the 7-segment display that is used by the hardware LFSR has some redundancy that can be removed.

[INFO GPL-0019] Util(%): 204.00
...
[ERROR GPL-0301] Utilization exceeds 100%.

Attempt #8 - Disable multiplication (again)

Commit, GDS run

Okay, I have another idea for removing multiplication...

[INFO GPL-0019] Util(%): 204.56
...
[ERROR GPL-0301] Utilization exceeds 100%.

😑

But it's not that bad. Let's leave it as-is.

Attempt #9 - Optimize LFSR assembly, reduce ROM size

Commit 1, Commit 2, GDS run

With the recent RAM optimizations we can save a couple of instructions in the LFSR program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
diff --git a/src/lfsr.asm b/src/lfsr.asm
index cc58ff1..9172e67 100644
--- a/src/lfsr.asm
+++ b/src/lfsr.asm
@@ -9,25 +9,21 @@
     M=D

 (lblLoop)
-    // Each instruction takes a cycle, and there are 10 instructions in the
-    // busy loop, so we need 625 loops to wait a second, with a clock speed
+    // Each instruction takes a cycle, and there are 8 instructions in the
+    // busy loop, so we need 781.25 loops to wait a second, with a clock speed
     // of 6250 Hz.
     @R0
     D=M
-    @624
+    @781
     D=D-A
     @lblNext
     D;JEQ

-    @R0
-    M=M+1
-    @lblLoop
-    0;JMP
+    @lblLoop    // Since lblLoop is at address 4, and we have only 2 RAM slots,
+                // this wraps around to 0 :)
+    M=M+1;JMP

 (lblNext)
-    @R0
-    M=0
-
     @16385  // IO_OUT
     D=M
     M=M>>
@@ -37,7 +33,7 @@
     D;JGT

     @lblLoop
-    0;JMP
+    M=0;JMP

 (lblXor)
     // R1 = ~(IO_OUT & Taps)
@@ -60,4 +56,4 @@
     M=D

     @lblLoop
-    0;JMP
+    M=0;JMP

This also means we can further reduce the ROM size:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
diff --git a/src/mbikovitsky_top.v b/src/mbikovitsky_top.v
index 917060e..a12e1f8 100644
--- a/src/mbikovitsky_top.v
+++ b/src/mbikovitsky_top.v
@@ -1,7 +1,7 @@
 module mbikovitsky_top #(
     parameter CLOCK_HZ = 6250,
     parameter BAUD = 781,
-    parameter ROM_WORDS = 42,
+    parameter ROM_WORDS = 38,
     parameter RAM_WORDS = 2  // Must be a power of 2
 ) (
     input [7:0] io_in,

And, since the instruction pointer is only 15 bits wide, we can save one flip-flop by cutting that bit from the PC register.

And...

[INFO GPL-0019] Util(%): 191.77
...
[ERROR GPL-0301] Utilization exceeds 100%.

Inching ever closer.

Attempt #10 - Implement XOR instruction

Commit, GDS run

Time for more drastic measures. Instead of XORing by hand, let's add a new instruction to the CPU. It will take the place of the multiplication instruction we removed. This means we can further reduce the size of the LFSR program and the ROM.

The whole thing is now only 27 instructions long.

[INFO GPL-0019] Util(%): 149.37
...
[ERROR GPL-0301] Utilization exceeds 100%.

Attempt #11 - Further optimize LFSR assembly

Commit, GDS run

We can shave off 3 more instructions using some clever bit twiddling.

In each cycle of an LFSR, the XOR with the taps is performed only if the LSB of the current state is 1. The current program implements this using a branch, but we can remove it using this neat trick: lfsr ^= (-lsb) & taps3.

Now at 24 instructions.

[INFO GPL-0019] Util(%): 140.78
...
[ERROR GPL-0301] Utilization exceeds 100%.

Attempt #12 - Downclock

Commit, GDS run

By reducing the clock speed from 6250 Hz to 625 Hz we can save a couple of flops. Specifically, both the hardware LFSR and the UART have registers for counting clock cycles. If the clock is slower, we need less bits for those counters.

[INFO GPL-0019] Util(%): 140.41
...
[ERROR GPL-0301] Utilization exceeds 100%.

Attempt #13 - Remove RAM cell

Commit, GDS run

We don't actually need 2 RAM cells at this point. We just need one for storing the loop count. Let's get rid of the other one.

[INFO GPL-0019] Util(%): 136.29
...
[ERROR GPL-0301] Utilization exceeds 100%.

Attempt #14 - Get rid of loop in LFSR assembly

Commit, GDS run

Let's do away with the constraint that the LFSR should update once a second. Removing the busy loop reduces the instruction count (and thus ROM size) to 16.

[INFO GPL-0019] Util(%): 103.36
...
[ERROR GPL-0301] Utilization exceeds 100%.

So close!

Attempt #15 - Get rid of RAM entirely

Commit, GDS run

Since we removed the busy loop, we don't actually need any RAM. Let's cut it out of the design and update the memory map:

Low address High address Description
0 0x3FFF Zeroes
0x4000 0x4000 io_in (high 8 bits are always 0 on read)
0x4001 0x4001 io_out (high 8 bits are ignored on write, 0 on read)

And...

[INFO GPL-0019] Util(%): 99.59
...
[ERROR GPL-0302] Use a higher -density or re-floorplan with a larger core area.
Given target density: 0.55
Suggested target density: 1.00

It fits! Except it doesn't, because apparently the density requirement is 55%. So we need to cut more stuff out.

Attempt #16 - Cut ROM in half

Commit, GDS run

Right now the program has some initialization code that sets the LFSR's initial state. This is a waste of ROM, since that code is run only once. Instead, let's use a two-phase approach:

  1. Upload an initialization program.
  2. Run the CPU.
  3. Stop the CPU.
  4. Upload the actual LFSR program.
  5. Run the CPU again.

Our main program now becomes this, at only 8 instructions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@16385  // lfsr
MD=D^M
M=M>>

// (-lsb) & taps
@1
D=D&A
D=-D
@142    // taps - 0x8E
D=D&A

The main trick is that if the ROM size is also 8 words, we don't need an explicit jump anymore. Instead, when execution falls off the bottom, the program counter will simply wrap around.

And...

[INFO GPL-0019] Util(%): 67.72
...
[ERROR GPL-0302] Use a higher -density or re-floorplan with a larger core area.
Given target density: 0.55
Suggested target density: 0.68

Just a little further...

Attempt #17 - Further reduce memory

Commit, GDS run

There's no reason for us to be able to read from io_in, right? It doesn't help the LFSR code at all. Let's make it so that all memory accesses go directly to io_out. This'll save us one instruction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-@16385  // lfsr
 MD=D^M
 M=M>>

 // (-lsb) & taps
 @1
 D=D&A
 D=-D
 @142    // taps - 0x8E
 D=D&A

But won't actually save on ROM, since we still need the wrap-around behaviour, which is only possible if the size is a power of 2.4 Hopefully we'll save some muxes, though...

[INFO GPL-0019] Util(%): 67.28
...
[ERROR GPL-0302] Use a higher -density or re-floorplan with a larger core area.
Given target density: 0.55
Suggested target density: 0.68

Man...

Attempt #18 - Cut ROM in half

Diff, GDS run

I don't think there's a way to get around this. We have to cut the ROM again. The LFSR program won't fit anymore, but at least we'll have something...

[INFO GPL-0019] Util(%): 50.71

So we have a Hack CPU with 4 words of ROM and 1 word of RAM, the bottom 8 bits of which are used to drive the output I/O pins. It is just enough for this:

1
2
3
4
@x
D=A
@y
M=D+A

Or a counter loop:

1
2
3
4
M=0
M=M+1
@1
0;JMP

And now we wait for the silicon to arrive.5

See you next year, folks πŸŽ„.

Appendix: Hack ISA

The Hack CPU has only two 16-bit general-purpose registers β€” A and D β€” and only two instructions β€” the A-Instruction and the C-Instruction.

A-Instruction

C-Instruction

For example, AD=-1 stores the value -1 in the A and D registers. In this case, the comp part is -1, and the dest part is AD.

The C-Instruction also recognizes the pseudo-register M, which is interpreted as the RAM location referenced by the A register. So, for instance, the following program:

1
2
@7
M=1

Will store 1 at RAM location 7.

The jump specification supports both conditional and unconditional jumps. For conditional jumps, the condition is the result of the computation performed by the current instruction. The jump target is taken from the A register. For example:

1
2
@5
D;JGT

Will jump to address 5 in ROM if the value of the D register is strictly greater than 0.

On the other hand, 0;JMP performs an unconditional jump. Since a comp must always be supplied, we just use 0.

Refer to the "Machine Language" chapter of the Nand2Tetris course for a full reference.


  1. This is the final submitted version. It's cleaned-up and doesn't have an off-by-one bug that was present in the original version. ↩︎

  2. The two other commits there were made within the next 15 minutes, and are purely cosmetic. They design would've still worked even if they didn't get in. ↩︎

  3. Taken from Wikipedia↩︎

  4. Wait, if the ROM has 8 words, but the program is only 7 words, won't we execute one garbage instruction before wrapping around to the start? Not if we zero-initialize the ROM. Zero translates to the @0 instruction, which is harmless here. ↩︎

  5. Actually, the final utilization is 52.09%, because I needed to fix a bug πŸ˜…. But that would ruin the flow of the narrative. ↩︎