A deep technical journey implementing bfloat16 floating point arithmetic from scratch in RTL hardware, culminating in two ASIC tapeouts on IHP 130nm silicon via Tiny Tapeout. Covers IEEE 754 internals (NaN, subnormals, rounding modes, signed zero, infinity), the design rationale for choosing bfloat16 with round-toward-zero and no subnormal/NaN/infinity support, dual-path adder architecture, Booth radix-4 multiplier design, verification challenges using C++23 stdfloat bfloat16_t as a golden model (and its float32 emulation pitfalls), and Yosys synthesis surprises with LZC implementations. The design achieved 454 MHz on the sg13cmos5l node.

Table of contents
How floating point works #What you never wanted to know #Adder example #Building the hard part #ASIC implementation rules #Optimizing #What do we need #Architecture #Verification #Implementation #Closing #Sort: