Hardware Implementation of Low-Latency 32-bit Floating-Point Reciprocal

Daniel Kho CK, Ahmad Fauzi M

Abstract

As the speed requirements of imaging and communications systems increase, the latency requirements of digital circuits also become stringent. Due to such tight latency or timing requirements, large-stage pipelined circuits need to be redesigned to meet the low-latency requirements. Most modern imaging and communications systems rely on digital signal processing (DSP) that compute complex mathematical operations. The emergence of powerful and low-cost field programmable gate array (FPGA) devices with hundreds of arithmetic multipliers has enabled many such DSP hardware applications, traditionally implemented only as software solutions. The reciprocal square root algorithm is a popular technique for computing square roots, used widely in many software applications. This paper shows how this algorithm can be implemented efficiently on hard ware, and is suitable for lowlatency mathematically-intensive applications. Using a low-cost FPGA device, the algorithm takes up less than 1000 look up-tables (LUTs), which on an Artix XC7A200T device, translates to less than 1% of all the LUT resources in the chip.

Relevant Publications in Electrical & Electronic Systems