Question #1
Your ECE friends over at Hamerschlag Hall are looking to implement hardware for a new computing system and have asked you for your help in choosing a specification for their 16-bit floating point value.
In addition to 64-bit (FP64) and 32-bit (FP32) specs, the IEEE 754 standard also specifies a 16-bit (FP16) floating point number. The 16 bits are divided as follows:
- 1 sign bit
- 5 EXP bits
- 10 FRAC bits
Google Brain, however, created their own Brain Floating Point Format (BFLOAT16) for use in their deep learning systems. The 16 bits are divided as follows:
- 1 sign bit
- 8 EXP bits
- 7 FRAC bits
a. Describe the tradeoffs between the FP16 and BFLOAT16 formats, i.e. for the ranges (largest and smallest positive values) and step size (distance between neighboring numbers). No need to calculate anything, just a qualitative explanation using the specs of each format.
b. List any problem(s) that there might be with converting certain numbers in FP16 to BFLOAT16 and vice versa.
c. Now think about how converting from FP16 to FP32 would work. What would you need to do to the EXP field and FRAC fields of the FP16 number?
d. Google Brain was formed in 2011 to leverage massive computing resources to perform deep learning research. Knowing that they need to do a ton of number conversions, why do you think they chose to create their own 16-bit floating point number that uses exactly 8 EXP bits? (Hint: How many EXP bits does FP32 have?)