This is a copy of a document by Christopher Vickery, which resided at this URL but that server no longer seems to be responding, and it is far too useful a document to consign to oblivion (in my opinion particularly the tables towards the end of the page are extremely useful), so I have grabbed a copy from Google's cache to preserve it. If you are aware of a proper current home for this page, then please inform me (A.M.Iwi {at} rl.ac.uk) and I will replace this page with a redirect. |
This page provides some links to places to go for more information about the IEEE-754 standard. You can order a copy of the standards document from [ the IEEE ].
The Q4, 1999 issue of the Intel Technology Journal has an article, IA-64 Floating-Point operations and the IEEE standard for binary floating-point arithmetic, by M. Cornea-Hasegan and B. Norin, which provides a summary of the IEEE-754 standard and includes a table of ten different floating-point formats used by Intel's 64-bit microprocessors (the IA-64 architecture). Most of the article discusses featues of the IA-64 floating-point instruction set.
A significant floating-point standard, which pre-dates the IEEE-754 standard, is the "hexadecimal encoding" used on IBM mainframes. This format uses sixteen instead of two as the base to which the exponent is raised. The IBM S/390 G5 processor was the first one to integrate traditional hexadecimal encoding and IEEE-754 in the same floating-point unit. It is described in the paper, The S/390 G5 floating point unit by E. M. Schwarz and C. A. Krygowski, which appeared in the IBM Journal of Research and Development, vol. 43, No. 5/6, September/November 1999, pp 707-721.
The rest of the material on this page came from Kevin J. Brewer, who worked for Delco Electronics at the time he wrote it. In addition to the material below, Kevin greatly refined the JavaScript code for the IEEE-754 Calculator page originally written by a Queens College student, Quanfei Wen.
There are currently three separate calculator pages:
- [ Show the IEEE-754 encoding of decimal numbers. ]
- [ Convert 32-bit floating-point values to decimal and show the bit fields within the floating-point value. ]
- [ Convert 64-bit floating-point values to decimal and show the bit fields within the floating-point value. ]
At the end of this page is [ Kevin's Chart ] which summarizes the IEEE-754 single and double precision formats.
It's the nature of the Web
that some of the links below no longer work. If you find a broken
one, please let me know, especially if you know where the page has
moved. (Send mail to vickery at babbage.cs.qc.edu with
"IEEE-754" in the Subject line.)
Kevin suggested, "Scroll up and down from the locations cited below
in order to learn other information about the IEEE-754 standard."
The source which showed me that there were actually positive and negative NaNs and introduced me to a new special number, Indeterminate, was [ this page ]. To find the table showing these NaNs and Indeterminate, use the Edit | Find... command on the string "the corresponding values". Scroll up a little in order to take a look at the "Special Operations" table. And right above that table is the list of special numbers and their meanings.The source which introduced me to the concepts of "signaling" and "quiet" NaNs was [ http://www.cas.american.edu/~studdard/classes/fall1995/4028201/notes/17oct95/I.html ]. To find the section on "signaling" and "quiet" NaNs, use the Edit | Find... command on the string "NaNs can be signaling or quiet".
The source which allowed me to distinguish between "signaling" and "quiet" NaNs was [ this page ]. To find the section on NaNs and the encodings of other special numbers, use the Edit | Find... command on the string "The definition of NaNs".
[ This source ] shows the mathematical equations which define the various IEEE-754 values and ranges.
The source which introduced me to IEEE-754's four rounding modes and the guard, round, and "sticky" bits was [ this page ]. To find the section on rounding, use the Edit | Find... command on the string "four different rounding modes".
Some sources on the Web claim that IEEE-754 specifies four floating-point formats in two groups, basic and extended, with a "single-precision" and a "double-precision" format in each of the two groups. To find this information, use the Edit | Find... command on the string "IEEE 754 specifies four" on [ http://www.cas.american.edu/~studdard/classes/fall1995/4028201/notes/17oct95/I.html ] and the Edit | Find... command on the string "The other two formats" on [ this page ].
Upon reading the IEEE-754 standard, one learns from "Table 1, Summary of Format Parameters" on page 9 that the extended formats are very loosely defined with unspecified exponential biases and only lower bounds for precisions and exponents, while the basic formats are specified exactly in terms of field widths and semantics. The extended formats are so loosely defined that particular implementations of these formats may be so different that numerical approximation routines using them could be non-portable.
Other sources on the Web claim that IEEE-754 specifies only three floating-point formats, "single-precision", "double-precision", and "quadruple-precision". [ One source ] shows the three IEEE-754 formats and their max and min values in DEC's Fortran-90 documentation. To find the section on the three IEEE-754 formats, use the Edit | Find... command on the string "32-bit IEEE". [ Another source ] shows the encodings of the special numbers and the number of bits in each field for each of the three IEEE-754 formats. To find the sections on the three IEEE-754 formats, use the Edit | Find... command on the string "For single-precision floating point numbers" and start scrolling down.
When comparing the format parameters of "extended double-precision" in IEEE-754's Table 1 and those of the so-called "quadruple-precision", one finds that the "quadruple-precision" format is simply a specific instance of the "extended double-precision" format. Similarly, one will note that "double-precision" is a specific instance of "extended single-precision".
The 80-bit "extended-precision" format is used "internally" by the Intel 80x87 floating-point math "co-processor" in order to be able to shift operands back and forth without any loss of precision in the IEEE-754 64-bit (and 32-bit) format. To find this information, use the Edit | Find... command on the string "it also implements an "extended-precision" format" on [ http://www.cas.american.edu/~studdard/classes/fall1995/4028201/notes/17oct95/I.html ].
A source which describes the exponential bias of Intel's 80-bit "extended-precision" format and its usage of the additional bits it contains relative to the "double-precision" format is [ http://webster.cs.ucr.edu/Page_asm/ArtofAssembly/CH14/CH14-1.html ]. To find this data, use the Edit | Find... command on the string "In order to help ensure accuracy".
[ http://webster.cs.ucr.edu/Page_asm/ArtofAssembly/CH14/CH14-3.html ] states that Intel's "extended-precision" format supports non-normalized numbers (values very close to zero whose most significant mantissa bit is not zero). To find this support information, use the Edit | Find... command on the string "Normalized values provide".
When one compares these stated and implied format parameters of Intel's "extended-precision" with those of "extended double-precision" in Table 1, one finds that the "extended-precision" format is a specific instance of the "extended double-precision" format, similarly to the "quadruple-precision" format.
Table 1 (Expanded)
Summary of Format Parameters
No. Parameter Format
SingleSingle
Extended
DoubleDouble
Extended
Quadruple^{ +}
Extended^{ #}(1) p (precision,
apparent mantissa width in bits)24 ³ 32 53 ³ 64 113 64 (2) Decimal digits of precision
p / log_{2}(10)7.22 ³ 9.63 15.95 ³ 19.26 34.01 19.26 (3) Mantissa's MS-Bit hidden bit unspecified hidden bit unspecified hidden bit explicit bit (4) Actual mantissa width in bits 23 ³ 31 52 ³ 63 112 64 (5) E_{max} +127 ³ +1023 +1023 ³ +16383 +16383 +16383 (6) E_{min} -126 £ -1022 -1022 £ -16382 -16382 -16382 (7) Exponent bias +127 unspecified +1023 unspecified +16383 +16383 (8) Exponent width in bits 8 ³ 11 11 ³ 15 15 15 (9) Sign width in bits 1 1 1 1 1 1 (10) Format width in bits
(9) + (8) + (4)32 ³ 43 64 ³ 79 128 80 (11) Range Magnitude Maximum
2^{Emax + 1}3.4028E+38 ³ 1.7976E+308 1.7976E+308 ³ 1.1897E+4932 1.1897E+4932 1.1897E+4932 (12) Range Magnitude Minimum
2^{Emin}1.1754E-38 £ 2.2250E-308 2.2250E-308 £ 3.3621E-4932 3.3621E-4932 3.3621E-4932 (13) Range Magnitude Minimum
(Denormalized)
2^{Emin - (4)}1.4012E-45 £ 1.0361E-317 4.9406E-324 £ 3.6451E-4951 6.4751E-4966 1.8225E-4951 (14) FORTRAN Language Type REAL*4 REAL*8 REAL*16 REAL*10 (15) C Language Type float double long double long double © Copyright 1985 by
The Institute of Electrical and Electronics Engineers, Inc
^{+ }Although the "quadruple-precision" name and the particular parameters of its format are not specified in the IEEE-754 standard, it is a legally derived IEEE-754 format because its parameters are specific subset elements within the bounds of those specified for the "extended double-precision" format.
^{# }Like the "quadruple-precision" format, Intel's "extended-precision" format is a legal IEEE-754 format derived from the "extended double-precision" format.
Other sources on IEEE-754 include:
- [ http://spectra.eng.hawaii.edu/Courses/EE361.S95/Lectures/Lec38/lec38.3.html ]
- [ http://www.ece.uiuc.edu/~ece291/lecture/l11.html ]
- [ Carleton University ]
- [ Florida State University ]
- [ http://duke.usask.ca/~reeves/prog/geoe314/geoe314.005b.html ]
- [ Grinnell College ]
- [ Papers on Floating-Point by William Kahan -- "The Father of IEEE-754" ]
Kevin's Summary Charts
Storage Layout and Ranges of Floating-Point Numbers
IEEE-754 floating-point numbers require three component fields: the sign, the exponent, and the mantissa. The exponential base is 2 and is never stored in any way with the value in either the registers or memory (it is implied). In order to allow the exponent and mantissa, when taken together, to vary monotonically, the signed exponent is represented in excess-127 unsigned form for single precision and excess-1023 for double precision. This excess-127 (or excess-1023) representation is indicated by the variable "e" below.
Since IEEE-754 floating-point numbers are stored in a signed magnitude form, the ranges and binary patterns of the positive and negative numbers are symmetric about the midpoint of the entire range of values (between the positive and negative zeros). As a result, essentially any statement made in regard to the positive numbers is also true of the negative numbers and vice versa.
The range of positive floating-point numbers is split into normalized numbers (normal numbers) which preserve the full precision of the mantissa, including the hidden bit, (24 bits for single precision and 53 bits for double precision) and denormalized numbers (subnormal numbers, so-called unnormalized numbers) which have from 1 to 23 significant bits for single precision and 1 to 52 bits for double precision.
The number line tables below, which show the layout for single (32-bit) and double (64-bit) precision floating-point numbers and their special values, were inspired by the table on [ this page ]. To find the table on which these two are based, use the Edit | Find... command on the string "the corresponding values". In their column headers, these tables indicate the number of bits in each field along with their bit ranges in square brackets.
The values shown in the Decimal Range column of the tables are the end points of their respective ranges with the IEEE-754 round-to-nearest value mode applied. JavaScript uses IEEE-754 double precision floating-point with round-to-nearest value mode to perform all of its arithmetic operations including its input string to numeric conversion routine. Therefore, by default, double (64-bit) precision conversions are automatically rounded to values matching these tables. In order for single (32-bit) precision conversions to be rounded to values matching these tables, the user must click the Rounded button on those pages where it is present.
32-bit Single Precision
Range Name Sign (s)
1 [31]Exponent (e)
8 [30-23]Mantissa (m)
23 [22-0]Hexadecimal Range Range Decimal Range^{ §} Quiet
-NaN1 11..11 11..11
:
10..01FFFFFFFF
:
FFC00001Indeterminate 1 11..11 10..00 FFC00000 Signaling
-NaN1 11..11 01..11
:
00..01FFBFFFFF
:
FF800001-Infinity
(Negative Overflow)1 11..11 00..00 FF800000 < -(2-2^{-23}) × 2^{127} £ -3.4028235677973365E+38 Negative Normalized
-1.m × 2^{(e-127)}1 11..10
:
00..0111..11
:
00..00FF7FFFFF
:
80800000-(2-2^{-23}) × 2^{127}
:
-2^{-126}-3.4028234663852886E+38
:
-1.1754943508222875E-38Negative Denormalized
-0.m × 2^{(-126)}1 00..00 11..11
:
00..01807FFFFF
:
80000001-(1-2^{-23}) × 2^{-126}
:
-2^{-149}
(-(1+2^{-52}) × 2^{-150})^{ *}-1.1754942106924411E-38
:
-1.4012984643248170E-45
(-7.0064923216240862E-46)^{ *}Negative Underflow 1 00..00 00..00 80000000 -2^{-150}
:
< -0-7.0064923216240861E-46
:
< -0-0 1 00..00 00..00 80000000 -0 -0 +0 0 00..00 00..00 00000000 0 0 Positive Underflow 0 00..00 00..00 00000000 > 0
:
2^{-150}> 0
:
7.0064923216240861E-46Positive Denormalized
0.m × 2^{(-126)}0 00..00 00..01
:
11..1100000001
:
007FFFFF((1+2^{-52}) × 2^{-150})^{ *}
2^{-149}
:
(1-2^{-23}) × 2^{-126}(7.0064923216240862E-46)^{ *}
1.4012984643248170E-45
:
1.1754942106924411E-38Positive Normalized
1.m × 2^{(e-127)}0 00..01
:
11..1000..00
:
11..1100800000
:
7F7FFFFF2^{-126}
:
(2-2^{-23}) × 2^{127}1.1754943508222875E-38
:
3.4028234663852886E+38+Infinity
(Positive Overflow)0 11..11 00..00 7F800000 > (2-2^{-23}) × 2^{127} ³ 3.4028235677973365E+38 Signaling
+NaN0 11..11 00..01
:
01..117F800001
:
7FBFFFFFQuiet
+NaN0 11..11 10..00
:
11..117FC00000
:
7FFFFFFF
64-bit Double Precision
Range Name Sign (s)
1 [63]Exponent (e)
11 [62-52]Mantissa (m)
52 [51-0]Hexadecimal Range Range Decimal Range^{ §} Quiet
-NaN1 11..11 11..11
:
10..01FFFFFFFFFFFFFFFF
:
FFF8000000000001Indeterminate 1 11..11 10..00 FFF8000000000000 Signaling
-NaN1 11..11 01..11
:
00..01FFF7FFFFFFFFFFFF
:
FFF0000000000001-Infinity
(Negative Overflow)1 11..11 00..00 FFF0000000000000 < -(2-2^{-52}) × 2^{1023} £ -1.7976931348623158E+308 Negative Normalized
-1.m × 2^{(e-1023)}1 11..10
:
00..0111..11
:
00..00FFEFFFFFFFFFFFFF
:
8010000000000000-(2-2^{-52}) × 2^{1023}
:
-2^{-1022}-1.7976931348623157E+308
:
-2.2250738585072014E-308Negative Denormalized
-0.m × 2^{(-1022)}1 00..00 11..11
:
00..01800FFFFFFFFFFFFF
:
8000000000000001-(1-2^{-52}) × 2^{-1022}
:
-2^{-1074}
(-(1+2^{-52}) × 2^{-1075})^{ *}-2.2250738585072010E-308
:
-4.9406564584124654E-324
(-2.4703282292062328E-324)^{ *}Negative Underflow 1 00..00 00..00 8000000000000000 -2^{-1075}
:
< -0-2.4703282292062327E-324
:
< -0-0 1 00..00 00..00 8000000000000000 -0 -0 +0 0 00..00 00..00 0000000000000000 0 0 Positive Underflow 0 00..00 00..00 0000000000000000 > 0
:
2^{-1075}> 0
:
2.4703282292062327E-324Positive Denormalized
0.m × 2^{(-1022)}0 00..00 00..01
:
11..110000000000000001
:
000FFFFFFFFFFFFF((1+2^{-52}) × 2^{-1075})^{ *}
2^{-1074}
:
(1-2^{-52}) × 2^{-1022}(2.4703282292062328E-324)^{ *}
4.9406564584124654E-324
:
2.2250738585072010E-308Positive Normalized
1.m × 2^{(e-1023)}0 00..01
:
11..1000..00
:
11..110010000000000000
:
7FEFFFFFFFFFFFFF2^{-1022}
:
(2-2^{-52}) × 2^{1023}2.2250738585072014E-308
:
1.7976931348623157E+308+Infinity
(Positive Overflow)0 11..11 00..00 7FF0000000000000 > (2-2^{-52}) × 2^{1023} ³ 1.7976931348623158E+308 Signaling
+NaN0 11..11 00..01
:
01..117FF0000000000001
:
7FF7FFFFFFFFFFFFQuiet
+NaN0 11..11 10..00
:
11..117FF8000000000000
:
7FFFFFFFFFFFFFFF^{§ }Your least significant digits may differ.
^{* }The minimum magnitude values of denormalized ranges are represented by a single significant bit (a bit whose value is 1) at the right hand end of its format's mantissa. For single (32-bit) and double (64-bit) precision, these minimum range values are 1.4012984643248170E-45 and 4.9406564584124654E-324 respectively. The values 7.0064923216240862E-46 and 2.4703282292062328E-324 are each a little more than half of these minima. They are represented by one significant bit to the right of their format's storable mantissa and another 1-bit spaced the double precision's mantissa width to the right of the first bit. Then, as a result of the IEEE-754 round-to-nearest value mode's operation, these values are rounded to the denormalized range minimum values.
[ Convert Decimal Floating-Point Numbers to IEEE-754 Hexadecimal Representations. ]
[ Convert IEEE-754 32-bit Hexadecimal Representations to Decimal Floating-Point Numbers. ]
[ Convert IEEE-754 64-bit Hexadecimal Representations to Decimal Floating-Point Numbers. ]
[ CS-341 Home Page. ]
[ Dr. Vickery's Home Page. ]