IEEE 754-based half-precision floating point library
This is a C++ header-only library to provide an IEEE 754 conformant 16-bit half-precision floating point type along with corresponding arithmetic operators, type conversions and common mathematical functions. It aims for both efficiency and ease of use, trying to accurately mimic the behaviour of the builtin floating point types at the best performance possible. It is hosted on SourceForge.net.
Version 1.8.1 of the library has been released. This just fixes a compile error when including
half.hpp in multiple files, resulting in multiple definitions of the nanh() function due to a missing
Version 1.8.0 of the library has been released. It adds support for a bunch of additional C++11 mathematical functions even if their single-precision counterparts are not supported, in particular exponential and logarithmic functions (exp2(), expm1(), log2(), log1p()), hyperbolic area functions (asinh(), acosh(), atanh()) and the hypotenuse function (hypot()). The fma() function now uses the default single-precision implementation even if the single-precision version from
<cmath> is available but not faster than the straight-forward implementation. Thus it is now always at least equally fast to the manual half-precision
x*y + z operation (yet being correctly rounded as a single operation) and thus FP_FAST_FMAH practically always defined.
Furthermore, the internal expression implementation has been completely revised. This fixes issues with overload resolution which could occur when trying to call certain mathematical functions by unqualified invocation (relying on
using declarations or ADL) and led to ambiguities or the incorrect preference of the standard library functions over the half-precision versions.
The library in its most recent version can be obtained from here, see the Release Notes for further information:
If you are interested in previous versions of the library, see the SourceForge download page.
Comfortably enough, the library consists of just a single header file containing all the functionality, which can be directly included by your projects, without the neccessity to build anything or link to anything.
The library needs an IEEE-754-conformant single-precision
float type, but this should be the case on most modern platforms. Whereas the library is fully C++98-compatible, it can profit from certain C++11 features. Support for those features is checked and enabled automatically at compile (or rather preprocessing) time, but can be explicitly enabled or disabled by defining the corresponding preprocessor symbols to either 1 or 0 yourself. This is useful when the automatic detection fails (for more exotic implementations) or when a feature should be explicitly disabled:
|C++11 feature||Used for||Enabled for (and newer)||Override with|
|functions returning ||VC++ 2003, gcc, clang|
|static assertions||extended compile-time checks||VC++ 2010, gcc 4.3, clang 2.9|
|generalized constant expressions||constant operations||gcc 4.6, clang 3.1|
|proper ||gcc 4.6, clang 3.0|
|user-defined literals||half-precision literals||gcc 4.7, clang 3.1|
|sized integer types from ||more flexible type sizes||VC++ 2010, libstdc++ 4.3, libc++|
|certain new ||corresponding half functions||libstdc++ 4.3, libc++|
|hash function for halfs||VC++ 2010, libstdc++ 4.3, libc++|
The library has been tested successfully with Visual C++ 2005 - 2012, gcc 4.4 - 4.7 and clang 3.1. Please contact me if you have any problems, suggestions or even just success testing it on other platforms.
What follows are some general words about the usage of the library and its implementation. For a complete reference documentation of its iterface you should consult the API Documentation.
To make use of the library just include its only header file half.hpp, which defines all half-precision functionality inside the half_float namespace. The actual 16-bit half-precision data type is represented by the half type. This type behaves like the builtin floating point types as much as possible, supporting the usual arithmetic, comparison and streaming operators, which makes its use pretty straight-forward:
Furthermore the library provides proper specializations for
std::numeric_limits, defining various implementation properties, and
std::hash for hashing half-precision numbers (assuming support for C++11
std::hash). Similar to the corresponding preprocessor symbols from
<cmath> the library also defines the HUGE_VALH constant and maybe the FP_FAST_FMAH symbol.
The half is explicitly constructible/convertible from a single-precision
float argument. Thus it is also explicitly constructible/convertible from any type implicitly convertible to
float, but constructing it from types like
int will involve the usual warnings arising when implicitly converting those to
float because of the lost precision. On the one hand those warnings are intentional, because converting those types to half neccessarily also reduces precision. But on the other hand they are raised for explicit conversions from those types, when the user knows what he is doing. So if those warnings keep bugging you, then you won't get around first explicitly converting to
float before converting to half, or use the half_cast() described below. In addition you can also directly assign
float values to halfs.
For performance reasons the conversion from
float to half uses truncation (round toward zero, but mapping overflows to infinity) for rounding values not representable exactly in half-precision. If you are in need for other rounding behaviour (though this should rarely be the case), you can use the half_cast(). In addition to performning an explicit cast between half and any other type convertible to/from
float via an explicit cast to/from
float (and thus without any warnings due to possible precision-loss), it let's you explicitly specify the rounding mode to use for the float-to-half conversion. You can even synchronize it with the bultin single-precision implementation's rounding mode:
In contrast to the float-to-half conversion, which reduces precision, the conversion from half to
float (and thus to any other type implicitly convertible to
float) is implicit, because all values represetable with half-precision are also representable with single-precision. This way the half-to-float conversion behaves similar to the builtin float-to-double conversion and all arithmetic expressions involving both half-precision and single-precision arguments will be of single-precision type. This way you can also directly use the mathematical functions of the C++ standard library, though in this case you will invoke the single-precision versions which will also return single-precision values, which is (even if maybe performing the exact same computation, see below) not as conceptually clean when working in a half-precision environment.
You may also specificy explicit half-precision literals, since the library provides a user-defined literal inside the half_float::literal namespace, which you just need to import (assuming support for C++11 user-defined literals):
For performance reasons (and ease of implementation) many of the mathematical functions provided by the library as well as all arithmetic operations are actually carried out in single-precision under the hood, calling to the C++ standard library implementations of those functions whenever appropriate, meaning the arguments are converted to
floats and the result back to half. But to reduce the conversion overhead as much as possible any temporary values inside of lengthy expressions are kept in single-precision as long as possible, while still maintaining a strong half-precision type to the outside world. Only when finally assigning the value to a half or calling a function that works directly on halfs is the actual conversion done (or never, when further converting the result to
This approach has two implications. First of all you have to treat the documentation on this site as a simplified version, describing the behaviour of the library as if implemented this way. The actual argument and return types of functions and operators may involve other internal types (feel free to generate the exact developer documentation from the Doxygen comments in the library's header file if you really need to). But nevertheless the behaviour is exactly like specified in the documentation. The other implication is, that in the presence of rounding errors or over-/underflows arithmetic expressions may produce different results when compared to converting to half-precision after each individual operation:
But this should only be a problem in very few cases. One last word has to be said when talking about performance. Even with its efforts in reducing conversion overhead as much as possible, the software half-precision implementation can most probably not beat the direct use of single-precision computations. Usually using actual
float values for all computations and temproraries and using halfs only for storage is the recommended way. On the one hand this somehow makes the provided mathematical functions obsolete (especially in light of the implicit conversion from half to
float), but nevertheless the goal of this library was to provide a complete and conceptually clean half-precision implementation, to which the standard mathematical functions belong, even if usually not needed.
The half type uses the standard IEEE representation with 1 sign bit, 5 exponent bits and 10 mantissa bits (11 when counting the hidden bit). It supports all types of special values, like subnormal values, infinity and NaNs. But there are some limitations to the complete conformance to the IEEE 754 standard:
Some of those points could have been circumvented by controlling the floating point environment using
<cfenv> or implementing a similar exception mechanism. But this would have required excessive runtime checks giving two high an impact on performance for something that is rarely ever needed. If you really need to rely on proper floating point exceptions, it is recommended to explicitly perform computations using the builtin floating point types to be on the safe side. In the same way, if you really need to rely on a particular rounding behaviour, it is recommended to use single-precision computations and explicitly convert the result to half-precision using half_cast() and specifying the desired rounding mode. But those are really considered expert-scenarios rarely encountered in practice, since actually working with half-precision usually comes with a certain tolerance/ignorance of exactness considerations.
This library is developed by Christian Rau and released under the MIT License. If you have any questions or problems with it, feel free to contact me at rauy AT users.sourceforge.net or use any of the other means for support.
Additional credit goes to Jeroen van der Zijp for his paper on Fast Half Float Conversions, whose algorithms have been used in the library for converting between half-precision and single-precision values.