Message ID | 20211015025255.1285628-1-wangyang.guo@intel.com |
---|---|

State | Changes Requested |

Headers | show |

Series | x86-64: Impelment __ieee754_remainder with static rounding | expand |

Context | Check | Description |
---|---|---|

dj/TryBot-apply_patch | success | Patch applied to master at the time it was sent |

dj/TryBot-32bit | success | Build for i686 |

On Fri, 15 Oct 2021, Wangyang Guo via Libc-alpha wrote: > __ieee754_remainder sets and resets MXCSR register to use a specific > rounding mode. Save, set and restore MXCSR register has overhead. > Use AVX512 static rounding to override MXCSR-based rounding control > for floating point instructions with rounding semantics. It can > improve the performance of 2 indicators in bench-math-inlines: I think a key concern here is the effects of the patch on the readability and maintainability of the code. This patch is introducing macros for basic arithmetic operations, which aren't needed for any of the other macroized versions of libm functions. The code already suffers from a lack of comments, but all these extra macros make it even less readable. Furthermore, there are no comments added on the new default macro definitions to explain the exact semantics of the macros (requirements on the macro definitions and call sites), meaning anyone changing the code would probably need to be familiar with AVX512-specific details as well as reverse-engineering the semantics of individual macros. (For example, the macro you've called ROUND_TO_ZERO surely only has such an effect given various other requirements on its arguments and the environment in which it is called, so really needs a comment explaining the semantics and such requirements in detail.) Given how specific the macros seem to be to the particular instructions available in AVX512, it might well be better to use an entirely x86_64-specific implementation file instead of trying to share with the generic implementation using macros. However, I'd like us to have a better understanding of the wider context before concluding that's the right approach. * Do you expect such implementations using AVX512 static rounding to be useful for lots of functions rather than just one or a few? * Some of the complexity and much of what seems very AVX512-specific seems to come from changing types to use vectors. Are the AVX512 operations in question not available for scalar arithmetic? * Some other architectures also have static rounding support in their instruction sets, which should be taken into consideration in any changes to architecture-independent code to support static rounding. It may, however, have differences from the AVX512 version. For example, while RISC-V allows a rounding mode to be set in each floating-point instruction, exception flags are still set as usual; it doesn't have support for instructions that don't raise exceptions. Perhaps the RISC-V maintainers could comment on whether, given that, there would be any performance benefit on common RISC-V implementations in using static rounding modes for code that still needs to save and restore exception state, or whether static rounding modes would only help performance on RISC-V for cases where the code doesn't need to do anything special with exception state? Do any other current architectures supported by glibc have static rounding support in their instruction sets? If so, could their maintainers also comment on these issues? (IA64 has something similar - multiple sets of floating-point state, so that code can use a different rounding mode register and set of exception flags, subject to ABI requirements on that floating-point state - but isn't really a current architecture.) * To what extent is the use of static rounding modes something the compiler should be able to do rather than macroizing the code? TS 18661-1 and C2X support #pragma STDC FENV_ROUND FE_TONEAREST which seems like what this code needs for the static rounding mode part of things, while TS 18661-5 adds #pragma STDC FENV_EXCEPT NO_FLAG FE_ALL_EXCEPT which might be what's wanted for discarding exceptions. Implementing those pragmas in GCC would certainly be a substantial project (maybe needing to start with something like Marc Glisse's -ffenv-access patches from August 2020 - a generic version would need to handle the case of the pragma changing the dynamic rounding modes and restricting code movement, before support is then added for architectures with actual static rounding mode support). It would however seem better in principle as a long-term solution for allowing lots of functions that currently change the rounding mode temporarily to use static rounding modes on all architectures that have static rounding mode support, rather than changing basic arithmetic to use macros everywhere in affected code. But it's possible that even given such pragma implementations you could have issues applying them for AVX512, if the change of types to use vectors is an essential part of the patch. While a compiler pragma implementation should certainly be able to set instruction fields for static rounding modes or discarding exceptions, it seems more doubtful to what extent it would rewrite scalar code, mixed with a lot of bit-wise manipulations of floating-point representations, into code working on vectors instead; especially the mix of floating-point operations with bit-wise manipulation is a rather specific use case for changing scalar to vector operations.

On Mon, Oct 18, 2021 at 11:02 AM Joseph Myers <joseph@codesourcery.com> wrote: > > On Fri, 15 Oct 2021, Wangyang Guo via Libc-alpha wrote: > > > __ieee754_remainder sets and resets MXCSR register to use a specific > > rounding mode. Save, set and restore MXCSR register has overhead. > > Use AVX512 static rounding to override MXCSR-based rounding control > > for floating point instructions with rounding semantics. It can > > improve the performance of 2 indicators in bench-math-inlines: > > I think a key concern here is the effects of the patch on the readability > and maintainability of the code. This patch is introducing macros for > basic arithmetic operations, which aren't needed for any of the other > macroized versions of libm functions. The code already suffers from a > lack of comments, but all these extra macros make it even less readable. > > Furthermore, there are no comments added on the new default macro > definitions to explain the exact semantics of the macros (requirements on > the macro definitions and call sites), meaning anyone changing the code > would probably need to be familiar with AVX512-specific details as well as > reverse-engineering the semantics of individual macros. (For example, the > macro you've called ROUND_TO_ZERO surely only has such an effect given > various other requirements on its arguments and the environment in which > it is called, so really needs a comment explaining the semantics and such > requirements in detail.) > > Given how specific the macros seem to be to the particular instructions > available in AVX512, it might well be better to use an entirely > x86_64-specific implementation file instead of trying to share with the > generic implementation using macros. However, I'd like us to have a > better understanding of the wider context before concluding that's the > right approach. > > > * Do you expect such implementations using AVX512 static rounding to be > useful for lots of functions rather than just one or a few? > > > * Some of the complexity and much of what seems very AVX512-specific seems > to come from changing types to use vectors. Are the AVX512 operations in > question not available for scalar arithmetic? > > > * Some other architectures also have static rounding support in their > instruction sets, which should be taken into consideration in any changes > to architecture-independent code to support static rounding. It may, > however, have differences from the AVX512 version. For example, while > RISC-V allows a rounding mode to be set in each floating-point > instruction, exception flags are still set as usual; it doesn't have > support for instructions that don't raise exceptions. Perhaps the RISC-V > maintainers could comment on whether, given that, there would be any > performance benefit on common RISC-V implementations in using static > rounding modes for code that still needs to save and restore exception > state, or whether static rounding modes would only help performance on > RISC-V for cases where the code doesn't need to do anything special with > exception state? For the RISC-V implementations I've worked on, simultaneously changing the dynamic rounding mode and the exception state is no more expensive than changing only the exception state. (I'm ignoring the cost of any integer instructions that might be needed to synthesize the new fcsr value before writing the fcsr.) The static rounding modes are primarily beneficial when a single computation makes use of multiple rounding modes, in which case repeated writes of the dynamic rounding mode register would become expensive. HTH. > > > Do any other current architectures supported by glibc have static rounding > support in their instruction sets? If so, could their maintainers also > comment on these issues? (IA64 has something similar - multiple sets of > floating-point state, so that code can use a different rounding mode > register and set of exception flags, subject to ABI requirements on that > floating-point state - but isn't really a current architecture.) > > > * To what extent is the use of static rounding modes something the > compiler should be able to do rather than macroizing the code? TS 18661-1 > and C2X support > > #pragma STDC FENV_ROUND FE_TONEAREST > > which seems like what this code needs for the static rounding mode part of > things, while TS 18661-5 adds > > #pragma STDC FENV_EXCEPT NO_FLAG FE_ALL_EXCEPT > > which might be what's wanted for discarding exceptions. > > Implementing those pragmas in GCC would certainly be a substantial project > (maybe needing to start with something like Marc Glisse's -ffenv-access > patches from August 2020 - a generic version would need to handle the case > of the pragma changing the dynamic rounding modes and restricting code > movement, before support is then added for architectures with actual > static rounding mode support). It would however seem better in principle > as a long-term solution for allowing lots of functions that currently > change the rounding mode temporarily to use static rounding modes on all > architectures that have static rounding mode support, rather than changing > basic arithmetic to use macros everywhere in affected code. > > But it's possible that even given such pragma implementations you could > have issues applying them for AVX512, if the change of types to use > vectors is an essential part of the patch. While a compiler pragma > implementation should certainly be able to set instruction fields for > static rounding modes or discarding exceptions, it seems more doubtful to > what extent it would rewrite scalar code, mixed with a lot of bit-wise > manipulations of floating-point representations, into code working on > vectors instead; especially the mix of floating-point operations with > bit-wise manipulation is a rather specific use case for changing scalar to > vector operations. > > -- > Joseph S. Myers > joseph@codesourcery.com

Joseph Myers wrote: > Given how specific the macros seem to be to the particular instructions > available in AVX512, it might well be better to use an entirely > x86_64-specific implementation file instead of trying to share with the > generic implementation using macros. However, I'd like us to have a > better understanding of the wider context before concluding that's the > right approach. OK, I will change the patch to x86_64 specific only. > * Do you expect such implementations using AVX512 static rounding to be > useful for lots of functions rather than just one or a few? Currently, I just find out AVX512 static rounding can have benefit to remainder function without MXCSR overhead. I am not sure whether it could be applied to other wider areas. > * Some of the complexity and much of what seems very AVX512-specific seems > to come from changing types to use vectors. Are the AVX512 operations in > question not available for scalar arithmetic? As far as I know, there is no scalar arithmetic with static rounding. BR Wangyang

Dear Wangyang, if not already done, please also fix the patch title: Impelment -> Implement. Paul

```
Paul Zimmermann wrote:
> if not already done, please also fix the patch title: Impelment -> Implement.
Got it. Thanks.
BR
Wangyang
```

diff --git a/sysdeps/ieee754/dbl-64/e_remainder.c b/sysdeps/ieee754/dbl-64/e_remainder.c index d076a37168..257e41ae7a 100644 --- a/sysdeps/ieee754/dbl-64/e_remainder.c +++ b/sysdeps/ieee754/dbl-64/e_remainder.c @@ -36,6 +36,7 @@ #include <math_private.h> #include <fenv_private.h> #include <libm-alias-finite.h> +#include <remainder-round-dbl-64.h> /**************************************************************************/ /* An ultimate remainder routine. Given two IEEE double machine numbers x */ @@ -44,9 +45,8 @@ double __ieee754_remainder (double x, double y) { - double z, d, xx; int4 kx, ky, n, nn, n1, m1, l; - mynumber u, t, w = { { 0, 0 } }, v = { { 0, 0 } }, ww = { { 0, 0 } }, r; + mynumber u, t; u.x = x; t.x = y; kx = u.i[HIGH_HALF] & 0x7fffffff; /* no sign for x*/ @@ -55,65 +55,88 @@ __ieee754_remainder (double x, double y) /*------ |x| < 2^1023 and 2^-970 < |y| < 2^1024 ------------------*/ if (kx < 0x7fe00000 && ky < 0x7ff00000 && ky >= 0x03500000) { - SET_RESTORE_ROUND_NOEX (FE_TONEAREST); + SET_ENV (FE_TONEAREST); + MYNUMBER_R_TYPE r; + MYNUMBER_R_DECL_VAL (u_r, u); + MYNUMBER_R_DECL_VAL (t_r, t); + MYNUMBER_R_DECL_ZERO (w); + MYNUMBER_R_DECL_ZERO (v); + MYNUMBER_R_DECL_ZERO (ww); + DOUBLE_R_TYPE z_r, d_r, xx; + DOUBLE_R_DECL_VAL (x_r, x); + if (kx + 0x00100000 < ky) return x; if ((kx - 0x01500000) < ky) { - z = x / t.x; - v.i[HIGH_HALF] = t.i[HIGH_HALF]; - d = (z + big.x) - big.x; - xx = (x - d * v.x) - d * (t.x - v.x); - if (d - z != 0.5 && d - z != -0.5) - return (xx != 0) ? xx : ((x > 0) ? ZERO.x : nZERO.x); + z_r = DIV_RN (x_r, t_r.x); + COPY_HIGH_HALF (v, t_r); + d_r = ROUND_TO_ZERO (z_r); + xx = FNMADD_RN (d_r, SUB_RN (t_r.x, v.x), + FNMADD_RN (d_r, v.x, x_r)); + if (IS_NEQ (ABS (SUB_RN (d_r, z_r)), 0.5)) + return (IS_NEQ (xx, 0) + ? TO_FP64 (xx) + : ((x > 0) ? ZERO.x : nZERO.x)); else { - if (fabs (xx) > 0.5 * t.x) - return (z > d) ? xx - t.x : xx + t.x; + if (IS_GT (ABS (xx), MUL_RN (TO_V (0.5), t_r.x))) + return (IS_GT (z_r, d_r) + ? TO_FP64 (SUB_RN (xx, t_r.x)) + : TO_FP64 (ADD_RN (xx, t_r.x))); else - return xx; + return TO_FP64 (xx); } } /* (kx<(ky+0x01500000)) */ else { - r.x = 1.0 / t.x; - n = t.i[HIGH_HALF]; + r.x = DIV_RN (TO_V (1.0), t_r.x); + n = GET_HIGH_HALF (t_r); nn = (n & 0x7ff00000) + 0x01400000; - w.i[HIGH_HALF] = n; - ww.x = t.x - w.x; + SET_HIGH_HALF (w, n); + ww.x = SUB_RN (t_r.x, w.x); l = (kx - nn) & 0xfff00000; - n1 = ww.i[HIGH_HALF]; - m1 = r.i[HIGH_HALF]; + n1 = GET_HIGH_HALF (ww); + m1 = GET_HIGH_HALF (r); while (l > 0) { - r.i[HIGH_HALF] = m1 - l; - z = u.x * r.x; - w.i[HIGH_HALF] = n + l; - ww.i[HIGH_HALF] = (n1) ? n1 + l : n1; - d = (z + big.x) - big.x; - u.x = (u.x - d * w.x) - d * ww.x; - l = (u.i[HIGH_HALF] & 0x7ff00000) - nn; + SET_HIGH_HALF (r, (m1 - l)); + z_r = MUL_RN (u_r.x, r.x); + SET_HIGH_HALF (w, (n + l)); + SET_HIGH_HALF (ww, ((n1) ? n1 + l: n1)); + d_r = ROUND_TO_ZERO (z_r); + u_r.x = SUB_RN (FNMADD_RN (d_r, w.x, u_r.x), + MUL_RN (d_r, ww.x)); + l = (GET_HIGH_HALF (u_r) & 0x7ff00000) - nn; } - r.i[HIGH_HALF] = m1; - w.i[HIGH_HALF] = n; - ww.i[HIGH_HALF] = n1; - z = u.x * r.x; - d = (z + big.x) - big.x; - u.x = (u.x - d * w.x) - d * ww.x; - if (fabs (u.x) < 0.5 * t.x) - return (u.x != 0) ? u.x : ((x > 0) ? ZERO.x : nZERO.x); - else - if (fabs (u.x) > 0.5 * t.x) - return (d > z) ? u.x + t.x : u.x - t.x; + SET_HIGH_HALF (r, m1); + SET_HIGH_HALF (w, n); + SET_HIGH_HALF (ww, n1); + z_r = MUL_RN (u_r.x, r.x); + d_r = ROUND_TO_ZERO (z_r); + u_r.x = SUB_RN (FNMADD_RN (d_r, w.x, u_r.x), + MUL_RN (d_r, ww.x)); + if (IS_LT (ABS (u_r.x), MUL_RN (TO_V (0.5), t_r.x))) + return (IS_NEQ (u_r.x, 0) + ? TO_FP64 (u_r.x) + : ((x > 0) ? ZERO.x : nZERO.x)); else - { - z = u.x / t.x; d = (z + big.x) - big.x; - return ((u.x - d * w.x) - d * ww.x); - } + if (IS_GT (ABS (u_r.x), MUL_RN (TO_V (0.5), t_r.x))) + return (IS_GT (d_r, z_r) + ? TO_FP64 (ADD_RN (u_r.x, t_r.x)) + : TO_FP64 (SUB_RN (u_r.x, t_r.x))); + else + { + z_r = DIV_RN (u_r.x, t_r.x); + d_r = ROUND_TO_ZERO (z_r); + return TO_FP64 (SUB_RN (FNMADD_RN (d_r, w.x, u_r.x), + MUL_RN (d_r, ww.x))); + } } } /* (kx<0x7fe00000&&ky<0x7ff00000&&ky>=0x03500000) */ else { + double z, d; if (kx < 0x7fe00000 && ky < 0x7ff00000 && (ky > 0 || t.i[LOW_HALF] != 0)) { y = fabs (y) * t128.x; @@ -150,4 +173,6 @@ __ieee754_remainder (double x, double y) } } } +#ifndef __ieee754_remainder libm_alias_finite (__ieee754_remainder, __remainder) +#endif diff --git a/sysdeps/ieee754/dbl-64/remainder-round-dbl-64.h b/sysdeps/ieee754/dbl-64/remainder-round-dbl-64.h new file mode 100644 index 0000000000..2fc2558cb8 --- /dev/null +++ b/sysdeps/ieee754/dbl-64/remainder-round-dbl-64.h @@ -0,0 +1,42 @@ +/* Used by remainder functions. Generic version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#define SET_ENV(RM) SET_RESTORE_ROUND_NOEX (RM) + +#define DOUBLE_R_TYPE double +#define DOUBLE_R_DECL_VAL(v_r, v) double v_r = (v) +#define MYNUMBER_R_TYPE mynumber +#define MYNUMBER_R_DECL_VAL(v_r, v) mynumber v_r = (v) +#define MYNUMBER_R_DECL_ZERO(v_r) mynumber v_r = { { 0, 0 } } + +#define COPY_HIGH_HALF(a, b) (a).i[HIGH_HALF] = (b).i[HIGH_HALF] +#define SET_HIGH_HALF(a, b) (a).i[HIGH_HALF] = (b) +#define GET_HIGH_HALF(a) (a).i[HIGH_HALF] + +#define ROUND_TO_ZERO(a) (((a) + big.x) - big.x) +#define ADD_RN(a, b) ((a) + (b)) +#define SUB_RN(a, b) ((a) - (b)) +#define MUL_RN(a, b) ((a) * (b)) +#define DIV_RN(a, b) ((a) / (b)) +#define FNMADD_RN(a, b, c) ((c) - (a) * (b)) +#define IS_NEQ(a, b) ((a) != (b)) +#define IS_LT(a, b) ((a) < (b)) +#define IS_GT(a, b) ((a) > (b)) +#define TO_FP64(a) ((double) (a)) +#define TO_V(a) ((double) (a)) +#define ABS(a) (fabs (a)) diff --git a/sysdeps/x86_64/fpu/multiarch/Makefile b/sysdeps/x86_64/fpu/multiarch/Makefile index d425ffd6d3..06c61241a0 100644 --- a/sysdeps/x86_64/fpu/multiarch/Makefile +++ b/sysdeps/x86_64/fpu/multiarch/Makefile @@ -56,6 +56,10 @@ CFLAGS-e_log-avx.c = -msse2avx -DSSE2AVX CFLAGS-s_atan-avx.c = -msse2avx -DSSE2AVX CFLAGS-s_sin-avx.c = -msse2avx -DSSE2AVX CFLAGS-s_tan-avx.c = -msse2avx -DSSE2AVX + +libm-sysdep_routines += e_remainder-avx512 + +CFLAGS-e_remainder-avx512.c = -mavx512f endif ifeq ($(subdir),mathvec) diff --git a/sysdeps/x86_64/fpu/multiarch/e_remainder-avx512.c b/sysdeps/x86_64/fpu/multiarch/e_remainder-avx512.c new file mode 100644 index 0000000000..5852a19794 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/e_remainder-avx512.c @@ -0,0 +1,20 @@ +/* AVX512F version of IEEE 754 remainder. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#define __ieee754_remainder __ieee754_remainder_avx512f +#include <sysdeps/ieee754/dbl-64/e_remainder.c> diff --git a/sysdeps/x86_64/fpu/multiarch/e_remainder.c b/sysdeps/x86_64/fpu/multiarch/e_remainder.c new file mode 100644 index 0000000000..e1462438e3 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/e_remainder.c @@ -0,0 +1,32 @@ +/* Multiple versions of IEEE 754 remainder. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#include <math.h> +#include <libm-alias-finite.h> + +extern double __redirect_ieee754_remainder (double, double); + +#define SYMBOL_NAME ieee754_remainder +#include "ifunc-avx512f.h" + +libc_ifunc_redirected (__redirect_ieee754_remainder, + __ieee754_remainder, IFUNC_SELECTOR ()); +libm_alias_finite (__ieee754_remainder, __remainder) + +#define __ieee754_remainder __ieee754_remainder_sse2 +#include <sysdeps/ieee754/dbl-64/e_remainder.c> diff --git a/sysdeps/x86_64/fpu/multiarch/ifunc-avx512f.h b/sysdeps/x86_64/fpu/multiarch/ifunc-avx512f.h new file mode 100644 index 0000000000..1f7b1d1ce6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/ifunc-avx512f.h @@ -0,0 +1,32 @@ +/* Common definition for ifunc selections optimized with AVX512F. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ +#include <init-arch.h> + +extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512f) attribute_hidden; + +static inline void * +IFUNC_SELECTOR (void) +{ + const struct cpu_features* cpu_features = __get_cpu_features (); + + if (CPU_FEATURE_USABLE_P (cpu_features, AVX512F)) + return OPTIMIZE (avx512f); + + return OPTIMIZE (sse2); +} diff --git a/sysdeps/x86_64/fpu/multiarch/remainder-round-dbl-64.h b/sysdeps/x86_64/fpu/multiarch/remainder-round-dbl-64.h new file mode 100644 index 0000000000..74301ee5f0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/remainder-round-dbl-64.h @@ -0,0 +1,66 @@ +/* Used by remainder functions. AVX512F version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifdef __AVX512F__ +# include <x86intrin.h> + +typedef union MM_Number +{ + __m128d x; + __m128i i; +} MM_Number; + +# define AVX512_RN (_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) +# define SET_ENV(RM) + +# define DOUBLE_R_TYPE __m128d +# define DOUBLE_R_DECL_VAL(v_r, v) __m128d v_r = _mm_set_sd (v) +# define MYNUMBER_R_TYPE MM_Number +# define MYNUMBER_R_DECL_VAL(v_r, v) \ + MM_Number v_r = { __extension__ (__m128d) { (v).x, 0.0 } } +# define MYNUMBER_R_DECL_ZERO(v_r) \ + MM_Number v_r = { __extension__ (__m128d) { 0.0, 0.0 } } + +# define COPY_HIGH_HALF(a, b) \ + (a).i = _mm_blend_epi32 ((a).i, (b).i, 0x1) +# define SET_HIGH_HALF(a, b) \ + (a).i = _mm_insert_epi32 ((a).i, b, 0x1) +# define GET_HIGH_HALF(a) _mm_extract_epi32 ((a).i, 0x1) + +# define ROUND_TO_ZERO(a) _mm_round_pd ((a), AVX512_RN) +# define ADD_RN(a, b) _mm_add_round_sd ((a), (b), AVX512_RN) +# define SUB_RN(a, b) _mm_sub_round_sd ((a), (b), AVX512_RN) +# define MUL_RN(a, b) _mm_mul_round_sd ((a), (b), AVX512_RN) +# define DIV_RN(a, b) _mm_div_round_sd ((a), (b), AVX512_RN) +# define FNMADD_RN(a, b, c) _mm_fnmadd_round_sd ((a), (b), (c), AVX512_RN) + +/* b is not a vector data type */ +# define IS_NEQ(a, b) (_mm_cvtsd_f64 (a) != b) +/* b is a vector data type */ +# define IS_LT(a, b) _mm_cmp_sd_mask ((a), (b), _CMP_LT_OS) +/* b is a vector data type */ +# define IS_GT(a, b) _mm_cmp_sd_mask ((a), (b), _CMP_GT_OS) +# define TO_FP64(a) _mm_cvtsd_f64 (a) +# define TO_V(a) _mm_set_sd (a) +# define ABS(a) \ + ((__m128d) _mm_and_si128 ((__m128i) (a), \ + _mm_cvtsi64_si128 (0x7fffffffffffffffll))) + +#else +# include <sysdeps/ieee754/dbl-64/remainder-round-dbl-64.h> +#endif

`__ieee754_remainder sets and resets MXCSR register to use a specific rounding mode. Save, set and restore MXCSR register has overhead. Use AVX512 static rounding to override MXCSR-based rounding control for floating point instructions with rounding semantics. It can improve the performance of 2 indicators in bench-math-inlines: * remainder_test1.normal: +18.9% * remainder_test2.normal: +19.4% Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> --- sysdeps/ieee754/dbl-64/e_remainder.c | 105 +++++++++++------- .../ieee754/dbl-64/remainder-round-dbl-64.h | 42 +++++++ sysdeps/x86_64/fpu/multiarch/Makefile | 4 + .../x86_64/fpu/multiarch/e_remainder-avx512.c | 20 ++++ sysdeps/x86_64/fpu/multiarch/e_remainder.c | 32 ++++++ sysdeps/x86_64/fpu/multiarch/ifunc-avx512f.h | 32 ++++++ .../fpu/multiarch/remainder-round-dbl-64.h | 66 +++++++++++ 7 files changed, 261 insertions(+), 40 deletions(-) create mode 100644 sysdeps/ieee754/dbl-64/remainder-round-dbl-64.h create mode 100644 sysdeps/x86_64/fpu/multiarch/e_remainder-avx512.c create mode 100644 sysdeps/x86_64/fpu/multiarch/e_remainder.c create mode 100644 sysdeps/x86_64/fpu/multiarch/ifunc-avx512f.h create mode 100644 sysdeps/x86_64/fpu/multiarch/remainder-round-dbl-64.h`