sys-libs/glibc/files/local/glibc-2.19-arm-memcpy.patch - mirrors/cros/chromiumos/overlays/chromiumos-overlay - Git at Google

 libc: Speed up large memcpy() on Cortex-A7/A15

 Details please see crbug://331427

 Experimentally it's been found that the "unaligned" memcpy() is
 actually faster for sufficiently large memory copies.  It appears that
 the changeover point is a little different for different processors,
 though.  For A15 there's a lot more run-to-run variance for
 medium-sized memcpy() but the changeover appears to be at ~16K.  For
 A7 (and maybe A9) the changeover seems to be a little further out.
 We think the variance in A15 memcpy() is is due to different physical
 addresses for the chunks of memory given to us by the kernel.  It is
 certain that the "aligned" code works faster at 4K and less and that
 the "unaligned" code works faster with very large chunks of memory.
 Since we care most about A15 performance and the A7 performance is not
 that much worse (and actually better for SDRAM transfers), we'll pick
 the number that's best for the A15.
 Tests on snow (A15 only):
 * Large (12M) aligned copies go from ~2350 MiB/s to ~2900 MiB/s.
 * Medium (96K) aligned copies go from ~5900 MiB/s to ~6300 MiB/s.
 * Medium (16K) aligned copies seem to be better but there's a lot of
   noise (old=8151.8, 8736.6, 8168.7; new=9364.9, 9829.5, 9639.0)
 * Small (4K, 8K) algined copies are unchanged.
 For A7-only on pit:
 * Large (12M) aligned copies go from 440 MiB/s to 930 MiB/s.
 * Medium (96K) aligned copies regress from ~2650 MiB/s to ~2400 MiB/s.
 * Medium (16K) aligned copies regress from ~3000 MiB/s to ~2800 MiB/s.
 * Small (4K, 8K) aligned copies are unchanged.
 See punybench changes at
 <https://chromium-review.googlesource.com/#/c/182168/3> for how this
 was tested.  For A15 changes I ran 3 times and averaged (there wasn't
 lots of variance except for 16K).  For A7 changes I ran 2 times.

 --- a/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
 +++ b/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
 @@ -374,6 +374,10 @@ ENTRY(memcpy)
  	cmp	tmp1, tmp2
  	bne	.Lcpy_notaligned

 +	/* Use the non-aligned code for >=16K; faster on A7/A15 (A9 too?) */
 +	cmp	count, #0x4000
 +	bge	.Lcpy_notaligned
 +
  #ifdef USE_VFP
  	/* Magic dust alert!  Force VFP on Cortex-A9.  Experiments show
  	   that the FP pipeline is much better at streaming loads and
	libc: Speed up large memcpy() on Cortex-A7/A15

	Details please see crbug://331427

	Experimentally it's been found that the "unaligned" memcpy() is
	actually faster for sufficiently large memory copies. It appears that
	the changeover point is a little different for different processors,
	though. For A15 there's a lot more run-to-run variance for
	medium-sized memcpy() but the changeover appears to be at ~16K. For
	A7 (and maybe A9) the changeover seems to be a little further out.
	We think the variance in A15 memcpy() is is due to different physical
	addresses for the chunks of memory given to us by the kernel. It is
	certain that the "aligned" code works faster at 4K and less and that
	the "unaligned" code works faster with very large chunks of memory.
	Since we care most about A15 performance and the A7 performance is not
	that much worse (and actually better for SDRAM transfers), we'll pick
	the number that's best for the A15.
	Tests on snow (A15 only):
	* Large (12M) aligned copies go from ~2350 MiB/s to ~2900 MiB/s.
	* Medium (96K) aligned copies go from ~5900 MiB/s to ~6300 MiB/s.
	* Medium (16K) aligned copies seem to be better but there's a lot of
	noise (old=8151.8, 8736.6, 8168.7; new=9364.9, 9829.5, 9639.0)
	* Small (4K, 8K) algined copies are unchanged.
	For A7-only on pit:
	* Large (12M) aligned copies go from 440 MiB/s to 930 MiB/s.
	* Medium (96K) aligned copies regress from ~2650 MiB/s to ~2400 MiB/s.
	* Medium (16K) aligned copies regress from ~3000 MiB/s to ~2800 MiB/s.
	* Small (4K, 8K) aligned copies are unchanged.
	See punybench changes at
	<https://chromium-review.googlesource.com/#/c/182168/3> for how this
	was tested. For A15 changes I ran 3 times and averaged (there wasn't
	lots of variance except for 16K). For A7 changes I ran 2 times.

	--- a/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
	+++ b/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
	@@ -374,6 +374,10 @@ ENTRY(memcpy)
	cmp tmp1, tmp2
	bne .Lcpy_notaligned

	+ /* Use the non-aligned code for >=16K; faster on A7/A15 (A9 too?) */
	+ cmp count, #0x4000
	+ bge .Lcpy_notaligned
	+
	#ifdef USE_VFP
	/* Magic dust alert! Force VFP on Cortex-A9. Experiments show
	that the FP pipeline is much better at streaming loads and