blob: da68cf9722d43011ba67fbcb29b8b30492ceddba [file] [log] [blame]
libc: Speed up large memcpy() on Cortex-A7/A15
Details please see crbug://331427
Experimentally it's been found that the "unaligned" memcpy() is
actually faster for sufficiently large memory copies. It appears that
the changeover point is a little different for different processors,
though. For A15 there's a lot more run-to-run variance for
medium-sized memcpy() but the changeover appears to be at ~16K. For
A7 (and maybe A9) the changeover seems to be a little further out.
We think the variance in A15 memcpy() is is due to different physical
addresses for the chunks of memory given to us by the kernel. It is
certain that the "aligned" code works faster at 4K and less and that
the "unaligned" code works faster with very large chunks of memory.
Since we care most about A15 performance and the A7 performance is not
that much worse (and actually better for SDRAM transfers), we'll pick
the number that's best for the A15.
Tests on snow (A15 only):
* Large (12M) aligned copies go from ~2350 MiB/s to ~2900 MiB/s.
* Medium (96K) aligned copies go from ~5900 MiB/s to ~6300 MiB/s.
* Medium (16K) aligned copies seem to be better but there's a lot of
noise (old=8151.8, 8736.6, 8168.7; new=9364.9, 9829.5, 9639.0)
* Small (4K, 8K) algined copies are unchanged.
For A7-only on pit:
* Large (12M) aligned copies go from 440 MiB/s to 930 MiB/s.
* Medium (96K) aligned copies regress from ~2650 MiB/s to ~2400 MiB/s.
* Medium (16K) aligned copies regress from ~3000 MiB/s to ~2800 MiB/s.
* Small (4K, 8K) aligned copies are unchanged.
See punybench changes at
<https://chromium-review.googlesource.com/#/c/182168/3> for how this
was tested. For A15 changes I ran 3 times and averaged (there wasn't
lots of variance except for 16K). For A7 changes I ran 2 times.
--- a/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
+++ b/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
@@ -374,6 +374,10 @@ ENTRY(memcpy)
cmp tmp1, tmp2
bne .Lcpy_notaligned
+ /* Use the non-aligned code for >=16K; faster on A7/A15 (A9 too?) */
+ cmp count, #0x4000
+ bge .Lcpy_notaligned
+
#ifdef USE_VFP
/* Magic dust alert! Force VFP on Cortex-A9. Experiments show
that the FP pipeline is much better at streaming loads and