| From a591d5008952bb00bbafa592f0132930b115217c Mon Sep 17 00:00:00 2001 |
| From: Yunlian Jiang <yunlian@google.com> |
| Date: Fri, 1 Aug 2014 15:19:34 -0700 |
| Subject: [PATCH 1/8] libc: Speed up large memcpy() on Cortex-A7/A15 |
| |
| Details please see crbug://331427 |
| |
| Experimentally it's been found that the "unaligned" memcpy() is |
| actually faster for sufficiently large memory copies. It appears that |
| the changeover point is a little different for different processors, |
| though. For A15 there's a lot more run-to-run variance for |
| medium-sized memcpy() but the changeover appears to be at ~16K. For |
| A7 (and maybe A9) the changeover seems to be a little further out. |
| We think the variance in A15 memcpy() is is due to different physical |
| addresses for the chunks of memory given to us by the kernel. It is |
| certain that the "aligned" code works faster at 4K and less and that |
| the "unaligned" code works faster with very large chunks of memory. |
| Since we care most about A15 performance and the A7 performance is not |
| that much worse (and actually better for SDRAM transfers), we'll pick |
| the number that's best for the A15. |
| Tests on snow (A15 only): |
| * Large (12M) aligned copies go from ~2350 MiB/s to ~2900 MiB/s. |
| * Medium (96K) aligned copies go from ~5900 MiB/s to ~6300 MiB/s. |
| * Medium (16K) aligned copies seem to be better but there's a lot of |
| noise (old=8151.8, 8736.6, 8168.7; new=9364.9, 9829.5, 9639.0) |
| * Small (4K, 8K) algined copies are unchanged. |
| For A7-only on pit: |
| * Large (12M) aligned copies go from 440 MiB/s to 930 MiB/s. |
| * Medium (96K) aligned copies regress from ~2650 MiB/s to ~2400 MiB/s. |
| * Medium (16K) aligned copies regress from ~3000 MiB/s to ~2800 MiB/s. |
| * Small (4K, 8K) aligned copies are unchanged. |
| See punybench changes at |
| <https://chromium-review.googlesource.com/#/c/182168/3> for how this |
| was tested. For A15 changes I ran 3 times and averaged (there wasn't |
| lots of variance except for 16K). For A7 changes I ran 2 times. |
| [Adrian: forward-ported to glibc 2.32] |
| --- |
| sysdeps/arm/armv7/multiarch/memcpy_impl.S | 4 ++++ |
| 1 file changed, 4 insertions(+) |
| |
| diff --git a/sysdeps/arm/armv7/multiarch/memcpy_impl.S b/sysdeps/arm/armv7/multiarch/memcpy_impl.S |
| index 379bb56fc9..78d8f64a91 100644 |
| --- a/sysdeps/arm/armv7/multiarch/memcpy_impl.S |
| +++ b/sysdeps/arm/armv7/multiarch/memcpy_impl.S |
| @@ -327,6 +327,10 @@ ENTRY(memcpy) |
| cmp tmp1, tmp2 |
| bne .Lcpy_notaligned |
| |
| + /* Use the non-aligned code for >=16K; faster on A7/A15 (A9 too?) */ |
| + cmp count, #0x4000 |
| + bge .Lcpy_notaligned |
| + |
| #ifdef USE_VFP |
| /* Magic dust alert! Force VFP on Cortex-A9. Experiments show |
| that the FP pipeline is much better at streaming loads and |
| -- |
| 2.30.2 |
| |