| libc: Speed up large memcpy() on Cortex-A7/A15 |
| |
| Details please see crbug://331427 |
| |
| Experimentally it's been found that the "unaligned" memcpy() is |
| actually faster for sufficiently large memory copies. It appears that |
| the changeover point is a little different for different processors, |
| though. For A15 there's a lot more run-to-run variance for |
| medium-sized memcpy() but the changeover appears to be at ~16K. For |
| A7 (and maybe A9) the changeover seems to be a little further out. |
| We think the variance in A15 memcpy() is is due to different physical |
| addresses for the chunks of memory given to us by the kernel. It is |
| certain that the "aligned" code works faster at 4K and less and that |
| the "unaligned" code works faster with very large chunks of memory. |
| Since we care most about A15 performance and the A7 performance is not |
| that much worse (and actually better for SDRAM transfers), we'll pick |
| the number that's best for the A15. |
| Tests on snow (A15 only): |
| * Large (12M) aligned copies go from ~2350 MiB/s to ~2900 MiB/s. |
| * Medium (96K) aligned copies go from ~5900 MiB/s to ~6300 MiB/s. |
| * Medium (16K) aligned copies seem to be better but there's a lot of |
| noise (old=8151.8, 8736.6, 8168.7; new=9364.9, 9829.5, 9639.0) |
| * Small (4K, 8K) algined copies are unchanged. |
| For A7-only on pit: |
| * Large (12M) aligned copies go from 440 MiB/s to 930 MiB/s. |
| * Medium (96K) aligned copies regress from ~2650 MiB/s to ~2400 MiB/s. |
| * Medium (16K) aligned copies regress from ~3000 MiB/s to ~2800 MiB/s. |
| * Small (4K, 8K) aligned copies are unchanged. |
| See punybench changes at |
| <https://chromium-review.googlesource.com/#/c/182168/3> for how this |
| was tested. For A15 changes I ran 3 times and averaged (there wasn't |
| lots of variance except for 16K). For A7 changes I ran 2 times. |
| |
| --- a/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S |
| +++ b/glibc-2.19/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S |
| @@ -374,6 +374,10 @@ ENTRY(memcpy) |
| cmp tmp1, tmp2 |
| bne .Lcpy_notaligned |
| |
| + /* Use the non-aligned code for >=16K; faster on A7/A15 (A9 too?) */ |
| + cmp count, #0x4000 |
| + bge .Lcpy_notaligned |
| + |
| #ifdef USE_VFP |
| /* Magic dust alert! Force VFP on Cortex-A9. Experiments show |
| that the FP pipeline is much better at streaming loads and |