特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

64位 – memcpy()性能 – Ubuntu x86_64

来源：互联网收集：自由互联发布时间：2021-06-22

我正在观察一些我无法解释的奇怪行为.以下是详细信息： – #include sched.h#include sys/resource.h#include time.h#include iostreamvoid memcpy_test() { int size = 32*4; char* src = new char[size]; char* dest = new char[

我正在观察一些我无法解释的奇怪行为.以下是详细信息： –

#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>

void memcpy_test() {
    int size = 32*4;
    char* src = new char[size];
    char* dest = new char[size];
    general_utility::ProcessTimer tmr;
    unsigned int num_cpy = 1024*1024*16; 
    struct timespec start_time__, end_time__;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    for(unsigned int i=0; i < num_cpy; ++i) {
        __builtin_memcpy(dest, src, size);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
    delete [] src;
    delete [] dest;
}

当我在编译器选项中指定-march = native时,生成的二进制运行速度慢2.7倍.这是为什么？如果有的话,我希望-march = native来生成优化的代码.还有其他功能可以显示这种行为吗？

编辑1：
另一个有趣的观点是,如果尺寸> 32 * 4那么这样生成的二进制文件的运行时间没有差别

编辑2：
以下是详细的性能分析(__builtin_memcpy())： –

size = 32 * 4,没有-march = native – 7.5 ns,-march = native – 19.3

size = 32 * 8,没有-march = native – 26.3 ns,-march = native – 26.5

编辑3：

即使我分配int64_t / int32_t,这个观察也不会改变.

编辑4：

size = 8192,没有-march = native~2750 ns,-march = native~2750(之前报告此数字时出错,错误地写为26.5,现在是正确的)

我已经运行了很多次,每次运行的数字都是一致的.

我已将你的发现复制到：我的Core 2 Duo上的g(Ubuntu / Linaro 4.5.2-8ubuntu4)4.5.2,Linux 2.6.38-10-通用#46-Ubuntu x86_64.结果可能会因您的编译器版本和CPU而异.我得到~26和~9.

When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?

因为-march = native版本被编译成(使用objdump -D找到你也可以使用gcc -S -fverbose-asm)：

rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8

并且没有编译成16个加载/存储对的版本,如：

mov    0x20(%rbp),%rdx
    mov    %rdx,0x20(%rbx)

这显然在我们的计算机上更快.

If anything, I would expect -march=native to produce optimized code.

在这种情况下,事实证明,在一系列动作中赞成rep movsq是一种悲观,但情况可能并非总是如此.第一个版本更短,在某些(大多数？)情况下可能更好.或者它可能是优化器中的错误.

Is there other functions which could show this type of behavior ?

指定-march = native时生成的代码不同的任何函数,可疑包括在头中实现为宏或静态的函数,其名称以__builtin开头.也可能是(浮点)数学函数.

Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated

这是因为它们都编译为rep movsq,128可能是GCC将生成一系列加载/存储的最大大小(看看这是否也适用于其他平台会很有趣). BTW当编译器在编译时不知道大小时(例如int size = atoi(argv [1]);)然后它只是变成对带有或不带开关的memcpy的调用.

上一篇：性能 – 是否可以并行运行JUnit Theories？
下一篇：性能数据收集和可视化工具

64位 – memcpy()性能 – Ubuntu x86_64

相关文章