Perusing the Linux x86 idle loop, I noticed a memory barrier in between monitor and mwait, and I can't figure out exactly why it's necessary.
void mwait_idle_with_hints(unsigned long ax, unsigned long cx) {
if (!need_resched()) {
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) // True for Intel Xeon 7400 series CPUs
clflush((void *)¤t_thread_info()->flags); // Workaround for erratum AAI65 for the Xeon 7400 series
__monitor((void *)¤t_thread_info()->flags, 0, 0);
smp_mb();
if (!need_resched())
__mwait(ax, cx);
}
}
smp_mb() is a macro for asm volatile("mfence":::"memory").
Why is it necessary here? I understand why a compiler memory barrier would be necessary, but not a hardware memory barrier.
Related
My objective
I want to write a RAM bandwidth benchmark
My solution
U64 benchRead(U64 &io_minLoopTicks) //io_minLoopTicks initial value is U64_MAX
{
const U32 elementsCount = 100'000'000;
U64 accumulator = 0;
std::vector<U64> tab;
tab.resize(elementsCount); //800MB
for(U8 i = 0; i < 10; ++i) // Loops to get a reliable value.
{
const U64 startTimestamp = profiler.start(); // profiler uses under the hood QueryPerformanceCounter
for(const U64 j : tab)
accumulator += j; // Do something with the RAM to not be optimized away.
const U64 loopTicks = profiler.end() - startTimestamp;
if(loopTicks < io_minLoopTicks)
io_minLoopTicks = loopTicks;
}
return accumulator;
}
My problem
The theory would be 57.6 GB/s (8 bytes (bus width) * 2 (channels) * 3.6 GT).
In practise I have 27.7 GB/s (9700K #4.8GHz, 4 single rank 3600MT DDR4 DIMMs).
So my number seems to reflect only one channel performance.
My guess is that the way I coded, the array is in an area of RAM that can only be accessed by one channel. I am not familiar at all with multi-channel so I don't really understand the limits of that technology.
For sure my RAM is dual channel (CPU-Z confirmed it)
Performance analysis
I checked the ouput of Clang14 and it uses SSE (O3 with no -march=native). I could be wrong (not an ASM guy) but it seems it unrolled the loop and does 128 bytes per iteration. A loop is 4.5 cycles on Coffee Lake according to uiCA leading to 139.38 GB/s, way above the RAM theoretical speed. So we should be RAM limited.
My questions
Is my array stored in one bank hence the fact it can only be accessed by one channel ?
How do we write code to leverage the benefits of multichannel RAM (here dual channel) ?
References
Godbolt
I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);
The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.
The following code fragment creates a function (fun) with just one RET instruction.
The loop repeatedly calls the function and overwrites the contents of the RET instruction after returning.
#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>
typedef void (*foo)();
#define RET (0xC3)
int main(){
// Allocate an executable page
char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
// Just write a RET instruction
*ins = RET;
// make fun point to the function with just RET instruction
foo fun = (foo)(ins);
// Repeat 0xfffffff times
for(long i = 0; i < 0xfffffff; i++){
fun();
*ins = RET;
}
return 0;
}
The Linux perf on X86 Broadwell machine has the following icache and iTLB statistics:
perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out
Performance counter stats for './a.out':
805,516,067 L1-icache-load-misses
4,857 iTLB-load-misses
32.052301220 seconds time elapsed
Now, look at the same code without overwriting the RET instruction.
#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>
typedef void (*foo)();
#define RET (0xC3)
int main(){
// Allocate an executable page
char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
// Just write a RET instruction
*ins = RET;
// make fun point to the function with just RET instruction
foo fun = (foo)(ins);
// Repeat 0xfffffff times
for(long i = 0; i < 0xfffffff; i++){
fun();
// Commented *ins = RET;
}
return 0;
}
And here is the perf statistics on the same machine.
perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out
Performance counter stats for './a.out':
11,738 L1-icache-load-misses
425 iTLB-load-misses
0.773433500 seconds time elapsed
Notice that overwriting the instruction causes L1-icache-load-misses to grow from 11,738 to 805,516,067 -- a manifold growth.
Also notice that iTLB-load-misses grows from 425 to 4,857--quite a growth but less compared to L1-icache-load-misses.
The running time grows from 0.773433500 seconds to 32.052301220 seconds -- a 41x growth!
It is unclear why the CPU should cause i-cache misses if the instruction footprint is so small. The only difference in the two examples is that the instruction is modified. Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?
Furthermore, why is there a 10x growth in the iTLB misses?
Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?
No.
If you want to modify code - the only path this can go is the following:
Store Date Execution Engine
Store Buffer & Forwarding
L1 Data Cache
Unified L2 Cache
L1 Instruction Cache
Note that you are also missing out on the μOP Cache.
This is illustrated by this diagram1, which I believe is sufficiently accurate.
I would suspect the iTLB misses could be due to regular TLB flushes. In case of no modification you are not affected by iTLB misses because your instructions actually come from the μOP Cache.
If they don't, I'm not quite sure. I would think the L1 Instruction Cache is virtually addressed, so no need to access the TLB if there is a hit.
1: unfortunately the image has a very restrictive copyright, so I refrain from highlighting the path / inlining the image.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have implemented the strlen() function in different ways, including SSE2 assembly, SSE4.2 assembly and SSE2 intrinsic, I also exerted some experiments on them, with strlen() in <string.h> and strlen() in glibc. However, their performance in terms of milliseconds (time) are unexpected.
My experiment environment:
CentOS 7.0 + gcc 4.8.5 + Intel Xeon
Following are my implementations:
strlen using SSE2 assembly
long strlen_sse2_asm(const char* src){
long result = 0;
asm(
"movl %1, %%edi\n\t"
"movl $-0x10, %%eax\n\t"
"pxor %%xmm0, %%xmm0\n\t"
"lloop:\n\t"
"addl $0x10, %%eax\n\t"
"movdqu (%%edi,%%eax), %%xmm1\n\t"
"pcmpeqb %%xmm0, %%xmm1\n\t"
"pmovmskb %%xmm1, %%ecx\n\t"
"test %%ecx, %%ecx\n\t"
"jz lloop\n\t"
"bsf %%ecx, %%ecx\n\t"
"addl %%ecx, %%eax\n\t"
"movl %%eax, %0"
:"=r"(result)
:"r"(src)
:"%eax"
);
return result;
}
2.strlen using SSE4.2 assembly
long strlen_sse4_2_asm(const char* src){
long result = 0;
asm(
"movl %1, %%edi\n\t"
"movl $-0x10, %%eax\n\t"
"pxor %%xmm0, %%xmm0\n\t"
"lloop2:\n\t"
"addl $0x10, %%eax\n\t"
"pcmpistri $0x08,(%%edi, %%eax), %%xmm0\n\t"
"jnz lloop2\n\t"
"add %%ecx, %%eax\n\t"
"movl %%eax, %0"
:"=r"(result)
:"r"(src)
:"%eax"
);
return result;
}
3. strlen using SSE2 intrinsic
long strlen_sse2_intrin_align(const char* src){
if (src == NULL || *src == '\0'){
return 0;
}
const __m128i zero = _mm_setzero_si128();
const __m128i* ptr = (const __m128i*)src;
if(((size_t)ptr&0xF)!=0){
__m128i xmm = _mm_loadu_si128(ptr);
unsigned int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(xmm,zero));
if(mask!=0){
return (const char*)ptr-src+(size_t)ffs(mask);
}
ptr = (__m128i*)(0x10+(size_t)ptr & ~0xF);
}
for (;;ptr++){
__m128i xmm = _mm_load_si128(ptr);
unsigned int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(xmm,zero));
if (mask!=0)
return (const char*)ptr-src+(size_t)ffs(mask);
}
}
I also looked up the one implemented in linux kernel, following is its implementation
size_t strlen_inline_asm(const char* str){
int d0;
size_t res;
asm volatile("repne\n\t"
"scasb"
:"=c" (res), "=&D" (d0)
: "1" (str), "a" (0), "" (0xffffffffu)
: "memory");
return ~res-1;
}
In my experience, I also added the one of standard library and compared their performance.
Followings are my main function code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <xmmintrin.h>
#include <x86intrin.h>
#include <emmintrin.h>
#include <time.h>
#include <unistd.h>
#include <sys/time.h>
int main()
{
struct timeval tpstart,tpend;
int i=0;
for(;i<1023;i++){
test_str[i] = 'a';
}
test_str[i]='\0';
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen(test_str);
gettimeofday(&tpend,NULL);
printf("strlen from stirng.h--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_inline_asm(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_inline_asm--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_sse2_asm(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_sse2_asm--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_sse4_2_asm(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_sse4_2_asm--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_sse2_intrin_align(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_sse2_intrin_align--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
return 0;
}
The result is : (ms)
strlen from stirng.h--->23.518000
strlen_inline_asm--->222.311000
strlen_sse2_asm--->782.907000
strlen_sse4_2_asm--->955.960000
strlen_sse2_intrin_align--->3499.586000
I have some questions about it:
Why strlen of string.h is so fast? I think its code should be identify to strlen_inline_asm because I copied the code from /linux-4.2.2/arch/x86/lib/string_32.c[http://lxr.oss.org.cn/source/arch/x86/lib/string_32.c#L164]
Why sse2 intrinsic and sse2 assembly are so different in performance?
Could someone help me how to disassembly the code so that I can see what has the function strlen of static library been transformed by the compiler? I used gcc -s but didn't find the disassembly of strlen from the <string.h>
I think my code may be not very well, I would be appreciate if you could help me improve my code, especially assembly ones.
Thanks.
Like I said in comments, your biggest error is benchmarking with -O0. I discussed exactly why testing with -O0 is a terrible idea in the first part of another post.
Benchmarks should be done with at least -O2, preferably with the same optimizations as your full project will build with, if you're trying to test test what source makes the fastest asm.
-O0 explains inline asm being way faster than C with intrinsics (or regular compiled C, for C strlen implementation borrowed from glibc).
IDK -O0 would still optimize away loop that discards the result of library strlen repeatedly, or if it somehow just avoided some other huge performance pitfall. It's not interesting to guess about exactly what happened in such a flawed test.
I tightened up your SSE2 inline-asm version. Mostly just because I've been playing with gcc inline asm input/output constraints recently, and wanted to see what it would look like if I wrote it to let the compiler choose which registers to use for temporaries, and avoided unneeded instructions.
The same inline asm works for 32 and 64-bit x86 targets; see this compiled for both on the Godbolt compiler explorer. When compiling to a stand-along function, it doesn't have to save/restore any registers even in 32bit mode:
WARNING: it can read past the end of the string by up to 15 bytes. This could segfault. See Is it safe to read past the end of a buffer within the same page on x86 and x64? for details on avoiding that: get to an alignment boundary, then use aligned loads because that's always safe if the vector contains at least 1 byte of string data. I left the code unchanged because it's interesting to discuss the effect of aligning pointers for SSE vs. AVX. Aligning pointers also avoids cache-line splits, and 4k page-splits (which are a performance pothole before Skylake).
#include <immintrin.h>
size_t strlen_sse2_asm(const char* src){
// const char *orig_src = src; // for a pointer-increment with a "+r" (src) output operand
size_t result = 0;
unsigned int tmp1;
__m128i zero = _mm_setzero_si128(), vectmp;
// A pointer-increment may perform better than an indexed addressing mode
asm(
"\n.Lloop:\n\t"
"movdqu (%[src], %[res]), %[vectmp]\n\t" // result reg is used as the loop counter
"pcmpeqb %[zerovec], %[vectmp]\n\t"
"pmovmskb %[vectmp], %[itmp]\n\t"
"add $0x10, %[res]\n\t"
"test %[itmp], %[itmp]\n\t"
"jz .Lloop\n\t"
"bsf %[itmp], %[itmp]\n\t"
"add %q[itmp], %q[res]\n\t" // q modifier to get quadword register.
// (add %edx, %rax doesn't work). But in 32bit mode, q gives a 32bit reg, so the same code works
: [res] "+r"(result), [vectmp] "=&x" (vectmp), [itmp] "=&r" (tmp1)
: [zerovec] "x" (zero) // There might already be a zeroed vector reg when inlining
, [src] "r"(src)
, [dummy] "m" (*(const char (*)[])src) // this reads the whole object, however long gcc thinks it is
: //"memory" // not needed because of the dummy input
);
return result;
// return result + tmp1; // doing the add outside the asm makes gcc sign or zero-extend tmp1.
// No benefit anyway, since gcc doesn't know that tmp1 is the offset within a 16B chunk or anything.
}
Note the dummy input, as an alternative to a "memory" clobber, to tell the compiler that the inline asm reads the memory pointed to by src, as well as the value of src itself. (The compiler doesn't know what the asm does; for all it knows the asm just aligns a pointer with and or something, so assuming that all input pointers are dereferenced would lead to missed optimizations from reordering / combining loads and stores across the asm. Also, this lets the compiler know we only read the memory, not modify it.) The GCC manual uses an example with this unspecified-length array syntax "m" (*(const char (*)[])src)
It should keep register pressure to a minimum when inlining, and doesn't tie up any special-purpose registers (like ecx which is needed for variable-count shifts).
If you could shave another uop out of the inner loop, it would be down to 4 uops that could issue at one per cycle. As it is, 5 uops means each iteration may take 2 cycles to issue from the frontend, on Intel SnB CPUs. (Or 1.25 cycles on later CPUs like Haswell, and maybe on SnB if I was wrong about the whole-number behaviour.)
Using an aligned pointer would allow the load to fold into a memory operand for pcmpeqb. (As well as being necessary for correctness if the string start is unaligned and the end is near the end of a page). Interestingly, using the zero-vector as the destination for pcmpeqb is ok in theory: you don't need to re-zero the vector between iterations, because you exit the loop if it's ever non-zero. It has 1-cycle latency, so turning the zero vector into a loop-carried dependency is only a problem when cache-misses delay an old iteration. Removing this loop-carried dependency chain might help in practice, though, by letting the back end go faster when catching up after a cache miss that delayed an old iteration.
AVX solves the problem completely (except for correctness if the string ends near the end of a page). AVX allows the load to be folded even without doing an alignment check first. 3-operand non-destructive vpcmpeqb avoids turning the zero vector into a loop-carried dependency. AVX2 would allow checking 32B at once.
Unrolling will help either way, but helps more without AVX. Align to a 64B boundary or something, and then load the whole cache line into four 16B vectors. Doing a combined check on the result of PORing them all together may be good, since pmovmsk + compare-and-branch is 2 uops.
Using SSE4.1 PTEST doesn't help (compared to pmovmsk / test / jnz) because it's 2 uops and can't macro-fuse the way test can.
PTEST can directly test for the whole 16B vector being all-zero or all-ones (using ANDNOT -> CF part), but not if one of the byte-elements is zero. (So we can't avoid pcmpeqb).
Have a look at Agner Fog's guides for optimizing asm, and the other links on the x86 wiki. Most optimization (Agner Fog's, and Intel's and AMD's) will mention optimizing memcpy and strlen specifically, IIRC.
If you read the source of the strlen function in the glibc, you can see that the function is not testing the string char by char, but longword by longword with complex bitwise operations : http://www.stdlib.net/~colmmacc/strlen.c.html. I guess it explains its speed, but the fact that it's even faster than rep instructions in assembly is indeed quite surprising.
Maybe (or rather not maybe) I do something in a wrong way but it seems that I can't get good performance from OpenCl kernel even though it runs significantly faster on GPU than in CPU.
Let me explain.
CPU kernel running time is ~100ms.
GPU kernel running time is ~8ms.
Above was measured using the clCreateCommandQueue with CL_QUEUE_PROFILING_ENABLE flag.
But, what is the problem is the time needed to call (enqueue) the kernel repeatedly.
200 kernel calls on CPU: ~19s
200 kernel calls on GPU: ~18s
Above was measured with calls to gettimeofday before and after the 200 loop. And just after the loop was the call to clFinish to wait until the 200 enqueued kernels are done.
Moreover, the time was measured only for enqueueing and executing the kernel, no data transfer from/to the kernel was involved.
Here is the loop:
size_t global_item_size = LIST_SIZE;
Start_Clock(&startTime);
for(int k=0; k<200; k++)
{
// Execute the OpenCL kernel on the list
ret = clEnqueueNDRangeKernel (command_queue, kernel, 1, NULL, &global_item_size, NULL, 0, NULL, &event);
}
clFinish(command_queue);
printf("] (in %0.4fs)\n", Stop_Clock(&startTime));
If 200 calls to the kernel take ~18s then it's completely irrelevant that the kernel on GPU is several times faster than on CPU...
What am I doing wrong?
EDIT
I made some additional tests and it seems that actually assigning the result of the computation to the output buffer is producing the overhead.
This kernel
__kernel void test_kernel(__global const float *A, __global const float *B, __global float *C)
{
// Get the index of the current element to be processed
int i = get_global_id(0);
// Do the work
C[i] = sqrt(sin(A[i]) + tan(B[i]));
}
executed 200 times has the timings as above. But if I change the C[i] line to
float z = sqrt(sin(A[i]) + tan(B[i]));
then this kernel takes 0.3s on CPU and 2.6s on GPU.
Interesting.
I wonder if it would be possible to speed up the execution by collecting the results in __local table and then assigning them to the output buffer C only in the execution of the last kernel call? (the kernel with the last global id, not the 200th enqueued kernel)