Impala - out of memory exception. Slow queries - performance

Can someone help me. I'm running a cluster of 5 Impala-Nodes for my Api. Now I get a lot of 'out of memory' Exceptions when I run queries.
Failed to get minimum memory reservation of 3.94 MB on daemon r5c3s4.colo.vm:22000 for query 924d155863398f6b:c4a3470300000000 because it would exceed an applicable memory limit. Memory is likely oversubscribed. Reducing query concurrency or configuring admission control may help avoid this error. Memory usage:
, node=[r4c3s2]
Process: Limit=55.00 GB Total=49.79 GB Peak=49.92 GB, node=[r4c3s2]
Buffer Pool: Free Buffers: Total=0, node=[r4c3s2]
Buffer Pool: Clean Pages: Total=4.21 GB, node=[r4c3s2]
Buffer Pool: Unused Reservation: Total=-4.21 GB, node=[r4c3s2]
Free Disk IO Buffers: Total=1.19 GB Peak=1.52 GB, node=[r4c3s2]
However, it says there are just used 23.83 GB of 150.00 GB. Also the queries became really slow. This problem occurred out of nowhere. Does anyone have an explanation for that?
Here are all memory infromation I got from the "/memz?detailed=true" page of one node:
Memory Usage
Memory consumption / limit: 23.83 GB / 150.00 GB
Breakdown`enter code here`
Process: Limit=150.00 GB Total=23.83 GB Peak=58.75 GB
Buffer Pool: Free Buffers: Total=72.69 MB
Buffer Pool: Clean Pages: Total=0
Buffer Pool: Unused Reservation: Total=-71.94 MB
Free Disk IO Buffers: Total=1.61 GB Peak=1.67 GB
RequestPool=root.default: Total=20.77 GB Peak=59.92 GB
Query(2647a4f63d37fdaa:690ad3b500000000): Reservation=20.67 GB ReservationLimit=120.00 GB OtherMemory=101.21 MB Total=20.77 GB Peak=20.77 GB
Unclaimed reservations: Reservation=71.94 MB OtherMemory=0 Total=71.94 MB Peak=139.94 MB
Fragment 2647a4f63d37fdaa:690ad3b50000001c: Reservation=0 OtherMemory=114.48 KB Total=114.48 KB Peak=855.48 KB
AGGREGATION_NODE (id=9): Total=102.12 KB Peak=102.12 KB
Exprs: Total=102.12 KB Peak=102.12 KB
EXCHANGE_NODE (id=8): Total=0 Peak=0
DataStreamRecvr: Total=0 Peak=0
DataStreamSender (dst_id=10): Total=872.00 B Peak=872.00 B
CodeGen: Total=3.50 KB Peak=744.50 KB
Fragment 2647a4f63d37fdaa:690ad3b500000014: Reservation=0 OtherMemory=243.31 KB Total=243.31 KB Peak=1.57 MB
AGGREGATION_NODE (id=3): Total=102.12 KB Peak=102.12 KB
Exprs: Total=102.12 KB Peak=102.12 KB
AGGREGATION_NODE (id=7): Total=119.12 KB Peak=119.12 KB
Exprs: Total=119.12 KB Peak=119.12 KB
EXCHANGE_NODE (id=6): Total=0 Peak=0
DataStreamRecvr: Total=0 Peak=0
DataStreamSender (dst_id=8): Total=6.81 KB Peak=6.81 KB
CodeGen: Total=7.25 KB Peak=1.34 MB
Fragment 2647a4f63d37fdaa:690ad3b50000000c: Reservation=2.32 GB OtherMemory=349.48 KB Total=2.32 GB Peak=2.32 GB
AGGREGATION_NODE (id=2): Total=119.12 KB Peak=119.12 KB
Exprs: Total=119.12 KB Peak=119.12 KB
AGGREGATION_NODE (id=5): Reservation=2.32 GB OtherMemory=199.74 KB Total=2.32 GB Peak=2.32 GB
Exprs: Total=120.12 KB Peak=120.12 KB
EXCHANGE_NODE (id=4): Total=0 Peak=0
DataStreamRecvr: Total=336.00 B Peak=549.14 KB
DataStreamSender (dst_id=6): Total=6.44 KB Peak=6.44 KB
CodeGen: Total=15.85 KB Peak=3.10 MB
Fragment 2647a4f63d37fdaa:690ad3b500000004: Reservation=18.29 GB OtherMemory=100.52 MB Total=18.38 GB Peak=18.38 GB
AGGREGATION_NODE (id=1): Reservation=18.29 GB OtherMemory=334.12 KB Total=18.29 GB Peak=18.29 GB
Exprs: Total=148.12 KB Peak=148.12 KB
HDFS_SCAN_NODE (id=0): Total=100.17 MB Peak=178.15 MB
Exprs: Total=4.00 KB Peak=4.00 KB
DataStreamSender (dst_id=4): Total=6.75 KB Peak=6.75 KB
CodeGen: Total=9.72 KB Peak=2.92 MB
RequestPool=fe-eval-exprs: Total=0 Peak=12.00 KB
Untracked Memory: Total=1.44 GB
tcmalloc
------------------------------------------------
MALLOC: 24646559936 (23504.8 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 725840992 ( 692.2 MiB) Bytes in central cache freelist
MALLOC: + 4726720 ( 4.5 MiB) Bytes in transfer cache freelist
MALLOC: + 208077600 ( 198.4 MiB) Bytes in thread cache freelists
MALLOC: + 105918656 ( 101.0 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 25691123904 (24501.0 MiB) Actual memory used (physical + swap)
MALLOC: + 53904392192 (51407.2 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 79595516096 (75908.2 MiB) Virtual address space used
MALLOC:
MALLOC: 133041 Spans in use
MALLOC: 842 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
System
Physical Memory: 252.41 GB
Transparent Huge Pages Config:
enabled: always [madvise] never
defrag: [always] madvise never
khugepaged defrag: 1
Process and system memory metrics
Name Value Description
memory.anon-huge-page-bytes 19.01 GB Total bytes of anonymous (a.k.a. transparent) huge pages used by this process.
memory.mapped-bytes 113.09 GB Total bytes of memory mappings in this process (the virtual memory size).
memory.num-maps 18092 Total number of memory mappings in this process.
memory.rss 24.51 GB Resident set size (RSS) of this process, including TCMalloc, buffer pool and Jvm.
memory.thp.defrag [always] madvise never The system-wide 'defrag' setting for Transparent Huge Pages.
memory.thp.enabled always [madvise] never The system-wide 'enabled' setting for Transparent Huge Pages.
memory.thp.khugepaged-defrag 1 The system-wide 'defrag' setting for khugepaged.
memory.total-used 23.83 GB Total memory currently used by TCMalloc and buffer pool.
Buffer pool memory metrics
Name Value Description
buffer-pool.clean-page-bytes 0 Total bytes of clean page memory cached in the buffer pool.
buffer-pool.clean-pages 0 Total number of clean pages cached in the buffer pool.
buffer-pool.clean-pages-limit 12.00 GB Limit on number of clean pages cached in the buffer pool.
buffer-pool.free-buffer-bytes 72.69 MB Total bytes of free buffer memory cached in the buffer pool.
buffer-pool.free-buffers 177 Total number of free buffers cached in the buffer pool.
buffer-pool.limit 120.00 GB Maximum allowed bytes allocated by the buffer pool.
buffer-pool.reserved 20.67 GB Total bytes of buffers reserved by Impala subsystems
buffer-pool.system-allocated 20.67 GB Total buffer memory currently allocated by the buffer pool.
buffer-pool.unused-reservation-bytes 71.94 MB Total bytes of buffer reservations by Impala subsystems that are currently unused
JVM aggregate memory metrics
Name Value Description
jvm.total.committed-usage-bytes 1.45 GB Jvm total Committed Usage Bytes
jvm.total.current-usage-bytes 903.10 MB Jvm total Current Usage Bytes
jvm.total.init-usage-bytes 1.92 GB Jvm total Init Usage Bytes
jvm.total.max-usage-bytes 31.23 GB Jvm total Max Usage Bytes
jvm.total.peak-committed-usage-bytes 2.09 GB Jvm total Peak Committed Usage Bytes
jvm.total.peak-current-usage-bytes 1.48 GB Jvm total Peak Current Usage Bytes
jvm.total.peak-init-usage-bytes 1.92 GB Jvm total Peak Init Usage Bytes
jvm.total.peak-max-usage-bytes 31.41 GB Jvm total Peak Max Usage Bytes
JVM heap memory metrics
Name Value Description
jvm.heap.committed-usage-bytes 1.37 GB Jvm heap Committed Usage Bytes
jvm.heap.current-usage-bytes 827.25 MB Jvm heap Current Usage Bytes
jvm.heap.init-usage-bytes 2.00 GB Jvm heap Init Usage Bytes
jvm.heap.max-usage-bytes 26.67 GB Jvm heap Max Usage Bytes
jvm.heap.peak-committed-usage-bytes 0 Jvm heap Peak Committed Usage Bytes
jvm.heap.peak-current-usage-bytes 0 Jvm heap Peak Current Usage Bytes
jvm.heap.peak-init-usage-bytes 0 Jvm heap Peak Init Usage Bytes
jvm.heap.peak-max-usage-bytes 0 Jvm heap Peak Max Usage Bytes
JVM non-heap memory metrics
Name Value Description
jvm.non-heap.committed-usage-bytes 76.90 MB Jvm non-heap Committed Usage Bytes
jvm.non-heap.current-usage-bytes 75.68 MB Jvm non-heap Current Usage Bytes
jvm.non-heap.init-usage-bytes 2.44 MB Jvm non-heap Init Usage Bytes
jvm.non-heap.max-usage-bytes -1.00 B Jvm non-heap Max Usage Bytes
jvm.non-heap.peak-committed-usage-bytes 0 Jvm non-heap Peak Committed Usage Bytes
jvm.non-heap.peak-current-usage-bytes 0 Jvm non-heap Peak Current Usage Bytes
jvm.non-heap.peak-init-usage-bytes 0 Jvm non-heap Peak Init Usage Bytes
jvm.non-heap.peak-max-usage-bytes 0 Jvm non-heap Peak Max Usage Bytes

Process: Limit=150.00 GB Total=23.83 GB Peak=58.75 GB
this caused by memory limit Memorylimit exceeded
change those setting memory.soft_limit_in_bytes ,memory.limit_in_bytes mem_limit ,default_pool_mem_limit value to 0 or -1
1 or 0 represents unlimited .

Related

Possible eager mmap page eviction MacOS

I have a program which accesses a large memory block allocated with mmap. It accesses it unevenly, mostly accessing the first ~1 GB on memory, sometimes the next ~2 GB of memory, and rarely the last ~4 GB of memory. The memory is a shared memory mapping with PROT_READ and PROT_WRITE backed by an unlinked file.
Compared to the Linux version, I've found the MacOS version is exceedingly slow. Yet, the memory pressure is low. (6.42 Used, 9.51 Cached.)
The following usage statistics originate from activity monitor:
"Memory": 1.17 GB
Real memory Size: 3.71 GB
Virtual Memory Size: 51.15 GB
Shared Memory Size: 440 KB
Private Memory Size: 3.74 GB
Why is this? Is there anyway to improve caching behavior?

Is it possible to reclaim Private_Clean pages?

I have a process that has to read a bunch of stuff from a mmap()ed file and then does memory intensive processing of some of this data (discarding the mmaped data as it gets processed). In my case, the mmaped file is from LMDB. After I mmap the file, I get something like this:
7fc32f29b000-7fc50c000000 rw-s 00000000 fc:02 75628978 /tmp/.tmp5SYf4y/data.mdb
Size: 7812500 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 285516 kB
Pss: 285516 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 285516 kB
Private_Dirty: 0 kB
Referenced: 285516 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 285516 kB
Let's say the process has only 300000kB of physical RAM, limited by cgroups (and may or may now have available swap).
I understand that since all memory pages are Locked (pinned), they cannot be swapped out. After mmap()ing the file (i.e. reading from LMDB), the process then starts allocating more memory, beyond (physical RAM - Shared_Clean) in the output above. Can these Shared_clean pages be evicted and reclaimed by the memory pressure of the new allocations from the same process?
Answering my own question: the answer is yes, clean mmaped pages are reclaimed when there's a memory pressure from within the same process.
To demonstrate this, I added a loop that allocates heap memory after the LMDB file has been mmaped. I ran the process under a cgroup limiting the RSS by the mmaped file size. Then /proc/<pid>/smaps shows that the Clean_Pages RSS of the mmaped file starts dropping when the program goes into heap allocation loop, and the heap RSS / pages start growing correspondingly.

Spark Scratch Space

I have a cluster of 13 machines with 4 physical CPUs and 24 G of RAM.
I started a spark cluster with one driver and 12 slaves.
I set the number of cores by slaves to 12 cores, meaning I have a cluster as foloowing :
Alive Workers: 12
Cores in use: 144 Total, 110 Used
Memory in use: 263.9 GB Total, 187.0 GB Used
I started an application with the folowing configuration :
[('spark.driver.cores', '4'),
('spark.executor.memory', '15G'),
('spark.executor.id', 'driver'),
('spark.driver.memory', '5G'),
('spark.python.worker.memory', '1042M'),
('spark.cores.max', '96'),
('spark.rdd.compress', 'True'),
('spark.serializer.objectStreamReset', '100'),
('spark.executor.cores', '8'),
('spark.default.parallelism', '48')]
I understand there are 15G of RAM by executor with 8 task slot and a parallelism of 48 (48 = 6 task slot * 12 slaves).
then I have two big files on HDFS : 6 G each, (from a directory of 12 files of 5 blocks of 128 Mb each) , with a 3x replication factor.
I union these two files => I get one dataframe of 12 GB I think but I see a 37 G reading input through the IHM :
That could be the first question : Why 37 Gb ?
Then as the execution time is too long for me, I try to cache the data so that I can go faster. But the caching method never finishes, here you can see it is already 45 minutes before the end (Vs 6 min not cached !):
So I try to understand why, and I see the usage of Memory/Disk on the storage section of the ihm :
So there are some part of the RDD that are staying on disk.
Furthemore I see the executors may still have free memory :
And I notice on the same "storage" page that the size of the RDD has jumped :
Storage Level: Disk Serialized 1x Replicated
Cached Partitions: 72
Total Partitions: 72
Memory Size: 42.7 GB
Disk Size: 73.3 GB
=> I understand : Memory Size: 42.7 GB + Disk Size: 73.3 GB = 110 G !
=> So my 6 G file has transformed on 37 G and then on 110 G ???
But i try to understand why is there still some memory left on my executor, and I go to the "err" dump of one, and I see :
18/02/08 11:04:08 INFO MemoryStore: Will not store rdd_50_46
18/02/08 11:04:09 WARN MemoryStore: Not enough space to cache rdd_50_46 in memory! (computed 1134.1 MB so far)
18/02/08 11:04:09 INFO MemoryStore: Memory use = 1641.6 KB (blocks) + 7.7 GB (scratch space shared across 6 tasks(s)) = 7.7 GB. Storage limit = 7.8 GB.
18/02/08 11:04:09 WARN BlockManager: Persisting block rdd_50_46 to disk instead.
And Here I see that the executor want to cache a 1641.6 KB block (only 1Mo !) and I can't because there is a ["scratch space"] of 7.7 Gb "shared across 6 tasks".
=> What is a "scratch space" ? ?
=> The 6 tasks => comes from the parallelism of 48 / 12 = 6
And then I come back to the app information, and I see that the count that lasted 48 min read only 37 Gb of data ! (The 48 min are clearly used to cache the data too)
When I do a count on the cached dataframe I have a 116G input read :
And at the end of the day, the time saved by the cached count is not that impressive, here are 3 duration :
4.8 ' : count on cached df
48' : count while caching
5.8' : count on not cached df (read directly from hdfs)
So why is it so ?
Because the cached df is not that much cached :
Meaning more or less 40 Gb in memory and 60 Gb on disk.
I am surprised because at 15G / executor * 12 slaves => 180 Gb of memory, and I can cache only 40 Gb ... But in fact I remember that the memory is splitted :
30% for spark
54% for storage
16% for shuffle
So I understand that I do have 54% * 15G for storage, ie 8.1 G, meaning that on my 180 Gb, I only have 97 Gb for storage. Why do I have 90 - 40 = 50 G not used then ?
Oups... This is a long post !
Plenty of questions... Sorry...

Committed Bytes and Commit Limit - Memory Statistics

I'm trying to understand the actual difference between committed bytes and commit limit.
From the definitions below,
Commit Limit is the amount of virtual memory that can be committed
without having to extend the paging file(s). It is measured in bytes.
Committed memory is the physical memory which has space reserved on the disk paging files.
Committed Bytes is the amount of committed virtual memory, in bytes.
From my computer configurations, i see that my Physical Memory is 1991 MB, Virtual Memory (total paging file for all files) is 1991 MB and
Minimum Allowed is 16 MB, Recommended is 2986 MB and Currently Allocated is 1991 MB.
But when i open my perfmon and monitor Committed Bytes and Commit Limit, the numbers differ a lot. So what exactly are these Committed Bytes and Commit Limit and how do they form.
Right now in my perfmon, Committed Bytes is running at 3041 MB (Sometimes it goes to 4000 MB as well), Commit Limit is 4177 MB. So how are they calculated. Kindly explain. I've read a lot of documents but I wasn't understanding how this works.
Please help. Thanks.

How can the rep stosb instruction execute faster than the equivalent loop?

How can the instruction rep stosb execute faster than this code?
Clear: mov byte [edi],AL ; Write the value in AL to memory
inc edi ; Bump EDI to next byte in the buffer
dec ecx ; Decrement ECX by one position
jnz Clear ; And loop again until ECX is 0
Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually?
In modern CPUs, rep stosb's and rep movsb's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.
(Note this only applies to stos and movs, not repe cmpsb or repne scasb. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.
See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb or a less-bad scalar bithack for a strlen that happens to get large, and a simple SIMD alternative.)
rep stos/movs has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb is tuned for medium to small memsets and it's optimal to use rep stosd or rep stosq (if you aren't going to use a SIMD loop).
When single-stepping with a debugger, rep stos only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.
See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs.
See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)
A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi, an AMD core could macro-fuse the compare-and-branch.)
One major feature of so-called Fast String operations (rep movs and rep stos on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).
IDK how good AMD's implementation is.
(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)
See the optimization resources (esp. Agner Fog's guides) linked from the x86 tag wiki.
Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q] and rep movs[b/w/d/q] use weakly-ordered stores (like movnt), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence after rep movs.
So for large aligned buffers on Intel IvB and later, a rep stos implementation of memset can beat any other implementation. One that uses movnt stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.
For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs). The threshold may be higher than that, depending on the CPU and the surrounding code.
Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb has a larger penalty for misalignment than movdqu.
See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).
If your CPU has CPUID ERMSB bit, then rep movsb and rep stosb commands are executed differently than on older processors.
See Intel Optimization Reference Manual, section 3.7.6 Enhanced REP MOVSB and REP STOSB operation (ERMSB).
Both the manual and my tests show that the benefits of rep stosb comparing to generic 32-bit register moves on a 32-bit CPU of Skylake microarchitecture appear only on large memory blocks, larger than 128 bytes. On smaller blocks, like 5 bytes, the code that you have shown (mov byte [edi],al; inc edi; dec ecx; jnz Clear) would be much faster, since the startup costs of rep stosb are very high - about 35 cycles. However, this speed difference has diminished on Ice Lake microarchitecture launched in September 2019, introducing the Fast Short REP MOV (FSRM) feature. This feature can be tested by a CPUID bit. It was intended for 128 bytes and shorter strings to be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but, on 32-bit, strings have to be at least 4KB in order for rep movsb to start outperforming other methods.
To get the benefits of rep stosb on the processors with CPUID ERMSB bit, the following conditions should be met:
the destination buffer has to be aligned to a 16-byte boundary;
if the length is a multiple of 64, it can produce even higher performance;
the direction bit should be set "forward" (set by the cld instruction).
According to the Intel Optimization Manual, ERMSB begins to outperform memory store via regular register on Skylake when the length of the memory block is at least 128 bytes. As I wrote, there is high internal startup ERMSB - about 35 cycles. ERMSB begins to clearly outperform other methods, including AVX copy and fill, when the length is more than 2048 bytes. However, this mainly applies to Skylake microarchitecture and not necessarily be the case for the other CPU microarchitectures.
On some processors, but not on the other, when the destination buffer is 16-byte aligned, REP STOSB using ERMSB can perform better than SIMD approaches, i.e., when using MMX or SSE registers. When the destination buffer is misaligned, memset() performance using ERMSB can degrade about 20% relative to the aligned case, for processors based on Intel microarchitecture code name Ivy Bridge. In contrast, SIMD implementation of REP STOSB will experience more negligible degradation when the destination is misaligned, according to Intel's optimization manual.
Benchmarks
I've done some benchmarks. The code was filling the same fixed-size buffer many times, so the buffer stayed in cache (L1, L2, L3), depending on the size of the buffer. The number of iterations was such as the total execution time should be about two seconds.
Skylake
On Intel Core i5 6600 processor, released on September 2015 and based on Skylake-S quad-core microarchitecture (3.30 GHz base frequency, 3.90 GHz Max Turbo frequency) with 4 x 32K L1 cache, 4 x 256K L2 cache and 6MB L3 cache, I could obtain ~100 GB/sec on REP STOSB with 32K blocks.
The memset() implementation that uses REP STOSB:
1297920000 blocks of 16 bytes: 13.6022 secs 1455.9909 Megabytes/sec
0648960000 blocks of 32 bytes: 06.7840 secs 2919.3058 Megabytes/sec
1622400000 blocks of 64 bytes: 16.9762 secs 5833.0883 Megabytes/sec
817587402 blocks of 127 bytes: 8.5698 secs 11554.8914 Megabytes/sec
811200000 blocks of 128 bytes: 8.5197 secs 11622.9306 Megabytes/sec
804911628 blocks of 129 bytes: 9.1513 secs 10820.6427 Megabytes/sec
407190588 blocks of 255 bytes: 5.4656 secs 18117.7029 Megabytes/sec
405600000 blocks of 256 bytes: 5.0314 secs 19681.1544 Megabytes/sec
202800000 blocks of 512 bytes: 2.7403 secs 36135.8273 Megabytes/sec
101400000 blocks of 1024 bytes: 1.6704 secs 59279.5229 Megabytes/sec
3168750 blocks of 32768 bytes: 0.9525 secs 103957.8488 Megabytes/sec (!), i.e., 103 GB/s
2028000 blocks of 51200 bytes: 1.5321 secs 64633.5697 Megabytes/sec
413878 blocks of 250880 bytes: 1.7737 secs 55828.1341 Megabytes/sec
19805 blocks of 5242880 bytes: 2.6009 secs 38073.0694 Megabytes/sec
The memset() implementation that uses MOVDQA [RCX],XMM0:
1297920000 blocks of 16 bytes: 3.5795 secs 5532.7798 Megabytes/sec
0648960000 blocks of 32 bytes: 5.5538 secs 3565.9727 Megabytes/sec
1622400000 blocks of 64 bytes: 15.7489 secs 6287.6436 Megabytes/sec
817587402 blocks of 127 bytes: 9.6637 secs 10246.9173 Megabytes/sec
811200000 blocks of 128 bytes: 9.6236 secs 10289.6215 Megabytes/sec
804911628 blocks of 129 bytes: 9.4852 secs 10439.7473 Megabytes/sec
407190588 blocks of 255 bytes: 6.6156 secs 14968.1754 Megabytes/sec
405600000 blocks of 256 bytes: 6.6437 secs 14904.9230 Megabytes/sec
202800000 blocks of 512 bytes: 5.0695 secs 19533.2299 Megabytes/sec
101400000 blocks of 1024 bytes: 4.3506 secs 22761.0460 Megabytes/sec
3168750 blocks of 32768 bytes: 3.7269 secs 26569.8145 Megabytes/sec (!) i.e., 26 GB/s
2028000 blocks of 51200 bytes: 4.0538 secs 24427.4096 Megabytes/sec
413878 blocks of 250880 bytes: 3.9936 secs 24795.5548 Megabytes/sec
19805 blocks of 5242880 bytes: 4.5892 secs 21577.7860 Megabytes/sec
Please note that the drawback of using the XMM0 register is that it is 128 bits (16 bytes) while I could have used YMM0 register of 256 bits (32 bytes). Anyway, stosb uses the non-RFO protocol. Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge. See https://stackoverflow.com/a/33905887/6910868 for more details and the source.
Anyway, even you compare just two of the methods that I have provided, and even though the second method is far from ideal, as you see, on 64-bit blocks rep stosb is slower, but starting from 128-byte blocks, rep stosb begin to outperform other methods, and the difference is very significant starting from 512-byte blocks and longer, provided that you are clearing the same memory block again and again within the cache.
Therefore, for REP STOSB, maximum speed was 103957 (one hundred three thousand nine hundred fifty-seven) Megabytes per second, while with MOVDQA [RCX],XMM0 it was just 26569 (twenty-six thousand five hundred sixty-nine) twenty-six thousand five hundred sixty-nine.
As you see, the highest performance was on 32K blocks, which is equal to 32K L1 cache of the CPU on which I've made the benchmarks.
Ice Lake
REP STOSB vs AVX-512 store
I have also done tests on an Intel i7 1065G7 CPU, released in August 2019 (Ice Lake/Sunny Cove microarchitecture), Base frequency: 1.3 GHz, Max Turbo frequency 3.90 GHz. It supports AVX512F instruction set. It has 4 x 32K L1 instruction cache and 4 x 48K data cache, 4x512K L2 cache and 8 MB L3 cache.
Destination alignment
On 32K blocks zeroized by rep stosb, performance was from 175231 MB/s for destination misaligned by 1 byte (e.g. $7FF4FDCFFFFF) and quickly rose to 219464 MB/s for aligned by 64 bytes (e.g. $7FF4FDCFFFC0), and then gradually rose to 222424 MB/sec for destinations aligned by 256 bytes (Aligned to 256 bytes, i.e. $7FF4FDCFFF00). After that, the speed did not rise, even if destination was aligned by 32KB (e.g. $7FF4FDD00000), and was still 224850 MB/sec.
There was no difference in speed between rep stosb and rep stosq.
On buffers aligned by 32K, the speed of AVX-512 store was exactly the same as for rep stosb, for loops starting from 2 stores in a loop (227777 MB/sec) and didn't grow for loops unrolled for 4 and even 16 stores. However, for a loop of just 1 store the speed was a little bit lower - 203145 MB/sec.
However, if the destination buffer was misaligned by just 1 byte, the speed of AVX512 store dropped dramatically, i.e. more than 2 times, to 93811 MB/sec, in contrast to rep stosb on similar buffers, which gave 175231 MB/sec.
Buffer Size
For 1K (1024 bytes) blocks, AVX-512 (205039 KB/s) was 3 times faster than rep stosb (71817 MB/s)
And for 512 bytes blocks, AVX-512 performance was always the same as for larger block types (194181 MB/s), while rep stosb dropped to 38682 MB/s. At this block type, the difference was 5 times in favor of AVX-512.
For 2K (2048) blocks, AVX-512 had 210696 MB/s, while for rep stosb it was 123207 MB/s, almost twice slower. Again, there was no difference between rep stosb and rep stosq.
For 4K (4096) blocks, AVX-512 had 225179 MB/s, while rep stosb: 180384 MB/s, almost catching up.
For 8K (8192) blocks, AVX-512 had 222259 MB/s, while rep stosb: 194358 MB/s, close!
For 32K (32768) blocks, AVX-512 had 228432 MB/s, rep stosb: 220515 MB/s - now at last! We are approaching the L0 data cache size of my CPU - 48Kb! This is 220 Gigabytes per second!
For 64K (65536) blocks, AVX-512 had 61405 MB/s, rep stosb: 70395 MB/s!
Such a huge drop when we ran out of the L0 cache! And, it was evident that, from this point, rep stosb begins to outperform AVX-512 stores.
Now let's check the L1 cache size. For for 512K blocks, AVX-512 made 62907 MB/s and rep stosb made 70653 MB/s. That's where rep stosb begins to outperform AVX-512. The difference is not yet significant, but the bigger the buffer, the bigger the difference.
Now let's take a huge buffer of 1GB (1073741824). With AVX-512, the speed was 14319 MB/s, rep stosb it as 27412 MB/s, i.e. twice as fast as AVX-512!
I've also tried to use non-temporal instructions for filling the 32K buffers vmovntdq [rcx], zmm31, but the performance was about 4 time slower than just vmovdqa64 [rcx], zmm31. How can I take benefits of vmovntdq when filling memory buffers? Should there be some specific size of the buffer in order vmovntdq to take an advantage?
Also, if the destination buffers are aligned by at least 64 bits, there is no performance difference in vmovdqa64 vs vmovdqu64. Therefore, I do have a question: does the instruction vmovdqa64 is only needed for debugging and safety when we have vmovdqu64?
Figure 1: Speed of iterative store to the same buffer, MB/s
block AVX stosb
----- ----- ------
0.5K 194181 38682
1K 205039 205039
2K 210696 123207
4K 225179 180384
8K 222259 194358
32K 228432 220515
64K 61405 70395
512K 62907 70653
1G 14319 27412
Summary on performance of multiple clearing the same memory block within the cache
rep stosb on Ice Lake CPUs begins to outperform AVX-512 stores only for repeatedly clearing the same memory buffer larger than the L0 cache size, i.e. 48K on the Intel i7 1065G7 CPU. And on small memory buffers, AVX-512 stores are much faster: for 1KB - 3 times faster, for 512 bytes - 5 times faster.
However, the AVX-512 stores are susceptible to misaligned buffers, while rep stosb is not as sensitive to misalignment.
Therefore, I have figured out that rep stosb begins to outperform AVX-512 stores only on buffers that exceed L0 data cache size, or 48KB as in case of the Intel i7 1065G7 CPU. This conclusion is valid at least on Ice Lake CPUs. An earlier Intel recommendation that string copy begins to outperform AVX copy starting from 2KB buffers also should be re-tested for newer microarchitectures.
Clearing different memory buffers, each only once
My previous benchmarks were filling the same buffer many times in row. A better benchmark might be to allocate many different buffers and only fill each buffer once, to not interfere with the cache.
In this scenario, there is no much difference at all between rep stosb and AVX-512 stores. The only difference is when all the data does not come close to a physical memory limit, under Windows 10 64 bit. In the following benchmarks, the total data size was below 8 GB with total physical ram of 16 GB. When I was allocating about 12 GB, performance drops about 20 times, regardless of the method. Windows began to discard purged memory pages, and probably did some other stuff when the memory was about to be full. The L3 cache size of 8MB on the i7 1065G7 CPU did not seem to matter the benchmarks at all. All that matters is that you didn't have to come close to physical memory limit, and it depends on your operating system on how it handles such situations. As I said, under Windows 10, if I took just half physical memory, it was OK, but it I took 3/4 of available memory, my benchmark slowed 20 times. I didn't even try to take more than 3/4. As I told, the total memory size is 16 GB. The amount available, according to the task manager, was 12 GB.
Here is the benchmark of the speed of filling various blocks of memory totalling 8 GB with zeros (in MB/sec) on the i7 1065G7 CPU with 16 GB total memory, single-threaded. By "AVX" I mean "AVX-512" normal stores, and by "stosb" I mean "rep stosb".
Figure 2: Speed of store to the multiple buffers, once each, MB/s
block AVX stosb
----- ---- ----
0.5K 3641 2759
1K 4709 3963
2K 12133 13163
4K 8239 10295
8K 3534 4675
16K 3396 3242
32K 3738 3581
64K 2953 3006
128K 3150 2857
256K 3773 3914
512K 3204 3680
1024K 3897 4593
2048K 4379 3234
4096K 3568 4970
8192K 4477 5339
Conclusion on clearing the memory within the cache
If your memory does not exist in the cache, than the performance of AVX-512 stores and rep stosb is about the same when you need to fill memory with zeros. It is the cache that matters, not the choice between these two methods.
The use of non-temporal store to clear the memory which is not in the cache
I was zeroizing 6-10 GB of memory split by a sequence of buffers aligned by 64 bytes. No buffers were zeroized twice. Smaller buffers had some overhead, and I had only 16 GB of physical memory, so I zeroized less memory in total with smaller buffers. I used various tests for the buffers starting from 256 bytes and up to to 8 GB per buffer.
I took 3 different methods:
Normal AVX-512 store by vmovdqa64 [rcx+imm], zmm31 (a loop of 4 stores and then compare the counter);
Non-temporal AVX-512 store by vmovntdq [rcx+imm], zmm31 (same loop of 4 stores);
rep stosb.
For small buffers, the normal AVX-512 store was the winner. Then, starting from 4KB, the non-temporal store took the lead, while rep stosb still lagged behind.
Then, from 256KB, rep stosb outperformed AVX-512, but not the non-temporal store, and since that, the situation didn’t change. The winner was a non-temporal AVX-512 store, then came rep stosb and then the normal AVX-512 store.
Figure 3. Speed of store to the multiple buffers, once each, MB/s by three different methods: normal AVX-512 store, nontemporal AVX-512 store and rep stosb.
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 2.90s, 2.30 GB/s by normal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by nontemporal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by rep stosb
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.06s, 2.62 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.02s, 2.65 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.66s, 2.18 GB/s by rep stosb
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.10s, 2.87 GB/s by normal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.37s, 2.64 GB/s by nontemporal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 4.85s, 1.83 GB/s by rep stosb
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.45s, 2.73 GB/s by normal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.79s, 2.48 GB/s by nontemporal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 4.83s, 1.95 GB/s by rep stosb
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by normal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 3.46s, 2.81 GB/s by nontemporal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by rep stosb
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.24s, 3.04 GB/s by normal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 2.65s, 3.71 GB/s by nontemporal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.35s, 2.94 GB/s by rep stosb
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.37s, 2.94 GB/s by normal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 2.73s, 3.63 GB/s by nontemporal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.53s, 2.81 GB/s by rep stosb
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.19s, 3.12 GB/s by normal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 2.64s, 3.77 GB/s by nontemporal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.44s, 2.90 GB/s by rep stosb
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.08s, 3.24 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 2.58s, 3.86 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.29s, 3.03 GB/s by rep stosb
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.22s, 3.10 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 2.49s, 4.01 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.26s, 3.07 GB/s by rep stosb
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.52s, 3.97 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 1.98s, 5.06 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.43s, 4.11 GB/s by rep stosb
Zeroized 10.00 GB: 20475 blocks of 512 KB for 2.15s, 4.65 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.70s, 5.87 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.81s, 5.53 GB/s by rep stosb
Zeroized 10.00 GB: 10238 blocks of 1 MB for 2.18s, 4.59 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.50s, 6.68 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.63s, 6.13 GB/s by rep stosb
Zeroized 10.00 GB: 5119 blocks of 2 MB for 2.02s, 4.96 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.59s, 6.30 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.54s, 6.50 GB/s by rep stosb
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.90s, 5.26 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.37s, 7.29 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.47s, 6.81 GB/s by rep stosb
Zeroized 9.99 GB: 1279 blocks of 8 MB for 2.04s, 4.90 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.51s, 6.63 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.56s, 6.41 GB/s by rep stosb
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.93s, 5.18 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.37s, 7.30 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.45s, 6.89 GB/s by rep stosb
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.95s, 5.11 GB/s by normal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.41s, 7.06 GB/s by nontemporal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.42s, 7.02 GB/s by rep stosb
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.85s, 5.38 GB/s by normal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.33s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.40s, 7.09 GB/s by rep stosb
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.99s, 4.96 GB/s by normal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.42s, 6.97 GB/s by nontemporal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.55s, 6.37 GB/s by rep stosb
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.83s, 5.32 GB/s by normal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.32s, 7.38 GB/s by nontemporal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.64s, 5.93 GB/s by rep stosb
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.89s, 5.02 GB/s by normal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.31s, 7.27 GB/s by nontemporal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.42s, 6.71 GB/s by rep stosb
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.76s, 5.13 GB/s by normal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.26s, 7.12 GB/s by nontemporal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.29s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.48s, 5.42 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.07s, 7.49 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.15s, 6.94 GB/s by rep stosb
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.48s, 5.40 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.08s, 7.40 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.14s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.50s, 5.35 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.07s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.21s, 6.63 GB/s by rep stosb
Avoiding AVX-SSE transition penalties
For all the AVX-512 code, I've used the ZMM31 register, because SSE registers come from 0 to to 15, so the AVX-512 registers 16 to 31 do not have their SSE counterparts, thus do not incur the transition penalty.

Resources