Heavy disk io write in select statement in HIVE - hadoop

In hive I running a query -
select ret[0],ret[1],ret[2],ret[3],ret[4],ret[5],ret[6] from (select combined1(extra) as ret from log_test1) a ;
Here ret[0],ret[1],ret[2] ... are domain, date, IP, etc. This query is doing heavy write on disk.
iostat result on one of the box in cluster.
avg-cpu: %user %nice %system %iowait %steal %idle
20.65 0.00 1.82 57.14 0.00 20.39
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 0.00 0.00 0.00 535.00 0.00 23428.00 87.58 143.94 269.11 0.00 269.11 1.87 100.00
My mapper is basically stuck in disk IO. I have 3 box cluster. My yarn configuration is
Mapper memory(mapreduce.map.memory.mb)=2GB,
I/O Sort Memory Buffer=1 GB.
I/O Sort Spill Percent=0.8
Counters of my jobs are
FILE: Number of bytes read 0
FILE: Number of bytes written 2568435
HDFS: Number of bytes read 1359720216
HDFS: Number of bytes written 19057298627
Virtual memory (bytes) snapshot 24351916032
Total committed heap usage (bytes) 728760320
Physical memory (bytes) snapshot 2039455744
Map input records 76076426
Input split bytes 2738
GC time elapsed (ms) 55602
Spilled Records 0
As mapper should initially write everything in RAM and when RAM gets full(I/O Sort Memory Buffer),it should spill the data into disk. But as I am seeing, Spilled Records=0 and also mapper is not using full RAM, still there is so heavy disk write.
Even when I am running query
select combined1(extra) from log_test1;
I am getting same heavy disk io write.
What can be the reason of this heavy disk write and how can I reduce this heavy disk write ? As in this case disk io is becoming bottleneck for my mapper.

It may be that your subquery is being written to disk before the second stage of the processing takes place. You should use Explain to examine the execution plan.
You could try rewriting your subquery as a CTE https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression

Related

Children vs parent output in iozone

I'm failing to understand what iozone benchmark outputs.
Here I'm launching a basic read with 16 processes, each of them reading a 2048 KiB files all at once.
I've aggressively disabled caching with echo 3 > /proc/sys/vm/drop_caches.
Results are the following:
Run began: Thu Apr 21 22:12:42 2022
File size set to 2048 kB
Record Size 2048 kB
Include close in write timing
Include fsync in write timing
Command line used: iozone -t 16 -s 2048 -r 2048 -ce -i 1
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 kBytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 16 processes
Each process writes a 2048 kByte file in 2048 kByte records
Children see throughput for 16 readers = 1057899.00 kB/sec
Parent sees throughput for 16 readers = 559102.01 kB/sec
Min throughput per process = 0.00 kB/sec
Max throughput per process = 1057899.00 kB/sec
Avg throughput per process = 66118.69 kB/sec
Min xfer = 0.00 kB
Children see throughput for 16 re-readers = 948555.56 kB/sec
Parent sees throughput for 16 re-readers = 584476.30 kB/sec
Min throughput per process = 0.00 kB/sec
Max throughput per process = 948555.56 kB/sec
Avg throughput per process = 59284.72 kB/sec
Min xfer = 0.00 kB
I don't get why 'children' bandwidth differs so much from 'parent' bandwidth nor why it seems that only one process have been used (Min throughput per process is 0.0 kB/sec and Avg throughput per process is Children see throughput for 16 readers / 16).
This SO question is roughly the same but the only answer is a bit vague.

How can the rep stosb instruction execute faster than the equivalent loop?

How can the instruction rep stosb execute faster than this code?
Clear: mov byte [edi],AL ; Write the value in AL to memory
inc edi ; Bump EDI to next byte in the buffer
dec ecx ; Decrement ECX by one position
jnz Clear ; And loop again until ECX is 0
Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually?
In modern CPUs, rep stosb's and rep movsb's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.
(Note this only applies to stos and movs, not repe cmpsb or repne scasb. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.
See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb or a less-bad scalar bithack for a strlen that happens to get large, and a simple SIMD alternative.)
rep stos/movs has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb is tuned for medium to small memsets and it's optimal to use rep stosd or rep stosq (if you aren't going to use a SIMD loop).
When single-stepping with a debugger, rep stos only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.
See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs.
See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)
A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi, an AMD core could macro-fuse the compare-and-branch.)
One major feature of so-called Fast String operations (rep movs and rep stos on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).
IDK how good AMD's implementation is.
(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)
See the optimization resources (esp. Agner Fog's guides) linked from the x86 tag wiki.
Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q] and rep movs[b/w/d/q] use weakly-ordered stores (like movnt), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence after rep movs.
So for large aligned buffers on Intel IvB and later, a rep stos implementation of memset can beat any other implementation. One that uses movnt stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.
For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs). The threshold may be higher than that, depending on the CPU and the surrounding code.
Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb has a larger penalty for misalignment than movdqu.
See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).
If your CPU has CPUID ERMSB bit, then rep movsb and rep stosb commands are executed differently than on older processors.
See Intel Optimization Reference Manual, section 3.7.6 Enhanced REP MOVSB and REP STOSB operation (ERMSB).
Both the manual and my tests show that the benefits of rep stosb comparing to generic 32-bit register moves on a 32-bit CPU of Skylake microarchitecture appear only on large memory blocks, larger than 128 bytes. On smaller blocks, like 5 bytes, the code that you have shown (mov byte [edi],al; inc edi; dec ecx; jnz Clear) would be much faster, since the startup costs of rep stosb are very high - about 35 cycles. However, this speed difference has diminished on Ice Lake microarchitecture launched in September 2019, introducing the Fast Short REP MOV (FSRM) feature. This feature can be tested by a CPUID bit. It was intended for 128 bytes and shorter strings to be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but, on 32-bit, strings have to be at least 4KB in order for rep movsb to start outperforming other methods.
To get the benefits of rep stosb on the processors with CPUID ERMSB bit, the following conditions should be met:
the destination buffer has to be aligned to a 16-byte boundary;
if the length is a multiple of 64, it can produce even higher performance;
the direction bit should be set "forward" (set by the cld instruction).
According to the Intel Optimization Manual, ERMSB begins to outperform memory store via regular register on Skylake when the length of the memory block is at least 128 bytes. As I wrote, there is high internal startup ERMSB - about 35 cycles. ERMSB begins to clearly outperform other methods, including AVX copy and fill, when the length is more than 2048 bytes. However, this mainly applies to Skylake microarchitecture and not necessarily be the case for the other CPU microarchitectures.
On some processors, but not on the other, when the destination buffer is 16-byte aligned, REP STOSB using ERMSB can perform better than SIMD approaches, i.e., when using MMX or SSE registers. When the destination buffer is misaligned, memset() performance using ERMSB can degrade about 20% relative to the aligned case, for processors based on Intel microarchitecture code name Ivy Bridge. In contrast, SIMD implementation of REP STOSB will experience more negligible degradation when the destination is misaligned, according to Intel's optimization manual.
Benchmarks
I've done some benchmarks. The code was filling the same fixed-size buffer many times, so the buffer stayed in cache (L1, L2, L3), depending on the size of the buffer. The number of iterations was such as the total execution time should be about two seconds.
Skylake
On Intel Core i5 6600 processor, released on September 2015 and based on Skylake-S quad-core microarchitecture (3.30 GHz base frequency, 3.90 GHz Max Turbo frequency) with 4 x 32K L1 cache, 4 x 256K L2 cache and 6MB L3 cache, I could obtain ~100 GB/sec on REP STOSB with 32K blocks.
The memset() implementation that uses REP STOSB:
1297920000 blocks of 16 bytes: 13.6022 secs 1455.9909 Megabytes/sec
0648960000 blocks of 32 bytes: 06.7840 secs 2919.3058 Megabytes/sec
1622400000 blocks of 64 bytes: 16.9762 secs 5833.0883 Megabytes/sec
817587402 blocks of 127 bytes: 8.5698 secs 11554.8914 Megabytes/sec
811200000 blocks of 128 bytes: 8.5197 secs 11622.9306 Megabytes/sec
804911628 blocks of 129 bytes: 9.1513 secs 10820.6427 Megabytes/sec
407190588 blocks of 255 bytes: 5.4656 secs 18117.7029 Megabytes/sec
405600000 blocks of 256 bytes: 5.0314 secs 19681.1544 Megabytes/sec
202800000 blocks of 512 bytes: 2.7403 secs 36135.8273 Megabytes/sec
101400000 blocks of 1024 bytes: 1.6704 secs 59279.5229 Megabytes/sec
3168750 blocks of 32768 bytes: 0.9525 secs 103957.8488 Megabytes/sec (!), i.e., 103 GB/s
2028000 blocks of 51200 bytes: 1.5321 secs 64633.5697 Megabytes/sec
413878 blocks of 250880 bytes: 1.7737 secs 55828.1341 Megabytes/sec
19805 blocks of 5242880 bytes: 2.6009 secs 38073.0694 Megabytes/sec
The memset() implementation that uses MOVDQA [RCX],XMM0:
1297920000 blocks of 16 bytes: 3.5795 secs 5532.7798 Megabytes/sec
0648960000 blocks of 32 bytes: 5.5538 secs 3565.9727 Megabytes/sec
1622400000 blocks of 64 bytes: 15.7489 secs 6287.6436 Megabytes/sec
817587402 blocks of 127 bytes: 9.6637 secs 10246.9173 Megabytes/sec
811200000 blocks of 128 bytes: 9.6236 secs 10289.6215 Megabytes/sec
804911628 blocks of 129 bytes: 9.4852 secs 10439.7473 Megabytes/sec
407190588 blocks of 255 bytes: 6.6156 secs 14968.1754 Megabytes/sec
405600000 blocks of 256 bytes: 6.6437 secs 14904.9230 Megabytes/sec
202800000 blocks of 512 bytes: 5.0695 secs 19533.2299 Megabytes/sec
101400000 blocks of 1024 bytes: 4.3506 secs 22761.0460 Megabytes/sec
3168750 blocks of 32768 bytes: 3.7269 secs 26569.8145 Megabytes/sec (!) i.e., 26 GB/s
2028000 blocks of 51200 bytes: 4.0538 secs 24427.4096 Megabytes/sec
413878 blocks of 250880 bytes: 3.9936 secs 24795.5548 Megabytes/sec
19805 blocks of 5242880 bytes: 4.5892 secs 21577.7860 Megabytes/sec
Please note that the drawback of using the XMM0 register is that it is 128 bits (16 bytes) while I could have used YMM0 register of 256 bits (32 bytes). Anyway, stosb uses the non-RFO protocol. Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge. See https://stackoverflow.com/a/33905887/6910868 for more details and the source.
Anyway, even you compare just two of the methods that I have provided, and even though the second method is far from ideal, as you see, on 64-bit blocks rep stosb is slower, but starting from 128-byte blocks, rep stosb begin to outperform other methods, and the difference is very significant starting from 512-byte blocks and longer, provided that you are clearing the same memory block again and again within the cache.
Therefore, for REP STOSB, maximum speed was 103957 (one hundred three thousand nine hundred fifty-seven) Megabytes per second, while with MOVDQA [RCX],XMM0 it was just 26569 (twenty-six thousand five hundred sixty-nine) twenty-six thousand five hundred sixty-nine.
As you see, the highest performance was on 32K blocks, which is equal to 32K L1 cache of the CPU on which I've made the benchmarks.
Ice Lake
REP STOSB vs AVX-512 store
I have also done tests on an Intel i7 1065G7 CPU, released in August 2019 (Ice Lake/Sunny Cove microarchitecture), Base frequency: 1.3 GHz, Max Turbo frequency 3.90 GHz. It supports AVX512F instruction set. It has 4 x 32K L1 instruction cache and 4 x 48K data cache, 4x512K L2 cache and 8 MB L3 cache.
Destination alignment
On 32K blocks zeroized by rep stosb, performance was from 175231 MB/s for destination misaligned by 1 byte (e.g. $7FF4FDCFFFFF) and quickly rose to 219464 MB/s for aligned by 64 bytes (e.g. $7FF4FDCFFFC0), and then gradually rose to 222424 MB/sec for destinations aligned by 256 bytes (Aligned to 256 bytes, i.e. $7FF4FDCFFF00). After that, the speed did not rise, even if destination was aligned by 32KB (e.g. $7FF4FDD00000), and was still 224850 MB/sec.
There was no difference in speed between rep stosb and rep stosq.
On buffers aligned by 32K, the speed of AVX-512 store was exactly the same as for rep stosb, for loops starting from 2 stores in a loop (227777 MB/sec) and didn't grow for loops unrolled for 4 and even 16 stores. However, for a loop of just 1 store the speed was a little bit lower - 203145 MB/sec.
However, if the destination buffer was misaligned by just 1 byte, the speed of AVX512 store dropped dramatically, i.e. more than 2 times, to 93811 MB/sec, in contrast to rep stosb on similar buffers, which gave 175231 MB/sec.
Buffer Size
For 1K (1024 bytes) blocks, AVX-512 (205039 KB/s) was 3 times faster than rep stosb (71817 MB/s)
And for 512 bytes blocks, AVX-512 performance was always the same as for larger block types (194181 MB/s), while rep stosb dropped to 38682 MB/s. At this block type, the difference was 5 times in favor of AVX-512.
For 2K (2048) blocks, AVX-512 had 210696 MB/s, while for rep stosb it was 123207 MB/s, almost twice slower. Again, there was no difference between rep stosb and rep stosq.
For 4K (4096) blocks, AVX-512 had 225179 MB/s, while rep stosb: 180384 MB/s, almost catching up.
For 8K (8192) blocks, AVX-512 had 222259 MB/s, while rep stosb: 194358 MB/s, close!
For 32K (32768) blocks, AVX-512 had 228432 MB/s, rep stosb: 220515 MB/s - now at last! We are approaching the L0 data cache size of my CPU - 48Kb! This is 220 Gigabytes per second!
For 64K (65536) blocks, AVX-512 had 61405 MB/s, rep stosb: 70395 MB/s!
Such a huge drop when we ran out of the L0 cache! And, it was evident that, from this point, rep stosb begins to outperform AVX-512 stores.
Now let's check the L1 cache size. For for 512K blocks, AVX-512 made 62907 MB/s and rep stosb made 70653 MB/s. That's where rep stosb begins to outperform AVX-512. The difference is not yet significant, but the bigger the buffer, the bigger the difference.
Now let's take a huge buffer of 1GB (1073741824). With AVX-512, the speed was 14319 MB/s, rep stosb it as 27412 MB/s, i.e. twice as fast as AVX-512!
I've also tried to use non-temporal instructions for filling the 32K buffers vmovntdq [rcx], zmm31, but the performance was about 4 time slower than just vmovdqa64 [rcx], zmm31. How can I take benefits of vmovntdq when filling memory buffers? Should there be some specific size of the buffer in order vmovntdq to take an advantage?
Also, if the destination buffers are aligned by at least 64 bits, there is no performance difference in vmovdqa64 vs vmovdqu64. Therefore, I do have a question: does the instruction vmovdqa64 is only needed for debugging and safety when we have vmovdqu64?
Figure 1: Speed of iterative store to the same buffer, MB/s
block AVX stosb
----- ----- ------
0.5K 194181 38682
1K 205039 205039
2K 210696 123207
4K 225179 180384
8K 222259 194358
32K 228432 220515
64K 61405 70395
512K 62907 70653
1G 14319 27412
Summary on performance of multiple clearing the same memory block within the cache
rep stosb on Ice Lake CPUs begins to outperform AVX-512 stores only for repeatedly clearing the same memory buffer larger than the L0 cache size, i.e. 48K on the Intel i7 1065G7 CPU. And on small memory buffers, AVX-512 stores are much faster: for 1KB - 3 times faster, for 512 bytes - 5 times faster.
However, the AVX-512 stores are susceptible to misaligned buffers, while rep stosb is not as sensitive to misalignment.
Therefore, I have figured out that rep stosb begins to outperform AVX-512 stores only on buffers that exceed L0 data cache size, or 48KB as in case of the Intel i7 1065G7 CPU. This conclusion is valid at least on Ice Lake CPUs. An earlier Intel recommendation that string copy begins to outperform AVX copy starting from 2KB buffers also should be re-tested for newer microarchitectures.
Clearing different memory buffers, each only once
My previous benchmarks were filling the same buffer many times in row. A better benchmark might be to allocate many different buffers and only fill each buffer once, to not interfere with the cache.
In this scenario, there is no much difference at all between rep stosb and AVX-512 stores. The only difference is when all the data does not come close to a physical memory limit, under Windows 10 64 bit. In the following benchmarks, the total data size was below 8 GB with total physical ram of 16 GB. When I was allocating about 12 GB, performance drops about 20 times, regardless of the method. Windows began to discard purged memory pages, and probably did some other stuff when the memory was about to be full. The L3 cache size of 8MB on the i7 1065G7 CPU did not seem to matter the benchmarks at all. All that matters is that you didn't have to come close to physical memory limit, and it depends on your operating system on how it handles such situations. As I said, under Windows 10, if I took just half physical memory, it was OK, but it I took 3/4 of available memory, my benchmark slowed 20 times. I didn't even try to take more than 3/4. As I told, the total memory size is 16 GB. The amount available, according to the task manager, was 12 GB.
Here is the benchmark of the speed of filling various blocks of memory totalling 8 GB with zeros (in MB/sec) on the i7 1065G7 CPU with 16 GB total memory, single-threaded. By "AVX" I mean "AVX-512" normal stores, and by "stosb" I mean "rep stosb".
Figure 2: Speed of store to the multiple buffers, once each, MB/s
block AVX stosb
----- ---- ----
0.5K 3641 2759
1K 4709 3963
2K 12133 13163
4K 8239 10295
8K 3534 4675
16K 3396 3242
32K 3738 3581
64K 2953 3006
128K 3150 2857
256K 3773 3914
512K 3204 3680
1024K 3897 4593
2048K 4379 3234
4096K 3568 4970
8192K 4477 5339
Conclusion on clearing the memory within the cache
If your memory does not exist in the cache, than the performance of AVX-512 stores and rep stosb is about the same when you need to fill memory with zeros. It is the cache that matters, not the choice between these two methods.
The use of non-temporal store to clear the memory which is not in the cache
I was zeroizing 6-10 GB of memory split by a sequence of buffers aligned by 64 bytes. No buffers were zeroized twice. Smaller buffers had some overhead, and I had only 16 GB of physical memory, so I zeroized less memory in total with smaller buffers. I used various tests for the buffers starting from 256 bytes and up to to 8 GB per buffer.
I took 3 different methods:
Normal AVX-512 store by vmovdqa64 [rcx+imm], zmm31 (a loop of 4 stores and then compare the counter);
Non-temporal AVX-512 store by vmovntdq [rcx+imm], zmm31 (same loop of 4 stores);
rep stosb.
For small buffers, the normal AVX-512 store was the winner. Then, starting from 4KB, the non-temporal store took the lead, while rep stosb still lagged behind.
Then, from 256KB, rep stosb outperformed AVX-512, but not the non-temporal store, and since that, the situation didn’t change. The winner was a non-temporal AVX-512 store, then came rep stosb and then the normal AVX-512 store.
Figure 3. Speed of store to the multiple buffers, once each, MB/s by three different methods: normal AVX-512 store, nontemporal AVX-512 store and rep stosb.
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 2.90s, 2.30 GB/s by normal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by nontemporal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by rep stosb
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.06s, 2.62 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.02s, 2.65 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.66s, 2.18 GB/s by rep stosb
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.10s, 2.87 GB/s by normal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.37s, 2.64 GB/s by nontemporal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 4.85s, 1.83 GB/s by rep stosb
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.45s, 2.73 GB/s by normal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.79s, 2.48 GB/s by nontemporal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 4.83s, 1.95 GB/s by rep stosb
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by normal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 3.46s, 2.81 GB/s by nontemporal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by rep stosb
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.24s, 3.04 GB/s by normal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 2.65s, 3.71 GB/s by nontemporal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.35s, 2.94 GB/s by rep stosb
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.37s, 2.94 GB/s by normal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 2.73s, 3.63 GB/s by nontemporal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.53s, 2.81 GB/s by rep stosb
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.19s, 3.12 GB/s by normal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 2.64s, 3.77 GB/s by nontemporal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.44s, 2.90 GB/s by rep stosb
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.08s, 3.24 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 2.58s, 3.86 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.29s, 3.03 GB/s by rep stosb
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.22s, 3.10 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 2.49s, 4.01 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.26s, 3.07 GB/s by rep stosb
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.52s, 3.97 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 1.98s, 5.06 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.43s, 4.11 GB/s by rep stosb
Zeroized 10.00 GB: 20475 blocks of 512 KB for 2.15s, 4.65 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.70s, 5.87 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.81s, 5.53 GB/s by rep stosb
Zeroized 10.00 GB: 10238 blocks of 1 MB for 2.18s, 4.59 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.50s, 6.68 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.63s, 6.13 GB/s by rep stosb
Zeroized 10.00 GB: 5119 blocks of 2 MB for 2.02s, 4.96 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.59s, 6.30 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.54s, 6.50 GB/s by rep stosb
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.90s, 5.26 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.37s, 7.29 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.47s, 6.81 GB/s by rep stosb
Zeroized 9.99 GB: 1279 blocks of 8 MB for 2.04s, 4.90 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.51s, 6.63 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.56s, 6.41 GB/s by rep stosb
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.93s, 5.18 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.37s, 7.30 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.45s, 6.89 GB/s by rep stosb
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.95s, 5.11 GB/s by normal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.41s, 7.06 GB/s by nontemporal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.42s, 7.02 GB/s by rep stosb
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.85s, 5.38 GB/s by normal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.33s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.40s, 7.09 GB/s by rep stosb
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.99s, 4.96 GB/s by normal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.42s, 6.97 GB/s by nontemporal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.55s, 6.37 GB/s by rep stosb
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.83s, 5.32 GB/s by normal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.32s, 7.38 GB/s by nontemporal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.64s, 5.93 GB/s by rep stosb
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.89s, 5.02 GB/s by normal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.31s, 7.27 GB/s by nontemporal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.42s, 6.71 GB/s by rep stosb
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.76s, 5.13 GB/s by normal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.26s, 7.12 GB/s by nontemporal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.29s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.48s, 5.42 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.07s, 7.49 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.15s, 6.94 GB/s by rep stosb
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.48s, 5.40 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.08s, 7.40 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.14s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.50s, 5.35 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.07s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.21s, 6.63 GB/s by rep stosb
Avoiding AVX-SSE transition penalties
For all the AVX-512 code, I've used the ZMM31 register, because SSE registers come from 0 to to 15, so the AVX-512 registers 16 to 31 do not have their SSE counterparts, thus do not incur the transition penalty.

understanding iostat %utilization

Used below to test the limit of what throughtput the disk can achieve
dd if=/dev/zero of=test bs=4k count=25000 conv=fdatasync
with multiple runs it averaged out to about 130 MB/s
now when running cassandra on these system i am monitoring the disk usage using
iostat -dmxt 30 sdd sdb sdc
there are certain entries i want to make sure i am interpreting them correctly like below.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdc 0.00 2718.60 186.30 27.20 17.87 12.06 287.13 44.98 215.06 2.79 59.58
even though the sum of rMB/s + wMB/s should be roughly equal to %util(disk throughput which is 130MB/s) and i am assuming some of the utilization goes towards seek , can the difference be huge enough to take about 24% of utilization.
Thanks in advance for any help.
the frequent spin/seek does take significant amount of (latency) time. in my test, the io bandwidth between sequential io and random io is about 3x. also, it is better to use fio
(https://github.com/axboe/fio) to run this type of tests, e.g direct io, sequential read/write with proper sector size (256kb or 512kb - depending on the support from the controller) and libaio as io engine, io queue depth 64. the test will be muchly controlled.

Insert statement with high buffer gets and high index contention

We have a table with over 300,000,000 rows and two single column indexes. Every now and then the application comes to a hault. At that same time there is high index contention for the insert statement for this table. I also noticed a large amount of buffer gets. Can someone help me remedy this problem?
Here are statistics for the statement when the index contention is high and we are having performance issues.
Total Per Execution Per Row
Executions 51,857 1 1.00
Elapsed Time (sec) 3,270.67 0.06 0.06
CPU Time (sec) 1,554.41 0.03 0.03
Buffer Gets 140,844,228 2,716.01 2,716.01
Disk Reads 1,160 0.02 0.02
Direct Writes 0 0.00 0.00
Rows 51,857 1.00 1
Fetches 0 0.00 0.00
Same statement, same time range, similar workload.
Total Per Execution Per Row
Executions 94,424 1 1.00
Elapsed Time (sec) 30.41 <0.01 <0.01
CPU Time (sec) 12.90 <0.01 <0.01
Buffer Gets 1,130,297 11.97 11.97
Disk Reads 469 <0.01 <0.01
Direct Writes 0 0.00 0.00
Rows 94,424 1.00 1
Fetches 0 0.00 0.00
There are two ways to look at a primary index:
a way to do fast lookups for the most common queries
a way to speed up insertions (and posibly deletions)
most people think in terms of the primary index in the first sense
but there can be only one primary key, since it actual disk order
By having a sequence (or a timestamp) as the primary key, you are basically trying to put records very close (same page) and can have contention, as all inserts try to go to the same place
If you use your primary key instead to distribute the data, you will have fewer insert collisions. It can pay to have a primary key that is the most variable attribute (closest to a good distribution), even if that attribute is rarely queried, in fact adding an extra column with a random value can be used.
There is not enough information provided about how you use the data, but it might pay to trade a bit of query time, to avoid these collisions.

Sphinx indexer speed

I have the following issue:
I try to index same table with same data and same config on 2 servers.
First is local machine (Not very good).
Second is Amazon EC2 General-purpose M1 medium insance.
The search results differs almost twice:
Local:
collected 84208 docs, 7.6 MB
sorted 54.9 Mhits, 100.0% done
total 84208 docs, 7646878 bytes
total 27.188 sec, 281252 bytes/sec, 3097.17 docs/sec
total 3 reads, 0.177 sec, 99835.0 kb/call avg, 59.2 msec/call avg
total 1013 writes, 7.735 sec, 679.4 kb/call avg, 7.6 msec/call avg
Amazon:
collected 84208 docs, 7.6 MB
sorted 54.9 Mhits, 100.0% done
total 84208 docs, 7646878 bytes
total 52.111 sec, 146740 bytes/sec, 1615.92 docs/sec
total 3 reads, 1.270 sec, 99833.9 kb/call avg, 423.4 msec/call avg
total 1010 writes, 6.980 sec, 680.8 kb/call avg, 6.9 msec/call avg
Does anybody have a clue what can be a reason for such results?
Do I need to run some specific option on Amazon for Sphinx server?

Resources