Redis Sets Performance Issue - performance

I am trying to benchmark my redis SUNION command.
While benchmarking one of the sets contains ~ 1000 elements and other contains ~10 elements.
The order of execution is around 0.52 ms per call.
Is this performance ideal or am I missing out on some tuning settings in the conf file.
I am trying to implement tag filtering on objects using basic sets operations.
For ex.
obj1 -> {id - 1 colour red location x}
obj1 -> {id - 1 colour red location x}
obj2 -> { id - 2 colour yellow location y}
obj3 -> { id - 3 clolour red location y}
in order to store i am using sets for storing object ids for each dimensions.hence
colour:red -> {1,3}
colour:yellow -> {2}
location:x -> {1}
location:y -> {2,3}
This enables me to expose apis on top of this like :
objects coloured red in location x
objects coloured red in any location
each of these actually translates into multiple sets operations for me using union intersection diff which i have implemented using pipelines.
Scale :
The max number of elements inside any set is very less ~ 5000. And latency is the main point consider. If there is any other way that i should take to achieve this kind of performance. Would be helpful.

I don't know if the performance is ideal (sounds great, less than 1ms is great in my book, but it really depends on your requirements).
I've performed the below test on my laptop:
$ for i in `seq 0 999`; do redis-cli sadd s1000 forbar$i; done
...
$ for i in `seq 0 9`; do redis-cli sadd s10 foobar$i; done
...
$ redis-benchmark SUNION s1000 s10
...
100.00% <= 56 milliseconds
1171.37 requests per second
$ redis-benchmark SUNION s1000
...
100.00% <= 57 milliseconds
1062.70 requests per second
$ redis-benchmark SMEMBERS s1000
100.00% <= 19 milliseconds
3300.33 requests per second
$ redis-benchmark SINTER s1000
100.00% <= 17 milliseconds
3311.26 requests per second
...
127.0.0.1:6379> INFO
# Server
redis_version:999.999.999
redis_git_sha1:4e5e0d37
redis_git_dirty:1
redis_build_id:abfc1e37fd7acbf7
redis_mode:standalone
os:Darwin 17.7.0 x86_64
arch_bits:64
multiplexing_api:kqueue
atomicvar_api:atomic-builtin
gcc_version:4.2.1
process_id:23596
run_id:73b0f2f6795eccb8286fc086c83d89da45314ce2
tcp_port:6379
uptime_in_seconds:55429
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:9775394
...
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,4
Processor Name: Intel Core i7
Processor Speed: 2.2 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB
Boot ROM Version: MBP114.0184.B00
SMC Version (system): 2.29f24
These results (and some digging into the code) show that:
SUNION is slower than just getting the members of a set (SMEMBERS uses the same implementation as SINTER).
Iterating and replying with 1000 elements takes at least 17-19ms
The difference between SUNION and the others is not in orders of magnitude, which is good, but perhaps requires some optimizations in the Redis code (at least for this edge case).

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Latency of accessing main memory is almost the same order of sending a packet

Looking at Jeff Dean's famous latency guides
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.
Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?
Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - CPU own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
20,000 ns - Send 2 KB over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Trans-Atlantic Network RTT :
Global optical networks work roughly at a speed of light ( 300.000.000 m/s )
LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.
So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms
Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).
The HDD Disk :
HDD-s can have very fast and very short transport-path for moving the data, but the READ-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM )
HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about 10 [ms]
HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )
These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:
>>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
11.1 ms per CYL # 5k4 RPM disk,
8.3 ms per CYL # 7k2 RPM disk,
6.0 ms per CYL # 10k RPM disk,
4.0 ms per CYL # 15k RPM disk,
3.3 ms per CYL # 18k RPM disk.
Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.
More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms] to seek-and-read but one head:cyl-path
Laws of Physics decide
Some numbers from 2020.
Load from L1 is 4 cycles on Intel Coffee Lake and Ryzen (0.8nsec on a 5GHz CPU).
Load from memory is ~215 cycles on Intel Coffee Lake (43nsec on a 5GHz CPU). ~280 cycles on Ryzen.

How to interpret stats/show in Rebol 3

Wanted to make some profiling on a R3 script and was checking at the stats command.
But what do these informations mean?
How can it be used to monitor memory usage?
>> stats/show
Series Memory Info:
node size = 16
series size = 20
5 segs = 409640 bytes - headers
4888 blks = 812448 bytes - blocks
1511 strs = 86096 bytes - byte strings
2 unis = 86016 bytes - unicode strings
4 odds = 39216 bytes - odd series
6405 used = 1023776 bytes - total used
0 free / 14075 bytes - free headers / node-space
Pool[ 0] 8B 202/ 3328: 256 ( 6%) 13 segs, 26728 total
Pool[ 1] 16B 178/ 512: 256 (34%) 2 segs, 8208 total
Pool[ 2] 32B 954/ 2560: 512 (37%) 5 segs, 81960 total
...
Pool[26] 64B 0/ 0: 128 ( 0%) 0 segs, 0 total
Pools used 654212 of 1906200 (34%)
System pool used 497664
== 1023776
It shows the internal memory management information, not sure how useful it would be to the script.
Anyway, here are some explanations about the memory pools.
Most pools are for series (there is a dedicated pool for GOB!s, and some others if you're looking at Atronix source code), to make it simple, I will focus on series pools here.
Internally, a series has a header and its data which is a chunk of contiguous memory. The header has the width and length info about the series. The data holds the actual content of the series. In R3, Series is used extensively to implement block!, port!, string!, object!, etc. So managing memory in R3 is almost managing (allocating and destroying) series. Because of the difference in the width and length of serieses, pools are introduced to reduce the fragmentation.
When a new series is needed, the header is allocated in a special pool, and another pool is chosen for its data. The pool whose width is closed to the size of the series is chosen. E.g. a block with 3 elements will probably be allocated in a pool with width of 128-byte (on 32-bit systems, a block is a series with 4 (3 + 1 terminater) elements). As a pool could increase as the the program runs, it's implemented as a list of segments. New segments will be allocated and appended to the list as needed (but it's never released back to system).
Another special pool is the system pool, which is chosen when the required memory is big. R3 doesn't actually manage this pool other than collecting some statistics.
When it tries to collect garbage, it will sweep the root context, and mark everything that can be reachable, then it will go through the series header pool, and find out all unneeded serieses and destroy them.
If you use stats without a refinement, you can see the actual memory usage. So comparing memory usage before and after your implementations you can see which one uses less memory.
>> stats
== 1129824
>> s: make string! 1024
== ""
>> stats
== 1132064

How to compute the time of an in-place external merge sort?

The original problem is like this:
You are to sort 1PB size of integers ranging from -2^31 ~ 2^31 - 1 (int), you have 1024 machines each having 1TB disk space and 16GB memory space. Assume disk speed is 128MB/s (r/w) and memory speed is 8GB/s (r/w). Time for CPU can be ignored. Network transfer time can be ignored for simplicity. Compute the approximated time needed.
I know with external sort we can sort the 1TB data on a single machine in roughly 10hrs as computed like this:
Disk access (2r2w): 1T * 4 / 128MB/s = 2 ^ 15 sec ~ 9 hrs
Mem access:
sorting 2^48 Integers in 64 parts (2 ^ 42 each) roughly takes 1.3 min each. So totally 1.4 hr.
63 way merging takes several seconds, and thus is ignored.
But what about the next step: the combination of 1024T data? I have no idea how this is computed. So any help please?
2^31 is = 2 billion (2 "giga"). So you are looking at lot of duplicate numbers and fixed range. So consider Radix Sort ( http://en.wikipedia.org/wiki/Radix_sort ).
Each processor, for a subset od data) creates 'count' array (x[0] contains the count of 0s etc). Then you can merge all results into one array. Later you can "construct" the sorted array.

Which is faster to process a 1TB file: a single machine or 5 networked machines?

Which is faster to process a 1TB file: a single machine or 5 networked
machines? ("To process" refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.
Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:
Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs
This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.
You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.
In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.
You only need to send 0.8tb if your source machine is part of the 5.
It may not even make sense sending the data to other machines. Consider this:
In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.
So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.
So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.
The primary point here is that disk is the primary bottleneck which overshadows everything else.

Resources