Redis - Benchmark vs reality - caching

I have a Redis standalone instance in production. Earlier 8 instances of my application, each having 64 Redis connections(total 12*64) at a rate of 2000 QPS per instance would give me a latency of < 10ms(which I am fine with). Due to an increase in traffic, I had to increase the number of application instances to 16, while also decreasing the connection count per instance from 128 to 16 (total 16*16=256). This was done after benchmarking with memtier benchmark as below
12 Threads
64 Connections per thread
2000 Requests per thread
ALL STATS
========================================================================
Type Ops/sec Hits/sec Misses/sec Latency KB/sec
------------------------------------------------------------------------
Sets 0.00 --- --- 0.00000 0.00
Gets 79424.54 516.26 78908.28 9.90400 2725.45
Waits 0.00 --- --- 0.00000 ---
Totals 79424.54 516.26 78908.28 9.90400 2725.45
16 Threads
16 Connections per thread
2000 Requests per thread
ALL STATS
========================================================================
Type Ops/sec Hits/sec Misses/sec Latency KB/sec
------------------------------------------------------------------------
Sets 0.00 --- --- 0.00000 0.00
Gets 66631.87 433.11 66198.76 3.32800 2286.47
Waits 0.00 --- --- 0.00000 ---
Totals 66631.87 433.11 66198.76 3.32800 2286.47
Redis benchmark gave similar results.
However, when I made this change in Production, (16*16), the latency shot up back to 60-70ms. I thought the connection count provisioned was less (which seemed unlikely) and went back to 64 connections (64*16), which as expected increased the latency further. For now, I have half of my applications hitting the master Redis and the other half connected to slave with each having 64 connections (8*64 to master, 8*64 to slave) and this works for me(8-10ms latency).
What could have gone wrong that the latency increased with 256 (16*16) connections but reduced with 512(64*8)connections even though the benchmark says otherwise? I agree to not fully trust the benchmark, but even as a guideline, these are polar opposite results.
Note: 1. Application and Redis are colocated, there is no network latency, memory used is about 40% in Redis and the fragmentation ratio is about 1.4. The application uses Jedis for connection pooling. 2. The latency does not include the overhead of Redis miss, only the Redis round trip is considered.

Related

Why Impala Scan Node is very slow (RowBatchQueueGetWaitTime)?

This query returns in 10 seconds most of the times, but occasionally it need 40 seconds or more.
There are two executer nodes in the swarm, and there is no remarkable difference between profiles of the two nodes, following is one of them:
HDFS_SCAN_NODE (id=0):(Total: 39s818ms, non-child: 39s818ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.07
- AverageScannerThreadConcurrency: 1.47
- BytesRead: 563.73 MB (591111366)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 0
- BytesReadRemoteUnexpected: 0
- BytesReadShortCircuit: 0
- CachedFileHandlesHitCount: 0 (0)
- CachedFileHandlesMissCount: 560 (560)
- CollectionItemsRead: 0 (0)
- DecompressionTime: 1s501ms
- MaterializeTupleTime(*): 11s685ms
- MaxCompressedTextFileLength: 0
- NumColumns: 9 (9)
- NumDictFilteredRowGroups: 0 (0)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 56 (56)
- NumScannerThreadMemUnavailable: 0 (0)
- NumScannerThreadReservationsDenied: 0 (0)
- NumScannerThreadsStarted: 4 (4)
- NumScannersWithNoReads: 0 (0)
- NumStatsFilteredRowGroups: 0 (0)
- PeakMemoryUsage: 142.10 MB (149004861)
- PeakScannerThreadConcurrency: 2 (2)
- PerReadThreadRawHdfsThroughput: 151.39 MB/sec
- RemoteScanRanges: 1.68K (1680)
- RowBatchBytesEnqueued: 2.32 GB (2491334455)
- RowBatchQueueGetWaitTime: 39s786ms
- RowBatchQueuePeakMemoryUsage: 1.87 MB (1959936)
- RowBatchQueuePutWaitTime: 0.000ns
- RowBatchesEnqueued: 6.38K (6377)
- RowsRead: 73.99M (73994828)
- RowsReturned: 6.40M (6401849)
- RowsReturnedRate: 161.27 K/sec
- ScanRangesComplete: 56 (56)
- ScannerThreadsInvoluntaryContextSwitches: 99 (99)
- ScannerThreadsTotalWallClockTime: 1m10s
- ScannerThreadsSysTime: 630.808ms
- ScannerThreadsUserTime: 12s824ms
- ScannerThreadsVoluntaryContextSwitches: 1.25K (1248)
- TotalRawHdfsOpenFileTime(*): 9s396ms
- TotalRawHdfsReadTime(*): 3s789ms
- TotalReadThroughput: 11.70 MB/sec
Buffer pool:
- AllocTime: 1.240ms
- CumulativeAllocationBytes: 706.32 MB (740630528)
- CumulativeAllocations: 578 (578)
- PeakReservation: 140.00 MB (146800640)
- PeakUnpinnedBytes: 0
- PeakUsedReservation: 33.83 MB (35471360)
- ReadIoBytes: 0
- ReadIoOps: 0 (0)
- ReadIoWaitTime: 0.000ns
- WriteIoBytes: 0
- WriteIoOps: 0 (0)
- WriteIoWaitTime: 0.000ns
We can notice that RowBatchQueueGetWaitTime is very high, almost 40 seconds, but I cannot figure out why, admitting that TotalRawHdfsOpenFileTime takes 9 seconds and TotalRawHdfsReadTime takes almost 4 seconds, I still cannot explain where are other 27 seconds spend on.
Can you suggest the possible issue and how can I solve it?
The threading model in the scan nodes is pretty complex because there are two layers of workers threads for scanning and I/O - I'll call them scanner and I/O threads. I'll go top down and call out some potential bottlenecks and how to identify them.
High RowBatchQueueGetWaitTime indicates that the main thread consuming from the scan is spending a lot of time waiting for the scanner threads to produce rows. One major source of variance can be the number of scanner threads - if the system is under resource pressure each query can get fewer threads. So keep an eye on AverageScannerThreadConcurrency to understand if that is varying.
The scanner threads would be spending their time doing a variety of things. The bulk of time is generally
Not running because the operating system scheduled a different thread.
Waiting for I/O threads to read data from the storage system
Decoding data, evaluating predicates, other work
With #1 you would see a higher value for ScannerThreadsInvoluntaryContextSwitches and ScannerThreadsUserTime/ScannerThreadsSysTime much lower than ScannerThreadsTotalWallClockTime. If ScannerThreadsUserTime is much lower than MaterializeTupleTime, that would be another symptom.
With #2 you would see high ScannerThreadsUserTime and MaterializeTupleTime. It looks like here there is a significant amount of CPU time going to that, but not the bulk of the time.
To identify #3, I would recommend looking at TotalStorageWaitTime in the fragment profile to understand how much time threads actually spent waiting for I/O. I also added ScannerIoWaitTime in more recent Impala releases which is more convenient since it's in the scanner profile.
If the storage wait time is slow, there are a few things to consider
If TotalRawHdfsOpenFileTime is high, it could be that opening the files is a bottleneck. This can happen on any storage system, including HDFS. See Why Impala spend a lot of time Opening HDFS File (TotalRawHdfsOpenFileTime)?
If TotalRawHdfsReadTime is high, reading from the storage system may be slow (e.g. if the data is not in the OS buffer cache or it is a remote filesystem like S3)
Other queries may be contending for I/O resources and/or I/O threads
I suspect in your case that the root cause is both slowness opening files for this query, and slowness opening files for other queries causing scanner threads to be occupied. Likely enabling file handle caching will solve the problem - we've seen dramatic improvements in performance on production deployments by doing that.
Another possibility worth mentioning is that the built-in JVM is doing some garbage collection - this could block some of the HDFS operations. We have some pause detection that logs messages when there is a JVM pause. You can also look at the /memz debug page, which I think has some GC stats. Or connect up other Java debugging tools.
ScannerThreadsVoluntaryContextSwitches: 1.25K (1248) means that there were 1248 situations were scan threads got "stuck" waiting for some external resource, and subsequently put to sleep().
Most likely that resource was disk IO. That would explain quite low average reading speed (TotalReadThroughput: *11.70 MB*/sec) while having "normal" per-read thruput (PerReadThreadRawHdfsThroughput: 151.39 MB/sec).
EDIT
To increase performance, you may want to try:
enable short circuit reads (dfs.client.read.shortcircuit=true)
configure HDFS caching and alter Impala table to use cache
(Note that both applicable if you're running Impala against HDFS, not some sort of object store.)

What can be the reason for CPU load to NOT scale linearly with the number of traffic processing workers?

We are writing a Front End that is supposed to process large volume of traffic (in our case it is Diameter traffic, but that may be irrelevant to the question). As client connects, the server socket gets assigned to one of the Worker processes that perform all the actual traffic processing. In other words, Worker does all the work, and more Workers should be added when more clients get connected.
One would expect the CPU load per message to be the same for different number of Workers, because Workers are totally independent, and serve different sets of client connections. Yet our tests show that it takes more CPU time per message, as the number of Workers grow.
To be more precise, the CPU load depends on the TPS (Transactions or Request-Responses per second) as follows.
For 1 Worker:
60K TPS - 16%, 65K TPS - 17%... i.e. ~0.26% CPU per KTPS
For 2 Workers:
80K TPS - 29%, 85K TPS - 30%... i.e. ~0.35% CPU per KTPS
For 4 Workers:
85K TPS - 33%, 90K TPS - 37%... i.e. ~0.41% CPU per KTPS
What is the explanation for this? Workers are independent processes and there is no inter-process communication between them. Also each Worker is single-threaded.
The programming language is C++
This effect is observed on any hardware, which is close to this one: 2 Intel Xeon CPU, 4-6 cores, 2-3 GHz
OS: RedHat Linux (RHEL) 5.8, 6.4
CPU load measurements are done using mpstat and top.
If either the size of the program code used by a worker or the size of the data processed by a worker (or both) is non-small, the reason could be the reduced effectiveness of the various caches: The locality-over-time of how a single worker accesses its program code and/or its data is disturbed by other workers intervening.
The effect can be quite complicated to understand, because:
it depends massively on the structure of your code's computations,
modern CPUs have about three levels of cache,
each cache has a different size,
some caches are local to one core, others are not,
how often the workers intervene depends on your operating system's scheduling strategy
which gets even more complicated if there are multiple cores,
unless your programming language's run-time system also intervenes,
in which case it is more complicated still,
your network interface is a computer of its own and has a cache, too,
and probably more.
Caveat: Given the relatively coarse granularity of process scheduling, the effect of this ought not to be as large as it is, I think.
But then: Have you looked up how "percent of CPU" is even defined?
Until you reach CPU saturation on your machine you cannot be sure that the effect is actually as large as it looks. And when you do reach saturation, it may not be the CPU at all that is the bottleneck here, so are you sure you need to care about CPU load?
I complete agree with #Lutz Prechelt. Here I just want to add the method about how to investigate on the issue and the answer is Perf.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottom neck in CPU-bound applications.
the output of perf is like this:
Performance counter stats for './cache_line_test 0 1 2 3':
1288.050638 task-clock # 3.930 CPUs utilized
185 context-switches # 0.144 K/sec
8 cpu-migrations # 0.006 K/sec
395 page-faults # 0.307 K/sec
3,182,411,312 cycles # 2.471 GHz [39.95%]
2,720,300,251 stalled-cycles-frontend # 85.48% frontend cycles idle [40.28%]
764,587,902 stalled-cycles-backend # 24.03% backend cycles idle [40.43%]
1,040,828,706 instructions # 0.33 insns per cycle
# 2.61 stalled cycles per insn [51.33%]
130,948,681 branches # 101.664 M/sec [51.48%]
20,721 branch-misses # 0.02% of all branches [50.65%]
652,263,290 L1-dcache-loads # 506.396 M/sec [51.24%]
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
4,846,815 LLC-loads # 3.763 M/sec [40.18%]
301 LLC-load-misses # 0.01% of all LL-cache hits [39.58%]
It output your cache miss rate with will easy you to tune your program and see the effect.
I write a article about cache line effects and perf and you can read it for more details.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Cassandra Amazon EC2 , lots of IOWait

We have the following stats on single node cassandra on Amazon EC2/Rightscale m1.large instance with 2 ephemeral disks with raid0. (7.6 GB Total Memory)
4 GB RAM is allocated to cassandra Heap, 800MB is Heap NEW size.
following stats are from OpsCenter community 2.0
Read Requests 285 to 340 per second
Write Requests 257 to 720 per second
OS Load 15.15 to 17.15
Write Request Latency 293 to 685 micros
OS Sent Network Traffic 18 MB to 30 MB per second
OS Recieved Network Traffic 22 MB to 34 MB per second
OS Disk Queue Size 23 to 26 requests
Read Requests Pending 8 to 20
Read Request Latency 69140 to 92885 micros
OS Disk latency 37 to 42 ms
OS Disk Throughput 12 to 14 Mb per second
Disk IOPs Reads 600 to 740 per second
Disk IOPs Writes 2 to 7 per second
IOWait 60 to 70 % CPU avg
Idle 24 to 30 % CPU avg
Rowcache is disabled.
Are the above stats are satisfying with the provided configuration....OR how could we tweak it more to get less IOWait..........because we think that we are experiencing lots of IOWait.....how could we tweak it to get the best.
Read Requests are mixed.........some are from one super column family and one standard having more than million keys......and varying no. of super columns max 14 with varying no. of subcolumns from 1 to 10000 and varying no. of columns max 14 in standard column family...............subcolumns are very thin in nature with 0 bytes value....8 bytes for name.
Process is removing the data from super column family and writing the processed data on standard one.
Would EBS Disks work better....on Amazon EC2
I'm not positive whether you can tweak your config easily to get more disk performance, but using Snappy compression could help a good deal in making your app need to read less overall. It may also help to use the new composite key layout instead of supercolumns.
One thing I can say for sure: EBS will NOT work better. Stay away from that at all costs if you care about latency.

NodeJS on Ubuntu slow?

I just installed Ubuntu 10.10 server with NodeJS 0.4.6 using this guide: http://www.codediesel.com/linux/installing-node-js-on-ubuntu-10-04/ on my laptop:
Acer 5920G (Intel Core 2 Duo (2ghz), 4 gb ram)
After that I created a little test how nodejs would perform and wrote this little hello world script:
var http = require('http');
http.createServer(function(req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
res.write('Hello World');
res.end();
}).listen(8080);
Now to test the performance i used Apache Benchmark on Windows with the following settings
ab -r -c 1000 -n 10000 http://192.168.1.103:8000/
But the results are very low compared to http://zgadzaj.com/benchmarking-node-js-testing-performance-against-apache-php/
Server Software:
Server Hostname: 192.168.1.103
Server Port: 8000
Document Path: /
Document Length: 12 bytes
Concurrency Level: 1000
Time taken for tests: 23.373 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 760000 bytes
HTML transferred: 120000 bytes
Requests per second: 427.84 [#/sec] (mean)
Time per request: 2337.334 [ms] (mean)
Time per request: 2.337 [ms] (mean, across all concurrent requests)
Transfer rate: 31.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.3 1 28
Processing: 1236 2236 281.2 2327 2481
Waiting: 689 1522 169.5 1562 1785
Total: 1237 2238 281.2 2328 2484
Percentage of the requests served within a certain time (ms)
50% 2328
66% 2347
75% 2358
80% 2364
90% 2381
95% 2397
98% 2442
99% 2464
100% 2484 (longest request)
Any one got a clue? (Compile, Hardware problem, Drivers, Configuration, Slow script)
Edit 4-17 14:04 GMT+1
I am testing the machine over 1Gbit local connection. When I ping it gives me 0 ms so that would be good I guess. When I issue the apachebenchmark on my Windows 7 machine the CPU raises to 100% :|
It seems like you are running the test over a medium with a high Bandwidth-Delay Product; in your case, high latency (>1s). Assuming 1s delay, a 100MBit link and 76 Bytes per request, you need more than 150000 requests in parallel to saturate it.
First, test the latency (with ping or so). Also, watch the CPU and network usage on all participating machines. This will give you an indication of the bottleneck in your tests. What are the benchmark results for an Apache webserver?
Also, it could be hardware/driver problem. Watch dmesg on both machines. And although it's probably not the reason for this specific problem, don't forget to change the CPU speed governor to performance on both machines!

Resources