YCSB understanding the output - performance

I search in the site and found another question about that, but there's no answers.
I'm executing YCSB tool on a cassandra cluster, and the output of YCSB is:
[OVERALL], RunTime(ms), 302016.0 -> 05 mins 02 secs
[OVERALL], Throughput(ops/sec), 3311.0828565374018
[UPDATE], Operations, 499411
[UPDATE], AverageLatency(us), 2257.980987603397
[UPDATE], MinLatency(us), 389
[UPDATE], MaxLatency(us), 169380
[UPDATE], 95thPercentileLatency(ms), 4
[UPDATE], 99thPercentileLatency(ms), 8
[UPDATE], Return=0, 499411
[UPDATE], 0, 50039
[UPDATE], 1, 222610
[UPDATE], 2, 138349
[UPDATE], 3, 49465
and it continue about 'till number 70. How does it mean? Are there the number of seconds in which are runs that number of operations? Strange, cause the test runs for over than 5 minutes as you can see from the voice overall.
Thank you for your time!

The output indicates
The total execution time was 05 mins 02 secs
The average throughput was 3311.0828565374018 across all threads
There were 499411 update operations
The Average, Minimum, Maximum, 99th and 95th Percentile latency
499411 operations gave a return code of zero (All were successful. Non-zero return indicates failed operation)
50039 operations completed in less than 1ms.
222610 operations completed between 1 and 2ms.
138349 operations completed between 2 to 3ms.
...and so on...They will probably go up to 1000ms.
It is also possible to get a time-series for the latencies by adding the -p timeseries.granularity=2000 switch to the ycsb command.
More information is available in the documentation

Related

Calculate Average Response time Calculation in JMeter

(Attached as image)
"In My summary report
Total Samplers = 11944
My total Average response = 2494 mili-second = 2.49 seconds.
What i understand from here 11944 samplers are processed in average of 2.49 seconds.That means my test actually should processed for 11944 x 2.49 Seconds = 82 hours.But it actually ran about 15-20 mints max.
So trying to understand,is it reduced execution time due to JMeter parallel/multiple thread execution or i am understanding it wrong way.
I want to know a single request average response time"
JMeter calculates response time as:
Sum of all Samplers response times
Divided by the number of samplers
basically it's arithmetic mean of all samplers response times.
11944 x 2.49 / 3600 gives 8.2 hours and yes, this is how much time it would take to execute the test with a single user, the amount of time will reduce proportionally depending on the number of threads used
More information:
Calculator class source code
JMeter Glossary
Understanding Your Reports: Part 2 - KPI Correlations
It depends on threads number you used
For example if you used 50 threads 12K Samples/requests and each time took (average of) 2.5 seconds
12000 * 2.5 / 50 / 60 = 10 minutes
^ ^ ^ ^
requests avg. sec threads sec per minute

Jmeter interpreting results in simple terms

So I'm trying to test a website, and trying to interpret the aggregate report by "common sense" (as I tried looking up the meanings of each result and i cannot understand how they should be interpreted).
TEST 1
Thread Group: 1
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 645
- Median 645
- 90% Line 645
- Min 645
- Max 645
- Throughput 1.6/sec
So I am under the assumption that the first result is the best outcome.
TEST 2
Thread Group: 5
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 647
- Median 647
- 90% Line 647
- Min 643
- Max 652
- Throughput 3.5/sec
I am assuming TEST 2 result is not so bad, given that the results are near TEST 1.
TEST 3
Thread Group: 10
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 710
- Median 711
- 90% Line 739
- Min 639
- Max 786
- Throughput 6.2/sec
Given the dramatic difference, I am assuming that if 10 users concurrently requested for the website, it will not perform well. How would this set of tests be interpreted in simple terms?
It is as simple as available resources.
Response Times are dependent on many things and following are critical factors:
Server Machine Resources (Network, CPU, Disk, Memory etc)
Server Machine Configuration (type of server, number of nodes, no. of threads etc)
Client Machine Resources (Network, CPU, Disk, Memory etc)
As you understand it is about mostly how server is busy responding to other requests and how much client machine is busy generating/processing Load (I assume you run all 10 users in single machine)
Best way to know the actual reason is by Monitoring these resources using nmon for linux & perfmon or task manager for Windows (or any other monitoring tool) and understand the differences when you ran 1, 5, 10 users.
Apart from Theory part, I assume that it is talking time because of your are putting the sudden load where the server takes time in processing the previous requests.
Are you using client and server on the same machine? If Yes, that would tell us that the system resources are utilized both for client threads (10 threads) and server threads.
Resposne Time = client sends the request to server TIME + server processing TIME + Server sends the resposne to the client TIME
In your case, it might be one or more TIME's increased.
If you have good bandwidth, then it might be server processing time
Your results are confusing.
For thread count of 5 and 10, you have given the same number of samples - 1. It should be 1 (1 thread), 5 ( 5 threads) and 10 samples for 10 threads. Your experiment has statistically less samples to conclude anything. You should model your load in such a way that the 1 thread load is sustained for a longer period before you ramp up 5 and 10 threads. If you are running a small test to assess the the scalability of your application, you could do something like
1 thread - 15 mins
5 threads - 15 mins
10 threads - 15 mins
provide the observations for each of the 15 min period. If you application is really scaling, it should maintain the same response time even under increased load.
Looking at your results, I don't see any issues with your application. There is nothing varying. Again, you don't have much samples that can lead to statistically relevant conclusion.

questions on time usage reported by SLURM

I have problems understanding the time usage report below:
1) why the times for job step 1 & 2 do not add up to the batch line?
2) what is the relationship between each column, especially for TotalCPU and CPUTime?
3) for time usage of the job, which one is best to report?
$ sacct -o JOBID,AllocCPUs,AveCPU,reqcpus,systemcpu,usercpu,tot
alcpu,cputime,cputimeraw -j 649176
JobID AllocCPUS AveCPU ReqCPUS SystemCPU UserCPU TotalCPU CPUTime CPUTimeRAW
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
649176 24 24 00:02.047 01:06.896 01:08.943 00:23:36 1416
649176.batch 24 00:00:00 24 00:00.027 00:00.014 00:00.041 00:23:36 1416
649176.0 24 00:00:00 24 00:00.813 00:24.886 00:25.699 00:08:48 528
649176.1 24 00:00:18 24 00:01.207 00:41.996 00:43.203 00:14:24 864
1) why the times for job step 1 & 2 do not add up to the batch line?
The time reported for .batch for SystemCPU, UserCPU and TotalCPU is the time spend running the commands in the batch file, not counting the spawned processes[1]. CPUTime and CPUTimeRAW do count the spawned processes and thus they add up to the lines corresponding to the job steps.
2) what is the relationship between each column, especially for
TotalCPU and CPUTime?
TotalCPU is the sum of UserCPU and SystemCPU of each CPU, while CPUTime is the elapsed time multiplied by the number requested CPU. The difference between both is the time spent with the CPUs doing nothing (neither in user mode nor in kernel mode), most of the time waiting for I/O [2]
3) for time usage of the job, which one is best to report?
It depends on what you want to show. Elapsed (which you did not show here) gives the "time to solution". CPUTimeRAW is what is often accounted and paid for. Difference between CPUTime and TotalCPU gives information about the I/O overhead.
[1] From the man page
SystemCPU The amount of system CPU time used by the job or job step. The format of the output is identical to that of the
Elapsed field.
NOTE: SystemCPU provides a measure of the task’s parent process and does not include CPU time of child
processes.
[2] https://en.wikipedia.org/wiki/CPU_time

nodejs http with redis, only have 6000req/s

Test node_redis benchmark, it show incr has more than 100000 ops/s
$ node multi_bench.js
Client count: 5, node version: 0.10.15, server version: 2.6.4, parser: hiredis
INCR, 1/5 min/max/avg/p95: 0/ 2/ 0.06/ 1.00 1233ms total, 16220.60 ops/sec
INCR, 50/5 min/max/avg/p95: 0/ 4/ 1.61/ 3.00 648ms total, 30864.20 ops/sec
INCR, 200/5 min/max/avg/p95: 0/ 14/ 5.28/ 9.00 529ms total, 37807.18 ops/sec
INCR, 20000/5 min/max/avg/p95: 42/ 508/ 302.22/ 467.00 519ms total, 38535.65 ops/sec
Then I add redis in nodejs with http server
var http = require("http"), server,     
redis_client = require("redis").createClient();
server = http.createServer(function (request, response) {
    response.writeHead(200, {
        "Content-Type": "text/plain"
    });
    
    redis_client.incr("requests", function (err, reply) {
            response.write(reply+'\n');
        response.end();
    });
}).listen(6666);
server.on('error', function(err){
console.log(err);
process.exit(1);
});
Use ab command to test, it only has 6000 req/s
$ ab -n 10000 -c 100 localhost:6666/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software:
Server Hostname: localhost
Server Port: 6666
Document Path: /
Document Length: 7 bytes
Concurrency Level: 100
Time taken for tests: 1.667 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1080000 bytes
HTML transferred: 70000 bytes
Requests per second: 6000.38 [#/sec] (mean)
Time per request: 16.666 [ms] (mean)
Time per request: 0.167 [ms] (mean, across all concurrent requests)
Transfer rate: 632.85 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 2
Processing: 12 16 3.2 15 37
Waiting: 12 16 3.1 15 37
Total: 13 17 3.2 16 37
Percentage of the requests served within a certain time (ms)
50% 16
66% 16
75% 16
80% 17
90% 20
95% 23
98% 28
99% 34
100% 37 (longest request)
Last I test 'hello world', it reached 7k req/s
Requests per second: 7201.18 [#/sec] (mean)
How to profile and figure out the reason why redis in http lose some performance?
I think you have misinterpreted the result of multi_bench benchmark.
First, this benchmark spreads the load over 5 connections, while you have only one in your node.js program. More connections mean more communication buffers (allocated on a per socket basis) and better performance.
Then, while a Redis server is able to sustain 100K op/s (provided you open several connections, and/or use pipelining), node.js and node_redis are not able to reach this level. The result of your run of multi_bench shows that when pipelining is not used, only 16K op/s are achieved.
Client count: 5, node version: 0.10.15, server version: 2.6.4, parser: hiredis
INCR, 1/5 min/max/avg/p95: 0/ 2/ 0.06/ 1.00 1233ms total, 16220.60 ops/sec
This result means that with no pipelining, and with 5 concurrent connections, node_redis is able to process 16K op/s globally. Please note that measuring a throughput of 16K op/s while only sending 20K ops (default value of multi_bench) is not very accurate. You should increase num_requests for better accuracy.
The result of your second benchmark is not so surprising: you add an http layer (which is more expensive to parse than Redis protocol itself), use only 1 connection to Redis while ab tries to open 100 concurrent connections to node.js, and finally get 6K op/s, resulting in a 1.2K op/s throughput overhead compared to a "Hello world" HTTP server. What did you expect?
You could try to squeeze out a bit more performance by leveraging node.js clustering capabilities, as described in this answer.

Doubling each number a number of times as specify by the user

I am new to hadoop and I am learning by using few examples. I am currently trying to pass a file with random integers on it. For each and every number i want it to be double base on the number specify by the user at runtime.
3536 5806 2545 249 485 5467 1162 8941 962 6457
665 6754 889 5159 3161 5401 704 4897 135 907
8111 1059 4971 5195 3031 630 6265 827 5882 9358
9212 9540 676 3191 4995 8401 9857 4884 8002 3701
931 875 6427 6945 5483 545 4322 5120 1694 2540
9039 5524 872 840 8730 4756 2855 718 6612 4125
Above is the file sample.
For example when the user specify at runtime
jar ~/dissertation/workspace/TestHadoop/src/DoubleNum.jar DoubleNum Integer Output 3
the output for say the first line will be
3536 * 8 5806* 8 2545* 8 249* 8 485* 8 5467* 8 1162* 8 8941* 8 962* 8 6457* 8
Because for each iteration the number will be double so for 3 iterations it will be 2^3. How can I achieve this using mapreduce?
For chaining one job into the next, check out:
Chaining multiple MapReduce jobs in Hadoop
Also, this may be a good time to learn about sequence files, as they provide an efficient way of passing data from one map/reduce job to another.
As for your particular problem, you don't need reducers here, so make it map-only by setting the number of reducers to zero. Sending the output to reducers will only incur extra network overhead. (However, be careful about the number of files you create over time, eventually the NameNode will not appreciate it. Each mapper will create one file.)
I understand that you are trying to use this as an example of perhaps something more complex... but in this case you can use a common optimization technique: If you find yourself wanting to chain one mapper-only task into another map/reduce job, you can squash the two mappers together. For example, instead of multiplying by 2, then 2 gain, the 2 again, why not just multiply by 2 and by 2 and by in the same mapper? Basically, if all your operations are independent on one number or line, you can apply the iterations within the same mapper, per record. This will reduce the amount of overhead significantly.

Resources