Storm topology performance hit when acking - apache-storm

I'm using this tool from yahoo to run some performance tests on my storm cluster -
https://github.com/yahoo/storm-perf-test
I notice that there's almost a 10x performance hit I get when I turn acking on. Here's some details to reproduce the test -
Cluster -
3 supervisor nodes and 1 nimbus node. Each node is a c3.large.
With acking -
bin/storm jar storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --ack --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 141 0 1424707134585 0 0 0.0
WAITING 1 3 3 141 141 1424707154585 20000 24660 0.11758804321289062
WAITING 1 3 3 141 141 1424707174585 20000 17320 0.08258819580078125
RUNNING 1 3 3 141 141 1424707194585 20000 13880 0.06618499755859375
RUNNING 1 3 3 141 141 1424707214585 20000 21720 0.10356903076171875
RUNNING 1 3 3 141 141 1424707234585 20000 43220 0.20608901977539062
RUNNING 1 3 3 141 141 1424707254585 20000 35520 0.16937255859375
RUNNING 1 3 3 141 141 1424707274585 20000 33820 0.16126632690429688
Without acking -
bin/storm jar ~/target/storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 140 0 1424707374386 0 0 0.0
WAITING 1 3 3 140 140 1424707394386 20000 565460 2.6963233947753906
WAITING 1 3 3 140 140 1424707414386 20000 1530680 7.298851013183594
RUNNING 1 3 3 140 140 1424707434386 20000 3280760 15.643882751464844
RUNNING 1 3 3 140 140 1424707454386 20000 3308000 15.773773193359375
RUNNING 1 3 3 140 140 1424707474386 20000 4367260 20.824718475341797
RUNNING 1 3 3 140 140 1424707494386 20000 4489000 21.40522003173828
RUNNING 1 3 3 140 140 1424707514386 20000 5058960 24.123001098632812
The last 2 columns are the ones that are really important. It shows the number of tuples transferred and the rate in MBps.
Is this kind of performance hit expected with storm when we turn on acking? I'm using version 0.9.3 and no advanced networking.

There is always going to be a certain degree of performance degradation with acking enabled -- it's the price you pay for reliability. Throughput will ALWAYS be higher with acking disabled, but you have no guarantee if your data is processed or dropped on the floor. Whether that's a 10x hit like you're seeing, or significantly less, is a matter of tuning.
One important setting is topology.max.spout.pending, which allows you to throttle spouts so that only that many tuples are allowed "in flight" at any given time. That setting is useful for making sure downstream bolts don't get overwhelmed and start timing out tuples.
That setting also has no effect with acking disabled -- it's like opening the flood gates and dropping any data that overflows. So again, it will always be faster.
With acking enabled, Storm will make sure everything gets processed at least once, but you need to tune topology.max.spout.pending appropriately for your use case. Since every use case is different, this is a matter of trial and error. Set it too low, and you will have low throughput. Set it too high and your downstream bolts will get overwhelmed, tuples will time out, and you will get replays.
To illustrate, set maxSpoutPending to 1 and run the benchmark again. Then try 1000.
So yes, a 10x performance hit is possible without proper tuning. If data loss is okay for your use case, turn acking off. But if you need reliable processing, turn it on, tune for your use case, and scale horizontally (add more nodes) to reach your throughput requirements.

Related

Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then

Regularly the past days our ES 7.4 cluster (4 nodes) is giving read timeouts and is getting slower and slower when it comes to running certain management commands. Before that it has been running for more than a year without any trouble. For instance /_cat/nodes was taking 2 minutes yesterday to execute, today it is already taking 4 minutes. Server loads are low, memory usage seems fine, not sure where to look further.
Using the opster.com online tool I managed to get some hint that the management queue size is high, however when executing the suggested commands there to investigate I don't see anything out of the ordinary other than that the command takes long to give a result:
$ curl "http://127.0.0.1:9201/_cat/thread_pool/management?v&h=id,active,rejected,completed,node_id"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 345 100 345 0 0 2 0 0:02:52 0:02:47 0:00:05 90
id active rejected completed node_id
JZHgYyCKRyiMESiaGlkITA 1 0 4424211 elastic7-1
jllZ8mmTRQmsh8Sxm8eDYg 1 0 4626296 elastic7-4
cI-cn4V3RP65qvE3ZR8MXQ 5 0 4666917 elastic7-2
TJJ_eHLIRk6qKq_qRWmd3w 1 0 4592766 elastic7-3
How can I debug this / solve this? Thanks in advance.
If you notice your elastic7-2 node is having 5 active requests in the management queue, which is really high, As the management queue capacity itself is just 5, and it's used only for very few operations(Management, not search/index).
You can have a look at threadpools in elasticsearch for further read.

Elasticsearch Uneven Write Queue Distribution

I have a cluster with an even distribution of data that is routed based on the document _id- which is a random string. During normal operations, searching and writing to the cluster is done with an even distribution. However, when bulking updating documents in the cluster for several minutes, only 1-2 nodes appear to be working.
Here is what a bulk update operation looks like after several minutes of running-
q qs node_id
0 200 Wd5JFj4gRk-9pKL_Jubd3w
0 200 FQ86BI1ASUS0tu-XQMuk6w
0 200 dMeO029LSiqjwicm3YP8JA
0 200 b8zAduWdRyO7P9Lz7hSFBQ
0 200 K0o4v_mHRqSRNZWJpzvJPQ
224 200 HN1yQG_hRF2eiCyy_0Dpcg
0 200 GXsc0FKsSUemue-e1Cuzsg
0 200 LcDaZoipQA63UOg0_WHguA
0 200 PdKFe7nLRaCnEqECNLpFvg
0 200 glani3PYQ4qppwzvLQnjIQ
0 200 T9jqycccQ-a03YtUCGVy0w
As you can see, the HN1y node becomes very active where the other nodes seem to go quiet. The total throughput of updates drops dramatically and the only way to resolve it is to pause the bulk update operaion, wait a minute, and resume. At which point we go through the same steps of even distribution to eventually one node appearing to do all of the work.
How can a cluster get into a situation like this? Does this suggest there really is an uneven distribution, or is something else going on?

Jmeter interpreting results in simple terms

So I'm trying to test a website, and trying to interpret the aggregate report by "common sense" (as I tried looking up the meanings of each result and i cannot understand how they should be interpreted).
TEST 1
Thread Group: 1
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 645
- Median 645
- 90% Line 645
- Min 645
- Max 645
- Throughput 1.6/sec
So I am under the assumption that the first result is the best outcome.
TEST 2
Thread Group: 5
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 647
- Median 647
- 90% Line 647
- Min 643
- Max 652
- Throughput 3.5/sec
I am assuming TEST 2 result is not so bad, given that the results are near TEST 1.
TEST 3
Thread Group: 10
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 710
- Median 711
- 90% Line 739
- Min 639
- Max 786
- Throughput 6.2/sec
Given the dramatic difference, I am assuming that if 10 users concurrently requested for the website, it will not perform well. How would this set of tests be interpreted in simple terms?
It is as simple as available resources.
Response Times are dependent on many things and following are critical factors:
Server Machine Resources (Network, CPU, Disk, Memory etc)
Server Machine Configuration (type of server, number of nodes, no. of threads etc)
Client Machine Resources (Network, CPU, Disk, Memory etc)
As you understand it is about mostly how server is busy responding to other requests and how much client machine is busy generating/processing Load (I assume you run all 10 users in single machine)
Best way to know the actual reason is by Monitoring these resources using nmon for linux & perfmon or task manager for Windows (or any other monitoring tool) and understand the differences when you ran 1, 5, 10 users.
Apart from Theory part, I assume that it is talking time because of your are putting the sudden load where the server takes time in processing the previous requests.
Are you using client and server on the same machine? If Yes, that would tell us that the system resources are utilized both for client threads (10 threads) and server threads.
Resposne Time = client sends the request to server TIME + server processing TIME + Server sends the resposne to the client TIME
In your case, it might be one or more TIME's increased.
If you have good bandwidth, then it might be server processing time
Your results are confusing.
For thread count of 5 and 10, you have given the same number of samples - 1. It should be 1 (1 thread), 5 ( 5 threads) and 10 samples for 10 threads. Your experiment has statistically less samples to conclude anything. You should model your load in such a way that the 1 thread load is sustained for a longer period before you ramp up 5 and 10 threads. If you are running a small test to assess the the scalability of your application, you could do something like
1 thread - 15 mins
5 threads - 15 mins
10 threads - 15 mins
provide the observations for each of the 15 min period. If you application is really scaling, it should maintain the same response time even under increased load.
Looking at your results, I don't see any issues with your application. There is nothing varying. Again, you don't have much samples that can lead to statistically relevant conclusion.

Cassandra Reading Benchmark with Spark

I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.
However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).
But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!
My Benchmark Results:
Cluster-size 4: Write: 1750 seconds / Read: 360 seconds
Cluster-size 2: Write: 3446 seconds / Read: 420 seconds
Cluster-size 1: Write: 7595 seconds / Read: 284 seconds
ADDITIONAL TRY - WITH THE CASSANDRA-STRESS TOOL
I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:
Clustersize Threads Ops/sek Time
1 4 10146 30,1
8 15612 30,1
16 20037 30,2
24 24483 30,2
121 43403 30,5
913 50933 31,7
2 4 8588 30,1
8 15849 30,1
16 24221 30,2
24 29031 30,2
121 59151 30,5
913 73342 31,8
3 4 7984 30,1
8 15263 30,1
16 25649 30,2
24 31110 30,2
121 58739 30,6
913 75867 31,8
4 4 7463 30,1
8 14515 30,1
16 25783 30,3
24 31128 31,1
121 62663 30,9
913 80656 32,4
Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!
Results as diagram:
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.
--> Question here: Are these results the cluster-wide results or is this a test for a local node (and so the result of only one instance of the ring)???
Can someone give an explanation? Thank you!
I ran a similar test with a spark worker running on each Cassandra node.
Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.
Here are the times I got:
1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds
So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.
By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.
You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.

improving cassandra read time in my scenerio

I'm testing single node Datastax Cassandra 2.0 with default configuration with a client written using Astyanax.
In my scenario there is one CF, each row contains key (natural number parsed to string) and one column, that keeps 1kB of random text data.
Client performs operations of inserting rows, until the data size reaches 50GB. It does this with speed of 3000 req/sec, which is enough for me.
Next step is to read all of this data, with the same order as they were inserted. And here come problems. Lets see example log, produced by my program:
reads writes time req/sec
99998 0 922,59 108
100000 0 508,51 196
100000 0 294,85 339
100000 0 195,99 510
100000 0 137,11 729
100000 0 105,48 948
100000 0 105,83 944
100000 0 76,05 1314
100000 0 71,94 1389
100000 0 63,34 1578
100000 0 63,91 1564
100000 0 65,69 1522
100000 0 1217,52 82
100000 0 725,67 137
100000 0 502,03 199
100000 0 342,17 292
100000 0 336,83 296
100000 0 332,56 300
100000 0 330,27 302
100000 0 359,74 277
100000 0 320,01 312
100000 0 369,02 270
100000 0 774,47 129
100000 0 564,81 177
100000 0 729,50 137
100000 0 656,28 152
100000 0 611,29 163
100000 0 589,29 169
100000 0 693,99 144
100000 0 658,12 151
100000 0 294,53 339
100000 0 126,81 788
100000 0 206,13 485
100000 0 924,29 108
The throughput is unstable, and rather low.
I'm interested in any help, that may improve read time.
I also can provide some more information.
Thanks for help!
Kuba
I'm guessing you are doing your read sequentially. If you do them in parallel you should be able to do many more operations per second.
Update to address single read latency:
Read latency can be affected by the following variables:
Is the row in memory (Memtable or Row cache)?
How many sstables is the row spread over?
How wide is the row?
How many columns need to be scanned past to find the column you are looking for?
Are you reading from the front of end of the row?
Does the row have tomstones?
Are you using leveled or size-tiered compaction?
Are the sstables in the disk cache or not?
How many replicas does the coordinator need to wait for?
How many other requests is the node servicing at the same time?
network latency
disk latency (rotational)
disk utilization (queue size/await) -- can be affected by compaction
disk read ahead size
Java GC pauses
CPU utilization -- can be affected by compactions
Context switches
Are you in swap?
There are a number of tools that can help you answer these questions, some
specific to Cassandra and others general system performance tools. Look in the
Cassandra logs for GC pauses and for dropped requests. Look at nodetool cfstats
to see latency stats. Use nodetool cfhistograms to check latency distributions,
the number of sstables hit per read, and row size distribution. Use nodetool tpstats
to check for dropped requests and queue sizes.
You can also use tools like iostat and vmstat to see disk and system utilization stats.

Resources