I'm testing single node Datastax Cassandra 2.0 with default configuration with a client written using Astyanax.
In my scenario there is one CF, each row contains key (natural number parsed to string) and one column, that keeps 1kB of random text data.
Client performs operations of inserting rows, until the data size reaches 50GB. It does this with speed of 3000 req/sec, which is enough for me.
Next step is to read all of this data, with the same order as they were inserted. And here come problems. Lets see example log, produced by my program:
reads writes time req/sec
99998 0 922,59 108
100000 0 508,51 196
100000 0 294,85 339
100000 0 195,99 510
100000 0 137,11 729
100000 0 105,48 948
100000 0 105,83 944
100000 0 76,05 1314
100000 0 71,94 1389
100000 0 63,34 1578
100000 0 63,91 1564
100000 0 65,69 1522
100000 0 1217,52 82
100000 0 725,67 137
100000 0 502,03 199
100000 0 342,17 292
100000 0 336,83 296
100000 0 332,56 300
100000 0 330,27 302
100000 0 359,74 277
100000 0 320,01 312
100000 0 369,02 270
100000 0 774,47 129
100000 0 564,81 177
100000 0 729,50 137
100000 0 656,28 152
100000 0 611,29 163
100000 0 589,29 169
100000 0 693,99 144
100000 0 658,12 151
100000 0 294,53 339
100000 0 126,81 788
100000 0 206,13 485
100000 0 924,29 108
The throughput is unstable, and rather low.
I'm interested in any help, that may improve read time.
I also can provide some more information.
Thanks for help!
Kuba
I'm guessing you are doing your read sequentially. If you do them in parallel you should be able to do many more operations per second.
Update to address single read latency:
Read latency can be affected by the following variables:
Is the row in memory (Memtable or Row cache)?
How many sstables is the row spread over?
How wide is the row?
How many columns need to be scanned past to find the column you are looking for?
Are you reading from the front of end of the row?
Does the row have tomstones?
Are you using leveled or size-tiered compaction?
Are the sstables in the disk cache or not?
How many replicas does the coordinator need to wait for?
How many other requests is the node servicing at the same time?
network latency
disk latency (rotational)
disk utilization (queue size/await) -- can be affected by compaction
disk read ahead size
Java GC pauses
CPU utilization -- can be affected by compactions
Context switches
Are you in swap?
There are a number of tools that can help you answer these questions, some
specific to Cassandra and others general system performance tools. Look in the
Cassandra logs for GC pauses and for dropped requests. Look at nodetool cfstats
to see latency stats. Use nodetool cfhistograms to check latency distributions,
the number of sstables hit per read, and row size distribution. Use nodetool tpstats
to check for dropped requests and queue sizes.
You can also use tools like iostat and vmstat to see disk and system utilization stats.
Related
I am testing Clickhouse insert performance and so far I am able to insert over 200K rows/second. To me, this is good. However I see system utilizations is not very high and wonder if I can push more.
CH is in a server with Dual xxx 14 cores # 2.4 GHz, 56 vCPU with 256GB mem. And insert 1B rows in 1 hour 10 minutes. During that time I see,
load avg: 23.68, 22.44, 20.32
%Cpu: 2.93 us, 0.54 sy, 0.14 ni, 95.3 id, 0.96 wa, 0.05 hi, 0.09 si, 0 st
clickhouse-serv (%CPU, RES): 134.3%, 25.6g
These numbers above are average from "top" of every 5 seconds.
I have observed that clickhouse-server' %CPU usage is never above 200% as if there is a hard limit.
CH version: 21.2.2.8
Engine: Buffer (MergeTree) w/ default configuration; w/o Buffer it performs 10% less
dataset: in json, 2608 B/row, 150 columns
per insert: 500K rows, which is about 1.2GB
insert by 20 processes with clickhouse-clients from a different server
500K rows/insert and 20 clients give best performance (I have tried different numbers)
Linux 4.18.x (Red Hat)
Questions:
Is 200K rows/second (or %200 CPU usage) max per CH server? If not, how can I improve?
Can I have more than one CH server instances in one server? Will it be practical and give better performance?
In case there is no certain limit on the clickhouse-server side (or I am doing something wrong), I am checking if any others can impose such limit to applications (clickhouse-server).
Thanks in advance.
dataset: in json, 2608 B/row, 150 columns
insert by 20 processes with clickhouse-clients from a different server
In this case clickhouse-client parses JSON and probably CPU utilization is 100% at a different server. You need more inserting nodes to parse JSON.
Regularly the past days our ES 7.4 cluster (4 nodes) is giving read timeouts and is getting slower and slower when it comes to running certain management commands. Before that it has been running for more than a year without any trouble. For instance /_cat/nodes was taking 2 minutes yesterday to execute, today it is already taking 4 minutes. Server loads are low, memory usage seems fine, not sure where to look further.
Using the opster.com online tool I managed to get some hint that the management queue size is high, however when executing the suggested commands there to investigate I don't see anything out of the ordinary other than that the command takes long to give a result:
$ curl "http://127.0.0.1:9201/_cat/thread_pool/management?v&h=id,active,rejected,completed,node_id"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 345 100 345 0 0 2 0 0:02:52 0:02:47 0:00:05 90
id active rejected completed node_id
JZHgYyCKRyiMESiaGlkITA 1 0 4424211 elastic7-1
jllZ8mmTRQmsh8Sxm8eDYg 1 0 4626296 elastic7-4
cI-cn4V3RP65qvE3ZR8MXQ 5 0 4666917 elastic7-2
TJJ_eHLIRk6qKq_qRWmd3w 1 0 4592766 elastic7-3
How can I debug this / solve this? Thanks in advance.
If you notice your elastic7-2 node is having 5 active requests in the management queue, which is really high, As the management queue capacity itself is just 5, and it's used only for very few operations(Management, not search/index).
You can have a look at threadpools in elasticsearch for further read.
I have statistis of autotrace, before and after the modification of my query.
Does this statistics imply some significant performance improvements?
The statistics Before/After as below.
BEFORE AFTER
----- -----
recursive calls 5 3
db block gets 16 8
consistent gets 45 44
physical reads 2 1
redo size 1156 600
bytes sent via SQL*Net to client 624 624
bytes received via SQL*Net from client 519 519
SQL*Net roundtrips to/from client 2 2
sorts (memory) 0 1
sorts (disk) 0 0
rows processed 1 1
I won't read too much into it just by the Auto-trace information. You might also want to check the explain plan and the actual run time of query to see if performance has improved, Also ensure that you have gathered latest stats on all of your tables being used in the query.
I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.
However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).
But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!
My Benchmark Results:
Cluster-size 4: Write: 1750 seconds / Read: 360 seconds
Cluster-size 2: Write: 3446 seconds / Read: 420 seconds
Cluster-size 1: Write: 7595 seconds / Read: 284 seconds
ADDITIONAL TRY - WITH THE CASSANDRA-STRESS TOOL
I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:
Clustersize Threads Ops/sek Time
1 4 10146 30,1
8 15612 30,1
16 20037 30,2
24 24483 30,2
121 43403 30,5
913 50933 31,7
2 4 8588 30,1
8 15849 30,1
16 24221 30,2
24 29031 30,2
121 59151 30,5
913 73342 31,8
3 4 7984 30,1
8 15263 30,1
16 25649 30,2
24 31110 30,2
121 58739 30,6
913 75867 31,8
4 4 7463 30,1
8 14515 30,1
16 25783 30,3
24 31128 31,1
121 62663 30,9
913 80656 32,4
Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!
Results as diagram:
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.
--> Question here: Are these results the cluster-wide results or is this a test for a local node (and so the result of only one instance of the ring)???
Can someone give an explanation? Thank you!
I ran a similar test with a spark worker running on each Cassandra node.
Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.
Here are the times I got:
1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds
So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.
By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.
You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.
I'm using this tool from yahoo to run some performance tests on my storm cluster -
https://github.com/yahoo/storm-perf-test
I notice that there's almost a 10x performance hit I get when I turn acking on. Here's some details to reproduce the test -
Cluster -
3 supervisor nodes and 1 nimbus node. Each node is a c3.large.
With acking -
bin/storm jar storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --ack --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 141 0 1424707134585 0 0 0.0
WAITING 1 3 3 141 141 1424707154585 20000 24660 0.11758804321289062
WAITING 1 3 3 141 141 1424707174585 20000 17320 0.08258819580078125
RUNNING 1 3 3 141 141 1424707194585 20000 13880 0.06618499755859375
RUNNING 1 3 3 141 141 1424707214585 20000 21720 0.10356903076171875
RUNNING 1 3 3 141 141 1424707234585 20000 43220 0.20608901977539062
RUNNING 1 3 3 141 141 1424707254585 20000 35520 0.16937255859375
RUNNING 1 3 3 141 141 1424707274585 20000 33820 0.16126632690429688
Without acking -
bin/storm jar ~/target/storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 140 0 1424707374386 0 0 0.0
WAITING 1 3 3 140 140 1424707394386 20000 565460 2.6963233947753906
WAITING 1 3 3 140 140 1424707414386 20000 1530680 7.298851013183594
RUNNING 1 3 3 140 140 1424707434386 20000 3280760 15.643882751464844
RUNNING 1 3 3 140 140 1424707454386 20000 3308000 15.773773193359375
RUNNING 1 3 3 140 140 1424707474386 20000 4367260 20.824718475341797
RUNNING 1 3 3 140 140 1424707494386 20000 4489000 21.40522003173828
RUNNING 1 3 3 140 140 1424707514386 20000 5058960 24.123001098632812
The last 2 columns are the ones that are really important. It shows the number of tuples transferred and the rate in MBps.
Is this kind of performance hit expected with storm when we turn on acking? I'm using version 0.9.3 and no advanced networking.
There is always going to be a certain degree of performance degradation with acking enabled -- it's the price you pay for reliability. Throughput will ALWAYS be higher with acking disabled, but you have no guarantee if your data is processed or dropped on the floor. Whether that's a 10x hit like you're seeing, or significantly less, is a matter of tuning.
One important setting is topology.max.spout.pending, which allows you to throttle spouts so that only that many tuples are allowed "in flight" at any given time. That setting is useful for making sure downstream bolts don't get overwhelmed and start timing out tuples.
That setting also has no effect with acking disabled -- it's like opening the flood gates and dropping any data that overflows. So again, it will always be faster.
With acking enabled, Storm will make sure everything gets processed at least once, but you need to tune topology.max.spout.pending appropriately for your use case. Since every use case is different, this is a matter of trial and error. Set it too low, and you will have low throughput. Set it too high and your downstream bolts will get overwhelmed, tuples will time out, and you will get replays.
To illustrate, set maxSpoutPending to 1 and run the benchmark again. Then try 1000.
So yes, a 10x performance hit is possible without proper tuning. If data loss is okay for your use case, turn acking off. But if you need reliable processing, turn it on, tune for your use case, and scale horizontally (add more nodes) to reach your throughput requirements.