I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.
However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).
But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!
My Benchmark Results:
Cluster-size 4: Write: 1750 seconds / Read: 360 seconds
Cluster-size 2: Write: 3446 seconds / Read: 420 seconds
Cluster-size 1: Write: 7595 seconds / Read: 284 seconds
ADDITIONAL TRY - WITH THE CASSANDRA-STRESS TOOL
I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:
Clustersize Threads Ops/sek Time
1 4 10146 30,1
8 15612 30,1
16 20037 30,2
24 24483 30,2
121 43403 30,5
913 50933 31,7
2 4 8588 30,1
8 15849 30,1
16 24221 30,2
24 29031 30,2
121 59151 30,5
913 73342 31,8
3 4 7984 30,1
8 15263 30,1
16 25649 30,2
24 31110 30,2
121 58739 30,6
913 75867 31,8
4 4 7463 30,1
8 14515 30,1
16 25783 30,3
24 31128 31,1
121 62663 30,9
913 80656 32,4
Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!
Results as diagram:
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.
--> Question here: Are these results the cluster-wide results or is this a test for a local node (and so the result of only one instance of the ring)???
Can someone give an explanation? Thank you!
I ran a similar test with a spark worker running on each Cassandra node.
Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.
Here are the times I got:
1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds
So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.
By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.
You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.
Related
I am testing Clickhouse insert performance and so far I am able to insert over 200K rows/second. To me, this is good. However I see system utilizations is not very high and wonder if I can push more.
CH is in a server with Dual xxx 14 cores # 2.4 GHz, 56 vCPU with 256GB mem. And insert 1B rows in 1 hour 10 minutes. During that time I see,
load avg: 23.68, 22.44, 20.32
%Cpu: 2.93 us, 0.54 sy, 0.14 ni, 95.3 id, 0.96 wa, 0.05 hi, 0.09 si, 0 st
clickhouse-serv (%CPU, RES): 134.3%, 25.6g
These numbers above are average from "top" of every 5 seconds.
I have observed that clickhouse-server' %CPU usage is never above 200% as if there is a hard limit.
CH version: 21.2.2.8
Engine: Buffer (MergeTree) w/ default configuration; w/o Buffer it performs 10% less
dataset: in json, 2608 B/row, 150 columns
per insert: 500K rows, which is about 1.2GB
insert by 20 processes with clickhouse-clients from a different server
500K rows/insert and 20 clients give best performance (I have tried different numbers)
Linux 4.18.x (Red Hat)
Questions:
Is 200K rows/second (or %200 CPU usage) max per CH server? If not, how can I improve?
Can I have more than one CH server instances in one server? Will it be practical and give better performance?
In case there is no certain limit on the clickhouse-server side (or I am doing something wrong), I am checking if any others can impose such limit to applications (clickhouse-server).
Thanks in advance.
dataset: in json, 2608 B/row, 150 columns
insert by 20 processes with clickhouse-clients from a different server
In this case clickhouse-client parses JSON and probably CPU utilization is 100% at a different server. You need more inserting nodes to parse JSON.
I'm using this tool from yahoo to run some performance tests on my storm cluster -
https://github.com/yahoo/storm-perf-test
I notice that there's almost a 10x performance hit I get when I turn acking on. Here's some details to reproduce the test -
Cluster -
3 supervisor nodes and 1 nimbus node. Each node is a c3.large.
With acking -
bin/storm jar storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --ack --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 141 0 1424707134585 0 0 0.0
WAITING 1 3 3 141 141 1424707154585 20000 24660 0.11758804321289062
WAITING 1 3 3 141 141 1424707174585 20000 17320 0.08258819580078125
RUNNING 1 3 3 141 141 1424707194585 20000 13880 0.06618499755859375
RUNNING 1 3 3 141 141 1424707214585 20000 21720 0.10356903076171875
RUNNING 1 3 3 141 141 1424707234585 20000 43220 0.20608901977539062
RUNNING 1 3 3 141 141 1424707254585 20000 35520 0.16937255859375
RUNNING 1 3 3 141 141 1424707274585 20000 33820 0.16126632690429688
Without acking -
bin/storm jar ~/target/storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 140 0 1424707374386 0 0 0.0
WAITING 1 3 3 140 140 1424707394386 20000 565460 2.6963233947753906
WAITING 1 3 3 140 140 1424707414386 20000 1530680 7.298851013183594
RUNNING 1 3 3 140 140 1424707434386 20000 3280760 15.643882751464844
RUNNING 1 3 3 140 140 1424707454386 20000 3308000 15.773773193359375
RUNNING 1 3 3 140 140 1424707474386 20000 4367260 20.824718475341797
RUNNING 1 3 3 140 140 1424707494386 20000 4489000 21.40522003173828
RUNNING 1 3 3 140 140 1424707514386 20000 5058960 24.123001098632812
The last 2 columns are the ones that are really important. It shows the number of tuples transferred and the rate in MBps.
Is this kind of performance hit expected with storm when we turn on acking? I'm using version 0.9.3 and no advanced networking.
There is always going to be a certain degree of performance degradation with acking enabled -- it's the price you pay for reliability. Throughput will ALWAYS be higher with acking disabled, but you have no guarantee if your data is processed or dropped on the floor. Whether that's a 10x hit like you're seeing, or significantly less, is a matter of tuning.
One important setting is topology.max.spout.pending, which allows you to throttle spouts so that only that many tuples are allowed "in flight" at any given time. That setting is useful for making sure downstream bolts don't get overwhelmed and start timing out tuples.
That setting also has no effect with acking disabled -- it's like opening the flood gates and dropping any data that overflows. So again, it will always be faster.
With acking enabled, Storm will make sure everything gets processed at least once, but you need to tune topology.max.spout.pending appropriately for your use case. Since every use case is different, this is a matter of trial and error. Set it too low, and you will have low throughput. Set it too high and your downstream bolts will get overwhelmed, tuples will time out, and you will get replays.
To illustrate, set maxSpoutPending to 1 and run the benchmark again. Then try 1000.
So yes, a 10x performance hit is possible without proper tuning. If data loss is okay for your use case, turn acking off. But if you need reliable processing, turn it on, tune for your use case, and scale horizontally (add more nodes) to reach your throughput requirements.
Currently I'm working on a project which uses elasticsearch 1.4.1.
The cluster consists of 22 nodes, and 20 of them are data nodes, each node has 16GB as the heap memory.
The thing is, when I'm doing massive queries, some of the nodes(2 or 3) consume 70% of the heap memory, while the rest just use less than 10%.
So I'm wondering is this because most of the queries go to those 2 or 3 nodes?
If not, can I further achieve better performance?
Thanks!
Just updated:
I just ran this command: curl -XGET localhost:9200/_cat/shards?v, and it returned:
index shard prirep state docs store ip node
....
mm 2 r STARTED 2248969 293.6mb 10.2.4.117 Mark Todd
mm 2 p STARTED 2248969 293.6mb 10.2.4.129 Saint Elmo
mm 19 r STARTED 30172116 3.5gb 10.2.4.126 Fixer
mm 19 p STARTED 30172116 3.5gb 10.2.4.123 Loki
....
I'm wondering what the store here means? if it is the actual size of the documents, can I load all of them into memory?
This could be because the document matches for that query is somehow colonized to just those 2 machines.
That is , if there are 20 million matches , chances are that 8 million match belongs to one machine , 8 million match comes from another and only just 2 million comes from rest of the 18 machines.
I am guessing you are also using aggregation in the process due to which field data cache is fabricated.
(On Single machine)
I installed Hadoop 2.4.1. And write a program for read a sequence file with 28.6 MB, Iterate this program 10,000 time.
Then Get result:
Without Centralized Cache
Run Time(in ms)
1 19840
2 15096
3 14091
4 14222
5 14576
With Centralized Cache
Run Time(in ms)
1 19158
2 14649
3 14461
4 14302
5 14715
And I also write a Map-reduce Job and iterate it 25 times
Result:
Without Centralized Cache
Run Time(in ms)
1 909265
2 922750
3 898311
With Centralized Cache
Run Time(in ms)
1 898550
2 897663
3 926033
Not found Main difference between performance using Centralized Cache and without.
How to Analysis Increase performance using Centralized Cache?
Suggest any other way to find Increase performance using Centralized Cache.
I am new to hadoop and I am learning by using few examples. I am currently trying to pass a file with random integers on it. For each and every number i want it to be double base on the number specify by the user at runtime.
3536 5806 2545 249 485 5467 1162 8941 962 6457
665 6754 889 5159 3161 5401 704 4897 135 907
8111 1059 4971 5195 3031 630 6265 827 5882 9358
9212 9540 676 3191 4995 8401 9857 4884 8002 3701
931 875 6427 6945 5483 545 4322 5120 1694 2540
9039 5524 872 840 8730 4756 2855 718 6612 4125
Above is the file sample.
For example when the user specify at runtime
jar ~/dissertation/workspace/TestHadoop/src/DoubleNum.jar DoubleNum Integer Output 3
the output for say the first line will be
3536 * 8 5806* 8 2545* 8 249* 8 485* 8 5467* 8 1162* 8 8941* 8 962* 8 6457* 8
Because for each iteration the number will be double so for 3 iterations it will be 2^3. How can I achieve this using mapreduce?
For chaining one job into the next, check out:
Chaining multiple MapReduce jobs in Hadoop
Also, this may be a good time to learn about sequence files, as they provide an efficient way of passing data from one map/reduce job to another.
As for your particular problem, you don't need reducers here, so make it map-only by setting the number of reducers to zero. Sending the output to reducers will only incur extra network overhead. (However, be careful about the number of files you create over time, eventually the NameNode will not appreciate it. Each mapper will create one file.)
I understand that you are trying to use this as an example of perhaps something more complex... but in this case you can use a common optimization technique: If you find yourself wanting to chain one mapper-only task into another map/reduce job, you can squash the two mappers together. For example, instead of multiplying by 2, then 2 gain, the 2 again, why not just multiply by 2 and by 2 and by in the same mapper? Basically, if all your operations are independent on one number or line, you can apply the iterations within the same mapper, per record. This will reduce the amount of overhead significantly.