How to validate increased performance using by Centralized Cache Management in HDFS - hadoop

(On Single machine)
I installed Hadoop 2.4.1. And write a program for read a sequence file with 28.6 MB, Iterate this program 10,000 time.
Then Get result:
Without Centralized Cache
Run Time(in ms)
1 19840
2 15096
3 14091
4 14222
5 14576
With Centralized Cache
Run Time(in ms)
1 19158
2 14649
3 14461
4 14302
5 14715
And I also write a Map-reduce Job and iterate it 25 times
Result:
Without Centralized Cache
Run Time(in ms)
1 909265
2 922750
3 898311
With Centralized Cache
Run Time(in ms)
1 898550
2 897663
3 926033
Not found Main difference between performance using Centralized Cache and without.
How to Analysis Increase performance using Centralized Cache?
Suggest any other way to find Increase performance using Centralized Cache.

Related

Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then

Regularly the past days our ES 7.4 cluster (4 nodes) is giving read timeouts and is getting slower and slower when it comes to running certain management commands. Before that it has been running for more than a year without any trouble. For instance /_cat/nodes was taking 2 minutes yesterday to execute, today it is already taking 4 minutes. Server loads are low, memory usage seems fine, not sure where to look further.
Using the opster.com online tool I managed to get some hint that the management queue size is high, however when executing the suggested commands there to investigate I don't see anything out of the ordinary other than that the command takes long to give a result:
$ curl "http://127.0.0.1:9201/_cat/thread_pool/management?v&h=id,active,rejected,completed,node_id"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 345 100 345 0 0 2 0 0:02:52 0:02:47 0:00:05 90
id active rejected completed node_id
JZHgYyCKRyiMESiaGlkITA 1 0 4424211 elastic7-1
jllZ8mmTRQmsh8Sxm8eDYg 1 0 4626296 elastic7-4
cI-cn4V3RP65qvE3ZR8MXQ 5 0 4666917 elastic7-2
TJJ_eHLIRk6qKq_qRWmd3w 1 0 4592766 elastic7-3
How can I debug this / solve this? Thanks in advance.
If you notice your elastic7-2 node is having 5 active requests in the management queue, which is really high, As the management queue capacity itself is just 5, and it's used only for very few operations(Management, not search/index).
You can have a look at threadpools in elasticsearch for further read.

Jmeter interpreting results in simple terms

So I'm trying to test a website, and trying to interpret the aggregate report by "common sense" (as I tried looking up the meanings of each result and i cannot understand how they should be interpreted).
TEST 1
Thread Group: 1
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 645
- Median 645
- 90% Line 645
- Min 645
- Max 645
- Throughput 1.6/sec
So I am under the assumption that the first result is the best outcome.
TEST 2
Thread Group: 5
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 647
- Median 647
- 90% Line 647
- Min 643
- Max 652
- Throughput 3.5/sec
I am assuming TEST 2 result is not so bad, given that the results are near TEST 1.
TEST 3
Thread Group: 10
Ramp-up: 1
Loop Count: 1
- Samples 1
- Average 710
- Median 711
- 90% Line 739
- Min 639
- Max 786
- Throughput 6.2/sec
Given the dramatic difference, I am assuming that if 10 users concurrently requested for the website, it will not perform well. How would this set of tests be interpreted in simple terms?
It is as simple as available resources.
Response Times are dependent on many things and following are critical factors:
Server Machine Resources (Network, CPU, Disk, Memory etc)
Server Machine Configuration (type of server, number of nodes, no. of threads etc)
Client Machine Resources (Network, CPU, Disk, Memory etc)
As you understand it is about mostly how server is busy responding to other requests and how much client machine is busy generating/processing Load (I assume you run all 10 users in single machine)
Best way to know the actual reason is by Monitoring these resources using nmon for linux & perfmon or task manager for Windows (or any other monitoring tool) and understand the differences when you ran 1, 5, 10 users.
Apart from Theory part, I assume that it is talking time because of your are putting the sudden load where the server takes time in processing the previous requests.
Are you using client and server on the same machine? If Yes, that would tell us that the system resources are utilized both for client threads (10 threads) and server threads.
Resposne Time = client sends the request to server TIME + server processing TIME + Server sends the resposne to the client TIME
In your case, it might be one or more TIME's increased.
If you have good bandwidth, then it might be server processing time
Your results are confusing.
For thread count of 5 and 10, you have given the same number of samples - 1. It should be 1 (1 thread), 5 ( 5 threads) and 10 samples for 10 threads. Your experiment has statistically less samples to conclude anything. You should model your load in such a way that the 1 thread load is sustained for a longer period before you ramp up 5 and 10 threads. If you are running a small test to assess the the scalability of your application, you could do something like
1 thread - 15 mins
5 threads - 15 mins
10 threads - 15 mins
provide the observations for each of the 15 min period. If you application is really scaling, it should maintain the same response time even under increased load.
Looking at your results, I don't see any issues with your application. There is nothing varying. Again, you don't have much samples that can lead to statistically relevant conclusion.

Correcting improper usage of Cassandra

I have a similar question that was unanswered (but had many comments):
How to make Cassandra fast
My setup:
Ubuntu Server
AWS service - Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz, 4GB Ram.
2 Nodes of Cassandra Datastax Community Edition: (2.1.3).
PHP 5.5.9. With datastax php-driver
I come from a MySQL database knowledge with very basic NoSQL hands on in terms of ElasticSearch (now called Elastic) and MongoDB in terms of Documents storage.
When I read how to use Cassandra, here are the bullets that I understood
It is distributed
You can have replicated rings to distribute data
You need to establish partition keys for maximum efficiency
Rethink your query rather than to use indices
Model according to queries and not data
Deletes are bad
You can only sort starting from the second key of your primary key set
Cassandra has "fast" write
I have a PHP Silex framework API that receive a batch json data and is inserted into 4 tables as a minimum, 6 at maximum (mainly due to different types of sort that I need).
At first I only had two nodes of Cassandra. I ran Apache Bench to test. Then I added a third node, and it barely shaved off a fraction of a second at higher batch size concurrency.
Concurrency Batch size avg. time (ms) - 2 Nodes avg. time (ms) - 3 Nodes
1 5 288 180
1 50 421 302
1 400 1 298 1 504
25 5 1 993 2 111
25 50 3 636 3 466
25 400 32 208 21 032
100 5 5 115 5 167
100 50 11 776 10 675
100 400 61 892 60 454
A batch size is the number of entries (to the 4-6 tables) it is making per call.
So batch of 5, means it is making 5x (4-6) table insert worth of data. At higher batch size / concurrency the application times out.
There are 5 columns in a table with relatively small size of data (mostly int with text being no more than approx 10 char long)
My keyspace is the following:
user_data | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
My "main" question is: what did I do wrong? It seems to be this is relatively small data set of that considering that Cassandra was built on BigDataTable at very high write speed.
Do I add more nodes beyond 3 in order to speed up?
Do I change my replication factor and do Quorum / Read / Write and then hunt for a sweet spot from the datastax doc: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Do I switch framework, go node.js for higher concurrency for example.
Do I rework my tables, as I have no good example as to how effectively use column family? I need some hint for this one.
For the table question:
I'm tracking history of a user. User has an event and is associated to a media id, and there so extra meta data too.
So columns are: event_type, user_id, time, media_id, extra_data.
I need to sort them differently therefore I made different tables for them (as I understood how Cassandra data modeling should work...I am perhaps wrong). Therefore I'm replicating the different data across various tables.
Help?
EDIT PART HERE
the application also has redis and mysql attached for other CRUD points of interest such as retrieving a user data and caching it for faster pull.
so far on avg with MySQL and then Redis activated, I have a 72ms after Redis kicks in, 180ms on MySQL pre-redis.
The first problem is you're trying to benchmark the whole system, without knowing what any individual component can do. Are you trying to see how fast an individual operation is? or how many operations per second you can do? They're different values.
I typically recommend you start by benchmarking Cassandra. Modern Cassandra can typically do 20-120k operations per second per server. With RF=3, that means somewhere between 5k and 40k reads / second or writes/second. Use cassandra-stress to make sure cassandra is doing what you expect, THEN try to loop in your application and see if it matches. If you slow way down, then you know the application is your bottleneck, and you can start thinking about various improvements (different driver, different language, async requests instead of sync, etc).
Right now, you're doing too much and analyzing too little. Break the problem into smaller pieces. Solve the individual pieces, then put the puzzle together.
Edit: Cassandra 2.1.3 is getting pretty old. It has some serious bugs. Use 2.1.11 or 2.2.3. If you're just starting development, 2.2.3 may be OK (and let's assume you'll actually go to production with 2.2.5 or so). If you're ready to go prod tomorrow, use 2.1.x instead.

Cassandra Reading Benchmark with Spark

I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.
However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).
But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!
My Benchmark Results:
Cluster-size 4: Write: 1750 seconds / Read: 360 seconds
Cluster-size 2: Write: 3446 seconds / Read: 420 seconds
Cluster-size 1: Write: 7595 seconds / Read: 284 seconds
ADDITIONAL TRY - WITH THE CASSANDRA-STRESS TOOL
I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:
Clustersize Threads Ops/sek Time
1 4 10146 30,1
8 15612 30,1
16 20037 30,2
24 24483 30,2
121 43403 30,5
913 50933 31,7
2 4 8588 30,1
8 15849 30,1
16 24221 30,2
24 29031 30,2
121 59151 30,5
913 73342 31,8
3 4 7984 30,1
8 15263 30,1
16 25649 30,2
24 31110 30,2
121 58739 30,6
913 75867 31,8
4 4 7463 30,1
8 14515 30,1
16 25783 30,3
24 31128 31,1
121 62663 30,9
913 80656 32,4
Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!
Results as diagram:
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.
--> Question here: Are these results the cluster-wide results or is this a test for a local node (and so the result of only one instance of the ring)???
Can someone give an explanation? Thank you!
I ran a similar test with a spark worker running on each Cassandra node.
Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.
Here are the times I got:
1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds
So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.
By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.
You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.

Spark cache inefficiency

I have quite powerful cluster with 3 nodes each 24 cores and 96gb RAM = 288gb total. I try to load 100gb of tsv files into Spark cache and do series of simple computation over data, like sum(col20) by col2-col4 combinations. I think it's clear scenario for cache usage.
But during Spark execution I found out that cache NEVER load 100% of data despite plenty of RAM space. After 1 hour of execution I have 70% of partitions in cache and 75gb cache usage out of 170gb available. It's looks like Spark somehow limit number of blocks/partitions it adds to cache instead to add all at very first action and have a great performance from very beginning.
I use MEMORY_ONLY_SER / Kryo ( cache size appr. 110% of on-disk data size )
Does someone have similar experience or know some Spark configs / environment conditions that could cause this caching behaviour ?
So, "problem" was solved with further reducing of split size. With mapreduce.input.fileinputformat.split.maxsize set to 100mb I got 98% cache load after 1st action finished, and 100% at 2nd action.
Other thing that worsened my results was spark.speculation=true - I try to avoid long-running tasks with that, but speculation management creates big performance overhead, and is useless for my case. So, just left default value for spark.speculation ( false )
My performance comparison for 20 queries are as following:
- without cache - 160 minutes ( 20 times x 8 min, reload each time 100gb from disk to memory )
- cache - 33 minutes total - 10m to load cache 100% ( during first 2 queries ) and 18 queries x 1.5 minutes each ( from in-memory Kryo-serialized cache )

Resources