Performance: Spark dynamic allocation with Cassandra - performance

I hope my question is simple. What happens when someone enables the Dynamic Allocation of spark with cassandra database?
I have a 16 node cluster where every node has installed versions of Spark and Cassandra, in order to also achieve data locality. I am wondering how does the dynamic allocation works at this case. Spark will calculate the workload in order to "hire" workers right? But how does spark know the size of the data( in order to calculate the workload) from cassandra db unless it tries to query it first?
For example, what if spark hires 2 workers and the data in cassandra are located on a 3rd node? Wouldn't that increase network traffic and time until cassandra copies the data from node 3 to node 2?
I tried it with my application and I saw from SparkUI that the master hired 1 executor to query the data from cassandra and then added another 5 or 6 executors to do the further processing. Overall, it took 10 minutes more that the normal 1 minute that takes without the dynamic allocation.
(FYI: I am also using spark-cassandra-connector 3.1.0)

The Spark Cassandra connector estimates the size of the table using the values stored in the system.size_estimates table. For example, if the size_estimates indicates that there are 200K CQL partitions in the table and the mean partition size is 1MB, the estimated table size is:
estimated_table_size = mean_partition_size x number_of_partitions
= 1 MB x 200,000
= 200,000 MB
The connector then calculates the Spark partitions as:
spark_partitions = estimated_table_size / input.split.size_in_mb
= 200,000 MB / 64 MB
= 3,125
When there is data locality (Spark worker/executor JVMs are co-located with Cassandra JVM), the connector knows which nodes own the data so you can take advantage of this functionaly by using the repartitionByCassandraReplica() so that each Spark partition will be processed by an executor on the same node where the data resides to avoid shuffling.
For more info, see the Spark Cassandra connector documentation. Cheers!

Related

Why Spark Fails for Huge Dataset with Container Getting Killed Issue and Hive works

I am trying to run a Simple Query Assuming running queries with spark.sql("query") compared to Dataframes has no performance Difference as I am using Spark 2.1.0 i have Catalyst Optimizer to take care of the optimization part & Tungsten Enabled.
Here i am joining 2 tables with a Left-Outer join. My 1st table is 200 GB & is the Driving table(being on left side) and the 2nd table is 2GB and there has to be no Filters as per our Business requirement.
Configuration of My Cluster. As this is Shared Cluster i have a assigned a specific queue which allows me to use 3-TB of Memory(Yes 3 tera bytes) but the No.of VCORES is 480 . That means i can only run 480 Parallel tasks. On top of that AT YARN LEVEL i have a Constraint to having MAX of 8 cores per node. And MAX of 16 GB of Container Memory Limit. Because of which i cannot give my Executor-Memory(which is per node) more than 12 GB as i am giving 3-GB as ExecutorMemoryOverhead to be on safer side which becomes 15 GB of per node memory utilization.
So after calculating 480 total allowed vcores with 8-cores per node limit i have got 480/8 = 60 Nodes for my computation. Which comes to 60*15 = 900 GB of usable memory(I don't why total queue memory is assigned 3 TB) And this is at peak .. IF i am the only one using the Queue but that's not always the case.
Now the doubt is how Spark this whole 900 GB of memory. From the Numbers & stats i can clearly say that my Job will run without any issues as the data size i am trying to process is just 210-250 GB MaX & i have 900 GB of available memory.
But i keep getting Container getting killed error msgs. And i cannot increase the YARN Container size becoz it is at YARN level and overall cluster will get the increased container size which is not the right thing. I have also tried Disabling vmem-check.enabled property to FALSE in my code using sparksession.config(property) but that doesn't help too May be i am not allowed to change anything at YARN Level so it might be ignoring that.
Now on what basis spark splits the data initially is it based on the Block size defined at Cluster Level (assuming 128 MB) I am thinking this because when my Job is started i see that my Big Table which is of around 200 GB has 2000 tasks so on what basis Spark calculates this 2000 tasks(partitions) I thought may be the Default partition size when spark starts to load my table is quite big by seeing the Input Size/Records && Shuffle Write Size/Records Under the Stage Tab of Spark UI and that is the reason why i am getting Container Killed Error & suggestion to increase Executor memory overhead which did not helped either.
I tried to Repartition the Data from 10k to 100k partitions and tried persisting to MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY but nothing Helped. Many of my task were getting failed and at the End job used to get Fail. Sometimes with Container killed, Direct Buffer, and others.
Now here what is the use of Persist /Caching and how does it behave ..???? I am doing
val result = spark.sql("query big_table").repartition(10000, $<column name>).persist()
The column in Repartition is the Joining key so it gets distributed. TO make this work before the JOIN i am doing result.show(1) . So the action is performed and data gets persisted on DISK and Spark will read data persisted on DISK for JOIN and there will be no load on memory as it is stored in small chunks on Disks(Am i correct over HERE ..??)
Why in HIVE this same job with the same Big Table plus some additional tables with Left Join get completed. Though it takes time but it completes successfully But it Fails in Spark..?? Why ?? Is Spark not the Complete Replacement of HIVE..?? Doesn't Spark works like HIVE when it comes to Spilling to Disk & write data to disk while using DISK for PERSISTING.
Does yarn-container size plays a role if we have less container size but good number of nodes ??
Does Spark combines memory of all the available nodes (15 GB Per Node as per container size) and Combine them to load a large partition..??

hbase skip region server to read rows directly from hfile

Am attempting to dump over 10 billion records into hbase which will
grow on average at 10 million per day and then attempt a full table
scan over the records. I understand that a full scan over hdfs will
be faster than hbase.
Hbase is being used to order the disparate data
on hdfs. The application is being built using spark.
The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
scan cache is 1000 and cache blocks is off
Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
The spark job essentially runs across each row and then computes something based on columns within a range.
Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.
Questions
When reading from hbase into spark, the regions seem to dictate the
spark-partition and thus the 2G limit. Hence problems with
caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and
reads directly from the snapshots, also creates it splits by Region
so would still fall into the region size problem above. It is
possible to read key-values from hfiles directly in which case the
split size is determined by the hdfs block size. Is there an
implementation of a scanner or other util which can read a row
directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.
As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours.
A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes.
The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.

Hadoop machine configuration

I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.
Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.
I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.

How to fully utilize all Spark nodes in cluster?

I have launched a 10 node cluster with the ec2-script in standalone mode for Spark. I am accessing data in s3 buckets from within the PySpark shell but when I perform transormations on the RDD, only one node is ever used. For example the below will read in data from the CommonCorpus:
bucket = ("s3n://#aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/"
"/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10"
"-180-212-248.ec2.internal.warc.gz")
data = sc.textFile(bucket)
data.count()
When I run this, only one of my 10 slaves processes the data. I know this because only one slave (213) has any logs of the activity when viewed from the Spark web console. When I view the the activity in Ganglia, this same node (213) is the only slave with a spike in mem usage when the activity was run.
Furthermore I have the exact same performance when I run the same script with an ec2 cluster of only one slave. I am using Spark 1.1.0 and any help or advice is greatly appreciated.
...ec2.internal.warc.gz
I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.
(Note, however, that Spark can load 10 gzipped files in parallel just fine; it's just that each of those 10 files can only be loaded by 1 task. You can still get parallelism across files, just not within a file.)
You can confirm that you only have 1 partition by checking the number of partitions in your RDD explicitly:
data.getNumPartitions()
The upper bound on the number of tasks that can run in parallel on an RDD is the number of partitions in the RDD or the number of slave cores in your cluster, whichever is lower.
In your case, it's the number of RDD partitions. You can increase that by repartitioning your RDD as follows:
data = sc.textFile(bucket).repartition(sc.defaultParallelism * 3)
Why sc.defaultParallelism * 3?
The Spark Tuning guide recommends having 2-3 tasks per core, and sc.defaultParalellism gives you the number of cores in your cluster.

Hadoop on cassandra database

I am using Cassandra to store my data and hive to process my data.
I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs)
So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).
If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/
They built and support hadoop with HDFS replaced with cassandra.
In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/
There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra
Cassandra and MapReduce - minimal setup requirements
Regarding your question - there is a tradeof:
a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.
My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.

Resources