I am facing a problem in using spark streaming with Apache Kafka where Spark is deployed on yarn. I am using Direct Approach (No Receivers) to read data from Kafka with 1 topic and 48 partitions. With this setup on a 5 node (4 worker) spark cluster (24 GB memory available on each machine) and spark configurations (spark.executor.memory=2gb, spark.executor.cores=1), there should be 48 executors on Spark cluster (12 executor on each machine).
Spark Streaming documentation also confirms that there is a one-to-one mapping between Kafka and RDD partitions. So for 48 kafka partitions, there should be 48 RDD partitions and each partition is being executed by 1 executor.But while running this, only 12 executors are created and spark cluster capacity remains unused & we are not able to get the desired throughput.
It seems that this Direct Approach to read data from Kafka in Spark Streaming is not behaving according to Spark Streaming documentation. Can anyone suggest, what wrong I am doing here as I am not able to scale horizontally to increase the throughput.
Related
I hope my question is simple. What happens when someone enables the Dynamic Allocation of spark with cassandra database?
I have a 16 node cluster where every node has installed versions of Spark and Cassandra, in order to also achieve data locality. I am wondering how does the dynamic allocation works at this case. Spark will calculate the workload in order to "hire" workers right? But how does spark know the size of the data( in order to calculate the workload) from cassandra db unless it tries to query it first?
For example, what if spark hires 2 workers and the data in cassandra are located on a 3rd node? Wouldn't that increase network traffic and time until cassandra copies the data from node 3 to node 2?
I tried it with my application and I saw from SparkUI that the master hired 1 executor to query the data from cassandra and then added another 5 or 6 executors to do the further processing. Overall, it took 10 minutes more that the normal 1 minute that takes without the dynamic allocation.
(FYI: I am also using spark-cassandra-connector 3.1.0)
The Spark Cassandra connector estimates the size of the table using the values stored in the system.size_estimates table. For example, if the size_estimates indicates that there are 200K CQL partitions in the table and the mean partition size is 1MB, the estimated table size is:
estimated_table_size = mean_partition_size x number_of_partitions
= 1 MB x 200,000
= 200,000 MB
The connector then calculates the Spark partitions as:
spark_partitions = estimated_table_size / input.split.size_in_mb
= 200,000 MB / 64 MB
= 3,125
When there is data locality (Spark worker/executor JVMs are co-located with Cassandra JVM), the connector knows which nodes own the data so you can take advantage of this functionaly by using the repartitionByCassandraReplica() so that each Spark partition will be processed by an executor on the same node where the data resides to avoid shuffling.
For more info, see the Spark Cassandra connector documentation. Cheers!
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)
Currently we are using spark with 1 minute batch interval to process the data. The data flow is like HTTP endpoint -> Spring XD -> kafka -> Spark Streaming -> HBASE. Format is JSON. We are running the spark jobs in a environment which has 6 nodemanagers each with 16 CPU cores and 110 GB of RAM. For caching metedata scala's triemap is used,so the cache will be per executor. Results on Spark with below settings:
Kafka partitions - 45
Spark executors - 3
Cores per executor -15
Number of JSON records 455 000 received over a time of 10 minutes.
Spark processed the records in 12 minutes. And each executor is able to process about 350-400 records per sec. Json parsing, validations, other stuffs are done before loading into HBASE.
With almost the same code with modifications for flink I ran the code with flink streaming deployed in YARN cluster. Results on Flink with below settings:
Kafka partitions - 45
Number of Task Managers for running the job - 3
Slots per TM - 15
Parallelism - 45
Number of JSON records 455 000 received over a time of 10 minutes.
But flink takes almost 50 minutes to process the records. With each TM process 30-40 records per second.
What am i missing here? Are there any other parameters/Configurations apart from the mentioned one impacts performance? The Job flow is DataStream -> Map -> custom functions. How could I improve the performance of flink here?
My Spark cluster has 1 master and 3 workers (on 4 separate machines, each machine with 1 core), and other settings are as in the picture below, where spark.cores.max is set to 3, and spark.executor.cores also 3 (in pic-1)
But when I submit my job to Spark cluster, from the Spark web-UI I can see only one executor is used (according to used memory and RDD blocks in pic-2), but not all of the executors. In this case the processing speed is much slower than I expected.
Since I've set the max cores to be 3, shouldn't all the executors be used to this job?
How to configurate Spark to distribute current job to all executors, instead of only one executor running current job?
Thanks a lot.
------------------pic-1:
------------------pic-2:
You said you are running two receivers, what kind of Receivers are they (Kafka, Hdfs, Twitter ??)
Which spark version are you using?
In my experience, if you are using any Receiver other than file receiver, then it will occupy 1 core permanently.
So when you say you have 2 receivers, then 2 cores will be permanently used for receiving the data, so you are left with only 1 core which is doing the work.
Please post the Spark master hompage screenshot as well. And Job's Streaming page screenshot.
In spark streaming only 1 receiver is launched, to get the data from input source to RDD.
Repartitioning the data after the 1st transformation can increase parallelism.
I am using Cassandra to store my data and hive to process my data.
I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs)
So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).
If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/
They built and support hadoop with HDFS replaced with cassandra.
In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/
There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra
Cassandra and MapReduce - minimal setup requirements
Regarding your question - there is a tradeof:
a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.
My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.