We have a system programing on spark and oracle DB. the bottleneck of my system is when workers would insert or update in DB, there are timeout error on some executors of workers. We have 6 worker which has 64G RAM and 8 cores. number of execution is equal to cores of worker and each executor run a task. oracle DB has 16 cores and 96G RAM. We think our spark cluster is bigger than DB system and insert or update for big data(each executor maybe insert or update 7G data simultaneously) is it true?
the other issue is that the distributed DB like Cassandra is a solution for this bottleneck?
Related
I hope my question is simple. What happens when someone enables the Dynamic Allocation of spark with cassandra database?
I have a 16 node cluster where every node has installed versions of Spark and Cassandra, in order to also achieve data locality. I am wondering how does the dynamic allocation works at this case. Spark will calculate the workload in order to "hire" workers right? But how does spark know the size of the data( in order to calculate the workload) from cassandra db unless it tries to query it first?
For example, what if spark hires 2 workers and the data in cassandra are located on a 3rd node? Wouldn't that increase network traffic and time until cassandra copies the data from node 3 to node 2?
I tried it with my application and I saw from SparkUI that the master hired 1 executor to query the data from cassandra and then added another 5 or 6 executors to do the further processing. Overall, it took 10 minutes more that the normal 1 minute that takes without the dynamic allocation.
(FYI: I am also using spark-cassandra-connector 3.1.0)
The Spark Cassandra connector estimates the size of the table using the values stored in the system.size_estimates table. For example, if the size_estimates indicates that there are 200K CQL partitions in the table and the mean partition size is 1MB, the estimated table size is:
estimated_table_size = mean_partition_size x number_of_partitions
= 1 MB x 200,000
= 200,000 MB
The connector then calculates the Spark partitions as:
spark_partitions = estimated_table_size / input.split.size_in_mb
= 200,000 MB / 64 MB
= 3,125
When there is data locality (Spark worker/executor JVMs are co-located with Cassandra JVM), the connector knows which nodes own the data so you can take advantage of this functionaly by using the repartitionByCassandraReplica() so that each Spark partition will be processed by an executor on the same node where the data resides to avoid shuffling.
For more info, see the Spark Cassandra connector documentation. Cheers!
I am trying to develop a Hadoop project for one of our clients. We will be receiving a data of around 2 TB per day, so as a part of reconciliation we would like to read the 2 TB of data and perform sorting and filter operations.
We have set up the Hadoop cluster with 5 data nodes running on t2x.large AWS instances containing 4 CPU cores and 16GB RAM. What is the advisable count of mappers and reducers we need to launch to complete the data processing quickly?
Take a look on this:
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-1/
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-2/
This depends on the task nature if it is RAM or CPU consuming and how parallel your system can be.
If every node contains 4 CPU cores and 16GB RAM. On average I suggest 4 to 6 map-reduce task on each node.
Creating too much mapred tasks will degrade your cpu performance and you may face container problems regarding not enough memory.
I want to understand a basic thing in spark streaming. I have 50 Kafka topic partitions and 5 numbers of executors, I am using DirectAPI so no. of RDD partitions will be 50. How this partition will be processed on 5 executors? Will spark process 1 partition at a time on each executors or if the executor has enough memory and cores it will process more than 1 partition in parallel on each executor.
Will spark process 1 partition at a time on each executors or if the
executor has enough memory and cores it will process more than 1
partition in parallel on each executor.
Spark will process each partition depending on the total amount of cores available to the job you're running.
Let's say your streaming job has 10 executors, each one with 2 cores. This means that you'll be able to process 10 x 2 = 20 partitions concurrently, assuming spark.task.cpus is set to 1.
If you really want the details, look inside Spark Standalone requests resources from CoarseGrainedSchedulerBackend, you can look at it's makeOffers:
private def makeOffers() {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toIndexedSeq
launchTasks(scheduler.resourceOffers(workOffers))
}
Key here is executorDataMap, which holds a mapping from executor id to an ExecutorData, which tells how much cores each such executor in the system is utilizing, and according to that and the preferred locality of the partition, makes an educated guess on which executor this task should run.
Here is an example from a live Spark Streaming app consuming from Kafka:
We have 5 partitions with 3 executors running, where each executor has more than 2 cores which enables the streaming to process each partition concurrently.
I am running a Spark-Kafka Streaming job with 4 executors(1 core each). And the kafka source topic had 50 partitions.
In the foreachpartition of the streaming java program, i am connecting to oracle and doing some work. Apache DBCP2 is being used for connection pool.
Spark-streaming program is making 4 connections to database- may be 1 for each executor. But, My Expectation is - since there are 50 partitions, there should be 50 threads running and 50 database connections exist.
How do i increase the parallelism without increasing the number of cores.
Your expectations are wrong. One core is one available thread in Spark nomenclature and one partition that can be processed at the time.
4 "cores" -> 4 threads -> 4 partitions processed concurently.
In spark executor, each core processes partitions one by one(one at a time). As you have 4 executors and each has only 1 core, that means you can only process 4 partitions concurrently at a time. So, if your Kafka has 50 partitions, your spark cluster need to run 13 rounds(4 partitions each round, 50 / 4 = 12.5) to finish a batch job. That is also why you can only see 4 connections to database.
I am confused about dealing with executor memory and driver memory in Spark.
My environment settings are as below:
Memory 128 G, 16 CPU for 9 VM
Centos
Hadoop 2.5.0-cdh5.2.0
Spark 1.1.0
Input data information:
3.5 GB data file from HDFS
For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit. Now I would like to set executor memory or driver memory for performance tuning.
From the Spark documentation, the definition for executor memory is
Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).
How about driver memory?
The memory you need to assign to the driver depends on the job.
If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low. Few 100's of MB will do. The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing.
If the job requires the driver to participate in the computation, like e.g. some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data.
e.g. If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph.
In a Spark Application, Driver is responsible for task scheduling and Executor is responsible for executing the concrete tasks in your job.
If you are familiar with MapReduce, your map tasks & reduce tasks are all executed in Executor(in Spark, they are called ShuffleMapTasks & ResultTasks), and also, whatever RDD you want to cache is also in executor's JVM's heap & disk.
So I think a few GBs will just be OK for your Driver.
Spark shell required memory = (Driver Memory + 384 MB) + (Number of executors * (Executor memory + 384 MB))
Here 384 MB is maximum memory (overhead) value that may be utilized by Spark when executing jobs.