Running parallel queries in Spark - hadoop

How does spark handle concurrent queries? I have read a bit about spark and underlying RDD's but I am unable to understand how concurrent queries would be handled?
For example if I run a query which loads the data in memory and the entire available memory is consumed and at the same time someone else runs a query involving another set of data, how would spark allocate the memory to both the queries? Also what would be the impact if the priorities are taken into account.
Also can running lots of parallel queries would result in the machines hanging ?

Firstly Spark doesn't take the in-memory (RAM) more than threshold limit.
Spark tries to allocate the default in-memory to every job.
If there is insufficient memory for a new job then it tries to spill the in-memory content of LeastRecentlyUsed (LRU) RDD to disk and then allocates to new job.
Optionally you can also specify the storage of RDD like IN-MEMORY only, DISK only, MEMORY AND DISK etc..
Scenario: consider a low in-memory machine with huge no of jobs, then most of the RDDs will be placed in disk only, as per the above approach.
So, the jobs will continue to run but it will not take the advantage of Spark in-memory processing.
Spark does the memory allocation very intelligently.
If Spark used on top-of YARN then Resource manager also takes place in the resource allocation.

Related

In Spark Streaming, can we store data (hashmap) in Executor memory

I want to maintain a cache (HashMap) in Spark Executors memory (long lived cache) so that all tasks running on the executor (at different times) can do a lookup there and also be able to update the cache.
Is this possible in Spark streaming?
I'm not sure there is a way to store custom data structures permanently on executors. My suggestion here is to use some external caching system (like Redis, Memcached or even ZooKeeper in some cases). You can further connect to that system using such methods like foreachPartition or mapPartitions while processing RDD/DataFrame to reduce the number of connections to 1 connection per partition.
The reason of why this would work is that i.e. both Redis and Memcached are in-memory storages so there will be no overhead of spilling data to disk.
The two other ways to distribute some state across executors are Accumulators and Broadcast variables. For Accumulators all executors can write into it but reading can be performed only by driver. For Broadcast variable you write it only once on driver and then distribute it as a read-only data structure to executors. Both cases doesn't work for you so the described solution is the only possible way that I can see here.

Random Reads and Scans inside the same HBase cluster

We have a situation where we host data for:
MapReduce/Spark jobs (disk accessed by seq. reads)
Random reads.
(disk accessed by seeks)
All inside the same cluster/table.
With YARN we can manage resources like CPU and RAM, but during intensive scans HDD can become a bottleneck and can slow down random read performance. How to manage that resource
How this kind of situations are being handled in general?
Since mapreduce generally does not require live data, people often make a backup of hbase table and run mapreduce on backup data table. Or do a snapshot of table and run mp. on it.

Spark SQL performance with Simple Scans

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context.
After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M
(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). I enlarged this value in my application as:
conf.set("spark.executor.memory", "2.5g");
Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'.
Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? Or there are other stuff being done worker which need memory ?
What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. That is why I believe I am missing some params by leaving them as defaults.
Any hint/suggestion would be highly appreciated.
Spark_worker_cores is shared across the instances. Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated.

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

How to deal with executor memory and driver memory in Spark?

I am confused about dealing with executor memory and driver memory in Spark.
My environment settings are as below:
Memory 128 G, 16 CPU for 9 VM
Centos
Hadoop 2.5.0-cdh5.2.0
Spark 1.1.0
Input data information:
3.5 GB data file from HDFS
For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit. Now I would like to set executor memory or driver memory for performance tuning.
From the Spark documentation, the definition for executor memory is
Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).
How about driver memory?
The memory you need to assign to the driver depends on the job.
If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low. Few 100's of MB will do. The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing.
If the job requires the driver to participate in the computation, like e.g. some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data.
e.g. If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph.
In a Spark Application, Driver is responsible for task scheduling and Executor is responsible for executing the concrete tasks in your job.
If you are familiar with MapReduce, your map tasks & reduce tasks are all executed in Executor(in Spark, they are called ShuffleMapTasks & ResultTasks), and also, whatever RDD you want to cache is also in executor's JVM's heap & disk.
So I think a few GBs will just be OK for your Driver.
Spark shell required memory = (Driver Memory + 384 MB) + (Number of executors * (Executor memory + 384 MB))
Here 384 MB is maximum memory (overhead) value that may be utilized by Spark when executing jobs.

Resources