I am currently working on Spark and trying to suggest an adaptive execution plan. However, I am wondering whether it is possible to modify the parameters of the Spark engine at runtime. For example, Can I use different compression codecs for two separate stages, or can I modify the memory fractions reserved for shuffling and computation at runtime? Say for the map phase, I diminish the memory fraction allocated for shuffling, to increase it later when the shuffling occurs?
Thanks
It is not possible in general.
While a subset of configuration options can be changed on runtime using (Customize SparkContext using sparkConf.set(..) when using spark-shell) RuntimeConfig object, core options, cannot be modified unless SparkContext is restarted.
Related
I have running a Pyspark application and I am trying to persist dataframe as I am using the dataframe again in the code.
I am using the following:
sourceDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
I am processing 30GB of data.
I have 3 nodes, all 16 GB RAM and 4 Virtual Cores.
From Spark UI, I see the Size in Memory after persistence is very less. I'd want it to store the cached data in RAM Memory as much as possible.
How can I best utilise RAM Memory?
Also, the GC time for the tasks seems quite high. How can I reduce it ?
You're already making the best use of memory by using dataframes and storing data with serialization. There's not much more you can do besides filtering out as much data as possible that isn't needed for the final result before caching.
Garbage collection is tricky. When working with Dataframe API and untyped transformations, catalyst is going to do its best to avoid unnecessary object creation. You really don't have much of a say when using dataframes and running into GC issues. Some operations are inherently more expensive as far as performance and object creation, but you can only control those using the typed dataset api and rdd api. You're best off doing what you're currently doing now. If GC is truly an issue, the best thing you can do is use a JVM profiling tool and find which pieces of code are creating the most objects and looking to optimize that. In addition, trying to minimize as much as possible data skew, and leveraging broadcast joins where possible should help avoid some GC.
What is the best way to optimize the Spark Jobs deployed on Yarn based cluster ? .
Looking for changes based on configuration not code level. My Question is classically design level question, what approach should be used to optimized the Jobs that are either developed on Spark Streaming or Spark SQL.
There is myth that BigData is magic and your code will be work like a dream once deployed to a BigData cluster.
Every newbie have same belief :) There is also misconception that given configurations over web blogs will be working fine for every problem.
There is no shortcut for optimization or Tuning the Jobs over Hadoop without understating your cluster deeply.
But considering the below approach I'm certain that you'll be able to optimize your job within a couple of hours.
I prefer to apply the pure scientific approach to optimize the Jobs. Following steps can be followed specifically to start optimization of Jobs as baseline.
Understand the Block Size configured at cluster.
Check the maximum memory limit available for container/executor.
Under the VCores available for cluster
Optimize the rate of data specifically in case of Spark streaming real-time jobs. (This is most tricky park in Spark-streaming)
Consider the GC setting while optimization.
There is always room of optimization at code level, that need to be considered as well.
Control the block size optimally based on cluster configuration as per Step 1. based on data rate. Like in Spark it can be calculated batchinterval/blockinterval
Now the most important steps come here. The knowledge I'm sharing is more specific to real-time use cases like Spark streaming, SQL with Kafka.
First of all you need to know to know that at what number or messages/records your jobs work best. After it you can control the rate to that particular number and start configuration based experiments to optimize the jobs. Like I've done below and able to resolve performance issue with high throughput.
I have read some of parameters from Spark Configurations and check the impact on my jobs than i made the above grid and start the experiment with same job but with five difference configuration versions. Within three experiment I'm able to optimize my job. The green highlighted in above picture is magic formula for my jobs optimization.
Although the same parameters might be very helpful for similar use cases but obviously these parameter not covers everything.
Assuming that the application works i.e memory configuration is taken care of and we have at least one successful run of the application. I usually look for underutilisation of executors and try to minimise it. Here are the common questions worth asking to find opportunities for improving utilisation of cluster/executors:
How much of work is done in driver vs executor? Note that when the main spark application thread is in driver, executors are killing time.
Does you application have more tasks per stage than number of cores? If not, these cores will not be doing anything while in this stage.
Are your tasks uniform i.e not skewed. Since spark move computation from stage to stage (except for some stages that can be parallel), it is possible for most of your tasks to complete and yet the stage is still running because one of skewed task is still held up.
Shameless Plug (Author) Sparklens https://github.com/qubole/sparklens can answer these questions for you, automatically.
Some of things are not specific to the application itself. Say if your application has to shuffle lots of data, pick machines with better disks and network. Partition your data to avoid full data scans. Use columnar formats like parquet or ORC to avoid fetching data for columns you don't need all the time. The list is pretty long and some problems are known, but don't have good solutions yet.
This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/
I'm writing an application, which produces several files storing them back to S3.
Most of the transformations operate on DataFrames. The current state of the application is already somewhat complex being translated into 60 jobs, some of them mapped to hundreds of stages. Some of the DataFrames are reused along the way and those are cached.
The problem is performance, which is clearly impacted by dependencies.
I have a few questions, any input on any of them will be highly appreciated.
(1) When I split the application into parts, execute them individually reading the inputs from generated files by the previous ones, the total execution time is a fraction (15%) of the execution time of the application run as a whole. This is counterintuitive as the whole application reuses DataFrames already in memory, caching guarantees that no DataFrame is computed more than once and various jobs are executing in parallel wherever possible.
I also noticed that the total number of stages in the latter case is much higher than the first one and I would think they should be comparable. Is there an explanation for this?
(2) If the approach with executing parts of the application individually is the way to go then how to enforce the dependencies between the parts to make sure the necessary inputs are ready.
(3) I read a few books, which devote some chapters to the execution model and performance analysis through the Spark Web UI. All of them use RDDs and I need DataFrames. Obviously even for DataFrame based applications Spark Web UI provides a lot of useful information but the analysis is much harder. Is there a good resource I could use to help me out?
(4) Is there a good example demonstrating how to minimize shuffling by appropriate partitioning of the DataFrame? My attempts so far have been ineffective.
Thanks.
Splitting application is not recommended, if you have more stages and having performance issues then try Checkpointing which saves an RDD to a reliable storage system (e.g. HDFS, S3) while forgetting the RDD’s lineage completely.
//set this property in program
SparkContext.setCheckpointDir(directory: String) method.
//checkpoint RDD
RDD.checkpoint()
If you are using DataFrames then manually checkpoint the data after logical points by introducing Parquet/ORC hops (Writing & Reading data from Parquet/ORC files)
//Write to ORC
dataframe.write.format("orc").save("/tmp/src-dir/tempFile.orc")
//where /tmp/src-dir/ is a HDFS directory
//Read ORC
val orcRead = sqlContext.read.format("orc").load("/tmp/src-dir/tempFile.orc")
Spiliting program is not recommended but still if you want to do it then create separate ".scala" programs and apply dependencies at Oozie level.
3.In Spark web UI refer SQL tab which will give you the execuiton plan. For detailed study on dataframes run
DF.explain() //which will show you the execution plan
DataFrames in Spark have their execution automatically optimized by a query optimizer. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution. Because the optimizer understands the semantics of operations and structure of the data, it can make intelligent decisions to speed up computation.
Refer Spark guide - http://spark.apache.org/docs/latest/sql-programming-guide.html
4.Sort the data before any operations such as join. To reduce shuffling use repartition function.
DF1 = DF.repartition(10)
Please post your code if you have any other specific doubt.
How can we get the overall memory used for a spark job. I am not able to get the exact parameter which we can refer to retrieve the same. Have referred to Spark UI but not sure of the field which we can refer. Also in Ganglia we have the following options:
a)Memory Buffer
b)Cache Memory
c)Free Memory
d)Shared Memory
e)Free Swap Space
Not able to get any option related to Memory Used. Does anyone have some idea regarding this.
If you persist your RDDs you can see how big they are in memory via the UI.
It's hard to get an idea of how much memory is being used for intermediate tasks (e.g. for shuffles). Basically Spark will use as much memory as it needs given what's available. This means that if your RDDs take up more than 50% of your available resources, your application might slow down because there are fewer resources available for execution.