Performance optimization of DataFrame based application - performance

I'm writing an application, which produces several files storing them back to S3.
Most of the transformations operate on DataFrames. The current state of the application is already somewhat complex being translated into 60 jobs, some of them mapped to hundreds of stages. Some of the DataFrames are reused along the way and those are cached.
The problem is performance, which is clearly impacted by dependencies.
I have a few questions, any input on any of them will be highly appreciated.
(1) When I split the application into parts, execute them individually reading the inputs from generated files by the previous ones, the total execution time is a fraction (15%) of the execution time of the application run as a whole. This is counterintuitive as the whole application reuses DataFrames already in memory, caching guarantees that no DataFrame is computed more than once and various jobs are executing in parallel wherever possible.
I also noticed that the total number of stages in the latter case is much higher than the first one and I would think they should be comparable. Is there an explanation for this?
(2) If the approach with executing parts of the application individually is the way to go then how to enforce the dependencies between the parts to make sure the necessary inputs are ready.
(3) I read a few books, which devote some chapters to the execution model and performance analysis through the Spark Web UI. All of them use RDDs and I need DataFrames. Obviously even for DataFrame based applications Spark Web UI provides a lot of useful information but the analysis is much harder. Is there a good resource I could use to help me out?
(4) Is there a good example demonstrating how to minimize shuffling by appropriate partitioning of the DataFrame? My attempts so far have been ineffective.
Thanks.

Splitting application is not recommended, if you have more stages and having performance issues then try Checkpointing which saves an RDD to a reliable storage system (e.g. HDFS, S3) while forgetting the RDD’s lineage completely.
//set this property in program
SparkContext.setCheckpointDir(directory: String) method.
//checkpoint RDD
RDD.checkpoint()
If you are using DataFrames then manually checkpoint the data after logical points by introducing Parquet/ORC hops (Writing & Reading data from Parquet/ORC files)
//Write to ORC
dataframe.write.format("orc").save("/tmp/src-dir/tempFile.orc")
//where /tmp/src-dir/ is a HDFS directory
//Read ORC
val orcRead = sqlContext.read.format("orc").load("/tmp/src-dir/tempFile.orc")
Spiliting program is not recommended but still if you want to do it then create separate ".scala" programs and apply dependencies at Oozie level.
3.In Spark web UI refer SQL tab which will give you the execuiton plan. For detailed study on dataframes run
DF.explain() //which will show you the execution plan
DataFrames in Spark have their execution automatically optimized by a query optimizer. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution. Because the optimizer understands the semantics of operations and structure of the data, it can make intelligent decisions to speed up computation.
Refer Spark guide - http://spark.apache.org/docs/latest/sql-programming-guide.html
4.Sort the data before any operations such as join. To reduce shuffling use repartition function.
DF1 = DF.repartition(10)
Please post your code if you have any other specific doubt.

Related

Is it possible to save intermediate Spark DataFrames to disk without significantly affecting performance?

Let's suppose I have the following data pipeline:
Companies and Shuttles are my input datasets read from disk using Spark, the preprocess_* functions perform some heavy data cleaning operations, and create_master_table_node is another function that takes 3 Spark DataFrames as input.
Generally, my understanding is that it's most performant to not write any of the intermediate Spark DataFrames (i.e. Preprocessed Companies, Preprocessed Shuttles) to disk. https://spokeo-engineering.medium.com/whats-the-fastest-way-to-store-intermediate-results-in-spark-54f2080defb6 suggests the same.
However, there are also benefits to persisting intermediate data to disk, such as being able to resume failed workflows or analyze intermediate results post-hoc. These benefits are important to me, so I want the intermediate outputs persisted at some point in my workflow--be it just after they're computed or one the entirely pipeline finishes or errors out. What's the least performance-impacting way to save Preprocessed Companies, Preprocessed Shuttles, etc. to disk? Is there a way to do this without affecting the rest of the computational graph?

Spark performance advantage vs. Hadoop MapReduce [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

How to join big dataframes in Spark SQL? (best practices, stability, performance)

I'm getting the same error than Missing an output location for shuffle when joining big dataframes in Spark SQL. The recommendation there is to set MEMORY_AND_DISK and/or spark.shuffle.memoryFraction 0. However, spark.shuffle.memoryFraction is deprecated in Spark >= 1.6.0 and setting MEMORY_AND_DISK shouldn't help if I'm not caching any RDD or Dataframe, right? Also I'm getting lots of other WARN logs and task retries that lead me to think that the job is not stable.
Therefore, my question is:
What are best practices to join huge dataframes in Spark SQL >= 1.6.0?
More specific questions are:
How to tune number of executors and spark.sql.shuffle.partitions to achieve better stability/performance?
How to find the right balance between level of parallelism (num of executors/cores) and number of partitions? I've found that increasing the num of executors is not always the solution as it may generate I/O reading time out exceptions because of network traffic.
Is there any other relevant parameter to be tuned for this purpose?
My understanding is that joining data stored as ORC or Parquet offers better performance than text or Avro for join operations. Is there a significant difference between Parquet and ORC?
Is there an advantage of SQLContext vs HiveContext regarding stability/performance for join operations?
Is there a difference regarding performance/stability when the dataframes involved in the join are previously registerTempTable() or saveAsTable()?
So far I'm using this is answer and this chapter as a starting point. And there are a few more stackoverflow pages related to this subject. Yet I haven't found a comprehensive answer to this popular issue.
Thanks in advance.
That are a lot of questions. Allow me to answer these one by one:
Your number of executors is most of the time variable in a production environment. This depends on the available resources. The number of partitions is important when you are performing shuffles. Assuming that your data is now skewed, you can lower the load per task by increasing the number of partitions.
A task should ideally take a couple of minus. If the task takes too long, it is possible that your container gets pre-empted and the work is lost. If the task takes only a few milliseconds, the overhead of starting the task gets dominant.
The level of parallelism and tuning your executor sizes, I would like to refer to the excellent guide by Cloudera: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
ORC and Parquet only encode the data at rest. When doing the actual join, the data is in the in-memory format of Spark. Parquet is getting more popular since Netflix and Facebook adopted it and put a lot of effort in it. Parquet allows you to store the data more efficient and has some optimisations (predicate pushdown) that Spark uses.
You should use the SQLContext instead of the HiveContext, since the HiveContext is deprecated. The SQLContext is more general and doesn't only work with Hive.
When performing the registerTempTable, the data is stored within the SparkSession. This doesn't affect the execution of the join. What it stores is only the execution plan which gets invoked when an action is performed (for example saveAsTable). When performining a saveAsTable the data gets stored on the distributed file system.
Hope this helps. I would also suggest watching our talk at the Spark Summit about doing joins: https://www.youtube.com/watch?v=6zg7NTw-kTQ. This might provide you some insights.
Cheers, Fokko

MapReduce for same task/different data

We have a system that is made up of multiple PostgreSQL databases. Each database has the same tables, i.e., schema, but only carries a share of the data (and not the full data!).The reason for distributing the data is that our customers run queries that are rather complex and perform up to 100 calculations per row.
By distributing the data to multiple databases, we want to lower the amount of work processed by each database, and ultimately speed up search. At the end, we combine the results of each database to create the final results.
A friend of mine has recommended looking at MapReduce (Hadoop). In my opinion, map-reduce only makes sense if the single workers share the same data but perform different type of work on it (corresponds to multiple instruction, single data).
In our case, however, the workers should perform the same task, but perform that task on various data (corresponds to single instruction, multiple data).
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Yes.
I think you have a misconception about Hadoop and MapReduce. A MapReduce job does indeed work on the same type of data (i.e., "same tables"), but different segments of that data. The parallel Map and Reduce tasks are the same tasks over different portions of the data. MapReduce is most definitely "single instruction, multiple data" from your definition.
Hadoop is by no means a drop-in replacement for a SQL database. They do different things in different ways. Here are some other things to note:
Note that MapReduce is only really going to do batch analytics for you. Things like rollups and counts and aggregates. You won't be able to retrieve or search with MapReduce effectively. Also, updating data in Hadoop is not a typical way you want to do things-- you treat things as more "append only". For any of that, you'll probably want to look at HBase.
Hadoop's file system segments the data for you. From a file system perspective, it'll look like files in folders that contain CSV (or some other file format). Files get split up into blocks, which can then be operated on separately with map tasks. You won't have to manually shard the data like you are now.
Take a look at Hive. It's a abstraction layer on top of MapReduce that interprets a light version of SQL into MapReduce under the covers. It should allow you to convert some of your logic a bit easier.

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

Resources