In my spark application, I am reading few hive tables in spark rdd and then performing few transformation on those rdds later. To avoid re computation I cached those rdds using rdd.cache() or rdd.persist() and rdd.checkpoint() methods.
As per spark documentation and online references I was of opinion that checkpointing operation is costlier than caching. Though caching keeps rdd lineage and checkpointing breaks it but checkpointing writes and reads from HDFS.
Strange thing I observed in my case is, I see checkpointing stage is faster (nearly 2 times) than caching/persisting(memory only). I ran multiple times and still results were similar.
I am not able to understand why this happening. Any help would be helpful.
I have ran similar benchmarks lately and I am experiencing the same: checkpoint() is faster despite more I/O. My explanation is that keeping the whole lineage is a costly operation.
I ran benchmarks on 1, 10, 100, 1000, 10000, 100000, 1000000, 2000000, 10m, and more records and checkpoint was always faster. The lineage was pretty simple (record filtering then a couple of aggregations). Storage was local on NVMe drives (not block over the network). I guess it really depends on a lot of criteria.
Related
This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/
I am conducting a performance test which compares queries on existing internal Hive tables between Spark SQL and Hive on Tez. Throughout the tests, Spark was showing query execution time that was on par or faster than Hive on Tez. These results are consistent with many of the examples out there. However, there was one noted exception with a query that involved key based selection at the individual record level. In this instance, Spark was significantly slower than Hive on Tez.
After researching this topic on the internet, I could not find a satisfactory answer and wanted to pose this example to the SO community to see if this is an individual one-off case associated with our environment or data, or a larger pattern related to Spark.
Spark 1.6.1
Spark Conf: Executors 2, Executory Memory 32G, Executor Cores 4.
Data is in an internal Hive Table which is stored as ORC file types compressed with zlib. The total size of the compressed files is ~2.2 GB.
Here is the query code.
#Python API
#orc with zlib key based select
dforczslt = sqlContext.sql("SELECT * FROM dev.perf_test_orc_zlib WHERE test_id= 12345678987654321")
dforczslt.show()
The total time to complete this query was over 400 seconds, compared to ~6 seconds with Hive on Tez. I also tried using predicate pushdown via the SQL context configs but this resulted in no noticeable performance increase. Also, when this same test was conducted using Parquet the query time was on par with Hive as well. I'm sure there are other solutions out there to increase the performance of the queries such as using RDDS v. Dataframes etc. But I'm really looking to understand how Spark is interacting with ORC files which is resulting in this gap.
Let me know if I can provide additional clarification around any of the talking points listed above.
The following steps might help to improve the performance of the Spark SQL query.
In general, Hive take the memory of the whole Hadoop cluster which is significantly larger than the executer memory (Here 2* 32 = 64 GB). What's the memory size of the nodes ?.
Further, the number of executers seems to be less (2) when compare to the number of number of map/reduce jobs generated by the hive query. Increasing the number of executers in multiples of 2 might help to improve the performance.
In SparkSQL and Dataframe, optimised execution using manually managed memory (Tungsten) is now enabled by default, along with code generation
for expression evaluation. this features can be enabled by setting spark.sql.tungsten.enabled to true in case if it's not already enabled.
sqlContext.setConf("spark.sql.tungsten.enabled", "true")
The columnar nature of the ORC format helps to avoid reading unnecessary columns. However, But, we are still reading unnecessary rows even if the query has WHERE clause filter.ORC predicate push-down would improve the performance with it's built-in indexs. Here, the ORC predicate push-down is disabled in the Spark SQL by default and need to be explicitly enabled.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
I would recommend you to do some more research and find the potential performance blockers if any.
I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.
Everyone is saying that Spark is using the memory and because of that it's much faster than Hadoop.
I didn't understand from the Spark documentation what the real difference is.
Where does Spark stores the data in memory while Hadoop doesn't?
What happens if the data is too big for the memory? How similar would it be to Hadoop in that case?
Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk. Mean intermediate output store in main memory where as hadoop store intermediate result in secondary memory. MapReduce inserts barriers, and it takes a long time to write things to disk and read them back. Hence MapReduce can be slow and laborious. The elimination of this restriction makes Spark orders of magnitude faster. For things like SQL engines such as Hive, a chain of MapReduce operations is usually needed, and this requires a lot of I/O activity. On to disk, off of disk—on to disk, off of disk. When similar operations are run on Spark, Spark can keep things in memory without I/O, so you can keep operating on the same data quickly. This results in dramatic improvements in performance, and that means Spark definitely moves us into at least the interactive category. For the record, there are some benefits to MapReduce doing all that recording to disk — as recording everything to disk allows for the possibility of restarting after failure. If you’re running a multi-hour job, you don’t want to begin again from scratch. For applications on Spark that run in the seconds or minutes, restart is obviously less of an issue.
It’s easier to develop for Spark. Spark is much more powerful and expressive in terms of how you give it instructions to crunch data. Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark.
Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL
In Hadoop MapReduce the input data is on disk, you perform a map and a reduce and put the result back on disk. Apache Spark allows more complex pipelines. Maybe you need to map twice but don't need to reduce. Maybe you need to reduce then map then reduce again. The Spark API makes it very intuitive to set up very complex pipelines with dozens of steps.
You could implement the same complex pipeline with MapReduce too. But then between each stage you write to disk and read it back. Spark avoids this overhead when possible. Keeping data in-memory is one way. But very often even that is not necessary. One stage can just pass the computed data to the next stage without ever storing the whole data anywhere.
This is not an option with MapReduce, because one MapReduce does not know about the next. It has to complete fully before the next one can start. That is why Spark can be more efficient for complex computation.
The API, especially in Scala, is very clean too. A classical MapReduce is often a single line. It's very empowering to use.
This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.