Spark vs MapReduce , why is Spark faster than MR ,the principle? - hadoop

As I know ,Spark preload the data from every nodes' disk(HDFS) into every nodes' RDD to compute. But as I guess, MapReduce must also load the data from HDFS to memory and then compute it in memory. So.. why is Spark more faseter?
Just because MapReduce load the data to memory at every time when MapReduce want to do the compute but Spark preload the data? Thank you very much.

There is a concept of an Resilient Distributed Dataset (RDD), which Spark uses, it allows to transparently store data on memory and persist it to disc when needed.
On other hand in Map reduce after Map and reduce tasks data will be shuffled and sorted (synchronisation barrier) and written to disk.
In Spark, there is no synchronisation barrier that slows map-reduce down. And the usage of memory makes the execution engine really fast.

Hadoop Map Reduce
Hadoop Map Reduce is Batch Processing
2.In HDFS high latency. Here is a full explanation about Hadoop MapReduce and Spark
http://commandstech.com/basic-difference-between-spark-and-map-reduce-with-examples/
Spark:
Coming to Spark is Streaming processing
Low latency because of RDDs.

Related

how RAM is used in mapreduce processing?

Need clarification on processing, daemons like(namenode,datanode,jobttracker,task tracker) these all lie in a cluster (single node cluster- they are distributed in hard-disk).
What is the use of RAM or cache in map reduce processing or how it is accessed by various process in map reduce ?
Job Tracker and Task tracker were used to manage resources in cluster in map reduce 1.x and the reason it was removed is because it was not efficient method. Since map reduce 2.x a new mechanism was introduced called YARN. You can visit this link http://javacrunch.in/Yarn.jsp for understanding in depth working of YARN. Hadoop daemons use the ram for optimizing the job execution like in map reduce RAM is used for keeping resource logs in memory when a new job is submitted so that resources manager can identify how to distribute a job in a cluster. One more important thing is that hadoop map reduce performe disk oriented jobs it uses disk for executing a job and that is a major reason due to which it is slower than spark.
Hope this solve your query
You mentioned cluster in your question, we will not call single server or machine as cluster
Daemons(Processes) don't distributed across hard disks, those will utilize RAM to run
Regarding Cache look into this answer
RAM is used during processing of Map Reduce application.
Once the data is read through InputSplits (from HDFS blocks) into memory (RAM), the processing happens on data stored in RAM.
mapreduce.map.memory.mb = The amount of memory to request from the scheduler for each map task.
mapreduce.reduce.memory.mb = The amount of memory to request from the scheduler for each reduce task.
Default value for above two parameters is 1024 MB ( 1 GB )
Some more memory related parameters have been used in Map Reduce phase. Have a look at documentation page about mapreduce-site.xml for more details.
Related SE questions:
Mapreduce execution in a hadoop cluster

How non mapreduce applications work in YARN?

By using YARN, we can run non mapreduce application.
But how it works?
In HDFS, All gets stored in Blocks. For each blocks one mapper tasks would get create to process whole dataset.
But Non mapreduce applications, how it will process the datasets in different data node with out using mapreduce?
Please explain me.
Do not confuse the Map reduce paradigm with other applications like for instance Spark. Spark can run under Yarn but does not use mappers or reducers.
Instead it uses executors, these executors are aware of the datalocality, the same way mapreduce is.
The spark Driver will start executors on data nodes and will try to keep the data locality in mind when doing so.
Also do not confuse Map Reduce default behaviour with standard behaviour. you do not need to have 1 mapper per input split.
Also HDFS and Map Reduce are two different things. HDFS is just the storage layer while Map Reduce handles processing.

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

Hbase mapreduce interaction

I have an program hbase and mapreduce.
I store data in HDFS, size of this file is : 100G. Now i put this data to Hbase.
I use mapreduce to scan this file lost 5 minutes. But to scan hbase table lost 30 minutes.
How to increase the speed when using hbase and mapreduce ?
Thanks.
I am assuming you are having a Single Node HDFS. If you had your 100Gb file in a Multi Node cluster of HDFS, it would have been much faster for both Map Reduce and Hive.
You could try increasing no of mappers and reducers on Map Reduce to gain some performance increase, have a look at this post.
Hive is essentially a Data Warehousing tool built on top of HDFS and every query is underneath is a Map Reduce task itself. So above post would answer this problem also.

How does sorting(Order by) be implemented in Hive?

We know that hive doesn't do sampling before a sorting job start.It just leverage the sorting machenism of MapReduce and perform merge-sort in reduce side and only one reduce is used.Since reduce collects all data output by mapper in this scenario,say a machine running reduce has ony 100GB disk, what if the data is too big to fit in the disk?
The parallel sorting mechanism of Hive is still under development, see here.
A well designed data warehouse or database application will avoid such global sorting. If needed, try using Pig or Terasort(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html)

Resources