MapReduce or Spark for Batch processing on Hadoop? - hadoop

I know that MapReduce is a great framework for batch processing on Hadoop. But, Spark also can be used as batch framework on Hadoop that provides scalability, fault tolerance and high performance compared MapReduce. Cloudera, Hortonworks and MapR started supporting Spark on Hadoop with YARN as well.
But, a lot of companies are still using MapReduce Framework on Hadoop for batch processing instead of Spark.
So, I am trying to understand what are the current challenges of Spark to be used as batch processing framework on Hadoop?
Any thoughts?

Spark is an order of magnitude faster than mapreduce for iterative algorithms, since it gets a significant speedup from keeping intermediate data cached in the local JVM.
With Spark 1.1 which primarily includes a new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager for sending shuffle data), a new external shuffle service made Spark perform the fastest PetaByte sort (on 190 nodes with 46TB RAM) and TeraByte sort breaking Hadoop's old record.
Spark can easily handle the dataset which are order of magnitude larger than the cluster's aggregate memory. So, my thought is that Spark is heading in the right direction and will eventually get even better.
For reference this blog post explains how databricks performed the petabyte sort.

I'm assuming when you say Hadoop you mean HDFS.
There are number of benefits of using Spark over Hadoop MR.
Performance: Spark is at least as fast as Hadoop MR. For iterative algorithms (that need to perform number of iterations of the same dataset) is can be a few orders of magnitude faster. Map-reduce writes the output of each stage to HDFS.
1.1. Spark can cache (depending on the available memory) this intermediate results and therefore reduce latency due to disk IO.
1.2. Spark operations are lazy. This means Spark can perform certain optimizing before it starts processing the data because it can reorder operations because they have executed yet.
1.3. Spark keeps a lineage of operations and recreates the partial failed state based on this lineage in case of failure.
Unified Ecosystem: Spark provides a unified programming model for various types of analysis - batch (spark-core), interactive (REPL), streaming (spark-streaming), machine learning (mllib), graph processing (graphx), SQL queries (SparkSQL)
Richer and Simpler API: Spark's API is richer and simpler. Richer because it supports many more operations (e.g., groupBy, filter ...). Simpler because of the expressiveness of these functional constructs. Spark's API supports Java, Scala and Python (for most APIs). There is experimental support for R.
Multiple Datastore Support: Spark supports many data stores out of the box. You can use Spark to analyze data in a normal or distributed file system, HDFS, Amazon S3, Apache Cassandra, Apache Hive and ElasticSearch to name a few. I'm sure support for many other popular data stores is comings soon. This essentially if you want to adopt Spark you don't have to move your data around.
For example, here is what code for word count looks in Spark (Scala).
val textFile = sc.textFile("some file on HDFS")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
I'm sure you have to write a few more lines if you are using standard Hadoop MR.
Here are some common misconceptions about Spark.
Spark is just a in-memory cluster computing framework. However, this is not true. Spark excels when your data can fit in memory because memory access latency is lower. But you can make it work even when your dataset doesn't completely fit in memory.
You need to learn Scala to use Spark. Spark is written in Scala and runs on the JVM. But the Spark provides support for most of the common APIs in Java and Python as well. So you can easily get started with Spark without knowing Scala.
Spark does not scale. Spark is for small datasets (GBs) only and doesn't scale to large number of machines or TBs of data. This is also not true. It has been used successfully to sort PetaBytes of data
Finally, if you do not have a legacy codebase in Hadoop MR it makes perfect sense to adopt Spark, the simple reason being all major Hadoop vendors are moving towards Spark for good reason.

Apache Spark runs in memory, making it much faster than mapreduce.
Spark started as a research project at Berkeley.
Mapreduce use disk extensively (for external sort, shuffle,..).
As the input size for a hadoop job is in order of terabytes. Spark memory requirements will be more than traditional hadoop.
So basically, for smaller jobs and with huge memory in ur cluster, sparks wins. And this is not practically the case for most clusters.
Refer to spark.apache.org for more details on spark

Related

is it possible to convert from hbase to spark rdd efficiency?

I have a large dataset of items in hbase that I want to load into a spark rdd for processing. My understanding is that hbase is optimized for low-latency single item searches on hadoop, so I am wondering if it's possible to efficiently query for 100 million items in hbase (~10Tb in size)?
Here is some general advice on making Spark and HBase work together.
Data colocation and partitioning
Spark avoids shuffling : if your Spark workers and HBase regions are located on the same machines, Spark will create partitions according to regions.
A good region split in HBase will map to a good partitioning in Spark.
If possible, consider working on your rowkeys and region splits.
Operations in Spark vs operations in HBase
Rule of thumb : use HBase scans only, and do everything else with Spark.
To avoid shuffling in your Spark operations, you can consider working on your partitions. For example : you can join 2 Spark rdd from HBase scans on their Rowkey or Rowkey prefix without any shuffling.
Hbase configuration tweeks
This discussion is a bit old (some configurations are not up to date) but still interesting : http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
And the link below has also some leads:
http://blog.asquareb.com/blog/2015/01/01/configuration-parameters-that-can-influence-hbase-performance/
You might find multiple sources (including the ones above) suggesting to change the scanner cache config, but this holds only with HBase < 1.x
We had this exact question at Splice Machine. We found the following based on our tests.
HBase had performance challenges if you attempted to perform remote scans from spark/mapreduce.
The large scans hurt performance of ongoing small scans by forcing garbage collection.
There was not a clear resource management dividing line between OLTP and OLAP queries and resources.
We ended up writing a custom reader that reads the HFiles directly from HDFS and performs incremental deltas with the memstore during scans. With this, Spark could perform quick enough for most OLAP applications. We also separated the resource management so the OLAP resources were allocated via YARN (On Premise) or Mesos (Cloud) so they would not disturb normal OLTP apps.
I wish you luck on your endeavor. Splice Machine is open source and you are welcome to checkout out our code and approach.

Spark performance advantage vs. Hadoop MapReduce [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

What is the Hadoop ecosystem and how does Apache Spark fit in?

I'm having a lot of trouble grasping what exactly a 'Hadoop ecosystem' is conceptually. I understand that you have some data processing tasks that you want to run and so you use MapReduce to split the job up into smaller pieces but I'm unsure about what people mean when they say 'Hadoop Ecosystem'. I'm also unclear as to what the benefits of Apache Spark are and why this is seen as so revolutionary? If it's all in-memory calculation, wouldn't that just mean that you would need higher RAM machines to run Spark jobs? How is Spark different than writing some parallelized Python code or something of that nature.
Your question is rather broad - the Hadoop ecosystem is a wide range of technologies that either support Hadoop MapReduce, make it easier to apply, or otherwise interact with it to get stuff done.
Examples:
The Hadoop Distributed Filesystem (HDFS) stores data to be processed by MapReduce jobs, in a scalable redundant distributed fashion.
Apache Pig provides a language, Pig Latin, for expressing data flows that are compiled down into MapReduce jobs
Apache Hive provides an SQL-like language for querying huge datasets stored in HDFS
There are many, many others - see for example https://hadoopecosystemtable.github.io/
Spark is not all in-memory; it can perform calculations in-memory if enough RAM is available, and can spill data over to disk when required.
It is particularly suitable for iterative algorithms, because data from the previous iteration can remain in memory. It provides a very different (and much more concise) programming interface, compared to plain Hadoop. It can provide some performance advantages even when the work is mostly done on disk rather than in-memory. It supports streaming as well as batch jobs. It can be used interactively, unlike Hadoop.
Spark is relatively easy to install and play with, compared to Hadoop, so I suggest you give it a try to understand it better - for experimentation it can run off a normal filesystem and does not require HDFS to be installed. See the documentation.

MySQL Cluster vs. Hadoop for handling big data

I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework.
What is the better solution. I would like to read your opinion.
I think the advantages of using a MySQL Cluster are:
high availability
good scalability
high performance / real time data access
you can use commodity hardware
And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?
The advantages of Hadoop with Hive on top of it are:
also good scalability
you can also use commodity hardware
the ability to run in heterogenous environments
parallel computing with the MapReduce framework
Hive with HiveQL
and the disadvantage is:
no real time data access. It may takes minutes or hours to analyze the data.
So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.
If you wanna choose a offline compute & storage arch.
I suggest hadoop not MySQL cluster for offline compute & storage, because of :
Cost : obviously, hadoop cluster is more cheap than MySQL cluster
Scalability : hadoop support more than ten thousands machine in a cluster
Ecosystem : mapreduce, hive, pig, sqoop and etc.
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
The answer to this question has a good explanation of this distinction.

Resources