What is the Hadoop ecosystem and how does Apache Spark fit in? - hadoop

I'm having a lot of trouble grasping what exactly a 'Hadoop ecosystem' is conceptually. I understand that you have some data processing tasks that you want to run and so you use MapReduce to split the job up into smaller pieces but I'm unsure about what people mean when they say 'Hadoop Ecosystem'. I'm also unclear as to what the benefits of Apache Spark are and why this is seen as so revolutionary? If it's all in-memory calculation, wouldn't that just mean that you would need higher RAM machines to run Spark jobs? How is Spark different than writing some parallelized Python code or something of that nature.

Your question is rather broad - the Hadoop ecosystem is a wide range of technologies that either support Hadoop MapReduce, make it easier to apply, or otherwise interact with it to get stuff done.
Examples:
The Hadoop Distributed Filesystem (HDFS) stores data to be processed by MapReduce jobs, in a scalable redundant distributed fashion.
Apache Pig provides a language, Pig Latin, for expressing data flows that are compiled down into MapReduce jobs
Apache Hive provides an SQL-like language for querying huge datasets stored in HDFS
There are many, many others - see for example https://hadoopecosystemtable.github.io/
Spark is not all in-memory; it can perform calculations in-memory if enough RAM is available, and can spill data over to disk when required.
It is particularly suitable for iterative algorithms, because data from the previous iteration can remain in memory. It provides a very different (and much more concise) programming interface, compared to plain Hadoop. It can provide some performance advantages even when the work is mostly done on disk rather than in-memory. It supports streaming as well as batch jobs. It can be used interactively, unlike Hadoop.
Spark is relatively easy to install and play with, compared to Hadoop, so I suggest you give it a try to understand it better - for experimentation it can run off a normal filesystem and does not require HDFS to be installed. See the documentation.

Related

Why does Hadoop choose MapReduce as its computing engine?

I know MapReduce(MR) is one of the three core frameworks of Hadoop and I am familiar with its mapper-shuffle-reducer progress.
My question can be separated into two parts:
1) What makes MR unique for Hadoop and why not other computing algorithms?
2) How does other language(e.g.: shell, python)'s computing part work? Is their computing procedure similar to MR?
"Divide and conquer" is a very powerful approach for handling datasets. MapReduce provides a way to read large amounts of data, but distributes the workload in an scalable fashion. Often, even unstructured data has a way to separate out individual "records" from a raw file, and Hadoop (or its compatible filesystems like S3) offer ways to store large quantities of data in raw formats rather than in a database. Hadoop is just a process managing a bunch of disks and files. YARN (or Mesos, or other scheduler) is what allows you to deploy and run individual "units of work" that go gather and process data.
MapReduce is not the only execution framework, though. Tez or Spark interact with the InputFormat and OutputFormat API of MapReduce to read or write data. Both offer optimizations over MapReduce where tasks can be defined using a computation graph called a DAG rather than saying "map this input, reduce that result, map the output of that reducer to another reducer..." . Even higher languages like Hive or Pig are easier to work with than pure MapReduce functions. All of these are extensions to Hadoop, though, and deserve to be standalone projects rather than part of Hadoop core.
I would argue most Hadoop people are not concerned about why Hadoop uses MapReduce, they just accept its API is how you process files on HDFS interfaces, even if that's abstracted behind other libraries like Spark.
Shell, Python, Ruby, Javascript, etc can all be ran via the Hadoop Streaming API as long as YARN nodes all have the required runtime dependencies. It does operate similarly to MapReduce. Data is processed by a mapper, keys are shuffled into a reducer, and processed again. Except I do not believe custom data formats are possible to be processed. For example, you might not be able to read Parquet data in Shell using Hadoop Streaming. Data must exist as new-line delimiter events, read via standard input, then written to standardized output.

How is Apache Spark different from the Hadoop approach?

Everyone is saying that Spark is using the memory and because of that it's much faster than Hadoop.
I didn't understand from the Spark documentation what the real difference is.
Where does Spark stores the data in memory while Hadoop doesn't?
What happens if the data is too big for the memory? How similar would it be to Hadoop in that case?
Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk. Mean intermediate output store in main memory where as hadoop store intermediate result in secondary memory. MapReduce inserts barriers, and it takes a long time to write things to disk and read them back. Hence MapReduce can be slow and laborious. The elimination of this restriction makes Spark orders of magnitude faster. For things like SQL engines such as Hive, a chain of MapReduce operations is usually needed, and this requires a lot of I/O activity. On to disk, off of disk—on to disk, off of disk. When similar operations are run on Spark, Spark can keep things in memory without I/O, so you can keep operating on the same data quickly. This results in dramatic improvements in performance, and that means Spark definitely moves us into at least the interactive category. For the record, there are some benefits to MapReduce doing all that recording to disk — as recording everything to disk allows for the possibility of restarting after failure. If you’re running a multi-hour job, you don’t want to begin again from scratch. For applications on Spark that run in the seconds or minutes, restart is obviously less of an issue.
It’s easier to develop for Spark. Spark is much more powerful and expressive in terms of how you give it instructions to crunch data. Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark.
Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL
In Hadoop MapReduce the input data is on disk, you perform a map and a reduce and put the result back on disk. Apache Spark allows more complex pipelines. Maybe you need to map twice but don't need to reduce. Maybe you need to reduce then map then reduce again. The Spark API makes it very intuitive to set up very complex pipelines with dozens of steps.
You could implement the same complex pipeline with MapReduce too. But then between each stage you write to disk and read it back. Spark avoids this overhead when possible. Keeping data in-memory is one way. But very often even that is not necessary. One stage can just pass the computed data to the next stage without ever storing the whole data anywhere.
This is not an option with MapReduce, because one MapReduce does not know about the next. It has to complete fully before the next one can start. That is why Spark can be more efficient for complex computation.
The API, especially in Scala, is very clean too. A classical MapReduce is often a single line. It's very empowering to use.

MapReduce or Spark for Batch processing on Hadoop?

I know that MapReduce is a great framework for batch processing on Hadoop. But, Spark also can be used as batch framework on Hadoop that provides scalability, fault tolerance and high performance compared MapReduce. Cloudera, Hortonworks and MapR started supporting Spark on Hadoop with YARN as well.
But, a lot of companies are still using MapReduce Framework on Hadoop for batch processing instead of Spark.
So, I am trying to understand what are the current challenges of Spark to be used as batch processing framework on Hadoop?
Any thoughts?
Spark is an order of magnitude faster than mapreduce for iterative algorithms, since it gets a significant speedup from keeping intermediate data cached in the local JVM.
With Spark 1.1 which primarily includes a new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager for sending shuffle data), a new external shuffle service made Spark perform the fastest PetaByte sort (on 190 nodes with 46TB RAM) and TeraByte sort breaking Hadoop's old record.
Spark can easily handle the dataset which are order of magnitude larger than the cluster's aggregate memory. So, my thought is that Spark is heading in the right direction and will eventually get even better.
For reference this blog post explains how databricks performed the petabyte sort.
I'm assuming when you say Hadoop you mean HDFS.
There are number of benefits of using Spark over Hadoop MR.
Performance: Spark is at least as fast as Hadoop MR. For iterative algorithms (that need to perform number of iterations of the same dataset) is can be a few orders of magnitude faster. Map-reduce writes the output of each stage to HDFS.
1.1. Spark can cache (depending on the available memory) this intermediate results and therefore reduce latency due to disk IO.
1.2. Spark operations are lazy. This means Spark can perform certain optimizing before it starts processing the data because it can reorder operations because they have executed yet.
1.3. Spark keeps a lineage of operations and recreates the partial failed state based on this lineage in case of failure.
Unified Ecosystem: Spark provides a unified programming model for various types of analysis - batch (spark-core), interactive (REPL), streaming (spark-streaming), machine learning (mllib), graph processing (graphx), SQL queries (SparkSQL)
Richer and Simpler API: Spark's API is richer and simpler. Richer because it supports many more operations (e.g., groupBy, filter ...). Simpler because of the expressiveness of these functional constructs. Spark's API supports Java, Scala and Python (for most APIs). There is experimental support for R.
Multiple Datastore Support: Spark supports many data stores out of the box. You can use Spark to analyze data in a normal or distributed file system, HDFS, Amazon S3, Apache Cassandra, Apache Hive and ElasticSearch to name a few. I'm sure support for many other popular data stores is comings soon. This essentially if you want to adopt Spark you don't have to move your data around.
For example, here is what code for word count looks in Spark (Scala).
val textFile = sc.textFile("some file on HDFS")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
I'm sure you have to write a few more lines if you are using standard Hadoop MR.
Here are some common misconceptions about Spark.
Spark is just a in-memory cluster computing framework. However, this is not true. Spark excels when your data can fit in memory because memory access latency is lower. But you can make it work even when your dataset doesn't completely fit in memory.
You need to learn Scala to use Spark. Spark is written in Scala and runs on the JVM. But the Spark provides support for most of the common APIs in Java and Python as well. So you can easily get started with Spark without knowing Scala.
Spark does not scale. Spark is for small datasets (GBs) only and doesn't scale to large number of machines or TBs of data. This is also not true. It has been used successfully to sort PetaBytes of data
Finally, if you do not have a legacy codebase in Hadoop MR it makes perfect sense to adopt Spark, the simple reason being all major Hadoop vendors are moving towards Spark for good reason.
Apache Spark runs in memory, making it much faster than mapreduce.
Spark started as a research project at Berkeley.
Mapreduce use disk extensively (for external sort, shuffle,..).
As the input size for a hadoop job is in order of terabytes. Spark memory requirements will be more than traditional hadoop.
So basically, for smaller jobs and with huge memory in ur cluster, sparks wins. And this is not practically the case for most clusters.
Refer to spark.apache.org for more details on spark

In which types of use cases is MapReduce superior to Spark?

I just attended a introductory class on Spark and asked the speaker if Spark could fully replace MapReduce, and was told that Spark could be used in replace of MapReduce for any use case, but there are particular use cases that MapReduce is actually faster than Spark.
What are the characteristics of the use cases that MapReduce can solve faster than Spark?
Pardon me for quoting myself from Quora, but:
For the data-parallel, one-pass, ETL-like jobs MapReduce was designed for, MapReduce is lighter-weight compared to the Spark equivalent
Spark is fairly mature, and so is YARN now, but Spark-on-YARN is still pretty new. The two may not be optimally integrated yet. For example until recently I don't think Spark could ask YARN for allocations based on number of cores? That is: MapReduce might be easier to understand, manage and tune
You can reproduce almost all of MapReduce's behavior in Spark, since Spark has narrow, simpler functions that can be used to produce a lot of executions. You don't always want to mimic MapReduce.
One thing Spark can't do yet is an out-of-core sort of the sort you happen to get from how classic MapReduce works, but that's coming. I suppose there aren't very direct analogs of a few things like MultipleOutputs either.

Hadoop - CPU intensive application - Small data

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.

Resources