I am facing a unique problem, and wanted your opinions here.
I have a legacy map-reduce application, where multiple map-reduce jobs run sequentially, the intermediate data is written back and forth to HDFS. Because of intermediate data written to HDFS, the jobs with small data lose more than gain from HDFS's features, and take considerably more time than what a non-Hadoop equivalent would have taken. Eventually I plan to convert all my map reduce jobs to Spark DAGs, however that's a big-bang change, so I am reasonably procrastinating.
What I really want as a short term solution is that, change the storage layer, so that I continue to benefit from map-reduce parallelism, but do not pay much penalty for storage layer. In that direction, I am thinking of using Spark as the storage layer, where map-reduce jobs will store their outputs in Spark through Spark Context, and the inputs will be read again (by creating Spark input split, each split will have it's own Spark RDD) from Spark Context.
In this way, I will be able to operate intermediate data read/write at memory speed, which will theoretically give me significant performance improvement.
My question is, does this architectural scheme make sense? Has anyone encountered situations like this? Am I missing something significant, which I should have considered even at this preliminary stage of the solution?
Thanks in advance!
does this architectural scheme make sense?
It doesn't. Spark has no standalone storage layer so there is nothing you can use here. If it wasn't enough at its core it is using standard Hadoop input formats for reading and writing data.
If you want to reduce overhead of a storage layer you should rather consider accelerated accelerated storage (like Alluxio) or memory grid (like Ignite Hadoop Accelerator).
Related
This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/
I have a doubt that in which cases , MapReduce is chosen over hive or pig.
I know that it is used when
We need indepth filtering of the input data.
working with unstructured data.
Working with graph. ....
But is there any place where we cant use hive, pig or we can work much better with MapReduce and it is used highly in real projects
Hive and Pig are generic solutions and they will have overhead while processing the data. Most of the scenarios it is negligible but in some cases it can be considerable.
If there are many tables that needs to be joined, using Hive and Pig tries to apply generic solution, if you use map reduce after understanding the data, you can come up with more optimal solution.
However map reduce should be treated as kernel. If your solution can be reused else where, it will be better to develop it using map reduce and integrate with Hive/Pig/Sqoop.
Pig can be used to process unstructured data. It will give more flexibility than Hive while processing the data.
Bare MapReduce is not written very often these days. Higher level abstractions such as the two you mentioned are more popular and adequate for query workloads.
Even in scenarios where HiveQL is too restrictive one might seek alternatives such as Cascading or Scalding for low-level batch jobs or the ever more popular Spark.
A primary motivation of using these high level abstractions is because most applications require a sequence of map and reduce phase which the MapReduce APIs leave you on your own to figure out how to serialize data between tasks.
Everyone is saying that Spark is using the memory and because of that it's much faster than Hadoop.
I didn't understand from the Spark documentation what the real difference is.
Where does Spark stores the data in memory while Hadoop doesn't?
What happens if the data is too big for the memory? How similar would it be to Hadoop in that case?
Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk. Mean intermediate output store in main memory where as hadoop store intermediate result in secondary memory. MapReduce inserts barriers, and it takes a long time to write things to disk and read them back. Hence MapReduce can be slow and laborious. The elimination of this restriction makes Spark orders of magnitude faster. For things like SQL engines such as Hive, a chain of MapReduce operations is usually needed, and this requires a lot of I/O activity. On to disk, off of disk—on to disk, off of disk. When similar operations are run on Spark, Spark can keep things in memory without I/O, so you can keep operating on the same data quickly. This results in dramatic improvements in performance, and that means Spark definitely moves us into at least the interactive category. For the record, there are some benefits to MapReduce doing all that recording to disk — as recording everything to disk allows for the possibility of restarting after failure. If you’re running a multi-hour job, you don’t want to begin again from scratch. For applications on Spark that run in the seconds or minutes, restart is obviously less of an issue.
It’s easier to develop for Spark. Spark is much more powerful and expressive in terms of how you give it instructions to crunch data. Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark.
Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL
In Hadoop MapReduce the input data is on disk, you perform a map and a reduce and put the result back on disk. Apache Spark allows more complex pipelines. Maybe you need to map twice but don't need to reduce. Maybe you need to reduce then map then reduce again. The Spark API makes it very intuitive to set up very complex pipelines with dozens of steps.
You could implement the same complex pipeline with MapReduce too. But then between each stage you write to disk and read it back. Spark avoids this overhead when possible. Keeping data in-memory is one way. But very often even that is not necessary. One stage can just pass the computed data to the next stage without ever storing the whole data anywhere.
This is not an option with MapReduce, because one MapReduce does not know about the next. It has to complete fully before the next one can start. That is why Spark can be more efficient for complex computation.
The API, especially in Scala, is very clean too. A classical MapReduce is often a single line. It's very empowering to use.
Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.
What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.