After reading this and this paper, I decided I want to implement a distributed volume rendering setup for large datasets on MapReduce as my undergraduate thesis work. Is Hadoop a reasonable choice? Wouldn't it being Java kill some performance gains or make difficult the integration with CUDA? Would Phoenix++ be a better tool for the job?
Hadoop also has a C++ API called Hadoop Pipes. Pipes allows you to write Map and Reduce code in C++, and thus interface with any C/C++ libraries you have available. It makes sense that this could enable you to interface with CUDA.
To my understanding, it is only a rewriting of MapReduce, thus all of the network communication and the distributed filesystem would still be handled by Java. Hadoop is intended to make parallelization of tasks simple and general, and as such it is unable to be the most efficient MapReduce implementation. Your requirements for efficiency versus available programmer time will probably be the deciding factor in using Hadoop or a more efficient, low-level framework.
Word Count in Pipes example. There is a real lack of documentation, unfortunately, but having the source available makes things a lot easier.
Related
I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.
For example, one common rule of thumb advice I see can be summarized as...
Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI
But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.
Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).
Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.
And how does Mahout, Mesos and Spark fit in all this?
What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?
There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:
Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.
So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).
If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.
If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.
The link you posted about FEM being done on MapReduce: Link
uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.
Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.
I'm having a lot of trouble grasping what exactly a 'Hadoop ecosystem' is conceptually. I understand that you have some data processing tasks that you want to run and so you use MapReduce to split the job up into smaller pieces but I'm unsure about what people mean when they say 'Hadoop Ecosystem'. I'm also unclear as to what the benefits of Apache Spark are and why this is seen as so revolutionary? If it's all in-memory calculation, wouldn't that just mean that you would need higher RAM machines to run Spark jobs? How is Spark different than writing some parallelized Python code or something of that nature.
Your question is rather broad - the Hadoop ecosystem is a wide range of technologies that either support Hadoop MapReduce, make it easier to apply, or otherwise interact with it to get stuff done.
Examples:
The Hadoop Distributed Filesystem (HDFS) stores data to be processed by MapReduce jobs, in a scalable redundant distributed fashion.
Apache Pig provides a language, Pig Latin, for expressing data flows that are compiled down into MapReduce jobs
Apache Hive provides an SQL-like language for querying huge datasets stored in HDFS
There are many, many others - see for example https://hadoopecosystemtable.github.io/
Spark is not all in-memory; it can perform calculations in-memory if enough RAM is available, and can spill data over to disk when required.
It is particularly suitable for iterative algorithms, because data from the previous iteration can remain in memory. It provides a very different (and much more concise) programming interface, compared to plain Hadoop. It can provide some performance advantages even when the work is mostly done on disk rather than in-memory. It supports streaming as well as batch jobs. It can be used interactively, unlike Hadoop.
Spark is relatively easy to install and play with, compared to Hadoop, so I suggest you give it a try to understand it better - for experimentation it can run off a normal filesystem and does not require HDFS to be installed. See the documentation.
Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.
Does anybody know how I could transform the code found on the Mahout in Action book, regarding the recommendation engines, so that it is consistent with a Ηadoop fully-distributed environment? My main difficulty is to transform my code (that currently reads and writes from a local disk) so that it runs in a pseudo-distributed environment (such Cloudera). Is the solution to my problem as simple as this one, or I should expect something more complex than that?
A truly distributed computation is quite different than a non-distributed computation, even when computing the same result. The structure is not the same, and the infrastructure it uses is not the same.
If you are just asking how the pseudo-distributed solution works regarding local files: you would ignore the Hadoop input/output mechanism and write a Mapper that reads your input from somewhere on HDFS and copies to local disk.
If you are asking how you actually distribute the computation, then you would have to switch to use the (completely-different) distributed implementations in the project. These actually use Hadoop to split up the computation. The process above is a hack that just runs many non-distributed tasks within a Hadoop container. These implementations are however completely off-line.
If you mean that you want a real-time recommender like in the Mahout .cf.taste packages, but also want to actually use Hadoop's distributed computing power, then you need more than Mahout. It's either one or the other in Mahout; there is code that does one or the other but they are not related.
This is exactly what Myrrix is, by the way. I don't mind advertising it here since it sounds like exactly what you may be looking for. It's an evolution of the work I began in this Mahout code. Among other things, it's a 2-tier architecture that has the real-time elements of Taste but can also transparently offload the computation to a Hadoop cluster.
(Even more basic than Difference between Pig and Hive? Why have both?)
I have a data processing pipeline written in several Java map-reduce tasks over Hadoop (my own custom code, derived from Hadoop's Mapper and Reducer). It's a series of basic operations such as join, inverse, sort and group by. My code is involved and not very generic.
What are the pros and cons of continuing this admittedly development-intensive approach vs. migrating everything to Pig/Hive with several UDFs? which jobs won't I be able to execute? will I suffer a performance degradation (working with 100s of TB)? will I lose ability to tweak and debug my code when maintaining? will I be able to pipeline part of the jobs as Java map-reduce and use their input-output with my Pig/Hive jobs?
Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.
The above reference also talks about pros and cons of Pig over developing applications in MapReduce.
As with any higher level language or abstraction, flexibility and performance is lost with Pig/Hive at the expense of developer productivity.
In this paper as of 2009 it is stated that Pig runs 1.5 times slower than plain MapReduce. It is expected that higher level tools built on top of Hadoop perform slower than plain MapReduce, however it is true that in order to have MapReduce perform optimally an advanced user that writes a lot of boilerplate code is needed (e.g. binary comparators).
I find it relevant to mention a new API called Pangool (which I'm a developer of) that aims to replace the plain Hadoop MapReduce API by making a lot of things easier to code and understand (secondary sort, reduce-side joins). Pangool doesn't impose a performance overhead (barely 5% as of its first benchmark) and retains all the flexibilty of the original MapRed API.