Hadoop - CPU intensive application - Small data - performance

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.

Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.

Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.

You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.

Related

Spark as the storage layer for Mapreduce

I am facing a unique problem, and wanted your opinions here.
I have a legacy map-reduce application, where multiple map-reduce jobs run sequentially, the intermediate data is written back and forth to HDFS. Because of intermediate data written to HDFS, the jobs with small data lose more than gain from HDFS's features, and take considerably more time than what a non-Hadoop equivalent would have taken. Eventually I plan to convert all my map reduce jobs to Spark DAGs, however that's a big-bang change, so I am reasonably procrastinating.
What I really want as a short term solution is that, change the storage layer, so that I continue to benefit from map-reduce parallelism, but do not pay much penalty for storage layer. In that direction, I am thinking of using Spark as the storage layer, where map-reduce jobs will store their outputs in Spark through Spark Context, and the inputs will be read again (by creating Spark input split, each split will have it's own Spark RDD) from Spark Context.
In this way, I will be able to operate intermediate data read/write at memory speed, which will theoretically give me significant performance improvement.
My question is, does this architectural scheme make sense? Has anyone encountered situations like this? Am I missing something significant, which I should have considered even at this preliminary stage of the solution?
Thanks in advance!
does this architectural scheme make sense?
It doesn't. Spark has no standalone storage layer so there is nothing you can use here. If it wasn't enough at its core it is using standard Hadoop input formats for reading and writing data.
If you want to reduce overhead of a storage layer you should rather consider accelerated accelerated storage (like Alluxio) or memory grid (like Ignite Hadoop Accelerator).

Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.
For example, one common rule of thumb advice I see can be summarized as...
Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI
But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.
Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).
Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.
And how does Mahout, Mesos and Spark fit in all this?
What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?
There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:
Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.
So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).
If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.
If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.
The link you posted about FEM being done on MapReduce: Link
uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.
Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

Mahout recommendation engine: going distributed

Does anybody know how I could transform the code found on the Mahout in Action book, regarding the recommendation engines, so that it is consistent with a Ηadoop fully-distributed environment? My main difficulty is to transform my code (that currently reads and writes from a local disk) so that it runs in a pseudo-distributed environment (such Cloudera). Is the solution to my problem as simple as this one, or I should expect something more complex than that?
A truly distributed computation is quite different than a non-distributed computation, even when computing the same result. The structure is not the same, and the infrastructure it uses is not the same.
If you are just asking how the pseudo-distributed solution works regarding local files: you would ignore the Hadoop input/output mechanism and write a Mapper that reads your input from somewhere on HDFS and copies to local disk.
If you are asking how you actually distribute the computation, then you would have to switch to use the (completely-different) distributed implementations in the project. These actually use Hadoop to split up the computation. The process above is a hack that just runs many non-distributed tasks within a Hadoop container. These implementations are however completely off-line.
If you mean that you want a real-time recommender like in the Mahout .cf.taste packages, but also want to actually use Hadoop's distributed computing power, then you need more than Mahout. It's either one or the other in Mahout; there is code that does one or the other but they are not related.
This is exactly what Myrrix is, by the way. I don't mind advertising it here since it sounds like exactly what you may be looking for. It's an evolution of the work I began in this Mahout code. Among other things, it's a 2-tier architecture that has the real-time elements of Taste but can also transparently offload the computation to a Hadoop cluster.

Hadoop: Disadvantages of using just 2 machines?

I want to do log parsing of huge amounts of data and gather analytic information. However all the data comes from external sources and I have only 2 machines to store - one as backup/replication.
I'm trying to using Hadoop, Lucene... to accomplish that. But, all the training docs mention that Hadoop is useful for distributed processing, multi-node. My setup does not fit into that architecture.
Are they any overheads with using Hadoop with just 2 machines? If Hadoop is not a good choice are there alternatives? We looked at Splunk, we like it, but that is expensive for us to buy. We just want to build our own.
Hadoop should be used for distributed batch processing problems.
5-common-questions-about-hadoop
Analysis of log files is one of the more common uses of Hadoop, its one of the tasks Facebook use it for.
If you have two machines, you by definition have a multi-node cluster. You can use Hadoop on a single machine if you want, but as you add more nodes the time it takes to process the same amount of data is reduced.
You say you have huge amounts of data? These are important numbers to understand. Personally when I think huge in terms of data, i think in the 100s terabytes+ range. If this is the case, you'll probably need more than two machines, especially if you want to use replication over the HDFS.
The analytic information you want to gather? Have you determined that these questions can be answered using the MapReduce approach?
Something you could consider would be to use Hadoop on Amazons EC2 if you have a limited amount of hardware resources. Here are some links to get you started:
hadoop-world-building-data-intensive-apps-with-hadoop-and-ec2
Hadoop Wiki - AmazonEC2

Resources