Any clustering system including Hadoop has both benefits and harms. Benefits is ability to compute in parallel, harms are overhead for task distribution.
Suppose I don't want to have any benefits and using one node. How can I run Hadoop to completely avoid overhead? Is running on single node pseudodistributed node sufficient? Or it will still have some parallelizing overhead?
Related
i heard that hadoop can get performance problems if you run broad queries because too many nodes can be involved?
Can anyone verify or falsify this statement?
Thanks!
BR
The namenode has performance problems if you add too many files as it must store all file locations in memory. You can optimize this by periodically creating larger archives. For example, daily database dumps becomes monthly/yearly compressed archives that are still in a processable format
HDFS datanodes are just a filesystem and scale linearly. Adding more NodeManager nodes overall has no negative consequences, and YARN has been reported as running up to 1000 nodes, I would suggest using standalone clusters if you actually needed more than that.
As with any distributed system, you need to optimize network switching and system monitoring, but those are operational performance problems not specific to Hadoop
I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.
For example, one common rule of thumb advice I see can be summarized as...
Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI
But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.
Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).
Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.
And how does Mahout, Mesos and Spark fit in all this?
What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?
There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:
Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.
So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).
If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.
If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.
The link you posted about FEM being done on MapReduce: Link
uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.
Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.
I have recently built a multi-node Hadoop 2.5.0 cluster using few ARM development boards, and I can not decide if I should use same type of board as master, if I should use faster arm board as master, or even if I should use a desktop to be master to manage slave nodes?
Is there any benefit of having master node faster than the slave nodes in the same cluster?
Besides benefits of increased RAM, does increased CPU performance of master node matters?
Namenode/Jobtracker hardware specifications must be relational to worker nodes. (something like this might help)
But I don't recall any recommendation about having more powerful master nodes. They don't need to have extra Ram/HDD/CPU power. Actually you can save money by using less power in master nodes without losing much performance. (Do not forget relational)
There is a lot of content explaining data locality and how MapReduce and HDFS works on multi-node clusters. But I can't find much information regarding a single node setup. In the past three months that I'm experimenting with Hadoop I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?
What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?
Does the parallelism that is provided by splitting the input data still applies in this case?
What's the difference of reading input from a single node HDFS and reading from the local filesystem?
I think due to my little experience I can't answer these questions clearly, so any help is appreciated!
Thanks in advance!
EDIT: I understand Hadoop is not suitable for a single node setup because of all the factors listed by #TC1. So, what's the benefit of setting up a pseudo-distributed Hadoop environment?
I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?
It depends. Combiners are run between mapping and reducing and you'd definitely feel the impact even on a single node if they were used right. Custom partitioners -- probably no, the data hits the same disk before reducing. They would affect the logic, i.e., what data your reducers receive, but probably not the performance
What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?
Processing capability. If you can get by with a single node setup for your data, you probably shouldn't be using Hadoop for your processing in the first place.
Does the parallelism that is provided by splitting the input data still applies in this case?
No, the bottleneck typically is I/O, i.e., accessing the disk. In this case, you're still accessing the same disk, only hitting it from more threads.
What's the difference of reading input from a single node HDFS and reading from the local filesystem?
Virtually non-existent. The idea of HDFS is to
store files in big, contiguous blocks, to avoid disk seeking
replicate these blocks among the nodes to provide resilience;
both of those are moot when running on a single node.
EDIT:
The difference between "single-node" and "pseudo-distributed" is that in single-mode all the Hadoop processes run on a single JVM. There's no network communication involved, not even through localhost etc. Even if simply testing a job on small data, I'd advise to use pseudo-distributed since that is essentially the same as a cluster.
The Apache Hadoop is inspired by the Google MapReduce paper. The flow of MapReduce can be considered as two set of SIMDs (single instruction multiple data), one for Mappers, another for Reducers. Reducers, through predefined "key", consume the output of Mappers. The essence of MapReduce framework (and Hadoop) is to automatically partition the data, determine the number of partitions and parallel jobs, and manage distributed resources.
I have a general algorithm (not necessarily MapReducable) to be run in parallel. I am not implementing the algorithm itself the MapReduce-way. Instead, the algorithm is just a single-machine python/java program. I want to run 64 copies of this program in parallel (assuming there is no concurrency issue in the program). i.e. I am more interested in the computing resources in the Hadoop cluster than the MapReduce frameworks. Is there anyway I can use the Hadoop cluster in this old fashion?
Other way of thinking about MapReduce, is MR does the transformation and Reduce does some sort of aggregations.
Hadoop also allows for a Map only job. This way it should be possible to run 64 copies of the Map program run in parallel.
Hadoop has the concept of slots. By default there will be 2 map and 2 reduce slots per node/machine. So, for 64 processes in parallel, 32 nodes are required. If the nodes are of higher end configuration, then the number of M/R slots per node can also be bumped up.