how to tune mapred.reduce.parallel.copies? - hadoop

Following reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we want to experiment with mapred.reduce.parallel.copies.
The blog mentions "looking very carefully at the logs". How would we know we've reached the sweet spot? what should we look for? how can we detect that we're over-parallelizing?

In order to do that you should basically look for 4 things : CPU, RAM, Disk and Network. If your setup is crossing the threshold of these metrics you can deduce that you are pushing the limits. For example, if you have set the value of "mapred.reduce.parallel.copies" to a value much higher than the number of cores available, you'll end up with too many threads in waiting state, as based on this property Threads will be created to fetch the Map output. In addition to that network might get overwhelmed. Or, if there is too much intermediate output to be shuffled , your job will become slow as you will need disk based shuffle in such a case, which will be slower than RAM based shuffle. Choose a wise value for "mapred.job.shuffle.input.buffer.percent" based on your RAM(defaults to 70% of Reducer heap, which is normally good). So, these are kinda things which will tell you whether you are over-parallelizing or not. There are a lot of other things as well which you should consider. I would recommend you to go through the Chapter 6 of "Hadoop Definitve Guide".
Some of the measures which you could take, in order to make your jobs efficient, are like using a combiner to limit the data transfer, enable intermediate compression etc.
HTH
P.S : The answer is not very specific to just "mapred.reduce.parallel.copies". It tells you about tuning your job in general. Actually speaking setting only this property is not gonna help you much. You should consider other important properties as well.

Reaching the "sweet spot" is really just finding the parameters that give you the best result for whichever metric you consider the most important, usually overall job time. To figure out what parameters are working I would suggest using the following profiling tools that Hadoop comes with, MrBench, TestDFSIO, and NNBench. These are found in the hadoop-mapreduce-client-jobclient-*.jar.
By running this command you will see a long list of benchmark programs that you can use besides the ones I mentioned above.
hadoop ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar
I would suggest running with the default parameters, run tests to give baseline benchmarks, then changing one parameter and rerunning. A bit time consuming but worth it, especially if you use a script to change parameters and run the benchmarks.

Related

Which metrics to measure the efficiency of a MapReduce application?

I wrote a MapReduce application which run on 6 nodes of computers.
I am sure that my MapReduce algorithm (run on a cluster of computers) outperforms the sequential algorithm (run on a single computer), but I think this does not mean that my MapReduce algorithm is efficiently enough, right?
I have searched around and found: speedup, scaleup, and sizeup metrics. Is it true that we normally consider these metrics when measuring the efficiency of MapReduce application? Is there any metric that we need to consider?
Thank you a lot.
Before specifically addressing your question, let's revisit the map-reduce model and see what's the real problem, it tries to solve. You can refer this answer (by me/ of course you can refer other good answers for the same problem), to get an idea of map-reduce model.
So what it really tries to solve? It deduces a generic model that can be applied to solve vast a range of problems that needs to process a massive amount of data (usually in GBs or even Peta Bytes). And the real deal of this model is, it can be easily parallelized and can even be easily distributed the execution among number of nodes. This article (by me) has some detailed explanation of whole model.
So let's go to your question, you are asking about measuring the efficiency of a map reduce program based on speed, memory-efficiency and scalability.
Speaking to the point, the efficiency of a map-reduce program always depend on how far it enjoys the parallelism given by the underlying computational power. This directly indicates that a map-reduce program runs on one cluster may not be the ideal program to run in a different cluster. So we need to have a good idea of our cluster, if we hope to build up our program to a precisely fine-tuned level. But practically its rare some one needs to get it tuned up to that much level.
Let's take your points one by one:
Speed up:
It depends on how you split your input to different portions. This directly deduces the amount of parallelism (in human control). So as I mentioned above, the speed-up directly depends on how your split logic going to be able to utilize your cluster.
Memory efficiency:
It mostly depends on how memory efficient your mapper logic and reducer logic are.
Scalability:
This is mostly out of concern. You can see that the map-reduce model is already highly scalable to a level that one would rarely think about an extra mile.
So speaking as a whole, efficiency of a map reduce program is rarely a concern (even speed and memory). Practically speaking the most valuable metric is the quality of its output. i.e. how good your analytic data are. (in place of marketing, research etc.)

Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
episode1
episode2
Hoping it could help
FF
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide

What is the difference and how to choose between distributed queue and distributed computing platform?

there are many files need to process with two computers real-timely,I want to distribute them to the two computers and these tasks need to be completed as soon as possibile(means real-time processing),I am thinking about the below plan:
(1) distributed queue like Gearman
(2)distributed computing platform like hadoop/spark/storm/s4 and so on
I have two questions
(1)what is the advantage and disadvantage between (1) and (2)?
(2) How to choose in (2),hadoop?spark?storm?s4?or other?
thanks!
Maybe I have not described the question clearly. In most case,there are 1000-3000 files with the same format , these files are independent,you do not need to care their order,the size of one file maybe tens to hundreds of KB and in the future, the number of files and size of single file will rise. I have wrote a program , it can process the file and pick up the data and then store the data in mongodb. Now there are only two computers, I just want a solution that can process these files with the program quickly(as soon as possibile) and is easy to extend and maintain
distributed queue is easy to use in my case bur maybe hard to extend and maintain , hadoop/spark is to "big" in the two computers but easy to extend and maintain, which is better, i am confused.
It depends a lot on the nature of your "processing". Some dimensions that apply here are:
Are records independent from each other or you need some form of aggregation? i.e: do you need some pieces of data to go together? Say, all transactions from a single user account.
Is you processing CPU bound? Memory bound? FileSystem bound?
What will be persisted? How will you persist it?
Whenever you see new data, do you need to recompute any of the old?
Can you discard data?
Is the data somewhat ordered?
What is the expected load?
A good solution will depend on answers to these (and possibly others I'm forgetting). For instance:
If computation is simple but storage and retrieval is the main concern, you should maybe look into a distributed DB rather than either of your choices.
It could be that you are best served by just logging things into a distributed filesystem like HDFS and then run batch computations with Spark (should be generally better than plain hadoop).
Maybe not, and you can use Spark Streaming to process as you receive the data.
If order and consistency are important, you might be better served by a publish/subscribe architecture, especially if your load could be more than what your two servers can handle, but there are peak and slow hours where your workers can catch up.
etc. So the answer to "how you choose?" is "by carefully looking at the constraints of your particular problem, estimate the load demands to your system and picking the solution that better matches those". All of these solutions and frameworks dominate the others, that's why they are all alive and kicking. The choice is all in the tradeoffs you are willing/able to make.
Hope it helps.
First of all, dannyhow is right - this is not what real-time processing is about. There is a great book http://www.manning.com/marz/ which says a lot about lambda archtecture.
The two ways you mentioned serves completly different purposes and are connected to the definition of word "task". For example, Spark will take a whole job you got for him and divide it into "tasks", but the outcome of one task is useless for you, you still need to wait for whole job to finish. You can create small jobs working on the same dataset and use spark's caching to speed it up. But then you won't get much advantage from distribution (if they have to be run one after another).
Are the files big? Are there connected somehow to each other? If yes, I'd go with Spark. If no, distributed queue.

Best method of having a single process distributed across a cluster

I'm very new to cluster computing, and wanted to know more about the various software used for cluster computing, and which is best for particular tasks. In particular, the problem I am trying to solve involves a Manager/Workers type scenario, where a single Manager is responsible for the creation of 100s to 1000s of jobs. Each job, while relatively large, must execute on a small frame-by-frame basis. I.e. the Manager will tell each job, "advance one frame and report back to me". The execution of a single frame will be very small, so latency between the Manager and the worker machines must be very small, on the order of microseconds.
Thank you! Any information would be appreciated, even stuff that doesn't perfectly fit the scenario I described, just to give me a starting point. Some that I have researched so far are Hadoop, HTCondor, and Akka.
Since communication latency is important to you, you should probably consider using MPI. It's not too difficult to write simple Master/Worker programs using MPI, and it will probably give you the best performance, especially if your cluster has high performance networking, such as infiniband.
If, as it seems, you're using Java, you will have to do some research to determine a good Java/MPI package. You'll find some suggestions here: Java openmpi.

Hybrid : OpenMPI + OpenMP on a cluster

I solve numerically some Ordinary Differential Equations.
I have a very simple (conceptually), but very long computations. There is a very long array (~2M cells) and for each cell I need to perform numerical integration. This procedure should be repeated 1000 times. By using OpenMP parallelism and one 24-core machine, it takes around a week to do this (which is not acceptable).
I have a cluster of 20 such (24-core) machines and think about Hybrid implementation. I want to use MPI to pass over these 20 nodes and at each node use regular OpenMP parallelism.
Basically, I need to split my very long array to 20(nodes)X24(proccs) working units.
Are there any suggestion of better implementation or better ideas? I've read a lot on this subject and I've got impression, that sometimes such hybrid implementation does not necessarily bring a real speed up.
May be I should create a "pool of workers" and "feed" them with my array or something else.
Any suggestion and useful links are welcome!
If your computation is as embarrassingly parallel as you indicate you should expect good speedup by spreading the load across all 20 of your machines. By good I mean close to 20 and by close to 20 I mean any number which you actually get which leaves you thinking that the effort has been worthwhile.
Your proposed hybrid solution is certainly feasible and you should get good speedup if you implement it.
One alternative to a hybrid MPI+OpenMP program would be a job script (written in your favourite scripting language) which simply splits your large array into 20 pieces and starts 20 jobs, one on each machine running an instance of your program. When they've all finished have another script ready to recombine the results. This would avoid having to write any MPI code at all.
If your computer has an installation of Grid Engine you can probably write a job submission script to submit your work as an array job and let Grid Engine take care of parcelling the work out to the individual machines/tasks. I expect that other job management systems have similar facilities but I'm not familiar with them.
Another alternative would be an all-MPI code, that is drop the OpenMP altogether and modify your code to use whatever processors it finds available when you run it. Again, if your program requires little or no inter-process communication you should get good speedup.
Using MPI on a shared memory computer is sometimes a better (in performance terms) approach than OpenMP, sometimes worse. Trouble is, it's difficult to be certain about which approach is better for a particular program on a particular architecture with RAM and cache and interconnects and buses and all the other variables to consider.
One factor I've ignored, largely because you've provided no data to consider, is the load-balancing of your program. If you split your very large dataset into 20 equal-sized pieces do you end up with 20 equal-duration jobs ? If not, and if you have an idea how job time varies with inputs, you might do something more sophisticated in splitting the job up than simply chopping your into those 20 equal pieces. You might, for instance, chop it into 2000 equal pieces and serve them one at a time to the machinery for execution. In this case what you gain in load-balancing might be at risk of being lost to the time costs of job management. You pays yer money and you takes yer choice.
From your problem statement I wouldn't be making a decision about which solution to go for on the basis of expected performance, because I'd expect any of the approaches to get into the same ballpark performance-wise, but on the time to develop a working solution.

Resources