Smaller block size for more intense jobs Hadoop

Smaller block size for more intense jobs Hadoop - hadoop

Does it make to use smaller blocks for jobs that perform more intense tasks?
For example, in Mapper I'm calculating the distance between two signals, which may take some time depending on the signal length, but on the other hand my dataset size is currently not so big. That takes me into temptation to specify smaller block size (like 16MB) and to increase the number of nodes in cluster. Does that make sense?
How should I perform? And if it is ok to use smaller blocks, how to do it? I haven't done it before...

Whether that makes sense to do for your job can only really be known by testing the performance. There is an overhead cost associated with launching additional JVM instances, and it's a question of whether the additional parallelization is given enough load to offset that cost and still make it a win.
You can change this setting for for a particular job rather than across the entire cluster. You'll have to decide what's a normal use case when deciding whether to make this a global change or not. If you want to make this change globally, you'll put the property in the XML config file or Cloudera Manager. If you only want to do it for particular jobs, put it in the job's configuration.
Either way, the smaller the value in mapreduce.input.fileinputformat.split.maxsize, the more mappers you'll get (it defaults to Integer.MAX_VALUE). That will work for any InputFormat that uses block size to determine it's splits (most do, since most extend FileInputFormat).
So to max out your utilization, you might do something like this
long bytesPerReducer = inputSizeInBytes / numberOfReduceTasksYouWant;
long splitSize = (CLUSTER_BLOCK_SIZE > bytesPerReducer) ? CLUSTER_BLOCK_SIZE : bytesPerReducer);
job.getConfiguration.setLong("mapreduce.input.fileinputformat.split.maxsize", splitSize);
Note that you can also increase the value of mapreduce.input.fileinputformat.split.minsize to reduce the number of mappers (it defaults to 1).

Related

Best way to split computation too large to fit into memory?

I have an operation that's running out of memory when using a batch size greater than 4 (I normally run with 32). I thought I could be clever by splitting this one operation along the batch dimension, using tf.split, running it on a subset of the batch, and then recombining using tf.concat. For some reason this doesn't work and results in an OOM error. Just to be clear, if I run on a batch size of 4 it works without splitting. If instead I run on a batch size of 32, and even if I were to perform a 32-way split so that each individual element is run independently, I still run out of memory. Doesn't TF schedule separate operations so that they do not overwhelm the memory? If not do I need to explicitly set up some sort of conditional dependence?

I discovered that the functional ops, specifically map_fn in this case, address my needs. By setting the parallel_iterations option to 1 (or some small number that would make the computation fit in memory), I'm able to control the degree of parallelism and avoid running out of memory.

Is the number of map tasks determined by the number of nodes in hadoop in case the input is small in size or is part of the hardware idle?

I have a basic mapreduce question.
My input consists of many small files and I have designed a custom CombinedFileInputFormat (which is working properly).
The size of all files together is only like 100 Mb for 20 000 files, but processing an individual file takes a couple of minutes (it's a heavy indexing problem), therefore I want as many map tasks as possible. Will hadoop take care of this or do I have to enforce it and how? In the latter case my first guess would be to manipulate the maximum split size but I am not sure if I am on the right track. Any help greatly appreciated! (suggestions on how to set the split size best in the latter case are also helpful)
Some extra information to be more clear:
There is however another reason I wanted to process multiple files per task and that is that I want to be able to use combiners. The output of a single task only produces unique keys, but between several files there might be a substantial overlap. By processing multiple files with the same map task I can implement a combiner or make use of in-mapper combining. This would definitely limit the amount of IO. The fact is that although a single file has a size of a couple of kilobytes the output of this file is roughly 30 * 10^6 key-value pairs which easily leads to a couple of Gigabytes.
I don't think there is another way to allow combining (or in-mapper combining) if you have only one file per maptask?
Regards, Dieter

To get the best utilization for your long running map tasks, you'll probably want each file to run in it's own task rather than using your implementation of CombineInputFormat.
Using combine input format is usually advisable when you have small files that are quickly processed as it takes longer to instantiate the map task (jvm, config etc) than it does to process the file itself. You can alleviate this you by configuring 'JVM reuse', but still for a CPU bound tasks (as opposed to an IO bound tasks) you'll just want to run map tasks for each input file.
You will however need your Job Tracker to have a good chunk of memory allocated to it so it can manage and track the 20k map tasks created.
Edit: In response to your updated question, if you want to use combined input format then you'll need to set the configuration properties for min / max size per node / rack. Hadoop won't be able to do anything more intelligible than try and keep files that are data local or rack local together in the same map task.

What's the best Task scheduling algorithm for some given tasks?

We have a list of tasks with different length, a number of cpu cores and a Context Switch time.
We want to find the best scheduling of tasks among the cores to maximize processor utilization.
How could we find this?
Isn't it like if we choose the biggest available tasks from the list and give them one by one to the current ready cores, it's going to be best or you think we must try all orders to find out which is the best?
I must add that all cores are ready at the time unit 0 and the tasks are supposed to work concurrently.

The idea here is that there's no silver bullet, for what you must consider what are the types of tasks being executed, and try to schedule them as nicely as possible.
CPU-bound tasks don't use much communication (I/O), and thus, need to be continuously executed, and interrupted only when necessary -- according to the policy being used;
I/O-bound tasks may be continuously put aside in the execution, allowing other processes to work, since it will be sleeping for many periods, waiting for data to be retrieved to primary memory;
interative tasks must be continuously executed, but needs not to be executed without interruptions, as it will generate interruptions, waiting for user inputs, but it needs to have a high priority, in order not to let the user notice delays in the execution.
Considering this, and the context switch costs, you must evaluate what types of tasks you have, choosing, thus, one or more policies for your scheduler.
Edit:
I thought this was a simply conceptual question. Considering you have to implement a solution, you must analyze the requirements.
Since you have the length of the tasks, and the context switch times, and you have to maintain the cores busy, this becomes an optimization problem, where you must keep the minimal number of cores idle when it reaches the end of the processes, but you need to maintain the minimum number of context switches, so that your overall execution time does not grow too much.
As pointed by svick, this sounds like a partition problem, which is NP-complete, and in which you need to divide a sequence of numbers into a given number of lists, so that the sum of each list is equal to each other.
In your problem you'd have a relaxation on the objective, so that you no longer need all the cores to execute the same amount of time, but you want the difference between any two cores execution time to be as small as possible.
In the reference given by svick, you can see a dynamic programming approach that you may be able to map onto your problem.

how to tune mapred.reduce.parallel.copies?

Following reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we want to experiment with mapred.reduce.parallel.copies.
The blog mentions "looking very carefully at the logs". How would we know we've reached the sweet spot? what should we look for? how can we detect that we're over-parallelizing?

In order to do that you should basically look for 4 things : CPU, RAM, Disk and Network. If your setup is crossing the threshold of these metrics you can deduce that you are pushing the limits. For example, if you have set the value of "mapred.reduce.parallel.copies" to a value much higher than the number of cores available, you'll end up with too many threads in waiting state, as based on this property Threads will be created to fetch the Map output. In addition to that network might get overwhelmed. Or, if there is too much intermediate output to be shuffled , your job will become slow as you will need disk based shuffle in such a case, which will be slower than RAM based shuffle. Choose a wise value for "mapred.job.shuffle.input.buffer.percent" based on your RAM(defaults to 70% of Reducer heap, which is normally good). So, these are kinda things which will tell you whether you are over-parallelizing or not. There are a lot of other things as well which you should consider. I would recommend you to go through the Chapter 6 of "Hadoop Definitve Guide".
Some of the measures which you could take, in order to make your jobs efficient, are like using a combiner to limit the data transfer, enable intermediate compression etc.
HTH
P.S : The answer is not very specific to just "mapred.reduce.parallel.copies". It tells you about tuning your job in general. Actually speaking setting only this property is not gonna help you much. You should consider other important properties as well.

Reaching the "sweet spot" is really just finding the parameters that give you the best result for whichever metric you consider the most important, usually overall job time. To figure out what parameters are working I would suggest using the following profiling tools that Hadoop comes with, MrBench, TestDFSIO, and NNBench. These are found in the hadoop-mapreduce-client-jobclient-*.jar.
By running this command you will see a long list of benchmark programs that you can use besides the ones I mentioned above.
hadoop ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar
I would suggest running with the default parameters, run tests to give baseline benchmarks, then changing one parameter and rerunning. A bit time consuming but worth it, especially if you use a script to change parameters and run the benchmarks.

How can I determine the appropriate number of tasks with GCD or similar?

I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?

Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio