What's the best value of -j switch?
I usually set this up to the number of CPU/Cores available.
Thanks.
I've always seen the number of cores available plus 1 as the recommended value
Just measure.
Start with the number of cores. And then add one until you feel that you get diminishing returns.
Related
I am using "make" command to compile. I know if I use "make -jN", N indicates the job number. But if I don't use any number after -j, what does it mean?
No number means no limit.
From the GNU Make manual [emphasis mine]:
If the ā-jā option is followed by an integer, this is the number of recipes to execute at once; this is called the number of job slots. If there is nothing looking like an integer after the ā-jā option, there is no limit on the number of job slots. The default number of job slots is one, which means serial execution (one thing at a time).
Does it make to use smaller blocks for jobs that perform more intense tasks?
For example, in Mapper I'm calculating the distance between two signals, which may take some time depending on the signal length, but on the other hand my dataset size is currently not so big. That takes me into temptation to specify smaller block size (like 16MB) and to increase the number of nodes in cluster. Does that make sense?
How should I perform? And if it is ok to use smaller blocks, how to do it? I haven't done it before...
Whether that makes sense to do for your job can only really be known by testing the performance. There is an overhead cost associated with launching additional JVM instances, and it's a question of whether the additional parallelization is given enough load to offset that cost and still make it a win.
You can change this setting for for a particular job rather than across the entire cluster. You'll have to decide what's a normal use case when deciding whether to make this a global change or not. If you want to make this change globally, you'll put the property in the XML config file or Cloudera Manager. If you only want to do it for particular jobs, put it in the job's configuration.
Either way, the smaller the value in mapreduce.input.fileinputformat.split.maxsize, the more mappers you'll get (it defaults to Integer.MAX_VALUE). That will work for any InputFormat that uses block size to determine it's splits (most do, since most extend FileInputFormat).
So to max out your utilization, you might do something like this
long bytesPerReducer = inputSizeInBytes / numberOfReduceTasksYouWant;
long splitSize = (CLUSTER_BLOCK_SIZE > bytesPerReducer) ? CLUSTER_BLOCK_SIZE : bytesPerReducer);
job.getConfiguration.setLong("mapreduce.input.fileinputformat.split.maxsize", splitSize);
Note that you can also increase the value of mapreduce.input.fileinputformat.split.minsize to reduce the number of mappers (it defaults to 1).
I've got eight materialized views with each containing about a thousand rows. They are refreshed with force on demand in a very time critical job which runs every minute. While refreshing, the views need to deliver data.
I'd want to use the following command for refreshing:
BEGIN
dbms_mview.refresh(list => 'MVIEW1, MVIEW2, [...]',
atomic_refresh => TRUE);
END;
Now there exists the parallelism parameter. I thought, it would be cool and clever to set an intelligent and well rethought value for it.
Are there general generally accepted tips for values for this parameter? Should it be equal to the number of materialized views (while keeping sane limitations)?
Thanks for help.
When considering the parallelism parameter, as with any parallel operations in Oracle, you should really consider the number of CPUs and the available I/O capacity. Also consider, can you afford to consume all available CPU, or do you need to leave some capacity for other users.
Also, note that even if you set the parallelism parameter, parallelism won't kick in, unless the materialized view was created as parallel.
There's a nice little whitepaper on the subject here:
http://www.doug.org/newsletter/march/MV_Refresh_Parallel.pdf
Hope that helps...
Following reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we want to experiment with mapred.reduce.parallel.copies.
The blog mentions "looking very carefully at the logs". How would we know we've reached the sweet spot? what should we look for? how can we detect that we're over-parallelizing?
In order to do that you should basically look for 4 things : CPU, RAM, Disk and Network. If your setup is crossing the threshold of these metrics you can deduce that you are pushing the limits. For example, if you have set the value of "mapred.reduce.parallel.copies" to a value much higher than the number of cores available, you'll end up with too many threads in waiting state, as based on this property Threads will be created to fetch the Map output. In addition to that network might get overwhelmed. Or, if there is too much intermediate output to be shuffled , your job will become slow as you will need disk based shuffle in such a case, which will be slower than RAM based shuffle. Choose a wise value for "mapred.job.shuffle.input.buffer.percent" based on your RAM(defaults to 70% of Reducer heap, which is normally good). So, these are kinda things which will tell you whether you are over-parallelizing or not. There are a lot of other things as well which you should consider. I would recommend you to go through the Chapter 6 of "Hadoop Definitve Guide".
Some of the measures which you could take, in order to make your jobs efficient, are like using a combiner to limit the data transfer, enable intermediate compression etc.
HTH
P.S : The answer is not very specific to just "mapred.reduce.parallel.copies". It tells you about tuning your job in general. Actually speaking setting only this property is not gonna help you much. You should consider other important properties as well.
Reaching the "sweet spot" is really just finding the parameters that give you the best result for whichever metric you consider the most important, usually overall job time. To figure out what parameters are working I would suggest using the following profiling tools that Hadoop comes with, MrBench, TestDFSIO, and NNBench. These are found in the hadoop-mapreduce-client-jobclient-*.jar.
By running this command you will see a long list of benchmark programs that you can use besides the ones I mentioned above.
hadoop ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar
I would suggest running with the default parameters, run tests to give baseline benchmarks, then changing one parameter and rerunning. A bit time consuming but worth it, especially if you use a script to change parameters and run the benchmarks.
I'm not even sure the following is possible, but it never hurts to ask:
I have two nodes running the same application. Each machine needs a sequence generator that will give it a number between 0 and 1e6. If a node has used a number, the other node must not use it. The generator should reset every night at midnight. No number should be used twice in the same day, even if a machine restarts. We'd like to avoid any solution involving databases, distributed caches or filesystems. Let's assume we will never need more than 1e6 numbers per day. The numbers do not have to be used in sequence.
So far we have thought of the following:
1) Machine A uses odd numbers, machine B uses even numbers.
Pros: no shared state.
Cons: a machine might run out of numbers when there are plenty left. If a machine restarts, it will reuse previously used numbers.
2) Machine A countr from 0 to 1e6, machine B from 1e6 to 0.
Pros: no shared state. Garantees that all available numbers will be consumed before running into problems.
Cons: doesn't scale to more than two machines. Same problem when a machine restarts.
What do you think? Is there a magic algorithm that will fulfill our requirements without needing to write anything to disk?
No number should be used twice in the same day, even if a machine restarts.
Since you don't want to use any persistent state, this suggests to me that the number must depend on the time somehow. That is the only way in which the algorithm can tell two distinct startups apart. Can't you just use a combination (node, timestamp) for sufficiently fine timestamps, instead of your numbers?
Why not just have a small serivce that hands out IDs upon request? This scales to more than one machine, and doesn't require a change to the client if you need to change the ID allocation algorithm. This is rather simple to implement and quite easy to maintain going forward.
I really think the best way would be to have some machine that hands out numbers on requests (maybe even number ranges if you want to avoid too many queries) that wrote things out to disk.
If you're really against it, you could be really clever with method 1 if you can gaurantee the rate at which numbers are consumed. For example the machine could use the current time to determine where in its range to begin. I.E. if it's noon, begin at the middle of my range. This could be tweaked if you can put an upper limit on the amount of numbers generated per second (or generic time interval). This still has the problem of wasted tags and is pretty convoluted just to avoid writing a single number to disk.