More than 120 counters in hadoop - hadoop

There's a limit for Hadoop counter size. It's 120 by default. I try to use the configuration "mapreduce.job.counters.limit" to change that, but it doesn't work. I've seen the source code. It's like the instance of JobConf in class "org.apache.hadoop.mapred.Counters" is private.
Have anybody seen that before? What's your solution?
THX :)

You can override that property in mapred-site.xml on your JT, TT, client nodes but make sure that this will be a system-wide modification:
<configuration>
...
<property>
<name>mapreduce.job.counters.limit</name>
<value>500</value>
</property>
...
</configuration>
Then restart the mapreduce service on your cluster.

In Hadoop 2, this configuration parameter is called
mapreduce.job.counters.max
Setting it on the command line or in your Configuration object isn't enough, though. You need to call the static method
org.apache.hadoop.mapreduce.counters.Limits.init()
in the setup() method of your mapper or reducer to get the setting to take effect.
Tested with 2.6.0 and 2.7.1.

The para is set by config file, while paras below will take effect
mapreduce.job.counters.max=1000
mapreduce.job.counters.groups.max=500
mapreduce.job.counters.group.name.max=1000
mapreduce.job.counters.counter.name.max=500

Just adding this in case anyone else faces the same problem we did: increasing the counters from with MRJob.
To raise the number of counters, add emr_configurations to your mrjob.conf (or pass it to MRJob as a config parameter):
runners:
emr:
emr_configurations:
- Classification: mapred-site
Properties:
mapreduce.job.counters.max: 1024
mapreduce.job.counters.counter.name.max: 256
mapreduce.job.counters.groups.max: 256
mapreduce.job.counters.group.name.max: 256

We can customize the limits as command line options only for specific jobs, instead of making change in mapred-site.xml.
-Dmapreduce.job.counters.limit=x
-Dmapreduce.job.counters.groups.max=y
NOTE: x and y are custom values based on your environment/requirement.

Related

Apache Nutch 2.3.1, increase reducer memory

I have setup a small size cluster if Hadoop with Hbase for Nutch 2.3.1. The hadoop version is 2.7.7 and Hbase is 0.98. I have customized a hadoop job and now I have to set memory for reducer task in driver class. I have come to know, in simple hadoop MR jobs, you can use JobConf method setMemoryForReducer. But there isn't any option available in Nutch. In my case , currently, reducer memory is set to 4 GB via mapred-site.xml (Hadoop configuration). But for Nutch, I have to double it.
Is it possible without changing hadoop conf files, either via driver class or nutch-site.xml
Finally, I was able to found the solution. NutchJob does the objective. Following is the code snippet
NutchJob job = NutchJob.getInstance(getConf(), "rankDomain-update");
int reducer_mem = 8192;
String memory = "-Xmx" + (int) (reducer_mem * 0.8)+ "m";
job.getConfiguration().setInt("mapreduce.reduce.memory.mb", reducer_mem);
job.getConfiguration().set("mapreduce.reduce.java.opts", memory );
// rest of code below

Hadoop passing variables from reducer to main

I am working on a map reduce program. I'm trying to pass parameters to the context configuration in the reduce method using the setLong method and then after completion read them in the main
in reducer:
context.getConfiguration().setLong(key, someLong);
In the Main after the job completion i try to read using :
long val = job.getConfiguration().getLong(key, -1);
but i always get -1.
when i try reading inside the reducer i see that the value is set and i get the correct answer.
am i missing something?
Thank you
You can use counters: set&update their value in reducers and then you can access them in your client application (Main).
You can translate configuration from main to map task or reduce task, but you cannot translate it back. The procedure of configuration translation is:
A configuration file is generated on the MapReduce client based on the configuration you set on main, and it will be pushed to a HDFS path only shared by the job. The file will be readonly
When launching a map or reduce task, the configuration file is pulled from the HDFS path, and task init the configuration based by the file.
If you want to translate configuration back, you may use another HDFS file: update the file on Reducer, and read it after job completes

Override storm.yaml / setting in Java bolts/spouts

Need help with Tuning apache storm. I have run a command on the nimbus server to increase the spout executors & and a for a bolt.
My question is simple. Does the command:
storm rebalance TopologyName -e spout/or/bolt=
Does this override number of parallel hints in the Java code ?
I ran this and did not see a change in the web GUI interface.
Also is there a way to override the parameter in the storm.yaml file ?
topology.max.spout.pending: 1000
Thanks for any help on this. I do have an excellent book on Storm but I cannot find out why my changes are not being reflected after rebalance...
Did you set the number of tasks high enough? See here for further details:
Rebalancing executors in Apache Storm
https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
So yes, it does override the parallel hint, but only if applicable.
And yes, you can use storm.yaml to set the default "max pending" parameter. This value can be changed for each topology individually by overwriting the default value in the configuration you provide for a topology when submitting it:
Config conf = new Config();
conf.setMaxSpoutPending( /* put your value here */ );
StormSubmitter.submitTopology("topologyName", conf, builder.createTopology());

hadoop: how to increase the limit of failed tasks

I want to run a job so that all task failures are just logged and are otherwise ignored (basically to test my input). Right now, when a task fails I get "# of failed Map Tasks exceeded allowed limit". How do I increase the limit?
I use Hadoop 1.2.1
Specify the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent in the mapred-site.xml to specify the failure threshold. Both are set to 0. Check the code for JobConf.java for more details.
In order to set increase the limit of the MapTasks try to add following in the mapred-site.xml file.
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>{cores}</value>
</property>
This will make the number of MapTasks set to maximum value. In place of {cores} you should substitute the value of cores you have. Setting this value to exact value of core available is not considered good. Let me know if you have any questions.
Hope this helps.
Happy Hadooping!!!

How RecommenderJob(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) will call my custom mappers and reducers?

I am running Mahout in Action example for 6 using command:
"hadoop jar target/mia-0.1-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData"
But the mappers and reducers in example of ch 06 are not working ?
You have to change the code to use the custom Mapper and Reducer classes you have in mind. Otherwise yes of course it runs the ones that are currently in the code. Add them, change the caller, recompile, and run it all on Hadoop. I am not sure what you refer to that is not working.

Resources