New Distributed cache API Hadoop2 backcompatibility - hadoop

I would like to know if new Distributed cache API is back-compatible with Hadoop 1?
If I change my code adhering the new API (since the old one is deprecated) will it work on Hadoop 1 cluster?
By new I mean:
Configuration conf = getConf();
...
Job job = Job.getInstance(conf);
...
job.addCacheFile(new URI(filename));

Related

Apache Nutch 2.3.1, increase reducer memory

I have setup a small size cluster if Hadoop with Hbase for Nutch 2.3.1. The hadoop version is 2.7.7 and Hbase is 0.98. I have customized a hadoop job and now I have to set memory for reducer task in driver class. I have come to know, in simple hadoop MR jobs, you can use JobConf method setMemoryForReducer. But there isn't any option available in Nutch. In my case , currently, reducer memory is set to 4 GB via mapred-site.xml (Hadoop configuration). But for Nutch, I have to double it.
Is it possible without changing hadoop conf files, either via driver class or nutch-site.xml
Finally, I was able to found the solution. NutchJob does the objective. Following is the code snippet
NutchJob job = NutchJob.getInstance(getConf(), "rankDomain-update");
int reducer_mem = 8192;
String memory = "-Xmx" + (int) (reducer_mem * 0.8)+ "m";
job.getConfiguration().setInt("mapreduce.reduce.memory.mb", reducer_mem);
job.getConfiguration().set("mapreduce.reduce.java.opts", memory );
// rest of code below

Job Configuration Object

I am implementing a new algorithm for scheduling in hadoop called TaskTrackerAware Scheduler. I have to configure some properties such as mapred.tascheduler.task.max ( maximum number of tasks that can run on a tasktracker for a single job) and mapred.tascheduler.hosts (host names of the task tracker in which jobs need to be run). How to configure these properties in Job configuration object?
Find below the snippet that may help you,
Configurarion conf = new Configuration();
conf.set("mapred.tascheduler.task.max" , <value>);
conf.set("mapred.tascheduler.hosts " , <value>);
Job job = Job.getInstance(conf, "app name");
Let me know if you need any further help..

Hadoop MapReduce log4j - log messages to a custom file in userlogs/job_ dir?

Its not clear to me as how one should configure Hadoop MapReduce log4j at a job level. Can someone help me answer these questions.
1) How to add support log4j logging from a client machine. i.e I want to use log4j property file at the client machine, and hence don't want to disturb the Hadoop log4j setup in the cluster. I would think having the property file in the project/jar should suffice, and hadoop's distributed cache should do the rest transferring the map-reduce jar.
2) How to log messages to a custom file in $HADOOP_HOME/logs/userlogs/job_/ dir.
3) Will map reduce task use both the log4j property file? the one supplied by the client job and the one present in the hadoop cluster? If yes, then the log4j.rootLogger would add both the property values?
Thanks
Srivatsan Nallazhagappan
You can configure log4j directly in your code. For example you can call PropertyConfigurator.configure(properties); e.g. in mapper/reducer setup method.
This is example with properties stored on hdfs:
InputStream is = fs.open(log4jPropertiesPath);
Properties properties = new Properties();
properties.load(is);
PropertyConfigurator.configure(properties);
where fs is FileSystem object and log4jPropertiesPath is path on hdfs.
With this you can also output logs to a dir with job_id. For example you can modify our properities before calling PropertyConfigurator.configure(properties);
Enumeration propertiesNames = properties.propertyNames();
while (propertiesNames.hasMoreElements()) {
String propertyKey = (String) propertiesNames.nextElement();
String propertyValue = properties.getProperty(propertyKey);
if (propertyValue.indexOf(JOB_ID_PATTERN) != -1) {
properties.setProperty(propertyKey, propertyValue.replace(JOB_ID_PATTERN, context.getJobID().toString()));
}
}
There is no straight forward way to override the log4j properties at each job level.
Map Reduce job itself doesn't store the logs in Hadoop,it writes logs in local file system(${hadoop.log.dir}/userlogs) of the datanodes. There is a separate process from Yarn called log-aggregation which collect those logs and combines.
Use yarn logs --applicationId <appId> to fetch the full log, then use unix command to parse and extract the part of the log you need.

Find Job Status by Job name or id for a hadoop mapreduce job

I'm very new to hadoop and have question.
I'm submitting (or creating) mapreduce jobs using Hadoop Job API v2 (i.e. namespace mapreduce than old one mapred)
I submit MR Jobs based on our own jobs. We maintain the Hadoop Job Name in this table.
I want to track the submitted jobs for the progress (and so completion) so that we can update our own jobs as complete.
All the Job Status API requires Job object. Whereas ‘Job Monitoring’ module of ours does not have any job object with it.
Can you please help us with anyway to get Job Status given a Job Name? We make sure job name are unique.
I google quite a bit only to find below. Is this the way to go? There is no other way in the v2 (.mapreduce. and not .mapred.) API to get job's status given the JobId?
Configuration conf = new Configuration();
JobClient jobClient = new JobClient(new JobConf(conf)); // deprecation WARN
JobID jobID = JobID.forName(jobID); // deprecation WARN
RunningJob runningJob = jobClient.getJob(jobID);
Field field = runningJob.getClass().getDeclaredField("status"); // reflection !!!
field.setAccessible(true);
JobStatus jobStatus = (JobStatus) field.get(runningJob);
http://blog.erdemagaoglu.com/post/9407457968/hadoop-mapreduce-job-statistics-a-fraction-of-them

MapReduce Distributed Cache

I am adding a file to distributed cache of Hadoop using
Configuration cng=new Configuration();
JobConf conf = new JobConf(cng, Driver.class);
DistributedCache.addCacheFile(new Path("DCache/Orders.txt").toUri(), cng);
where DCache/Orders.txt is the file in HDFS.
When I try to retrieve this file from the cache in configure method of mapper using:
Path[] cacheFiles=DistributedCache.getLocalCacheFiles(conf);
I get null pointer. What can be the error?
Thanks
DistributedCache doesn't work in single node mode, it just returns a null pointer. Or at least that was my experience with the current version.
I think the url is supposed to start with the hdfs identifier.
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#DistributedCache

Resources