MapReduce Distributed Cache - caching

I am adding a file to distributed cache of Hadoop using
Configuration cng=new Configuration();
JobConf conf = new JobConf(cng, Driver.class);
DistributedCache.addCacheFile(new Path("DCache/Orders.txt").toUri(), cng);
where DCache/Orders.txt is the file in HDFS.
When I try to retrieve this file from the cache in configure method of mapper using:
Path[] cacheFiles=DistributedCache.getLocalCacheFiles(conf);
I get null pointer. What can be the error?
Thanks

DistributedCache doesn't work in single node mode, it just returns a null pointer. Or at least that was my experience with the current version.
I think the url is supposed to start with the hdfs identifier.
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#DistributedCache

Related

Apache Nutch 2.3.1, increase reducer memory

I have setup a small size cluster if Hadoop with Hbase for Nutch 2.3.1. The hadoop version is 2.7.7 and Hbase is 0.98. I have customized a hadoop job and now I have to set memory for reducer task in driver class. I have come to know, in simple hadoop MR jobs, you can use JobConf method setMemoryForReducer. But there isn't any option available in Nutch. In my case , currently, reducer memory is set to 4 GB via mapred-site.xml (Hadoop configuration). But for Nutch, I have to double it.
Is it possible without changing hadoop conf files, either via driver class or nutch-site.xml
Finally, I was able to found the solution. NutchJob does the objective. Following is the code snippet
NutchJob job = NutchJob.getInstance(getConf(), "rankDomain-update");
int reducer_mem = 8192;
String memory = "-Xmx" + (int) (reducer_mem * 0.8)+ "m";
job.getConfiguration().setInt("mapreduce.reduce.memory.mb", reducer_mem);
job.getConfiguration().set("mapreduce.reduce.java.opts", memory );
// rest of code below

Hadoop passing variables from reducer to main

I am working on a map reduce program. I'm trying to pass parameters to the context configuration in the reduce method using the setLong method and then after completion read them in the main
in reducer:
context.getConfiguration().setLong(key, someLong);
In the Main after the job completion i try to read using :
long val = job.getConfiguration().getLong(key, -1);
but i always get -1.
when i try reading inside the reducer i see that the value is set and i get the correct answer.
am i missing something?
Thank you
You can use counters: set&update their value in reducers and then you can access them in your client application (Main).
You can translate configuration from main to map task or reduce task, but you cannot translate it back. The procedure of configuration translation is:
A configuration file is generated on the MapReduce client based on the configuration you set on main, and it will be pushed to a HDFS path only shared by the job. The file will be readonly
When launching a map or reduce task, the configuration file is pulled from the HDFS path, and task init the configuration based by the file.
If you want to translate configuration back, you may use another HDFS file: update the file on Reducer, and read it after job completes

New Distributed cache API Hadoop2 backcompatibility

I would like to know if new Distributed cache API is back-compatible with Hadoop 1?
If I change my code adhering the new API (since the old one is deprecated) will it work on Hadoop 1 cluster?
By new I mean:
Configuration conf = getConf();
...
Job job = Job.getInstance(conf);
...
job.addCacheFile(new URI(filename));

Set replication in Hadoop

I was trying loading file using hadoop API as an experiment.
I want to set replication to minimum as this one is for experiment.
I first tried this with FileSystem.setReplication():
Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://192.168.248.166:8020");
FileSystem dfs2 = FileSystem.get(config);
Path src2 = new Path("C:\\Users\\abc\\Desktop\\testfile.txt");
Path dst2 = new Path(dfs2.getWorkingDirectory()+"/tempdir");
dfs2.copyFromLocalFile(src2, dst2);
dfs2.setReplication(dst2, (short)1); /**setting replication**/
The replica was shown as 1, but it was available on 3 datanodes.
When I tried it with Configuration.set():
Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://192.168.248.166:8020");
config.set("dfs.replication", "1"); /**setting replication**/
FileSystem dfs2 = FileSystem.get(config);
Path src2 = new Path("C:\\Users\\abc\\Desktop\\testfile.txt");
Path dst2 = new Path(dfs2.getWorkingDirectory()+"/tempdir");
This gave the desired outcome (1 replica available on 1 datanode)
Why there are two APIs for the same thing?
What is the difference between these two?
The difference is that Filesystem's setReplication() sets the replication of an existing file on HDFS. In your case, you first copy the local file testFile.txt to HDFS, using the default replication factor (3) and then change the replication factor of this file to 1. After this command, it takes a while until the over-replicated blocks get deleted. (source)
On the other hand, when you use the config.set("dfs.replication", "1"); command to set the replication, you can copy the local file after that, so its blocks get copied just once, from the first time.
In other words, I believe (but I might be wrong) that both commands have the same final result, but you have to wait a little bit until the first one is carried out.

Hadoop MapReduce log4j - log messages to a custom file in userlogs/job_ dir?

Its not clear to me as how one should configure Hadoop MapReduce log4j at a job level. Can someone help me answer these questions.
1) How to add support log4j logging from a client machine. i.e I want to use log4j property file at the client machine, and hence don't want to disturb the Hadoop log4j setup in the cluster. I would think having the property file in the project/jar should suffice, and hadoop's distributed cache should do the rest transferring the map-reduce jar.
2) How to log messages to a custom file in $HADOOP_HOME/logs/userlogs/job_/ dir.
3) Will map reduce task use both the log4j property file? the one supplied by the client job and the one present in the hadoop cluster? If yes, then the log4j.rootLogger would add both the property values?
Thanks
Srivatsan Nallazhagappan
You can configure log4j directly in your code. For example you can call PropertyConfigurator.configure(properties); e.g. in mapper/reducer setup method.
This is example with properties stored on hdfs:
InputStream is = fs.open(log4jPropertiesPath);
Properties properties = new Properties();
properties.load(is);
PropertyConfigurator.configure(properties);
where fs is FileSystem object and log4jPropertiesPath is path on hdfs.
With this you can also output logs to a dir with job_id. For example you can modify our properities before calling PropertyConfigurator.configure(properties);
Enumeration propertiesNames = properties.propertyNames();
while (propertiesNames.hasMoreElements()) {
String propertyKey = (String) propertiesNames.nextElement();
String propertyValue = properties.getProperty(propertyKey);
if (propertyValue.indexOf(JOB_ID_PATTERN) != -1) {
properties.setProperty(propertyKey, propertyValue.replace(JOB_ID_PATTERN, context.getJobID().toString()));
}
}
There is no straight forward way to override the log4j properties at each job level.
Map Reduce job itself doesn't store the logs in Hadoop,it writes logs in local file system(${hadoop.log.dir}/userlogs) of the datanodes. There is a separate process from Yarn called log-aggregation which collect those logs and combines.
Use yarn logs --applicationId <appId> to fetch the full log, then use unix command to parse and extract the part of the log you need.

Resources