Hadoop Global Property Conf.Set / Conf.Get in Cleanup()? - hadoop

I am trying to use Global Variables in Hadoop via the Conf.set() and Context.getConfiguration().get() methods.
However, these don't seem to be working inside a Cleanup method I'm using - Though I am able to use the properties in Mapper and Reducer. Is is strange or normal behaviour?
Is there any other way of propagating the value of a variable across MapReduce Jobs, and inside cleanup method of a hadoop job.

The parameters set on the Job class are coming properly in the cleanup method.
The following is in the main method
Configuration conf = new Configuration();
conf.set("test", "123");
Job job = new Job(conf);
The following is the Mapper#cleanup method
protected void cleanup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
String param = conf.get("test");
System.out.println("clean p--> param = " + param);
}
The O/P of the above is
clean p--> param = 123
Check the code again. BTW, I tested it against 0.21 release.

Related

Connecting to Accumulo inside a Mapper using Kerberos

I am moving some software from an older Hadoop Cluster (uses username/password authentication) to a newer one, 2.6.0-cdh5.12.0 which has Kerberos authentication enabled.
I have been able to get many of existing Map/Reduce jobs that use Accumulo for its input and/or output to work fine using a DelegationToken set in the AccumuloInput/OutputFormat classes.
However, I have 1 job, that uses AccumuloInput/OutputFormat for input and output, but also inside its Mapper.setup() method, it connects to Accumulo via Zookeeper so that in the Mapper.map() method, it can compare each key/value being processed my the Mapper.map() to and entry in another Accumulo table.
I included the relevant code below which shows the setup() method connecting to Zookeeper user a PasswordToken and then creating an Accumulo table Scanner which is then used in the mapper method.
So the question is how do I replace the use of the PasswordToken with a KerberosToken for setting up the Accumulo scanner in the Mapper.setup() method? I can find no way to "get" the DelegationToken used by the AccumuloInput/OutputFormat classes that I set.
I have tried context.getCredentials().getAllTokens() and looking for a token of type org.apache.accumulo.code.client.security.tokens.AuthenticationToken -- all of the tokens returned here are of type org.apache.hadoop.security.token.Token.
Please note that I typed the code fragments in versus cut/paste as the code runs on a network unconnected to the internet - aka there may be a typo. :)
//****************************
// code in the M/R driver
//****************************
ClientConfiguration accumuloCfg = ClientConfiguration.loadDefault().withInstance("Accumulo1").withZkHosts("zookeeper1");
ZooKeeperInstance inst = new ZooKeeperInstance(accumuloCfg);
AuthenticationToken dt = conn.securityOperations().getDelegationToken(new DelagationTokenConfig());
AccumuloInputFormat.setConnectorInfo(job, username, dt);
AccumuloOutputFormat.setConnectorInfo(job, username, dt);
// other job setup and then
job.waitForCompletion(true)
//****************************
// this is inside the Mapper class of the M/R job
//****************************
private Scanner index_scanner;
public void setup(Context context) {
Configuration cfg = context.getConfiguration();
// properties set and passed from M/R Driver program
String username = cfg.get("UserName");
String password = cfg.get("Password");
String accumuloInstName = cfg.get("InstanceName");
String zookeepers = cfg.get("Zookeepers");
String tableName = cfg.get("TableName");
Instance inst = new ZooKeeperInstance(accumuloInstName, zookeepers);
try {
AuthenticationToken passwordToken = new PasswordToken(password);
Connector conn = inst.getConnector(username, passwordToken);
index_scanner = conn.createScanner(tableName, conn.securityOperations().getUserAuthorizations(username));
} catch(Exception e) {
e.printStackTrace();
}
}
public void map(Key key, Value value, Context context) throws IOException, InterruptedException {
String uuid = key.getRow().toString();
index_scanner.clearColumns();
index_scanner.setRange(Range.exact(uuid));
for(Entry<Key, Value> entry : index_scanner) {
// do some processing in here
}
}
The provided AccumuloInputFormat and AccumuloOutputFormat have a method to set the token in the job configuration with the Accumulo*putFormat.setConnectorInfo(job, principle, token). You can also serialize the token in a file in HDFS, using the AuthenticationTokenSerializer and use the version of the setConnectorInfo method which accepts a file name.
If a KerberosToken is passed in, the job will create a DelegationToken to use, and if a DelegationToken is passed in, it will just use that.
The provided AccumuloInputFormat should handle its own scanner, so normally, you shouldn't have to do that in your Mapper if you've set the configuration properly. However, if you're doing secondary scanning (for something like a join) inside your Mapper, you can inspect the provided AccumuloInputFormat's RecordReader source code for an example of how to retrieve the configuration and construct a Scanner.

Specify job properties and override properties in hadoop jobs

I have a hadoop (2.2.0) map-reduce job which reads text from a specified path (say INPUT_PATH), and does some processing. I don't want to hardcode the input path (since it comes from some other source which changes each week).
I believe there should be a way in hadoop to specify an xml properties file while running though the command line. How should I do it?
One way I thought was to set an environment variable which points to the location of the properties file and then read this env variable in code and subsequently read the property file. This could work because the value of the env variable can be changed each week without changing the code. But I feel this is an ugly way of loading properties and overrides.
Please let me know the least hacky way of doing this.
There is no inbuilt way to read any configuration file for input/output.
One way I can suggest is to implement a Java M/R Driver program that does the following,
Read the configuration (XML/properties/anything) (Probably generated / updated by the other process)
Set the Job Properties
Submit the Job using your hadoop command (pass the configuration file as argument)
Something like this,
public class SampleMRDriver
extends Configured implements Tool {
#Override
public int run(
String[] args)
throws Exception {
// Read from args the configuration file
Properties prop = new Properties();
prop.loadFromXML(new FileInputStream(args[0]));
Job job = Job.getInstance(getConf(), "Test Job");
job.setJarByClass(SampleMRDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
FileInputFormat.setInputPaths(job, new Path(prop.get("input_path")));
FileOutputFormat.setOutputPath(job, new Path(prop.get("output_path")));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(
String[] args)
throws Exception {
ToolRunner.run(new BatteryAnomalyDetection(), args);
}
}

hadoop, how to include 3part jar while try to run mapred job

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

Calling progress or increase counter in configure method of reducer

Is it possible to do so ?
Context: My configure method for a reducer needs to read a set of files from DistributedCache (total size is ~150MB). However, I don't know why it takes so long that hadoop kill some reducers despite the fact that there are some reducers that have finished successfully.
I use the old API where I can only access the JobConf conf variable in the configure method.
My idea is to make the reporter variable a field then I can call it in the configure method. But it seems configure is called before reduce is called.
Convert your code to use new API!
Then in setup(), you can access the context variable and call progress() as follows:
#Override
protected void setup(Context context) throws IOException, InterruptedException {
context.progress();
}

Accessing files in hadoop distributed cache

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs
Then, my setup function looks like this:
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
//etc
}
However, this localFiles array is always null.
I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either
I'm using hadoop 1.0.3
thanks
Peter
Problem here was that I was doing the following:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");
And now it works. Thanks to Harsh on hadoop user list for the help.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());
You can also do it in this way.
Once the Job is assigned to with a configuration object,
ie Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
And then if deal with attributes of conf as shown below, eg
conf.set("demiliter","|");
or
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.
This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.
//in main(String [] args)
Job job = new Job(conf,"Word Count");
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());
I didnt see the complete setup() function in Mapper code
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
// [0] because we added just one file.
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// now one can use BufferedReader's readLine() to read data
}

Resources