Connecting to Accumulo inside a Mapper using Kerberos - hadoop

I am moving some software from an older Hadoop Cluster (uses username/password authentication) to a newer one, 2.6.0-cdh5.12.0 which has Kerberos authentication enabled.
I have been able to get many of existing Map/Reduce jobs that use Accumulo for its input and/or output to work fine using a DelegationToken set in the AccumuloInput/OutputFormat classes.
However, I have 1 job, that uses AccumuloInput/OutputFormat for input and output, but also inside its Mapper.setup() method, it connects to Accumulo via Zookeeper so that in the Mapper.map() method, it can compare each key/value being processed my the Mapper.map() to and entry in another Accumulo table.
I included the relevant code below which shows the setup() method connecting to Zookeeper user a PasswordToken and then creating an Accumulo table Scanner which is then used in the mapper method.
So the question is how do I replace the use of the PasswordToken with a KerberosToken for setting up the Accumulo scanner in the Mapper.setup() method? I can find no way to "get" the DelegationToken used by the AccumuloInput/OutputFormat classes that I set.
I have tried context.getCredentials().getAllTokens() and looking for a token of type org.apache.accumulo.code.client.security.tokens.AuthenticationToken -- all of the tokens returned here are of type org.apache.hadoop.security.token.Token.
Please note that I typed the code fragments in versus cut/paste as the code runs on a network unconnected to the internet - aka there may be a typo. :)
//****************************
// code in the M/R driver
//****************************
ClientConfiguration accumuloCfg = ClientConfiguration.loadDefault().withInstance("Accumulo1").withZkHosts("zookeeper1");
ZooKeeperInstance inst = new ZooKeeperInstance(accumuloCfg);
AuthenticationToken dt = conn.securityOperations().getDelegationToken(new DelagationTokenConfig());
AccumuloInputFormat.setConnectorInfo(job, username, dt);
AccumuloOutputFormat.setConnectorInfo(job, username, dt);
// other job setup and then
job.waitForCompletion(true)
//****************************
// this is inside the Mapper class of the M/R job
//****************************
private Scanner index_scanner;
public void setup(Context context) {
Configuration cfg = context.getConfiguration();
// properties set and passed from M/R Driver program
String username = cfg.get("UserName");
String password = cfg.get("Password");
String accumuloInstName = cfg.get("InstanceName");
String zookeepers = cfg.get("Zookeepers");
String tableName = cfg.get("TableName");
Instance inst = new ZooKeeperInstance(accumuloInstName, zookeepers);
try {
AuthenticationToken passwordToken = new PasswordToken(password);
Connector conn = inst.getConnector(username, passwordToken);
index_scanner = conn.createScanner(tableName, conn.securityOperations().getUserAuthorizations(username));
} catch(Exception e) {
e.printStackTrace();
}
}
public void map(Key key, Value value, Context context) throws IOException, InterruptedException {
String uuid = key.getRow().toString();
index_scanner.clearColumns();
index_scanner.setRange(Range.exact(uuid));
for(Entry<Key, Value> entry : index_scanner) {
// do some processing in here
}
}

The provided AccumuloInputFormat and AccumuloOutputFormat have a method to set the token in the job configuration with the Accumulo*putFormat.setConnectorInfo(job, principle, token). You can also serialize the token in a file in HDFS, using the AuthenticationTokenSerializer and use the version of the setConnectorInfo method which accepts a file name.
If a KerberosToken is passed in, the job will create a DelegationToken to use, and if a DelegationToken is passed in, it will just use that.
The provided AccumuloInputFormat should handle its own scanner, so normally, you shouldn't have to do that in your Mapper if you've set the configuration properly. However, if you're doing secondary scanning (for something like a join) inside your Mapper, you can inspect the provided AccumuloInputFormat's RecordReader source code for an example of how to retrieve the configuration and construct a Scanner.

Related

Creating Custom Processor In Apache Nifi

I am building an custom processor to process flow file , to process the flow file i need to read an CSV file from my local file system. I created an proerty descriptor CSV_PATH as follows
public static final PropertyDescriptor CSV_PATH = new
PropertyDescriptor
.Builder().name("CSV Path")
.displayName("CSV Path")
.description("CSV Path Reader")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
#Override
protected void init(final ProcessorInitializationContext context) {
final List<PropertyDescriptor> descriptors = new
ArrayList<PropertyDescriptor>();
descriptors.add(JSON_PATH);
descriptors.add(CSV_PATH);
this.descriptors = Collections.unmodifiableList(descriptors);
final Set<Relationship> relationships = new HashSet<Relationship>();
relationships.add(SUCCESS);
this.relationships = Collections.unmodifiableSet(relationships);
}
Now I wants to get the value of CSV_PATH property set in UI while configuring processor. I am not able to get the CSV_PATH value. Also If I hardcode filepath in code then still I am not able to read CSV from local file system.
You want to use the following code to retrieve the value of the PropertyDescriptor from the ProcessContext:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final String csvPath = context.getProperty(CSV_PATH).getValue();
// Do something with csvPath
}
If you decide to support NiFi Expression Language in that property descriptor, you will also want to evaluate for that:
final String csvPath = context.getProperty(CSV_PATH).evaluateAttributeExpressions().getValue();
There are additional method overrides for that, including flowfile attributes, variable registry, custom decorators, etc.
This is documented in the Apache NiFi Developer's Guide. I recently did a presentation at Dataworks Summit Barcelona 2019 covering custom processor development with some best practices included and examples that may be helpful. You can also look at any existing processor in the NiFi codebase to see examples.

Understanding HBase Java Client

I started Hbase few days back and going through all the material of online.
I have installed and configured HBase and shell commands are working fine.
I got an example of Java client to get data from HBase Table and it executed successfully but I could not understand how it is working? In the code nowhere we have mentioned the port, host of Hbase server? How it able to fetch the data from table?
This is my code:
public class RetriveData {
public static void main(String[] args) throws IOException {
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
#SuppressWarnings({ "deprecation", "resource" })
HTable table = new HTable(config, "emp");
// Instantiating Get class
Get g = new Get(Bytes.toBytes("1"));
// Reading the data
Result result = table.get(g);
// Reading values from Result class object
byte [] value = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("name"));
byte [] value1 = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("city"));
// Printing the values
String name = Bytes.toString(value);
String city = Bytes.toString(value1);
System.out.println("name: " + name + " city: " + city);
}
}
The output looks like:
Output:
name: raju city: hyderabad
I agree with Binary Nerds answer
adding some more interesting information for better understanding.
Your Question :
I could not understand how it is working? In the code nowhere we have
mentioned the port, host of Hbase server? How it able to fetch the
data from table?
Since you are executing this program in cluster
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create()
all the cluster properties will be taken care from inside the cluster.. since you are in cluster and you are executing hbase java client program..
Now try like below (execute same program in different way from remote machine eclipse on windows to find out difference of what you have done earlier and now).
public static Configuration configuration; // this is class variable
static { //fill clusternode1,clusternode2,clusternode3 from your cluster
configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum",
"clusternode1,clusternode2,clusternode3");
configuration.set("hbase.master", "clusternode1:600000");
}
Hope this heps you to understand.
If you look at the source code for HBaseConfiguration on github you can see what it does when it calls create().
public static Configuration create() {
Configuration conf = new Configuration();
// In case HBaseConfiguration is loaded from a different classloader than
// Configuration, conf needs to be set with appropriate class loader to resolve
// HBase resources.
conf.setClassLoader(HBaseConfiguration.class.getClassLoader());
return addHbaseResources(conf);
}
Followed by:
public static Configuration addHbaseResources(Configuration conf) {
conf.addResource("hbase-default.xml");
conf.addResource("hbase-site.xml");
checkDefaultsVersion(conf);
HeapMemorySizeUtil.checkForClusterFreeMemoryLimit(conf);
return conf;
}
So its loading the configuration from your HBase configuration files hbase-default.xml and hbase-site.xml.

Specify job properties and override properties in hadoop jobs

I have a hadoop (2.2.0) map-reduce job which reads text from a specified path (say INPUT_PATH), and does some processing. I don't want to hardcode the input path (since it comes from some other source which changes each week).
I believe there should be a way in hadoop to specify an xml properties file while running though the command line. How should I do it?
One way I thought was to set an environment variable which points to the location of the properties file and then read this env variable in code and subsequently read the property file. This could work because the value of the env variable can be changed each week without changing the code. But I feel this is an ugly way of loading properties and overrides.
Please let me know the least hacky way of doing this.
There is no inbuilt way to read any configuration file for input/output.
One way I can suggest is to implement a Java M/R Driver program that does the following,
Read the configuration (XML/properties/anything) (Probably generated / updated by the other process)
Set the Job Properties
Submit the Job using your hadoop command (pass the configuration file as argument)
Something like this,
public class SampleMRDriver
extends Configured implements Tool {
#Override
public int run(
String[] args)
throws Exception {
// Read from args the configuration file
Properties prop = new Properties();
prop.loadFromXML(new FileInputStream(args[0]));
Job job = Job.getInstance(getConf(), "Test Job");
job.setJarByClass(SampleMRDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
FileInputFormat.setInputPaths(job, new Path(prop.get("input_path")));
FileOutputFormat.setOutputPath(job, new Path(prop.get("output_path")));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(
String[] args)
throws Exception {
ToolRunner.run(new BatteryAnomalyDetection(), args);
}
}

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Hadoop Global Property Conf.Set / Conf.Get in Cleanup()?

I am trying to use Global Variables in Hadoop via the Conf.set() and Context.getConfiguration().get() methods.
However, these don't seem to be working inside a Cleanup method I'm using - Though I am able to use the properties in Mapper and Reducer. Is is strange or normal behaviour?
Is there any other way of propagating the value of a variable across MapReduce Jobs, and inside cleanup method of a hadoop job.
The parameters set on the Job class are coming properly in the cleanup method.
The following is in the main method
Configuration conf = new Configuration();
conf.set("test", "123");
Job job = new Job(conf);
The following is the Mapper#cleanup method
protected void cleanup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
String param = conf.get("test");
System.out.println("clean p--> param = " + param);
}
The O/P of the above is
clean p--> param = 123
Check the code again. BTW, I tested it against 0.21 release.

Resources