I am writing a map-reduce job in java.
I want to use external table for writing hbase increments object.
For that, I am creating new HbaseConfiguration.
I want to be able to create once and use in all mappers.
Any idea?
The Job configuration is already passed to the mappers (and to the reducers) as a part of the context. You can access any HBase table with it.
Job job = new Job(HbaseConfiguration.create(), ...);
/* ... Rest of the job setup ... */
job.waitForCompletion(true);
Within your mapper setup method:
Configuration config = context.getConfiguration();
HTable mytable = new HTable(config, "my_table_name");
....
You can even send your custom parameters or arguments so you can instantiate any kind of object you could need on your mappers:
Configuration config = HbaseConfiguration.create();
config.set("myStringParam", "customValue");
config.setStrings("myStringsParam", "customValue1", "customValue2", "customValue3");
config.setBoolean("myBooleanParam", true);
Job job = new Job(config, ...);
/* ... Rest of the job setup ... */
job.waitForCompletion(true);
Within your mapper setup method:
Configuration config = context.getConfiguration();
String myStringParam = config.get("myStringParam");
String myBooleanParam = config.get("myBooleanParam");
BTW: I don't know why this question has so many downvotes, it's not a bad question.
Related
I am trying to create hbase connection in MapPartitionFunction of spark.
Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration
I tried the following code
SparkConf conf = new SparkConf()
.setAppName("EnterPrise Risk Score")
.setMaster("local");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.set("spark.kryo.registrationRequired", "true");
conf.registerKryoClasses(new Class<?>[] {
Class.forName("org.apache.hadoop.conf.Configuration"),
Class.forName("org.apache.hadoop.hbase.client.Table"),
Class.forName("com.databricks.spark.avro.DefaultSource$SerializableConfiguration")});
SparkSession sparkSession = SparkSession.builder().config(conf)
.getOrCreate();
Configuration hbaseConf= HBaseConfiguration
.create(hadoopConf);
I am using sparkSession to create dataset and pass hbaseConf to create connections to hbase.
Is there any way to connect to hbase?
You probably implicitly pass an HBase configuration to a spark action like this:
Configuration hbaseConfiguration = HBaseConfiguration.create();
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});
Why don't you just create Configuration right inside of it like this:
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Configuration hbaseConfiguration = HBaseConfiguration.create();
hbaseConfiguration.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM);
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});
everybody.I'm using mr to process some log file, the file is on hdfs. I want to retrieve some info form the file and store them to hbase.
so I launch the job
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar crm_hbase-1.0.jar /datastream/music/useraction/2014-11-30/music_useraction_20141130-230003072+0800.24576015364769354.00018022.lzo
if I just run job as "hadoop jar xxxx" it shows "not find HbaseConfiguraion"
My code is quite simple,
public int run(String[] strings) throws Exception {
Configuration config = HBaseConfiguration.create();
String kerbConfPrincipal = "ndir#HADOOP.HZ.NETEASE.COM";
String kerbKeytab = "/srv/zwj/ndir.keytab";
UserGroupInformation.loginUserFromKeytab(kerbConfPrincipal, kerbKeytab);
UserGroupInformation ugi = UserGroupInformation.getLoginUser();
System.out.println(" auth: " + ugi.getAuthenticationMethod());
System.out.println(" name: " + ugi.getUserName());
System.out.println(" using keytab:" + ugi.isFromKeytab());
HBaseAdmin.checkHBaseAvailable(config);
//set job name
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
//set output format and output table name
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "crm_data");
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
but when i try to run this MR, I cannot execute "context.write(null,put)", it seems the "map" halts at this line.
I think it has relationship with "kerbKeytab", does it mean I need to "login" when I run the "map" process
after adding TableMapReduceUtil it works
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
TableMapReduceUtil.initTableReducerJob(table, null, job);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
FileInputFormat.setInputPaths(job,input);
//FileInputFormat.addInputPath(job, new Path(strings[0]));
int ret = job.waitForCompletion(true) ? 0 : 1;
I am running a Java program on a remote computer and trying to read the split data using RecordReader object but instead getting:
Exception in thread "main" java.io.IOException: job information not found in JobContext. HCatInputFormat.setInput() not called?
I already have called the following:
_hcatInputFmt = HCatInputFormat.setInput(_myJob, db,tbl);
and then creating the RecordReader object as:
_hcatInputFmt.createRecordReader(hSplit, taskContext)
On debugging it fails while searching for the value of the key: HCAT_KEY_JOB_INFO in job configuration object, while trying to create a RecordReader object.
How do I set this value? Any pointers will be helpful.
Thanks.
We have to use getConfiguration() method to get the configuration from the job object. The configuration object used in creating the job object won't do it.
I had the same problem, you shuold use:
HCatInputFormat.setInput(job, dbName, inputTableName);
HCatSchema inputschema = HCatBaseInputFormat.getTableSchema(job.getConfiguration());
not
HCatInputFormat.setInput(job, dbName, inputTableName);
HCatSchema inputschema = HCatBaseInputFormat.getTableSchema(getConf());
Because, when you use Job.getInstance(conf), it will copy the conf, you can't use the original conf. Here is the code:
/**
* A new configuration with the same settings cloned from another.
*
* #param other the configuration from which to clone settings.
*/
#SuppressWarnings("unchecked")
public Configuration(Configuration other) {
this.resources = (ArrayList<Resource>) other.resources.clone();
synchronized(other) {
if (other.properties != null) {
this.properties = (Properties)other.properties.clone();
}
if (other.overlay!=null) {
this.overlay = (Properties)other.overlay.clone();
}
this.updatingResource = new ConcurrentHashMap<String, String[]>(
other.updatingResource);
this.finalParameters = Collections.newSetFromMap(
new ConcurrentHashMap<String, Boolean>());
this.finalParameters.addAll(other.finalParameters);
}
synchronized(Configuration.class) {
REGISTRY.put(this, null);
}
this.classLoader = other.classLoader;
this.loadDefaults = other.loadDefaults;
setQuietMode(other.getQuietMode());
}
I'm using Sqoop 1.4.4 and its java api to run an import job and I'm having
trouble figuring out how to access the job counters once the import has
completed. I see suitable methods in the ConfigurationHelper class, like
getNumMapOutputRecords, but I'm not sure how to pass the job to them.
Is there a way to get at the job from the SqoopTool or Sqoop objects?
My code looks something like this:
SqoopTool sqoopTool = new ImportTool();
SqoopOptions options = new SqoopOptions();
options.setConnectString(connectString);
options.setUsername(username);
options.setPassword(password);
options.setTableName(table);
options.setColumns(columns);
options.setWhereClause(whereClause);
options.setTargetDir(targetDir);
options.setNumMappers(1);
options.setFileLayout(FileLayout.TextFile);
options.setFieldsTerminatedBy(delimiter);
Configuration config = new Configuration();
config.set("oracle.sessionTimeZone", timezone.getID());
System.setProperty(Sqoop.SQOOP_RETHROW_PROPERTY, "1");
Sqoop sqoop = new Sqoop(sqoopTool, config, options);
String[] nullArgs = new String[0];
Sqoop.runSqoop(sqoop, nullArgs);
I am adding a file to distributed cache using the following code:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
Then I read the file into the mappers:
protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();
URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));
String line;
try {
while ((line = joinReader.readLine()) != null) {
s = line.toString().split("\t");
do stuff to s
} finally {
joinReader.close();
}
The problem is that I only read in one line, and it is not the file I was putting into the cache. Rather it is: cm9vdA==, or root in base64.
Has anyone else had this problem, or see how I'm using distributed cache incorrectly? I am using Hadoop 0.20.2 fully distributed.
Common mistake in your job configuration:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
After you create your Job object, you need to pull back the Configuration object as Job makes a copy of it, and configuring values in conf2 after you create the job will have no effect on the job iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
You should also check the number of files in the distributed cache, there is probably more than one and you're opening a random file which is giving you the value you are seeing.
I suggest you use symlinking which will make the files available in the local working directory, and with a known name:
DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);
// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));