how to get job counters with sqoop 1.4.4 java api? - sqoop

I'm using Sqoop 1.4.4 and its java api to run an import job and I'm having
trouble figuring out how to access the job counters once the import has
completed. I see suitable methods in the ConfigurationHelper class, like
getNumMapOutputRecords, but I'm not sure how to pass the job to them.
Is there a way to get at the job from the SqoopTool or Sqoop objects?
My code looks something like this:
SqoopTool sqoopTool = new ImportTool();
SqoopOptions options = new SqoopOptions();
options.setConnectString(connectString);
options.setUsername(username);
options.setPassword(password);
options.setTableName(table);
options.setColumns(columns);
options.setWhereClause(whereClause);
options.setTargetDir(targetDir);
options.setNumMappers(1);
options.setFileLayout(FileLayout.TextFile);
options.setFieldsTerminatedBy(delimiter);
Configuration config = new Configuration();
config.set("oracle.sessionTimeZone", timezone.getID());
System.setProperty(Sqoop.SQOOP_RETHROW_PROPERTY, "1");
Sqoop sqoop = new Sqoop(sqoopTool, config, options);
String[] nullArgs = new String[0];
Sqoop.runSqoop(sqoop, nullArgs);

Related

Not possible to set job description in spark streamingContext?

I am try to do
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("appName")
val sc: SparkContext = new SparkContext(sparkConf)
sc.setJobGroup("jobName", "job description")
val ssc: StreamingContext = new StreamingContext(sc, Seconds(2))
And it seems that setJobGroup is never working. I saw there is https://issues.apache.org/jira/browse/SPARK-10649, which it seems that StreamingContext overwrites information in the SparkContext. However StreamingContext doesn't provide any method to set Spark job description. I am wondering is there any way to customize job description for Spark streaming job?

Connect to Hbase using hadoop config in spark

I am trying to create hbase connection in MapPartitionFunction of spark.
Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration
I tried the following code
SparkConf conf = new SparkConf()
.setAppName("EnterPrise Risk Score")
.setMaster("local");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.set("spark.kryo.registrationRequired", "true");
conf.registerKryoClasses(new Class<?>[] {
Class.forName("org.apache.hadoop.conf.Configuration"),
Class.forName("org.apache.hadoop.hbase.client.Table"),
Class.forName("com.databricks.spark.avro.DefaultSource$SerializableConfiguration")});
SparkSession sparkSession = SparkSession.builder().config(conf)
.getOrCreate();
Configuration hbaseConf= HBaseConfiguration
.create(hadoopConf);
I am using sparkSession to create dataset and pass hbaseConf to create connections to hbase.
Is there any way to connect to hbase?
You probably implicitly pass an HBase configuration to a spark action like this:
Configuration hbaseConfiguration = HBaseConfiguration.create();
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});
Why don't you just create Configuration right inside of it like this:
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Configuration hbaseConfiguration = HBaseConfiguration.create();
hbaseConfiguration.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM);
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});

Exception : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=hbase, access=EXECUTE

I am trying to perform BulkLoad into Hbase. The input to map reduce is hdfs file(from Hive).
Using the below code in Tool(Job) class to initiate the bulk loading process
HFileOutputFormat.configureIncrementalLoad(job, new HTable(config, TABLE_NAME));
In Mapper, using the following as output of Mapper
context.write(new ImmutableBytesWritable(Bytes.toBytes(hbaseTable)), put);
Once the mapper is completed. performing the actual bulk loading using,
LoadIncrementalHFiles loadFfiles = new LoadIncrementalHFiles(configuration);
HTable hTable = new HTable(configuration, tableName);
loadFfiles.doBulkLoad(new Path(pathToHFile), hTable);
The job runs fine, but once the Loadincrement start, it hangs on for ever. I have to stop the job from running after many attempts. However after long wait of may be 30 mins, I finally got the above error. After extensive search I found, that Hbase would be trying to access the files(HFiles) which are placed in the output folder, and that folder do not have permission to be written or executed. So throwing the above error. So the alternative solutions are to add file access permissions as below in java code before Bulk Loading is performed.
FileSystem fileSystem = FileSystem.get(config);
fileSystem.setPermission(new Path(outputPath),FsPermission.valueOf("drwxrwxrwx"));
Is this the correct approach, as we move from development to production. Also once I added the above code, I got the similar error for the folder created inside the output folder. This time its the column family folder. This is dynamic action at runtime.
As a temporary workaround, I did as below and was able to move ahead.
fileSystem.setPermission(new Path(outputPath+"/col_fam_folder"),FsPermission.valueOf("drwxrwxrwx"));
Both the steps seems to be workarounds, and I need a correct solution to move to production. Thanks in advance
Try this
System.setProperty("HADOOP_USER_NAME", "hadoop");
Secure bulk load seems to be an appropriate answer. This thread explains a sample implementation. The snippet is copied over as below.
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HRegionInfo;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.client.coprocessor.SecureBulkLoadClient;
import org.apache.hadoop.hbase.security.UserProvider;
import org.apache.hadoop.hbase.security.token.FsDelegationToken;
import org.apache.hadoop.hbase.util.Pair;
import org.apache.hadoop.security.UserGroupInformation;
String keyTab = "pathtokeytabfile";
String tableName = "tb_name";
String pathToHFile = "/tmp/tmpfiles/";
Configuration configuration = new Configuration();
configuration.set("hbase.zookeeper.quorum","ZK_QUORUM");
configuration.set("hbase.zookeeper"+ ".property.clientPort","2181");
configuration.set("hbase.master","MASTER:60000");
configuration.set("hadoop.security.authentication", "Kerberos");
configuration.set("hbase.security.authentication", "kerberos");
//Obtaining kerberos authentication
UserGroupInformation.setConfiguration(configuration);
UserGroupInformation.loginUserFromKeytab("here keytab", path to the key tab);
HBaseAdmin.checkHBaseAvailable(configuration);
System.out.println("HBase is running!");
HBaseConfiguration.addHbaseResources(configuration);
Connection conn = ConnectionFactory.createConnection(configuration);
Table table = conn.getTable(TableName.valueOf(tableName));
HRegionInfo tbInfo = new HRegionInfo(table.getName());
//path to the HFiles that need to be loaded
Path hfofDir = new Path(pathToHFile);
//acquiring user token for authentication
UserProvider up = UserProvider.instantiate(configuration);
FsDelegationToken fsDelegationToken = new FsDelegationToken(up, "name of the key tab user");
fsDelegationToken.acquireDelegationToken(hfofDir.getFileSystem(configuration));
//preparing for the bulk load
SecureBulkLoadClient secureBulkLoadClient = new SecureBulkLoadClient(table);
String bulkToken = secureBulkLoadClient.prepareBulkLoad(table.getName());
System.out.println(bulkToken);
//creating the family list (list of family names and path to the hfile corresponding to the family name)
final List<Pair<byte[], String>> famPaths = new ArrayList<>();
Pair p = new Pair();
//name of the family
p.setFirst("nameofthefamily".getBytes());
//path to the HFile (HFile are organized in folder with the name of the family)
p.setSecond("/tmp/tmpfiles/INTRO/nameofthefilehere");
famPaths.add(p);
//bulk loading ,using the secure bulk load client
secureBulkLoadClient.bulkLoadHFiles(famPaths, fsDelegationToken.getUserToken(), bulkToken, tbInfo.getStartKey());
System.out.println("Bulk Load Completed..");

Create HbaseConfiguration once in MapReduce job

I am writing a map-reduce job in java.
I want to use external table for writing hbase increments object.
For that, I am creating new HbaseConfiguration.
I want to be able to create once and use in all mappers.
Any idea?
The Job configuration is already passed to the mappers (and to the reducers) as a part of the context. You can access any HBase table with it.
Job job = new Job(HbaseConfiguration.create(), ...);
/* ... Rest of the job setup ... */
job.waitForCompletion(true);
Within your mapper setup method:
Configuration config = context.getConfiguration();
HTable mytable = new HTable(config, "my_table_name");
....
You can even send your custom parameters or arguments so you can instantiate any kind of object you could need on your mappers:
Configuration config = HbaseConfiguration.create();
config.set("myStringParam", "customValue");
config.setStrings("myStringsParam", "customValue1", "customValue2", "customValue3");
config.setBoolean("myBooleanParam", true);
Job job = new Job(config, ...);
/* ... Rest of the job setup ... */
job.waitForCompletion(true);
Within your mapper setup method:
Configuration config = context.getConfiguration();
String myStringParam = config.get("myStringParam");
String myBooleanParam = config.get("myBooleanParam");
BTW: I don't know why this question has so many downvotes, it's not a bad question.

How to Read file from Hadoop using Java without command line

I wanted to read file from hadoop system, I could do that using the below code
String uri = theFilename;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
To run this file I have to run hadoop jar myjar.jar com.mycompany.cloud.CatFile /filepathin_hadoop
That works. But How can I do that same from other program, I mean without using hadoop jar command.
You can add your core-site.xml to that Configuration object so it knows the URI for your HDFS instance. This method requires HADOOP_HOME to be set.
Configuration conf = new Configuration();
Path coreSitePath = new Path(System.getenv("HADOOP_HOME"), "conf/core-site.xml");
conf.addResource(coreSitePath);
FileSystem hdfs = FileSystem.get(conf);
// rest of code the same
Now, without using hadoop jar you can open a connection to your HDFS instance.
Edit: Have to use conf.addResource(Path). If you use a String arg it, looks in the classpath for that filename.
There is another configuration method set(parameterName,value). If you use this method, you dont have to specify the location of core-site.xml. This would be useful for accessing HDFS from remote location like webserver.
Usage as follows :
String uri = theFilename;
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://10.132.100.211:8020/");
FileSystem fs = FileSystem.get(conf);
// Rest of the code

Resources