Modifying data while importing from Oracle to HBase using Sqoop - oracle

I am trying to transfer my data which is in the oracle database to my HBase table using Sqoop. I am successfully able to do that using Java Sqoop client.
However in this case, I am doing just the transfer and always using hbase_row_key as "COL1, COL2".
Now I want to do is before I put in the data in the hbase table, I want to decide on the hbase_row_key which should be "COl1,COL2" if COL2 is present, if it is absent hbase_row_key should be ""COl1,COL3" ( assuming COL3 is always present).
I think using a custom mapper instead of default mapper should do it but I am not sure how to do it with Sqoop. How to make Sqoop use custom mapper before inserting data into HBase.
Any help in this regards would be highly appreciated.
Thanks again!..
Below is my Java sqoop client code:
import com.cloudera.sqoop.SqoopOptions;
import com.cloudera.sqoop.tool.ImportTool;
public class TestSqoopClient {
public static void main(String[] args) throws Exception {
SqoopOptions options = new SqoopOptions();
options.setConnectString("my_database_connection_tring");
options.setUsername("my_user");
options.setPassword("my_password");
options.setNumMappers(2); // Default value is 4
//options.setSqlQuery("SELECT * FROM user_logs WHERE $CONDITIONS limit 10");
options.setTableName("my_tablename");
options.setWhereClause("my_where_condition");
options.setSplitByCol("my_split_column");
// HBase options
options.setHBaseTable("my_hbase_table_name");
options.setHBaseColFamily("my_column_family");
options.setCreateHBaseTable(false); // Create HBase table, if it does not exist
options.setHBaseRowKeyColumn("COL1,COL2");
int ret = new ImportTool().run(options);
}
}

Have a look at extending HBase code as specified at http://sqoop.apache.org/docs/1.4.6/SqoopDevGuide.html#_hbase_serialization_extensions by writing a custom PutTransformer.

Related

Integrate key-value database with Spark

I'm having trouble understanding how Spark interacts with storage.
I would like to make a Spark cluster that fetches data from a RocksDB database (or any other key-value store). However, at this moment, the best I can do is fetch the whole dataset from the database into memory in each of the cluster nodes (into a map for example) and build an RDD from that object.
What do I have to do to fetch only the necessary data (like Spark does with HDFS)? I've read about Hadoop Input Format and Record Readers, but I'm not completely grasping what I should implement.
I know this is kind of a broad question, but I would really appreciate some help to get me started. Thank you in advance.
Here is one possible solution. I assume you have client library for the key-value store(RocksDB in your case) that you want to access.
KeyValuePair represents a bean class representing one Key-value pair from your key-value store.
Classes
/*Lazy iterator to read from KeyValue store*/
class KeyValueIterator implements Iterator<KeyValuePair> {
public KeyValueIterator() {
//TODO initialize your custom reader using java client library
}
#Override
public boolean hasNext() {
//TODO
}
#Override
public KeyValuePair next() {
//TODO
}
}
class KeyValueReader implements FlatMapFunction<KeyValuePair, KeyValuePair>() {
#Override
public Iterator<KeyValuePair> call(KeyValuePair keyValuePair) throws Exception {
//ignore empty 'keyValuePair' object
return new KeyValueIterator();
}
}
Create KeyValue RDD
/*list with a dummy KeyValuePair instance*/
ArrayList<KeyValuePair> keyValuePairs = new ArrayList<>();
keyValuePairs.add(new KeyValuePair());
JavaRDD<KeyValuePair> keyValuePairRDD = javaSparkContext.parallelize(keyValuePairs);
/*Read one key-value pair at a time lazily*/
keyValuePairRDD = keyValuePairRDD.flatMap(new KeyValueReader());
Note:
Above solution creates an RDD with two partitions by default(one of them will be empty). Increase the partitions before applying any transformation on keyValuePairRDD to distribute the processing across executors.
Different ways to increase partitions:
keyValuePairRDD.repartition(partitionCounts)
//OR
keyValuePairRDD.partitionBy(...)

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark.
Here is the code:
JavaSparkContext sc = new JavaSparkContext(
new SparkConf().setAppName("MySQLtoEs")
.set("es.index.auto.create", "true")
.set("es.nodes", "127.0.0.1:9200")
.set("es.mapping.id", "id")
.set("spark.serializer", KryoSerializer.class.getName()));
SQLContext sqlContext = new SQLContext(sc);
// Data source options
Map<String, String> options = new HashMap<>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "OFFERS");
options.put("partitionColumn", "id");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
// Load MySQL query result as DataFrame
LOGGER.info("Loading DataFrame");
DataFrame jdbcDF = sqlContext.load("jdbc", options);
DataFrame df = jdbcDF.select("id", "title", "description",
"merchantId", "price", "keywords", "brandId", "categoryId");
df.show();
LOGGER.info("df.count : " + df.count());
EsSparkSQL.saveToEs(df, "offers/product");
You can see the code is very straightforward. It reads the data into a DataFrame, selects some columns and then performs a count as a basic action on the Dataframe. Everything works fine up to this point.
Then it tries to save the data into Elasticsearch, but it fails because it cannot handle some type. You can see the error log here.
I'm not sure about why it can't handle that type. Does anyone know why this is occurring?
I'm using Apache Spark 1.5.0, Elasticsearch 1.4.4 and elaticsearch-hadoop 2.1.1
EDIT:
I have updated the gist link with a sample dataset along with the source code.
I have also tried to use the elasticsearch-hadoop dev builds as mentionned by #costin on the mailing list.
The answer for this one was tricky, but thanks to samklr, I have managed to figure about what the problem was.
The solution isn't straightforward nevertheless and might consider some “unnecessary” transformations.
First let's talk about Serialization.
There are two aspects of serialization to consider in Spark serialization of data and serialization of functions. In this case, it's about data serialization and thus de-serialization.
From Spark’s perspective, the only thing required is setting up serialization - Spark relies by default on Java serialization which is convenient but fairly inefficient. This is the reason why Hadoop itself introduced its own serialization mechanism and its own types - namely Writables. As such, InputFormat and OutputFormats are required to return Writables which, out of the box, Spark does not understand.
With the elasticsearch-spark connector one must enable a different serialization (Kryo) which handles the conversion automatically and also does this quite efficiently.
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
Even since Kryo does not require that a class implement a particular interface to be serialized, which means POJOs can be used in RDDs without any further work beyond enabling Kryo serialization.
That said, #samklr pointed out to me that Kryo needs to register classes before using them.
This is because Kryo writes a reference to the class of the object being serialized (one reference is written for every object written), which is just an integer identifier if the class has been registered but is the full classname otherwise. Spark registers Scala classes and many other framework classes (like Avro Generic or Thrift classes) on your behalf.
Registering classes with Kryo is straightforward. Create a subclass of KryoRegistrator,and override the registerClasses() method:
public class MyKryoRegistrator implements KryoRegistrator, Serializable {
#Override
public void registerClasses(Kryo kryo) {
// Product POJO associated to a product Row from the DataFrame
kryo.register(Product.class);
}
}
Finally, in your driver program, set the spark.kryo.registrator property to the fully qualified classname of your KryoRegistrator implementation:
conf.set("spark.kryo.registrator", "MyKryoRegistrator")
Secondly, even thought the Kryo serializer is set and the class registered, with changes made to Spark 1.5, and for some reason Elasticsearch couldn't de-serialize the Dataframe because it can't infer the SchemaType of the Dataframe into the connector.
So I had to convert the Dataframe to an JavaRDD
JavaRDD<Product> products = df.javaRDD().map(new Function<Row, Product>() {
public Product call(Row row) throws Exception {
long id = row.getLong(0);
String title = row.getString(1);
String description = row.getString(2);
int merchantId = row.getInt(3);
double price = row.getDecimal(4).doubleValue();
String keywords = row.getString(5);
long brandId = row.getLong(6);
int categoryId = row.getInt(7);
return new Product(id, title, description, merchantId, price, keywords, brandId, categoryId);
}
});
Now the data is ready to be written into elasticsearch :
JavaEsSpark.saveToEs(products, "test/test");
References:
Elasticsearch's Apache Spark support documentation.
Hadoop Definitive Guide, Chapter 19. Spark, ed. 4 – Tom White.
User samklr.

LeaseExpiredException with custom UDF in Hive

I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library:
https://github.com/tobie/ua-parser
The UDF is rather simple:
public class DeviceTypeExtractTest extends UDF{
private Text result = new Text();
private static final Parser uaParser;
static {
try {
uaParser = new Parser();
}
catch(IOException e) {
throw new RuntimeException("Could not instantiate User-Agent parser.");
}
}
public Text evaluate( Text uaField){
if (uaField == null ) {
return null;
}
try
{
String uaString = uaField.toString();
Client client = uaParser.parse(uaString);
result.set(client.device.family);
return result;
}
catch(Exception e)
{
return null;
}
}
}
And it works just fine when run on a small dataset.
create table categories(
cat string);
insert overwrite table categories select DEVICE_TYPE_EXTRACT(user_agent) from raw_logs;
However, when testing this on a larger dataset of over 10 million rows, I get this LeaseExpiredException on every attempt:
http://pastebin.com/yK6Qmx6r
And my map and reduce processes remain stuck at 0% for hours. Note that if I take out this udf and use some internal Hive UDFs just for testing, this behavior does not take place.
I am running this on an Amazon EMR cluster with AMI version 2.4.5 (Hive 0.11.0.2 and Hadoop 1.0.3).
I tried increasing the performance of the cluster by deploying better hardware, but I get the same problem with any hardware scenario.
Any ideas?
Okay, scratch that. It seems that after upgrading my instance, things started to move around but I was just not waiting long enough for the mapping to happen. And the LeaseExpiredError was actually thrown because of little ol' me when I was killing the processes.
Still, the parsing is taking an immense amount of time and I would love some suggestions to further optimize this UDF.

How to pass Hive conf variable in hive udf?

I want to pass hive conf variable to hive UDF.
below is a code snippet.
hive -f ../hive/testHive.sql -hivevar testArg=${testArg}
Below is hive UDF call.
select setUserDefinedValueForColumn(columnName,'${testArg}') from testTable;
In udf I am getting value of testArg as null.
Please advice me how to use hive conf variable in udf and how to access Hive configuration in hive UDF?
I think that you should pass hive variable as 'hiveconf' using below command:
hive --hiveconf testArg="my test args" -f ../hive/testHive.sql
Then you may have below code inside a GenericUDF evaluate() method:
#Override
public Object evaluate(DeferredObject[] args) throws HiveException {
String myconf;
SessionState ss = SessionState.get();
if (ss != null) {
HiveConf conf = ss.getConf();
myconf= conf.get("testArg");
System.out.println("sysout.myconf:"+ myconf);
}
}
The code is tested on hive 1.2
You can't pass a Hive variable directly to the view by using ${hiveconf:testArg} in the view code because during the view creation Hive will take exactly value of the variable so the view will be static.
The only opportunity is to use UDF to access hive variable:
You can use GenericUDF. It has a method configure which takes MapredContext as a parameter. So, you need to specify a configure method in GenericUDF, like :
public void configure(MapredContext context){
yourVar = context.getJobConf().get("hive_variable");
}
This is only called in runtime of MapRedTask.

Pig: Perform task on completion of UDF

In Hadoop I have a Reducer that looks like this to transform data from a prior mapper into a series of files of a non InputFormat compatible type.
protected void setup(Context context) {
LocalDatabase ld = new LocalDatabase("localFilePath");
}
protected void reduce(BytesWritable key, Text value, Context context) {
ld.addValue(key, value)
}
protected void cleanup(Context context) {
saveLocalDatabaseInHDFS(ld);
}
I was rewriting my application in Pig, and can't figure out how this would be done in a Pig UDF as there's no cleanup function or anything else to denote when the UDF has finished running. How can this be done in pig?
I would say you'd need to write a StoreFunc UDF, wrapping your own custom OutputFormat - then you'd have the ability to close out in the Output Format's RecordWriter.close() method.
This will create an database in HDFS for each reducer however, so if you want everything in a single file, you'd need to run with a single reducer or run a secondary step to merge the databases together.
http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions
If you want something to run at the end of your UDF, use the finish() call. This will be called after all records have been processed by your UDF. It will be called once per mapper or reducer, the same as the cleanup call in your reducer.

Resources