I want to pass hive conf variable to hive UDF.
below is a code snippet.
hive -f ../hive/testHive.sql -hivevar testArg=${testArg}
Below is hive UDF call.
select setUserDefinedValueForColumn(columnName,'${testArg}') from testTable;
In udf I am getting value of testArg as null.
Please advice me how to use hive conf variable in udf and how to access Hive configuration in hive UDF?
I think that you should pass hive variable as 'hiveconf' using below command:
hive --hiveconf testArg="my test args" -f ../hive/testHive.sql
Then you may have below code inside a GenericUDF evaluate() method:
#Override
public Object evaluate(DeferredObject[] args) throws HiveException {
String myconf;
SessionState ss = SessionState.get();
if (ss != null) {
HiveConf conf = ss.getConf();
myconf= conf.get("testArg");
System.out.println("sysout.myconf:"+ myconf);
}
}
The code is tested on hive 1.2
You can't pass a Hive variable directly to the view by using ${hiveconf:testArg} in the view code because during the view creation Hive will take exactly value of the variable so the view will be static.
The only opportunity is to use UDF to access hive variable:
You can use GenericUDF. It has a method configure which takes MapredContext as a parameter. So, you need to specify a configure method in GenericUDF, like :
public void configure(MapredContext context){
yourVar = context.getJobConf().get("hive_variable");
}
This is only called in runtime of MapRedTask.
Related
I am attempting to execute a stored procedure from Spring Batch, the stored procedure has two parameters, an IN parameter and OUT parameter. What I want is to get the result set and the out parameter when the stored procedure is called.
I referred to StoredProcedureItemReader and StoredProcedureItemReaderBuilder
I can use this to call a stored procedure that has only IN parameter, however, I can't call after registering OUT parameter.
If we refer raw JDBC template we could use, it is possible to call a Store Procedure with IN and OUT variables using https://docs.oracle.com/javase/7/docs/api/java/sql/CallableStatement.html
And I believe that StoredProcedureItemReader or StoredProcedureIteamReaderBuilder uses CallableStatement behind the scenes.
My question is, how do I register OUT parameters to execute within Spring Batch using StoredProcedureItemReaderBuilder
Here's a sample code I tried
#StepScope
#Bean
public StoredProcedureItemReader<MyRow> rowReader(#Value("#{stepExecutionContext[tableName]}") String tableName) {
return new StoredProcedureItemReaderBuilder<MyRow>()
.procedureName("GetNameCountByFname")
.parameters(
new SqlParameter[]{
new SqlParameter("fname", Types.VARCHAR),
new SqlOutParameter("total", Types.INTEGER)
}).
preparedStatementSetter(
new PreparedStatementSetter() {
#Override
public void setValues(PreparedStatement ps)
throws SQLException {
ps.setString(1, "bob");
}
}
.rowMapper(new MyRowMapper(tableName))
.name(tableName + "_read")
.dataSource(dataSource)
.build();
}
The following error is given:
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:422) ~[na:1.8.0_251]
at java.util.ArrayList.get(ArrayList.java:435) ~[na:1.8.0_251]
at com.mysql.cj.jdbc.CallableStatement$CallableStatementParamInfo.getParameter(CallableStatement.java:283) ~[mysql-connector-java-8.0.20.jar:8.0.20]
at com.mysql.cj.jdbc.CallableStatement.checkIsOutputParam(CallableStatement.java:634) ~[mysql-connector-java-8.0.20.jar:8.0.20]
at com.mysql.cj.jdbc.CallableStatement.getObject(CallableStatement.java:1356) ~[mysql-connector-java-8.0.20.jar:8.0.20]
The following stored procedure is called
DELIMITER $$
CREATE PROCEDURE GetNameCountByFname(
IN fname VARCHAR(25),
OUT total INT
)
BEGIN
SELECT COUNT(*)
INTO total
FROM `first`
WHERE `name` = fname;
END$$
DELIMITER ;
My question is, how do I register OUT parameters to execute within Spring Batch using StoredProcedureItemReaderBuilder
That's not possible. This feature has been already requested but was rejected. Please find more details in https://github.com/spring-projects/spring-batch/issues/2024.
I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark.
Here is the code:
JavaSparkContext sc = new JavaSparkContext(
new SparkConf().setAppName("MySQLtoEs")
.set("es.index.auto.create", "true")
.set("es.nodes", "127.0.0.1:9200")
.set("es.mapping.id", "id")
.set("spark.serializer", KryoSerializer.class.getName()));
SQLContext sqlContext = new SQLContext(sc);
// Data source options
Map<String, String> options = new HashMap<>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "OFFERS");
options.put("partitionColumn", "id");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
// Load MySQL query result as DataFrame
LOGGER.info("Loading DataFrame");
DataFrame jdbcDF = sqlContext.load("jdbc", options);
DataFrame df = jdbcDF.select("id", "title", "description",
"merchantId", "price", "keywords", "brandId", "categoryId");
df.show();
LOGGER.info("df.count : " + df.count());
EsSparkSQL.saveToEs(df, "offers/product");
You can see the code is very straightforward. It reads the data into a DataFrame, selects some columns and then performs a count as a basic action on the Dataframe. Everything works fine up to this point.
Then it tries to save the data into Elasticsearch, but it fails because it cannot handle some type. You can see the error log here.
I'm not sure about why it can't handle that type. Does anyone know why this is occurring?
I'm using Apache Spark 1.5.0, Elasticsearch 1.4.4 and elaticsearch-hadoop 2.1.1
EDIT:
I have updated the gist link with a sample dataset along with the source code.
I have also tried to use the elasticsearch-hadoop dev builds as mentionned by #costin on the mailing list.
The answer for this one was tricky, but thanks to samklr, I have managed to figure about what the problem was.
The solution isn't straightforward nevertheless and might consider some “unnecessary” transformations.
First let's talk about Serialization.
There are two aspects of serialization to consider in Spark serialization of data and serialization of functions. In this case, it's about data serialization and thus de-serialization.
From Spark’s perspective, the only thing required is setting up serialization - Spark relies by default on Java serialization which is convenient but fairly inefficient. This is the reason why Hadoop itself introduced its own serialization mechanism and its own types - namely Writables. As such, InputFormat and OutputFormats are required to return Writables which, out of the box, Spark does not understand.
With the elasticsearch-spark connector one must enable a different serialization (Kryo) which handles the conversion automatically and also does this quite efficiently.
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
Even since Kryo does not require that a class implement a particular interface to be serialized, which means POJOs can be used in RDDs without any further work beyond enabling Kryo serialization.
That said, #samklr pointed out to me that Kryo needs to register classes before using them.
This is because Kryo writes a reference to the class of the object being serialized (one reference is written for every object written), which is just an integer identifier if the class has been registered but is the full classname otherwise. Spark registers Scala classes and many other framework classes (like Avro Generic or Thrift classes) on your behalf.
Registering classes with Kryo is straightforward. Create a subclass of KryoRegistrator,and override the registerClasses() method:
public class MyKryoRegistrator implements KryoRegistrator, Serializable {
#Override
public void registerClasses(Kryo kryo) {
// Product POJO associated to a product Row from the DataFrame
kryo.register(Product.class);
}
}
Finally, in your driver program, set the spark.kryo.registrator property to the fully qualified classname of your KryoRegistrator implementation:
conf.set("spark.kryo.registrator", "MyKryoRegistrator")
Secondly, even thought the Kryo serializer is set and the class registered, with changes made to Spark 1.5, and for some reason Elasticsearch couldn't de-serialize the Dataframe because it can't infer the SchemaType of the Dataframe into the connector.
So I had to convert the Dataframe to an JavaRDD
JavaRDD<Product> products = df.javaRDD().map(new Function<Row, Product>() {
public Product call(Row row) throws Exception {
long id = row.getLong(0);
String title = row.getString(1);
String description = row.getString(2);
int merchantId = row.getInt(3);
double price = row.getDecimal(4).doubleValue();
String keywords = row.getString(5);
long brandId = row.getLong(6);
int categoryId = row.getInt(7);
return new Product(id, title, description, merchantId, price, keywords, brandId, categoryId);
}
});
Now the data is ready to be written into elasticsearch :
JavaEsSpark.saveToEs(products, "test/test");
References:
Elasticsearch's Apache Spark support documentation.
Hadoop Definitive Guide, Chapter 19. Spark, ed. 4 – Tom White.
User samklr.
I am trying to transfer my data which is in the oracle database to my HBase table using Sqoop. I am successfully able to do that using Java Sqoop client.
However in this case, I am doing just the transfer and always using hbase_row_key as "COL1, COL2".
Now I want to do is before I put in the data in the hbase table, I want to decide on the hbase_row_key which should be "COl1,COL2" if COL2 is present, if it is absent hbase_row_key should be ""COl1,COL3" ( assuming COL3 is always present).
I think using a custom mapper instead of default mapper should do it but I am not sure how to do it with Sqoop. How to make Sqoop use custom mapper before inserting data into HBase.
Any help in this regards would be highly appreciated.
Thanks again!..
Below is my Java sqoop client code:
import com.cloudera.sqoop.SqoopOptions;
import com.cloudera.sqoop.tool.ImportTool;
public class TestSqoopClient {
public static void main(String[] args) throws Exception {
SqoopOptions options = new SqoopOptions();
options.setConnectString("my_database_connection_tring");
options.setUsername("my_user");
options.setPassword("my_password");
options.setNumMappers(2); // Default value is 4
//options.setSqlQuery("SELECT * FROM user_logs WHERE $CONDITIONS limit 10");
options.setTableName("my_tablename");
options.setWhereClause("my_where_condition");
options.setSplitByCol("my_split_column");
// HBase options
options.setHBaseTable("my_hbase_table_name");
options.setHBaseColFamily("my_column_family");
options.setCreateHBaseTable(false); // Create HBase table, if it does not exist
options.setHBaseRowKeyColumn("COL1,COL2");
int ret = new ImportTool().run(options);
}
}
Have a look at extending HBase code as specified at http://sqoop.apache.org/docs/1.4.6/SqoopDevGuide.html#_hbase_serialization_extensions by writing a custom PutTransformer.
I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library:
https://github.com/tobie/ua-parser
The UDF is rather simple:
public class DeviceTypeExtractTest extends UDF{
private Text result = new Text();
private static final Parser uaParser;
static {
try {
uaParser = new Parser();
}
catch(IOException e) {
throw new RuntimeException("Could not instantiate User-Agent parser.");
}
}
public Text evaluate( Text uaField){
if (uaField == null ) {
return null;
}
try
{
String uaString = uaField.toString();
Client client = uaParser.parse(uaString);
result.set(client.device.family);
return result;
}
catch(Exception e)
{
return null;
}
}
}
And it works just fine when run on a small dataset.
create table categories(
cat string);
insert overwrite table categories select DEVICE_TYPE_EXTRACT(user_agent) from raw_logs;
However, when testing this on a larger dataset of over 10 million rows, I get this LeaseExpiredException on every attempt:
http://pastebin.com/yK6Qmx6r
And my map and reduce processes remain stuck at 0% for hours. Note that if I take out this udf and use some internal Hive UDFs just for testing, this behavior does not take place.
I am running this on an Amazon EMR cluster with AMI version 2.4.5 (Hive 0.11.0.2 and Hadoop 1.0.3).
I tried increasing the performance of the cluster by deploying better hardware, but I get the same problem with any hardware scenario.
Any ideas?
Okay, scratch that. It seems that after upgrading my instance, things started to move around but I was just not waiting long enough for the mapping to happen. And the LeaseExpiredError was actually thrown because of little ol' me when I was killing the processes.
Still, the parsing is taking an immense amount of time and I would love some suggestions to further optimize this UDF.
In Hadoop I have a Reducer that looks like this to transform data from a prior mapper into a series of files of a non InputFormat compatible type.
protected void setup(Context context) {
LocalDatabase ld = new LocalDatabase("localFilePath");
}
protected void reduce(BytesWritable key, Text value, Context context) {
ld.addValue(key, value)
}
protected void cleanup(Context context) {
saveLocalDatabaseInHDFS(ld);
}
I was rewriting my application in Pig, and can't figure out how this would be done in a Pig UDF as there's no cleanup function or anything else to denote when the UDF has finished running. How can this be done in pig?
I would say you'd need to write a StoreFunc UDF, wrapping your own custom OutputFormat - then you'd have the ability to close out in the Output Format's RecordWriter.close() method.
This will create an database in HDFS for each reducer however, so if you want everything in a single file, you'd need to run with a single reducer or run a secondary step to merge the databases together.
http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions
If you want something to run at the end of your UDF, use the finish() call. This will be called after all records have been processed by your UDF. It will be called once per mapper or reducer, the same as the cleanup call in your reducer.