Is that possible to save values generated by Hive UDTF? - hadoop

I created Hive custom UDTF. My new UDTF returns new 3 columns based on input 2 columns.
But, I can't any INSERT operation by using generated values.
For example,
INSERT OVERWRITE DIRECTORY 'generated_data.csv' SELECT udtf(one, two) FROM table_orig;
INSERT OVERWRITE TABLE test_table SELECT udtf(one, two) FROM table_orig;
Both of INSERT queries returns NullPointerException like following:
2017-05-30T08:02:45,209 ERROR [main([])]: exec.Task (:()) - Failed to execute tez graph.
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.canWorkWithSameSession(TezSessionPoolManager.java:430)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:451)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:396)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:134)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2073)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1744)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1453)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1171)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1161)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Any suggestion is welcome, thank you !

Add columns list after UDTF:
INSERT OVERWRITE TABLE test_table SELECT udtf(one, two) as (col1, col2, col3) FROM table_orig;

Related

Fetching the sequence number from an Oracle database using jdbc .load()

I have a sequence number in my Oracle database that I am trying to fetch and increment using the following query: SELECT TABLE.SEQ.NEXTVAL FROM DUAL where TABLE is the table and SEQ is the sequence. It gets the value and increments it in DBeaver. However, when I run the following code:
QUERY = "SELECT TABLE.SEQ.NEXTVAL FROM DUAL"
sqlContext.read.format("jdbc").options(url=URL, driver=DRIVER,QUERY=QUERY, user=USER,password=PASSWORD).load()
It gives me error:
py4j.protocol.Py4JJavaError: An error occurred while calling o98.load.
: java.sql.SQLSyntaxErrorException: ORA-02287: sequence number not allowed here
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:886)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1175)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1296)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3613)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3657)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1495)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
The solution is to not query the sequence. Starting with 12.2 Oracle allows an identity column, where the ID is assigned in the INSERTstatement without a need of querying the nextval.
Unfortunately you can't modify existing primary key column to identity column, but you may workaroud the problem of getting the sequence value by setting the sequence as a DEFALULT value of the column - see here for further details.
Example - Table with PK Colun ID filled from sequence SEQ
CREATE TABLE TAB
(id INTEGER,
col1 VARCHAR2(5),
PRIMARY KEY (id));
select * from tab;
ID COL1
---------- -----
1 y
2 z
So the nextvalof the sequence is 3
alter table TAB modify id default SEQ.nextval;
Now you may insert new records in the table and the PK will be assigned from the sequence automatically
insert into tab (col1) values('u');
select * from tab;
ID COL1
---------- -----
1 y
2 z
3 u
You may still use the (same) sequence value explicite in the insert statement (as if you provide a value the DEFAULTclause is not firing).
insert into tab(id,col1) values(seq.nextval,'c');
If you want to assign a sequence nextval in case your application passes NULL such as
insert into tab(id,col1) values(null,'d');
you must use the default on null option
alter table TAB modify id default on null SEQ.nextval;

Unable to create hive table with constraints in Hive 2.3.0

I am unable to create table with constraints just like primary key or not null. Without constraints I can create table sucessfully.
I found that Hive support primary keys/foreign keys constraint as part of create table command in 2.1.0 and my version is 2.3.0. Following is the example code:
create table test3(a int primary key)
and this returns me the following error message:
MismatchedTokenException(221!=347)
at org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecognizer.java:617)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:6179)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:3808)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2382)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1316)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:25 mismatched input 'primary' expecting ) near 'int' in create table statement
I am using Hive 2.3.0 and Hadoop 2.7.3.
You will have to create the PRIMARY KEY using the below command
CREATE TABLE TEST3(A INT
PRIMARY KEY(A) disable novalidate);
Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive.

Creating hive table with columns from a text file

I am trying to create a table in Hive from a txt file using a shell script in this format.
My t_cols.txt has data as below:
id string, name string, city string, lpd timestamp
I want to create hive table whose columns should be coming from this text file.
This is how my shell script looks like:
table_cols=`cat t_cols.txt`
hive --hiveconf t_name=${table_cols} -e 'create table leap_frog_snapshot.LINKED_OBJ_TRACKING (\${hiveconf:t_name}) stored as orc tblproperties ("orc.compress"="SNAPPY");'
This is not working somehow.
I am getting the below error:
Logging initialized using configuration in file:/etc/hive/2.4.3.0-227/0/hive-log4j.properties
NoViableAltException(307#[])
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:38618)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:38375)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.java:38059)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HiveParser.java:36183)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5222)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2648)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1658)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1117)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:432)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:316)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1202)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1250)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1139)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1129)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:168)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:379)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:314)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:412)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:428)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:717)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:624)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:60 cannot recognize input near ')' 'stored' 'as' in column type
Am I missing something?
If this is not the right thing to do, what is the correct way to achieve this?
This is because cat is giving only first word before a space into the variable.
Following should work
#!/bin/bash
hive --hiveconf t_name="`cat t_cols.txt`" -e 'create table leap_frog_snapshot.LINKED_OBJ_TRACKING (${hiveconf:t_name}) stored as orc tblproperties ("orc.compress"="SNAPPY") ; '

unable to create partitions in hive

I am unable to create the partition into a new table from the table which is already present on hive.
The query that I am running on hive after the Table creation is
INSERT INTO TABLE ba_data.PNR_INFO1_partitioned PARTITION(pnr_create_dt) select * from pnr_info1_external;
The error that I am getting is
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hive/warehouse/ba_data.db/pnr_info1_partitioned/.hive-staging_hive_2016-08-09_17-47-47_508_8688474345886508021-1/_task_tmp.-ext-10002/pnr_create_dt=18%2F12%2F2013/_tmp.000000_3 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
After I browsed and found that namenode,datanode folders needs to be deleted and namenode should be formatted.I have done that sanitary task as well.But still the same error I am getting.
Also I have set the replication factor to 1 and all the Hadoop processes are running well.
Please suggest me how to proceed in order to get away from this issue.Your suggestions are much appreciated.
first you need to create
1. table with all field
2. load data into table
3. create table with partition column with their type
4. copy data from first table to partition table
I think the dynamic partition needs to be enabled. following works.
set hive.exec.dynamic.partition.mode=nonstrict;
create table parttable (id int) partitioned by (partcolumn string)
row format delimited fields terminated by '\t'
lines terminated by '\n'
;
create table source_table (id int,partcolumn string)
row format delimited fields terminated by '\t'
lines terminated by '\n'
;
insert into source_table values (1,'Chicago');
insert into source_table values (2,'Chicago');
insert into source_table values (3,'Orlando');
set hive.exec.dynamic.partition=true;
insert overwrite table parttable partition(partcolumn)
select id,partcolumn from source_table;

Hive error when inserting data into a partitioned table

I'm using Hive 0.12.0 and I've created a partitioned table.
Then I try to insert the data into the table: LOAD DATA LOCAL INPATH 'path/data' insert into table test partition (idx=1)
But then I get the following error:
ERROR metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(134)) - NoSuchObjectException(message:partition values=[1])
at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionWithAuth(ObjectStore.java:1427)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:111)
at com.sun.proxy.$Proxy4.getPartitionWithAuth(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partition_with_auth(HiveMetaStore.java:2025)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:102)
at com.sun.proxy.$Proxy5.get_partition_with_auth(Unknown Source)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partition_with_auth.getResult(ThriftHiveMetastore.java:6924)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partition_with_auth.getResult(ThriftHiveMetastore.java:6908)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:104)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
What's the solution for this?
You need to either pre-generate via add partition the partitions or use dynamic partitions.
Pre-generate partitions:
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';
Using dynamic partitions:
Dynamic partitions

Resources