i am loading data from one table into another in hive, while the new properties of the new table differ from the original.
While loading i am facing the below issue... Any help to fix this... ?
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"mdse_item_i":671841,"co_loc_i":146,"persh_expr_d":"2014-05-01","greg_d":"2013-06-17","persh_oh_q":16.0,"crte_btch_i":765,"updt_btch_i":765,"range_n":"ITEM_LOC_DAY_PERSH_OH_INV_2013-04-01_2013-07-31"}
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:159)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"mdse_item_i":671841,"
My old table defntn:
hive> describe nonclickstream.ITEM_LOC_DAY_PERSH_OH_INV;
OK
mdse_item_i int
co_loc_i int
persh_expr_d string
greg_d string
persh_oh_q double
crte_btch_i int
updt_btch_i int
range_n string
Time taken: 0.058 seconds
My new table def. is below:
hive> describe ITEM_LOC_DAY_PERSH_OH_INV;
OK
mdse_item_i int from deserializer
co_loc_i int from deserializer
persh_expr_d string from deserializer
greg_d string from deserializer
persh_oh_q string from deserializer
crte_btch_i int from deserializer
updt_btch_i int from deserializer
greg_date string
Time taken: 0.241 seconds
The new one is created with avro schema.
CREATE external TABLE ITEM_LOC_DAY_PERSH_OH_INV
partitioned by (greg_date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
Location '/common/TD/INV_new/ITEM_LOC_DAY_PERSH_OH_INV/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///common/TD/INV_new/ITEM_LOC_DAY_PERSH_OH_INV/ITEM_LOC_DAY_PERSH_OH_INV.avs');
Load command we are using:
INSERT INTO TABLE ITEM_LOC_DAY_PERSH_OH_INV PARTITION (greg_date)
SELECT
mdse_item_i,
co_loc_i,
persh_expr_d,
greg_d,
persh_oh_q,
crte_btch_i,
updt_btch_i,
greg_d FROM nonclickstream.ITEM_LOC_DAY_PERSH_OH_INV where range_n='ITEM_LOC_DAY_PERSH_OH_INV_2013-04-01_2013-07-31';
we are using dynamic partitioning while loading!
Actually what we are trying to do is re-partitioning the table with another column. while modified the schema also.
The same approach worked for other tables... but for only this table we are facing this issue......
Related
I am trying to create a HIVE table with SERDE. But it always fails.
My table creation command -
CREATE TABLE products_info_raw(
id STRING,
name STRING,
reseller STRING,
category STRING,
price BIGINT,
discount FLOAT,
profit_percent FLOAT
)
PARTITIONED BY (
rptg_dt STRING
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
I added jar -
ADD jar /Users/<user>/Development/Hadoop/projects/e-commerce/hive-json-serde.jar;
that contains the necessary JsonSerde class -
META-INF/
META-INF/MANIFEST.MF
org/
org/apache/
org/apache/hadoop/
org/apache/hadoop/hive/
org/apache/hadoop/hive/contrib/
org/apache/hadoop/hive/contrib/serde2/
org/json/
org/apache/hadoop/hive/contrib/serde2/JsonSerde.class
org/apache/hadoop/hive/contrib/serde2/NewJson.class
org/json/CDL.class
org/json/Cookie.class
org/json/CookieList.class
org/json/HTTP.class
org/json/HTTPTokener.class
org/json/JSONArray.class
org/json/JSONException.class
org/json/JSONML.class
org/json/JSONObject$1.class
org/json/JSONObject$Null.class
org/json/JSONObject.class
org/json/JSONString.class
org/json/JSONStringer.class
org/json/JSONTokener.class
org/json/JSONWriter.class
org/json/Test$1Obj.class
org/json/Test.class
org/json/XML.class
org/json/XMLTokener.class
But always keep getting below error -
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.serde2.SerDe
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I am using HIVE 3.1.0.
Please help.
I am trying to create hive table on top of HBase table. Using the mentioned query for same.
create external table MaprDB_batch_info_table (Batch_ID string, BatchParserJobId string, count string, CurrentRunTime string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,BatchInfo:BatchParserJobId,BatchInfo:count,BatchInfo:CurrentRunTime") TBLPROPERTIES ('hbase.table.name' = '/user/all/batchinfo');
This command is successfully executing in hive shell but when I try to execute same through bash shell
hive -e "create external table MaprDB_batch_info_table (Batch_ID string, BatchParserJobId string, count string, CurrentRunTime string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,BatchInfo:BatchParserJobId,BatchInfo:count,BatchInfo:CurrentRunTime") TBLPROPERTIES ('hbase.table.name' = '/user/all/batchinfo');
I get below error:
NoViableAltException(26#[])
at org.apache.hadoop.hive.ql.parse.HiveParser.tablePropertiesList(HiveParser.java:34375)
at org.apache.hadoop.hive.ql.parse.HiveParser.tableProperties(HiveParser.java:34243)
at org.apache.hadoop.hive.ql.parse.HiveParser.tableFileFormat(HiveParser.java:35913)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5380)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:397)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:309)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1146)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1194)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1083)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1073)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:708)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:473 cannot recognize input near 'hbase' '.' 'columns' in table properties list'
If anybody can help in rectifying this please.
Replace the " that you have within the query with '
...('hbase.columns.mapping'=':key,BatchInfo:BatchParserJobId,BatchInfo:count,BatchInfo:CurrentRunTime')...
Also you have an issue with the value given to 'hbase.table.name', replace the path with the actual table name.
In summary this is what I did:
Original data -> SELECT and save filtered data in HDFS -> create an External table using the file saved in HDFS -> populate an empty table using the External table.
Looking at the Exception, seems this has something todo with OUTPUT types between the two tables
In details :
1) I have "table_log" table with lots of data (in Database A) with the following structure (with 3 partitions) :
CREATE TABLE `table_log`(
`e_id` string,
`member_id` string,
.
.
PARTITIONED BY (
`dt` string,
`service_type` string,
`event_type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
2) I filtered data by (td,service_type,event_type) and saved the result in HDFS as follows :
INSERT OVERWRITE DIRECTORY '/user/atscale/filterd-ratlog' SELECT * FROM rat_log WHERE dt >= '2016-05-01' AND dt <='2016-05-31' AND service_type='xxxx_jp' AND event_type='vv';
3) Then I created an External Table (table_log_filtered_ext) (in Database B) with above result.
Note that this table doesn't have the partitions.
DROP TABLE IF EXISTS table_log_filtered_ext;
CREATE EXTERNAL TABLE `table_log_filtered_ext`(
`e_id` string,
`member_id` string,
.
.
dt string,
service_type string,
event_type string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
LOCATION '/user/atscale/filterd-ratlog'
4) I created another new table (table_log_filtered) similar to the "table_log" structure(with 3 partitions) as :
CREATE TABLE `table_log_filtered` (
`e_id` string,
`member_id` string,
.
.
PARTITIONED BY (
`dt` string,
`service_type` string,
`event_type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
5) Now I wanted to populate "table_log_filtered" table (with 3 partitions as in "table_log") from the data from the external table "table_log_filtered_ext"
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.execution.engine=tez;
INSERT OVERWRITE TABLE rat_log_filtered PARTITION(dt, service_type, event_type)
SELECT * FROM table_log_filtered_ext;
But I get this "java.lang.ClassCastException.
Looking at the exception, this has something todo with OUTPUT types between the two tables..
AnyTips ?:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":
.
.
.
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
... 16 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.LimitOperator.process(LimitOperator.java:54)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
... 17 more
Just in case if anyone else bump into this issue, the fix was as #Samson Scharfrichter mentioned , I specified STORED AS ORC for the table_log_filtered
CREATE TABLE `table_log_filtered` (
`e_id` string,
`member_id` string,
.
.
PARTITIONED BY (
`dt` string,
`service_type` string,
`event_type` string)
STORED AS ORC
I am trying to insert into a hive bucketed sorted table and stuck with a Negative Array Size exception thrown by the reducer. Please find below stack trace.
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.NegativeArraySizeException
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
And my table DDL is (Only showing a subset of columns for readability. Actual DDL has 100 columns)
CREATE TABLE clustered_sorted_orc( conv_type string,
multi_dim_id int,
multi_key_id int,
advertiser_id bigint,
buy_id bigint,
day timestamp
PARTITIONED BY(job_instance_id int)
CLUSTERED BY(conv_type) SORTED BY (day) INTO 8 BUCKETS
STORED AS ORC;
Insert statement is
FROM not_clustered_orc
INSERT OVERWRITE TABLE clustered_sorted_orc PARTITION(job_instance_id)
SELECT conv_type ,multi_dim_id ,multi_key_id ,advertiser_id,buy_id ,day, job_instance_id
Following hive properties are set
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode=nonstrict;
This is a log snippet from MergerManagerImpl which specifies ioSortFactor,mergeThreshold etc if it helps.
2016-06-30 05:57:20,518 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=12828540928, maxSingleShuffleLimit=3207135232, mergeThreshold=8466837504, ioSortFactor=64, memToMemMergeOutputsThreshold=64
I am using CDH 5.7.1, Hive1.1.0, Hadoop 2.6.0. Has anyone faced a similar issue before? Any help is really appreciated.
I got it working after setting
hive.optimize.sort.dynamic.partition=true
Here's the scenario: 3 folders are located in hdfs. The files are as follows:
/root/20140901/part-0
/root/20140901/part-1
/root/20140901/part-2
/root/20140902/part-0
/root/20140902/part-1
/root/20140902/part-2
/root/20140903/part-0
/root/20140903/part-1
/root/20140903/part-2
After creating a hive table whose command is as below, I invoke hql=[select * from hive_combine_test where rdm > 50000;], this will cost 9 mappers, just the same number as the files in hdfs.
CREATE EXTERNAL table hive_combine_test
(id string,
rdm string)
PARTITIONED BY (dateid string)
row format delimited fields terminated by '\t'
stored as textfile;
ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140901')
location '/root/20140901';
ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140902')
location '/root/20140902';
ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140903')
location '/root/20140903';
But what I want is to make all the part-i together in one split, in such way, there should be only three mappers.
I've tried to inherit from org.apache.hadoop.hive.ql.io.HiveInputFormat in order to test whether the custom JudHiveInputFormat can work.
public class JudHiveInputFormat<K extends WritableComparable, V extends Writable>
extends HiveInputFormat<WritableComparable, Writable> {
}
But when I mount it in hive, it returns exception:
hive> add jar /my_path/jud_udf.jar;
hive> set hive.input.format=com.judking.hive.inputformat.JudHiveInputFormat;
hive> select * from hive_combine_test where rdm > 50000;
java.lang.RuntimeException: com.judking.hive.inputformat.JudCombineHiveInputFormat
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:290)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1472)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1239)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1057)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:880)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:870)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Could anyone give me some clue? Thanks a lot!
As far as I know to add a custom INPUT/OUTPUT format in Hive you need to mention that format in your create table statement. Some thing like this:
CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT '<your input format class name >' OUTPUTFORMAT '<your output format class name>';
Since you need only InputFormat, so your create table statement will look like this:
CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'JudHiveInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';
Why you need to mention this OUTPUT format class, since you have overwritten the INPUT format Hive expects the OUTPUT class as well, so here we need to say Hive to use its DEFAULT OUTPUT format class.
May be you can give it a try.
Hope it helps...!!!