I am trying to insert into a hive bucketed sorted table and stuck with a Negative Array Size exception thrown by the reducer. Please find below stack trace.
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.NegativeArraySizeException
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
And my table DDL is (Only showing a subset of columns for readability. Actual DDL has 100 columns)
CREATE TABLE clustered_sorted_orc( conv_type string,
multi_dim_id int,
multi_key_id int,
advertiser_id bigint,
buy_id bigint,
day timestamp
PARTITIONED BY(job_instance_id int)
CLUSTERED BY(conv_type) SORTED BY (day) INTO 8 BUCKETS
STORED AS ORC;
Insert statement is
FROM not_clustered_orc
INSERT OVERWRITE TABLE clustered_sorted_orc PARTITION(job_instance_id)
SELECT conv_type ,multi_dim_id ,multi_key_id ,advertiser_id,buy_id ,day, job_instance_id
Following hive properties are set
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode=nonstrict;
This is a log snippet from MergerManagerImpl which specifies ioSortFactor,mergeThreshold etc if it helps.
2016-06-30 05:57:20,518 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=12828540928, maxSingleShuffleLimit=3207135232, mergeThreshold=8466837504, ioSortFactor=64, memToMemMergeOutputsThreshold=64
I am using CDH 5.7.1, Hive1.1.0, Hadoop 2.6.0. Has anyone faced a similar issue before? Any help is really appreciated.
I got it working after setting
hive.optimize.sort.dynamic.partition=true
Related
I share my experience about adding columns on a partitioned hive table.
As you can see, despite the CASCADE function, the ALTER brakes my table :(
add columns on partitioned table
table description
CREATE TABLE test (
a string,
b string,
c string
)
PARTITIONED BY (
x string,
y string,
z string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.compress'='SNAPPY'
);
duplicate the table
CREATE TABLE test_tmp...
hadoop distcp hdfs://.../test/* dfs://.../test_tmp
MSCK REPAIR TABLE test_tmp;
SELECT * FROM test_tmp
LIMIT 100
check : OK (I get results)
modify the table
ALTER TABLE test_tmp
ADD COLUMNS(
aa timestamp,
bb string,
cc int,
dd string
) CASCADE;
SELECT * FROM test_tmp
LIMIT 100
...
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:19, Vertex vertex_1502459312997_187854_4_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
... 1 statement(s) executed, 0 rows affected, exec/fetch time: 21.655/0.000 sec [0 successful, 1 errors]
check : KO (I get this error)
If you are using Hive 0.x or 1.x then you are probably a victim of...
HIVE-10598 Vectorization borks when column is added to table.
...which is specific to ORC format, even if it's not apparent from the JIRA label.
There is a partial fix as of Hive 2.0 (i.e. ADD is fixed, but DROP / RENAME / CHANGE are still crippled) thanks to
HIVE-11981 ORC Schema Evolution Issues (Vectorized, ACID, and
Non-Vectorized)
And another related fix as of Hive 2.1.1 for CHANGE
HIVE-14355 Schema evolution for ORC in llap is broken
for Int to String conversion
To be continued...
In summary this is what I did:
Original data -> SELECT and save filtered data in HDFS -> create an External table using the file saved in HDFS -> populate an empty table using the External table.
Looking at the Exception, seems this has something todo with OUTPUT types between the two tables
In details :
1) I have "table_log" table with lots of data (in Database A) with the following structure (with 3 partitions) :
CREATE TABLE `table_log`(
`e_id` string,
`member_id` string,
.
.
PARTITIONED BY (
`dt` string,
`service_type` string,
`event_type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
2) I filtered data by (td,service_type,event_type) and saved the result in HDFS as follows :
INSERT OVERWRITE DIRECTORY '/user/atscale/filterd-ratlog' SELECT * FROM rat_log WHERE dt >= '2016-05-01' AND dt <='2016-05-31' AND service_type='xxxx_jp' AND event_type='vv';
3) Then I created an External Table (table_log_filtered_ext) (in Database B) with above result.
Note that this table doesn't have the partitions.
DROP TABLE IF EXISTS table_log_filtered_ext;
CREATE EXTERNAL TABLE `table_log_filtered_ext`(
`e_id` string,
`member_id` string,
.
.
dt string,
service_type string,
event_type string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
LOCATION '/user/atscale/filterd-ratlog'
4) I created another new table (table_log_filtered) similar to the "table_log" structure(with 3 partitions) as :
CREATE TABLE `table_log_filtered` (
`e_id` string,
`member_id` string,
.
.
PARTITIONED BY (
`dt` string,
`service_type` string,
`event_type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
5) Now I wanted to populate "table_log_filtered" table (with 3 partitions as in "table_log") from the data from the external table "table_log_filtered_ext"
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.execution.engine=tez;
INSERT OVERWRITE TABLE rat_log_filtered PARTITION(dt, service_type, event_type)
SELECT * FROM table_log_filtered_ext;
But I get this "java.lang.ClassCastException.
Looking at the exception, this has something todo with OUTPUT types between the two tables..
AnyTips ?:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":
.
.
.
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
... 16 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.LimitOperator.process(LimitOperator.java:54)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
... 17 more
Just in case if anyone else bump into this issue, the fix was as #Samson Scharfrichter mentioned , I specified STORED AS ORC for the table_log_filtered
CREATE TABLE `table_log_filtered` (
`e_id` string,
`member_id` string,
.
.
PARTITIONED BY (
`dt` string,
`service_type` string,
`event_type` string)
STORED AS ORC
I am unable to create the partition into a new table from the table which is already present on hive.
The query that I am running on hive after the Table creation is
INSERT INTO TABLE ba_data.PNR_INFO1_partitioned PARTITION(pnr_create_dt) select * from pnr_info1_external;
The error that I am getting is
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hive/warehouse/ba_data.db/pnr_info1_partitioned/.hive-staging_hive_2016-08-09_17-47-47_508_8688474345886508021-1/_task_tmp.-ext-10002/pnr_create_dt=18%2F12%2F2013/_tmp.000000_3 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
After I browsed and found that namenode,datanode folders needs to be deleted and namenode should be formatted.I have done that sanitary task as well.But still the same error I am getting.
Also I have set the replication factor to 1 and all the Hadoop processes are running well.
Please suggest me how to proceed in order to get away from this issue.Your suggestions are much appreciated.
first you need to create
1. table with all field
2. load data into table
3. create table with partition column with their type
4. copy data from first table to partition table
I think the dynamic partition needs to be enabled. following works.
set hive.exec.dynamic.partition.mode=nonstrict;
create table parttable (id int) partitioned by (partcolumn string)
row format delimited fields terminated by '\t'
lines terminated by '\n'
;
create table source_table (id int,partcolumn string)
row format delimited fields terminated by '\t'
lines terminated by '\n'
;
insert into source_table values (1,'Chicago');
insert into source_table values (2,'Chicago');
insert into source_table values (3,'Orlando');
set hive.exec.dynamic.partition=true;
insert overwrite table parttable partition(partcolumn)
select id,partcolumn from source_table;
I have an orc formatted partitioned and clustered hive table to which I have to insert data from a temp table.
Create table statements:
orc table:
CREATE EXTERNAL TABLE DYN_TGT_TABLE(ID INT,USRNAME VARCHAR(20), LD TIMESTAMP) PARTITIONED BY (PART VARCHAR(20)) CLUSTERED BY (id) INTO 1024 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION '/tmp/dyn_tgt_table' TBLPROPERTIES ('transactional'='true','transient_lastDdlTime'='1461157247');
temp table:
CREATE EXTERNAL TABLE DYN_TEMP_TABLE(ID INT,USRNAME VARCHAR(20),PART VARCHAR(20), LD TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tmp/dyn_temp_table' TBLPROPERTIES ('transactional'='TRUE','transient_lastDdlTime'='1461216088');
Added 5 records to temp table:
hive> select * from DYN_TEMP_TABLE;
OK
1 user1 A 2016-05-06 06:31:48
2 user2 A 2016-05-06 06:31:48
14 user3 B 2016-05-06 06:31:48
15 user4 B 2016-05-06 06:31:48
16 user5 B 2016-05-06 06:31:48
Time taken: 0.166 seconds, Fetched: 5 row(s)
The below dynamic insert query is erroring:
insert into TABLE DYN_TGT_TABLE PARTITION(part) select id,usrname,ld,part from DYN_TEMP_TABLE;
Error message:
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0)
{"key":{},"value":{"_col0":15,"_col1":"user4","_col2":"2016-05-06 06:31:48","_col3":"B"}}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":15,"_col1":"user4","_col2":"2016-05-06 06:31:48","_col3":"B"}}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253)
... 7 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.findWriterOffset(FileSinkOperator.java:755)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:683)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244)
... 7 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Please help me point out the cause for this error.
i am loading data from one table into another in hive, while the new properties of the new table differ from the original.
While loading i am facing the below issue... Any help to fix this... ?
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"mdse_item_i":671841,"co_loc_i":146,"persh_expr_d":"2014-05-01","greg_d":"2013-06-17","persh_oh_q":16.0,"crte_btch_i":765,"updt_btch_i":765,"range_n":"ITEM_LOC_DAY_PERSH_OH_INV_2013-04-01_2013-07-31"}
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:159)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"mdse_item_i":671841,"
My old table defntn:
hive> describe nonclickstream.ITEM_LOC_DAY_PERSH_OH_INV;
OK
mdse_item_i int
co_loc_i int
persh_expr_d string
greg_d string
persh_oh_q double
crte_btch_i int
updt_btch_i int
range_n string
Time taken: 0.058 seconds
My new table def. is below:
hive> describe ITEM_LOC_DAY_PERSH_OH_INV;
OK
mdse_item_i int from deserializer
co_loc_i int from deserializer
persh_expr_d string from deserializer
greg_d string from deserializer
persh_oh_q string from deserializer
crte_btch_i int from deserializer
updt_btch_i int from deserializer
greg_date string
Time taken: 0.241 seconds
The new one is created with avro schema.
CREATE external TABLE ITEM_LOC_DAY_PERSH_OH_INV
partitioned by (greg_date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
Location '/common/TD/INV_new/ITEM_LOC_DAY_PERSH_OH_INV/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///common/TD/INV_new/ITEM_LOC_DAY_PERSH_OH_INV/ITEM_LOC_DAY_PERSH_OH_INV.avs');
Load command we are using:
INSERT INTO TABLE ITEM_LOC_DAY_PERSH_OH_INV PARTITION (greg_date)
SELECT
mdse_item_i,
co_loc_i,
persh_expr_d,
greg_d,
persh_oh_q,
crte_btch_i,
updt_btch_i,
greg_d FROM nonclickstream.ITEM_LOC_DAY_PERSH_OH_INV where range_n='ITEM_LOC_DAY_PERSH_OH_INV_2013-04-01_2013-07-31';
we are using dynamic partitioning while loading!
Actually what we are trying to do is re-partitioning the table with another column. while modified the schema also.
The same approach worked for other tables... but for only this table we are facing this issue......