Hive compression in ORC using Lz4 - hadoop

I am trying to compress RC and ORC file using LZ4. I have installed Hadoop-2.7.1 and Hive-1.2.1. In case of LZ4, I can compress RC file without any problem. But, when I try to load data in ORC file using LZ4, it is not working. I have created ORC table like below:
CREATE TABLE FINANCE_orc(
PERMNO STRING,
DATE STRING,
CUSIP STRING,
NCUSIP STRING,
COMNAM STRING,
TICKET STRING,
PERMCO STRING,
SHRCD STRING,
EXCHCD STRING,
HEXCD STRING,
SICCD STRING,
HSLCCD STRING,
PRC STRING,
VOL STRING,
RET STRING,
SHROUT STRING,
DLRET STRING,
VWRETD STRING,
EWRETD STRING,
SPRTRN STRING)
STORED AS ORC tblproperties ("orc.compress"="Lz4");
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.type = BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
INSERT OVERWRITE table finance_orc select * from finance;
But at the time of loading data it gives the following error:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"permno":"PERMNO","ndate":"DATE","cusip":"CUSIP","ncusip":"NCUSIP","comnam":"COMNAM","ticket":"TICKER","permco":"PERMCO","shrcd":"SHRCD","exchcd":"EXCHCD","hexcd":"HEXCD","siccd":"SICCD","hslccd":"HSICCD","prc":"PRC","vol":"VOL","ret":"RET","shrout":"SHROUT","dlret":"DLRET","vwretd":"VWRETD","ewretd":"EWRETD","sprtrn":"SPRTRN"}
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"permno":"PERMNO","ndate":"DATE","cusip":"CUSIP","ncusip":"NCUSIP","comnam":"COMNAM","ticket":"TICKER","permco":"PERMCO","shrcd":"SHRCD","exchcd":"EXCHCD","hexcd":"HEXCD","siccd":"SICCD","hslccd":"HSICCD","prc":"PRC","vol":"VOL","ret":"RET","shrout":"SHROUT","dlret":"DLRET","vwretd":"VWRETD","ewretd":"EWRETD","sprtrn":"SPRTRN"}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hive.ql.io.orc.CompressionKind.Lz4
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:577)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:675)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
... 9 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hive.ql.io.orc.CompressionKind.Lz4
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:622)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:566)
... 16 more
Caused by: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hive.ql.io.orc.CompressionKind.Lz4
at java.lang.Enum.valueOf(Enum.java:236)
at org.apache.hadoop.hive.ql.io.orc.CompressionKind.valueOf(CompressionKind.java:25)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getOptions(OrcOutputFormat.java:143)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:203)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:52)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
... 18 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I have used Snappy and Zlib with the same command and it is working fine. But problem is only with LZ4. I do not know the reason why?

The compression we can use in addition to ORC columnar compression is either of one of NONE, ZLIB, SNAPPY.
The default compression codec is ZLIB.
The compression codecs other than above are not allowed.
In general to get an idea about error, read the error log completely to find the issue to some extent. The error log said -
"org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hive.ql.io.orc.CompressionKind.Lz4"

Now you can manually substitute the jar file for lz4 on official spark repo.

Related

What is 152³305746

I am getting 152³305746 as some bad data in my input file. I am trying to filter it, but have been unsuccessful in telling hive on how to detect it and filter. I am not even sure what data type it is. I am expecting bigint values instead of what i see here.
I have tried the below things and their various combinations which did not help skip the few bad records i have in my input:
1)
CAST(mycol AS string) RLIKE "^[0-9]+$"
2)
mycol < 2147483647
3)
CREATE TEMPORARY MACRO isNumber(s string) CAST(s as BIGINT) IS NOT NULL;
4)
isNumber(mycol) != false
5)
SET mapreduce.map.skip.maxrecords = 100000000;
None of the above approaches worked. Hive fails with the following error:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NumberFormatException: For input string: "152³305746"
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:416)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:126)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489)
... 9 more
Caused by: java.lang.NumberFormatException: For input string: "152³305746"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseLong(ParsePrimitiveUtils.java:49)
at org.openx.data.jsonserde.objectinspector.primitive.JavaStringLongObjectInspector.get(JavaStringLongObjectInspector.java:46)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:400)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:279)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:239)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:201)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.makeValueWritable(ReduceSinkOperator.java:565)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:395)
... 17 more

Cloudera Hive: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce

I keep getting this error when trying to query data using hue
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce
From the hue job browser under the syslog tab
The error log is too big to paste here
http://pastebin.com/h8tgYuzR
Error from terminal
hive> SELECT count(*) FROM tweets;
Query ID = cloudera_20161128145151_137efb02-413b-4457-b21d-084101b77091
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1480364897609_0003, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1480364897609_0003/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1480364897609_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-11-28 14:52:09,804 Stage-1 map = 0%, reduce = 0%
2016-11-28 14:53:10,955 Stage-1 map = 0%, reduce = 0%
2016-11-28 14:53:13,213 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1480364897609_0003 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://quickstart.cloudera:8088/proxy/application_1480364897609_0003/
Examining task ID: task_1480364897609_0003_m_000000 (and more) from job job_1480364897609_0003
Task with the most failures(4):
-----
Task ID:
task_1480364897609_0003_m_000000
URL:
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1480364897609_0003&tipid=task_1480364897609_0003_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable Objavro.schema�
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable Objavro.schema�
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:505)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
... 8 more
Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#7aee0989; line: 1, column: 2]
at com.cloudera.hive.serde.JSONSerDe.deserialize(JSONSerDe.java:128)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:136)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:100)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:496)
... 9 more
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#7aee0989; line: 1, column: 2]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:306)
at org.codehaus.jackson.impl.ReaderBasedParser._handleUnexpectedValue(ReaderBasedParser.java:630)
at org.codehaus.jackson.impl.ReaderBasedParser.nextToken(ReaderBasedParser.java:364)
at org.codehaus.jackson.map.ObjectMapper._initForReading(ObjectMapper.java:2439)
at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2396)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1602)
at com.cloudera.hive.serde.JSONSerDe.deserialize(JSONSerDe.java:126)
... 12 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Here is the table
CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/flume/tweets';
data from the file I am trying to load http://pastebin.com/g7eg1BaP
I have a feeling that the table was defined using AVRO as the data type but a non-avro file was loaded to the table.
Remember that Hive is "schema on read" and not "schema on load". It will check for schema only when a job is run, not during loading or defining.
Please post the CREATE TABLE command used and a few records from the file you are trying to load.
Hope this helps.

Spark and Hive interoperability

I am using EMR-4.3.0, Spark 1.6.0, Hive 1.0.0.
I write a table like so (pseudocode) -
val df = <a dataframe>
df.registerTempTable("temptable")
sqlContext.setConf("hive.exec.dynamic.partition.mode","true")
sqlContext.sql("create external table exttable ( some columns ... )
partitioned by (partkey int) stored as parquet location 's3://some.bucket'")
sqlContext.sql("insert overwrite exttable partition(partkey) select columns from
temptable")
The write works fine and I can read the table back using -
sqlContext.sql("select * from exttable")
However, when I try to read the table using Hive as -
hive -e 'select * from exttable'
Hive throws a NullPointerException with the stack trace below. Any help appreciated! -
questTime=[0.008], ResponseProcessingTime=[0.295], HttpClientSendRequestTime=[0.313],
2016-05-19 03:08:02,537 ERROR [main()]: CliDriver (SessionState.java:printError(833)) - Failed with exception java.io.IOException:java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:663)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:561)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1619)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:221)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:153)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:364)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:712)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:631)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:570)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.NullPointerException
at parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:247)
at parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:368)
at parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:346)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:296)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:254)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:200)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:79)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:498)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:588)
... 15 more
UPDATE - After messing around for a bit, it seems like null values in the data mess Hive up. How do I avoid this?

spark-sql (hive#spark and hive#hadoop) dies with exceptions

Spark-SQL dies with the following exceptions:
Lost task 13.0 in stage 1.0 (TID 14, 10.15.0.224): java.io.InvalidClassException: org.apache.spark.sql.catalyst.expressions.AttributeMap; local class incompatible: stream classdesc serialVersionUID = -4625798594364953144, local class serialVersionUID = -4625798594364953144
and / or
com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.storage.BlockManagerId
What is going on?
More details follow below.
Case: Spark-sql
single server case, local configuration
downloaded spark-1.4.0-bin-hadoop2.6 distribution
centos 6.5 Server (32 Cores, 100GB RAM)
configuration:
(variables)
export SPARK_HOME=/opt/spark-1.4.0-bin-hadoop2.6
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_LOCAL_DIRS=$SPARK_HOME/work
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=4G
export SPARK_WORKER_INSTANCES=2
export SPARK_DAEMON_MEMORY=384m
(starting and running)
$SPARK_HOME/sbin/start-all.sh
$SPARK_HOME/bin/spark-sql --master spark://127.0.0.1:7077
The data reside in *.csv.gz individual files (a few thousands of them, a few tens to hundred of megabytes each). I've tested all files for gzip/io errors (non found). The format:
A1;A2;A3;A4;A5;A6;A7;A8;A9
'string';841054310;0;11.383907333;48.788023833;380.700000000;'2014-04-28T06:11:01.753990';0;'2015-06-26T23:54:49.461211'
'string';841054310;1;11.383867000;48.788031667;381.000000000;'2014-04-28T06:14:15.272059';4.77;'2015-06-26T23:55:03.132637'
'string';841054310;2;11.383829000;48.788019000;381.000000000;'2014-04-28T06:14:18.765123';3.19;'2015-06-26T23:55:03.414938'
'string';841054310;3;11.383804667;48.788041333;380.900000000;'2014-04-28T06:14:28.477338';5.1;'2015-06-26T23:55:04.190245'
'string';841054310;4;11.383767167;48.788053167;381.000000000;'2014-04-28T06:14:31.765796';4.29;'2015-06-26T23:55:04.459112'
'string';841054310;5;11.383726500;48.788057667;381.000000000;'2014-04-28T06:14:33.778935';6.18;'2015-06-26T23:55:04.628419'
'string';841054310;6;11.383685667;48.788059333;381.000000000;'2014-04-28T06:14:35.584490';5.71;'2015-06-26T23:55:04.779281'
'string';841054310;7;11.383643667;48.788062833;381.000000000;'2014-04-28T06:14:37.289736';9.21;'2015-06-26T23:55:04.921655'
'string';841054310;8;11.383601333;48.788069333;381.100000000;'2014-04-28T06:14:38.463847';10.78;'2015-06-26T23:55:05.022049'
'string';841054310;9;11.383558000;48.788074500;381.200000000;'2014-04-28T06:14:39.570567';10.92;'2015-06-26T23:55:05.118134'
'string';841054310;10;11.383514500;48.788076000;381.200000000;'2014-04-28T06:14:40.757880';6.88;'2015-06-26T23:55:05.217862'
'string';841054310;11;11.383472000;48.788074000;381.100000000;'2014-04-28T06:14:43.364629';7.45;'2015-06-26T23:55:05.440946'
'string';841054310;12;11.383423667;48.788068333;381.100000000;'2014-04-28T06:14:44.762990';11.91;'2015-06-26T23:55:05.565626'
'string';841054310;13;11.383375833;48.788059000;381.100000000;'2014-04-28T06:14:45.762960';15.1;'2015-06-26T23:55:05.653718'
Spark-SQL commands:
create external table raw (
A1 string,
A2 int,
A3 int,
A4 double,
A5 double,
A6 double,
A7 string,
A8 double,
A9 string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
location '/directory/'
alter table raw set tblproperties('skip.header.line.count'='1');
select count(*) from raw;
Typical error:
15/07/15 20:13:22 ERROR TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.storage.BlockManagerId
Serialization trace:
org$apache$spark$scheduler$CompressedMapStatus$$loc (org.apache.spark.scheduler.CompressedMapStatus)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:265)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:95)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.storage.BlockManagerId
at org.apache.spark.scheduler.CompressedMapStatusFieldAccess.set(Unknown Source)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:617)
... 12 more
for a directory with a few tens of Data-files spark computes the result without an error. Once more files are added errors are thrown.
Case hive#spark:
$SPARK_HOME/sbin/start-thriftserver.sh --master spark://127.0.0.1:7077 --hiveconf hive.server2.thrift.port=10001 --driver-memory 2G
$SPARK_HOME/bin/beeline
Beeline version 1.4.0 by Apache Hive
beeline> !connect jdbc:hive2://127.0.0.1:10001
0: jdbc:hive2://127.0.0.1:10001> select count(*) from raw;
I get the following error:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: Buffer underflow.
Serialization trace:
org$apache$spark$scheduler$CompressedMapStatus$$compressedSizes (org.apache.spark.scheduler.CompressedMapStatus) (state=,code=0)
Case: hive#hadoop:
Details:
single server, local mode
extracted hadoop-2.6.0 distribution under /opt
extracted apache-hive-1.2.1-bin distribution under /opt
extracted db-derby-10.11.1.1-bin distribution under /opt
configuration:
export HADOOP_HOME=/opt/hadoop-2.6.0
export DERBY_HOME=/opt/db-derby-10.11.1.1-bin
. /opt/db-derby-10.11.1.1-bin/bin/setEmbeddedCP
export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'
export HIVE_HOME=/opt/apache-hive-1.2.1-bin
export JAVA_HOME='/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/'
Running the thing:
$HIVE_HOME/bin/hive
hive> select count(*) from raw;
Output (and error):
Query ID = root_20150715204331_a7a4be13-31e5-41d2-a7dd-8fdbf9d9f2b0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2015-07-15 20:44:02,482 Stage-1 map = 0%, reduce = 0%
Ended Job = job_local1201654881_0001 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
hive>
where the hack are the log information stored in this case? I cannot find anything under /tmp/hive, nor in /opt/hadoop-2.6.0/, nor in /opt/apache-hive-1.2.1-bin

HBase Hive Integration - Error

When I try to Load Data from HDFS to HBase using Hive logical tables, I am facing the following problem. I am new for hadoop and not able to trace the error,.I am using CDH4 VM,
Creating a new HBase table which is managed by Hive
CREATE TABLE hive_hbasetable(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hivehbasek1");
Hbase shell Output
hbase(main):002:0> list
TABLE
hivebasek1
mysql_cityclimate
2 row(s) in 0.2470 seconds
I created a logical table hive_logictable in Hive
CREATE TABLE hive_logictable (foo INT, bar STRING) row format delimited fields terminated by ',';
Inserting data in hive_logictable from HDFS.
cat TextFile.txt
100,value1
101,value2
102,value3
103,value4
104,value5
105,value6
LOAD DATA LOCAL INPATH '/home/cloudera/TextFile.txt' OVERWRITE INTO TABLE hive_logictable;
Loading data into HBase table using Hive.
INSERT OVERWRITE TABLE hive_hbasetable SELECT * FROM hive_logictable;
Below are the error messages throwing....
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501200937_0004, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201501200937_0004
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201501200937_0004
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2015-01-20 10:38:07,412 Stage-0 map = 0%, reduce = 0%
2015-01-20 10:38:52,822 Stage-0 map = 100%, reduce = 100%
Ended Job = job_201501200937_0004 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201501200937_0004
Examining task ID: task_201501200937_0004_m_000002 (and more) from job job_201501200937_0004
Task with the most failures(4):
-----
Task ID:
task_201501200937_0004_m_000000
URL:
http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_201501200937_0004&tipid=task_201501200937_0004_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
End of Error Message.
Could you please check if the atomic insert works fine on the HIVE table ? And share the results ?

Resources