How to create an ORC file in Hive CDH? - hadoop

I can easily create an ORC file format in Apache Hadoop or Hortonworks' HDP:
CREATE TABLE ... STORED AS ORC
However this doesn't work in Cloudera's CDH 4.5. (Surprise!) I get:
FAILED: SemanticException Unrecognized file format in STORED AS clause: ORC
So as an alternative, I tried to download and install the Hive jar that contains the ORC classes:
hive> add jar /opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/hive/lib/hive-exec-0.11.0.jar;
Then create my ORC Table:
hive> CREATE TABLE test (name STRING)
> row format serde
> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> stored as inputformat
> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> outputformat
> 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
OK
But upon inserting into this table from some CSV data, I get an error:
hive> INSERT OVERWRITE TABLE test
> SELECT name FROM textdata;
Diagnostic Messages for this Task:
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
How should I create an ORC table in Hive in CDH?

CDH 4.5 contains Hive 0.10, see CDH Version 4.5.0 Packaging and Tarballs. ORC was added in Hive 0.11, see release notes and HIVE-3874: Create a new Optimized Row Columnar file format for Hive.
CDH 5 is in Beta now but it does contain Hive 0.11, see CDH Version 5.0.0 Beta 1.

Related

Alter table in hive is not working for serde 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' in Hive "Apache Hive (version 2.1.1-cdh6.3.4)"

Environment:
Apache Hive (version 1.1.0-cdh5.14.2)
I tried creating a table with below DDL.
create external table test1 (v_src_code string,d_extraction_date date) partitioned by (d_mis_date date) row format serde 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' with serdeproperties ("field.delim"="~|") stored as textfile location '/hdfs_path/test1' tblproperties("serialization.null.format"="");
Then I alter this table by adding one extra column as below.
alter table test1 add columns(n_limit_id bigint);
This is working perfectly fine.
But recently our cluster got upgraded. The new environment is
Apache Hive (version 2.1.1-cdh6.3.4)
The same table is created in this new environment. When I do alter table I get below error.
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Error: type expected at the position 0 of '<derived from deserializer>:bigint' but '<' is found. (state=08S01,code=1)

XmlSerde error in hive

While trying to execute the create table statement in hive getting the below error.
CREATE EXTERNAL TABLE BOOKDATA(
> TITLE VARCHAR(40),
> PRICE INT
> )ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
> WITH SERDEPROPERTIES (
> "column.xpath.TITLE"="/CATALOG/BOOK/TITLE/",
> "column.xpath.PRICE"="/CATALOG/BOOK/PRICE/")
> STORED AS
> INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
> LOCATION '/sourcedata'
> TBLPROPERTIES (
> "xmlinput.start"="<CATALOG",
> "xmlinput.end"= "</CATALOG>"
> );
FAILED: SemanticException Cannot find class 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
Please help on how to resolve this issue.I am using hive CLI.
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryandDDLExecution
1.
Requires restart of HiveServer2
hive.aux.jars.path
- Default Value: (empty)
- Added In: Hive 0.2.0 or earlier
The location of the plugin jars that contain implementations
of user defined functions (UDFs) and SerDes.
2.
hive.reloadable.aux.jars.path
- Default Value: (empty)
- Added In: Hive 0.14.0 with HIVE-7553
The locations of the plugin jars, which can be comma-separated folders or jars. They can be renewed (added, removed,
or updated) by executing the Beeline reload command without having to
restart HiveServer2. These jars can be used just like the auxiliary
classes in hive.aux.jars.path for creating UDFs or SerDes.
I too faced a similar issue in past. What worked out for me is just changing the hive engine to tez.
set hive.execution.engine=tez;
When you set Tez engine, It will pick up all the jars it needs to run a query.
Let me know if that works out for you.

Unable to run SerDe

We have one ebcdic sample file.
It is stored in /user/hive/warehouse/ebcdic_test_file.txt
Cobol layout of the file is stored in /user/hive/Warehouse/CobolSerde.cob
We are running on Hue browser query editor.
We also tried in CLI.
But the same error is coming
We have added CobolSerde.jar.
Via
Add jar /home/cloudera/Desktop/CobolSerde.jar
It has been added successfully. Proof via LIST JARS.
Query
CREATE EXTERNAL TABLE cobol2Hve
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde2.cobol.CobolSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/hive/warehouse/ebcdic_test_file.txt'
TBLPROPERTIES ('cobol.layout.url'='/user/hive/warehouse/CobolSerDe.cob','fb.length'='159');
Error while processing statement:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Cannot validate serde: com.savy3.hadoop.hive.serde2.cobol.CobolSerDe
Why is the error coming?
What is fb.length?

FileAlreadyExistsException occurred when I was exporting data from Presto with Hive engine on Amazon EMR

I tried to export data from the S3 bucket to other S3 bucket using Presto with Hive engine on Amazon EMR, like ETL, but FileAlreadyExistsException occurred when I was exporting data.
How can I export data using Presto?
Environments
emr-4.3.0
Hive 1.0.0
Presto-Sandbox 0.130
Error
I tried the following operation:
$ hive
hive> CREATE EXTERNAL TABLE logs(log string)
-> LOCATION 's3://foo-bucket/logs/';
hive> CREATE EXTERNAL TABLE s3_export(log string)
-> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
-> LOCATION 's3://foo-bucket/export/';
hive> exit;
$ presto-cli --catalog hive --schema default
presto:default> INSERT INTO s3_export SELECT log FROM logs;
Query 20160203_125741_00018_ba5sw, FAILED, 3 nodes
Splits: 7 total, 1 done (14.29%)
0:01 [39 rows, 4KB] [49 rows/s, 5.13KB/s]
Query 20160203_125741_00018_ba5sw failed: java.nio.file.FileAlreadyExistsException: /tmp
This is caused by Presto Hive connector does not like the symlink /tmp/ EMR (4.2 and 4.3) is using for hive.s3.staging-directory, you can use the Configuration API to overwrite hive.s3.staging-directory and set it to /mnt/tmp/, like this:
classification=presto-connector-hive,properties=[hive.s3.staging-directory=/mnt/tmp/]
I've resolved the problem by the following command:
presto-cli --catalog hive --schema default --execute 'select log from logs' | aws s3 cp - s3://foo-bucket/export/data.txt

Integrating Hbase with Hive: Register Hbase table

I am using Hortonworks Sandbox 2.0 which contains the following version of Hbase and Hive
Component Version
------------------------
Apache Hadoop 2.2.0
Apache Hive 0.12.0
Apache HBase 0.96.0
Apache ZooKeeper 3.4.5
...and
I am trying to register my hbase table into hive using the following query
CREATE TABLE IF NOT EXISTS Document_Table_Hive (key STRING, author STRING, category STRING) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,metadata:author,categories:category’) TBLPROPERTIES (‘hbase.table.name’ = ‘Document’);
This does not work, I get the following Exception:
2014-03-26 09:14:57,341 ERROR exec.DDLTask (DDLTask.java:execute(435)) – java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.setConf(HBaseStorageHandler.java:249)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
2014-03-26 09:14:57,368 ERROR ql.Driver (SessionState.java:printError(419)) – FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hbase/HBaseConfiguration
I have already created the Hbase Table “Document” and the describe command gives the following description
‘Document’,
{NAME => ‘categories’,..},
{NAME => ‘comments’,..},
{NAME => ‘metadata’,..}
I have tried the following things
add hive.aux.jars.path in hive-site.xml
hive.aux.jars.path
file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar,file:///usr/lib/hive/lib/hive-hbase-handler-0.12.0.2.0.6.0-76.jar,file:///usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar
add jars using hive add jar command
add jar /usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar;
add jar /usr/lib/hive/lib/hive-hbase-handler-0.12.0.2.0.6.0-76.jar;
add jar /usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar;
add jar /usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar;
add file /etc/hbase/conf/hbase-site.xml
specify the hadoop_classpath
export HADOOP_CLASSPATH=/etc/hbase/conf:/usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2:/usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar
And it is still not working!
How can I add the jars in the hive classpath so that it finds the hbaseConfiguration class,
or is it an entirely different issue?
No need to copy the entire jars. Just hbase-*.jar , zookeeper*.jar, hive-hbase-handler*.jar would be enough. By default all hadoop related jars would be added to hadoop classpath, Since hive internally uses hadoop command to execute.
Or
Instead of copying hbase jars to hive library by specifying HIVE_AUX_JARS_PATH environment variable to /usr/lib/hbase/lib/ in /etc/hive/conf/hive-env.sh will also do.
The second approach is more suggested than first

Resources