How to create Hive table on top orc format data? - hadoop

I have source data in orc format on HDFS.
I created external hive table on the top HDFS data with below command. I am using hive 1.2.1 version.
CREATE EXTERNAL TABLE IF NOT EXISTS test( table_columns ... ) ROW
FORMAT FIELDS TERMINATED BY '\u0001' STORED AS orc LOCATION 'path'
TBL PROPERTIES("orc.compress"="SNAPPY");
But while selecting data from table I am getting this exception.
"protobuf.InvalidProtocolBufferException: Protocol message was too large"
Please help me to resolve this issue.
Thanks.

Related

Create hive table from table schema stored in .avsc file

I have a hive table schema stored in one hdfs file schema.avsc.
I want to create a hive table of the same schema and want to dump a data from another hdfs path where data is stored in HDFS file system.
1 : How can i create a table ?
2 : How can i dump a data stored in hdfs file into created table ?
How can i create a table ?
The Apache Hive documentation on the AvroSerDe shows the syntax for creating a table based on an Avro schema stored in a file. For convenience, I'll repeat one of the examples here:
CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='http://schema_provider/kst.avsc');
This example pulls the schema file from a web server. The documentation also shows other options, such as pulling from a local file, depending on your specific needs.
I recommend reading the entire AvroSerDe documentation page. There is a lot of useful information there about getting the most out of using Hive with Avro.
How can i dump a data stored in hdfs file into created table ?
You can define an external table that references the existing HDFS files. The documentation page for External Tables shows the syntax. Repeating an example:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
After defining the external table, you can then use an INSERT-SELECT query that reads from the external table and writes to the Avro table. The documentation on Inserting data into Hive Tables from queries describes the INSERT-SELECT syntax. For example:
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

Index Hbase data to solr via Hive external table

I have crawled some data via Nutch 2.3.1. Data is stored in Hbase 0.98 table. I have created an external table that import data from hbase table. Now I have to index this data to solr 4.10.3. For that I have followed this well known tutorial. I have created hive table like
create external table if not exists solr_items (
id STRING,
content STRING,
url STRING,
title STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,content,url,title")
tblproperties ("solr.url" = "http://localhost:8983/solr/collection1") ;
There was some problem when I tried to copy data from hbase posted here. Then I just decide to first index some dummy data. For that I have decided to load data from a file like
LOAD DATA LOCAL INPATH 'data.csv3' OVERWRITE INTO TABLE solr_items;
But it gave following error
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
Where is the problem
HADOOP version is 1.2.1
You can't use LOAD DATA for external tables. Hive LanguageManual DML:
Hive does not do any transformation while loading data into tables.
Load operations are currently pure copy/move operations that move
datafiles into locations corresponding to Hive tables.
Hive obviously can't just copy data in case of Solr external table because Solr uses it's own internal data presentation.
You can insert though:
insert into table solr_items select * from tempTable;

How to load a flat file(not delimited file) into HBase?

I am new to hbase and I have a flat file(not delimited file) that I would like to load into a single hbase table.
Here is a preview of a row in my file:
0107E07201512310015071C11100747012015123100
I know fo example that from position 1 to 7 it's an id and from position 7 to 15 it's a date....
The problem is how to build a schema that correspond to my file or if there is a way to convert it to a delimited file or read such file using jaql because I'm working with Infosphere BigInsights.
Any help would be greatly appreciated.
Thanks in advance.
Create a Hive table using RegExSerDe
CREATE EXTERNAL TABLE testtable ((col1 STRING, col2 STRING, col3 STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (“input.regex” = “(.{5})(.{6})(.{3}).*” )
LOCATION ‘<hdfs-file-location>’;
You can create hive table pointing to HBase
Here are the instructions
http://hortonworks.com/blog/hbase-via-hive-part-1/
You can use
insert overwrite table to load the data from hive table to HBase-table
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-SELECTSandFILTERS
You can write a SerDe to deserialize into Hive and use Hive to export to HBase.

Malformed ORC file error

Upon upgrading Hive External table from RC to ORC format and running MSCK REPAIR TABLE on it when I do select all from the table , I get following error -
Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://myServer:port/my_table/prtn_date=yyyymm/part-m-00000__xxxxxxxxxxxxx Invalid postscript length 1
What is the process to be followed for migrating RC formatted historical data to ORC formatted new definition for same table if there is one ?
Hive doesn't automatically reformat the data when you add partitions. You have two choices:
Leave the old partitions as RC files and make the new partitions ORC.
Move the data to a staging table and use insert overwrite to re-write the data as ORC files.
Blockquote
Add Row format ,input format and outformat to solve the problen in create statement:
create external table xyz
(
a string,
b string)
PARTITIONED BY (
c string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
Loacation "hdfs path";

Loading Data from a .txt file to Table Stored as ORC in Hive

I have a data file which is in .txt format. I am using the file to load data into Hive tables. When I load the file in a table like
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS TEXTFILE;
the data is loaded correctly using
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
and I can run a SELECT * FROM test_details_txt; on the table in Hive.
However If I try to load the data in a table that is
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS ORC;
I receive the following error on trying to run a SELECT:
Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://master:6000/user/hive/warehouse/test.db/transaction_details/test_details.txt. Invalid postscript.
While loading the data using above LOAD statement I do not receive any error or exception.
Is there anything else that needs to be done while using the LOAD DATA IN PATH.. command to store data into an ORC table?
LOAD DATA just copies the files to hive datafiles. Hive does not do any transformation while loading data into tables.
So, in this case the input file /home/user/test_details.txt needs to be in ORC format if you are loading it into an ORC table.
A possible workaround is to create a temporary table with STORED AS TEXT, then LOAD DATA into it, and then copy data from this table to the ORC table.
Here is an example:
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
-- Load into Text table
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
-- Copy to ORC table
INSERT INTO TABLE test_details_orc SELECT * FROM test_details_txt;
Steps:
First create a table using stored as TEXTFILE  (i.e default or in
whichever format you want to create table)
Load data into text table.
Create table using stored as ORC as select * from text_table;
Select * from orc table.
Example:
CREATE TABLE text_table(line STRING);
LOAD DATA 'path_of_file' OVERWRITE INTO text_table;
CREATE TABLE orc_table STORED AS ORC AS SELECT * FROM text_table;
SELECT * FROM orc_table; /*(it can now be read)*/
Since Hive does not do any transformation to our input data, the format needs to be the same: either the file should be in ORC format, or we can load data from a text file to a text table in Hive.
ORC file is a binary file format, so you can not directly load text files into ORC tables.
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
First you need to create one normal table as textFile, load your data into the textFile table and then you can use insert overwrite query to write your data into ORC file.
create table table_name1 (schema of the table) row format delimited by ',' | stored as TEXTFILE
create table table_name2 (schema of the table) row format delimited by ',' | stored as ORC
load data local inpath ‘path of your file’ into table table_name1;(loading data from a local system)
INSERT OVERWRITE TABLE table_name2 SELECT * FROM table_name1;
Now all your data will be stored in an ORC file.
The similar procedure is applied to all the binary file formats i.e., Sequence files, RC files and Parquet files in Hive.
You can refer to the below link for more details.
https://acadgild.com/blog/file-formats-in-apache-hive/
Steps to load data into ORC file format in hive
1.Create one normal table using textFile format
2.Load the data normally into this table
3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile
4.Insert overwrite query to copy the data from textFile table to orcfile table
Refer the blog to learn the handson of how to load data into all file formats in hive
Load data into all file formats in hive

Resources