Load from HIVE table into HDFS as AVRO file - hadoop

I want to load a file into HDFS (as .avro file) from HIVE table.
Currently I am able to move a table as a file from HIVE to HDFS but I am not able to specify a particular format of my Target file. can some one help me in this.??

So your question is really
How do I convert a Hive table to a different storage format?
Create a new table with the same fields and types as the avro table and change the input format. Then insert into the new table from the old table.
INSERT OVERWRITE TABLE newtable SELECT * FROM oldtable

Related

How to convert existing text data in hdfs to Avro?

I have a table in hdfs which is stored in Text format, so now i have a requirement to add new column in between. So I thought to load new columns in avro as Avro supports schema evolution,but now the previous data is still in text format.
if you already have a table you can load that directly into avro table from hive, if not you can create hive table for that text file and load that to avro table.
Something like
create table test(fields type) row format delimited fields terminated by ',' stored as textile location 'textfilepath';
create table avrotbl like test stored as avrofile;
insert into abrotbl select * from test;

Create a HIVE table and save it to a tab-separated file?

I have some data in hdfs.
This data was migrated from a PostgreSQL database by using Sqoop.
The data has the following hadoopish format, like _SUCCESS, part-m-00000, etc.
I need to create a Hive table based on this data and then I need to export this table to a single tab-separated file.
As far as I know, I can create a table this way.
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Then I can save the table as tsv file:
hive -e 'select * from some_table' > /home/myfile.tsv
I don't know how to load data from hdfs into a Hive table.
Moreover, should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
I don't know how to load data from hdfs into Hive table
You create a table schema over a hdfs directory like you're doing.
should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
Unless you didn't tell sqoop to create the table, you must do it manually.
export this table into a single tab-separated file.
A query might work, or unless sqoop set the delimiter to \t, then you need to create another table from the first specifying such column separator. And then, you don't even need to query the table, just run hdfs dfs -getMerge on the directory

Insert partitioned data into partitioned hive table

I have stored the data in hdfs using Pig Multistorage with the column id.
So data stored as
/output/1/part-0000
/output/2/
/output/3/
Now I have created a partitioned table in hive and I want to load the data from /output folder into this partitioned table. Is there any way to achieve this?
First you create a temp hive table where you load all the data from pig output.
Then You load to your actual partitioned hive table from temp table.
Something like below:
FROM emp_external temp INSERT OVERWRITE TABLE emp_partition PARTITION(country) SELECT temp.id,temp.name,temp.dept,temp.sal,temp.country;
Else you can explore Hcatlog for this case.
not sure if you are looking to insert the data in the outputfolder (created from pig) to an existing table or loading the data in the output folder in to a new hive partitioned table.
If you want to load the data in to new hive table, you can create a new partitioned table pointing to the output folder
If you are looking to load the data into an existing hive table, then you can either create a temp table as #Aman mentioed and do a insert in to the destination table
or
You can just move/copy the files in the hdfs from output/ to hive table location.
Hope this helps
Assign a Hive schema to pig output location with partitioned columns (Alter table Add Partition) as column id. Now both are hive tables and you can use where clause over partitioned column to move over the data.

Data in HDFS files not seen under hive table

I have to create a hive table from data present in oracle tables.
I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files.
The sqoop completes successfully and the files also get generated in the HDFS target directory.
Then I run the create table script in hive. The tables gets created. But it is an empty table, no data is seen in the hive table.
Has anyone faced a similar problem?
Hive default delimiter is ctrlA, if you don't specify any delimiter it will take default delimiter. Add below line in your hive script .
row format delimited fields terminated by '\t'
Your Hive script and your expectation is wrong. You are trying to create a partitioned table on the data that you have already imported, partitions won't work that way. If your query has no partition in it then you can able to see data.
Basically If you want partitioned table , you can't create on the under lying data like you have tried above. If you want hive partition load the data from intermediate table or that sqoop directory to your partitioned table to get Hive partitions.

Loading Data from a .txt file to Table Stored as ORC in Hive

I have a data file which is in .txt format. I am using the file to load data into Hive tables. When I load the file in a table like
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS TEXTFILE;
the data is loaded correctly using
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
and I can run a SELECT * FROM test_details_txt; on the table in Hive.
However If I try to load the data in a table that is
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS ORC;
I receive the following error on trying to run a SELECT:
Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://master:6000/user/hive/warehouse/test.db/transaction_details/test_details.txt. Invalid postscript.
While loading the data using above LOAD statement I do not receive any error or exception.
Is there anything else that needs to be done while using the LOAD DATA IN PATH.. command to store data into an ORC table?
LOAD DATA just copies the files to hive datafiles. Hive does not do any transformation while loading data into tables.
So, in this case the input file /home/user/test_details.txt needs to be in ORC format if you are loading it into an ORC table.
A possible workaround is to create a temporary table with STORED AS TEXT, then LOAD DATA into it, and then copy data from this table to the ORC table.
Here is an example:
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
-- Load into Text table
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
-- Copy to ORC table
INSERT INTO TABLE test_details_orc SELECT * FROM test_details_txt;
Steps:
First create a table using stored as TEXTFILE  (i.e default or in
whichever format you want to create table)
Load data into text table.
Create table using stored as ORC as select * from text_table;
Select * from orc table.
Example:
CREATE TABLE text_table(line STRING);
LOAD DATA 'path_of_file' OVERWRITE INTO text_table;
CREATE TABLE orc_table STORED AS ORC AS SELECT * FROM text_table;
SELECT * FROM orc_table; /*(it can now be read)*/
Since Hive does not do any transformation to our input data, the format needs to be the same: either the file should be in ORC format, or we can load data from a text file to a text table in Hive.
ORC file is a binary file format, so you can not directly load text files into ORC tables.
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
First you need to create one normal table as textFile, load your data into the textFile table and then you can use insert overwrite query to write your data into ORC file.
create table table_name1 (schema of the table) row format delimited by ',' | stored as TEXTFILE
create table table_name2 (schema of the table) row format delimited by ',' | stored as ORC
load data local inpath ‘path of your file’ into table table_name1;(loading data from a local system)
INSERT OVERWRITE TABLE table_name2 SELECT * FROM table_name1;
Now all your data will be stored in an ORC file.
The similar procedure is applied to all the binary file formats i.e., Sequence files, RC files and Parquet files in Hive.
You can refer to the below link for more details.
https://acadgild.com/blog/file-formats-in-apache-hive/
Steps to load data into ORC file format in hive
1.Create one normal table using textFile format
2.Load the data normally into this table
3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile
4.Insert overwrite query to copy the data from textFile table to orcfile table
Refer the blog to learn the handson of how to load data into all file formats in hive
Load data into all file formats in hive

Resources