How to use Parquet in my current architecture?

How to use Parquet in my current architecture? - hadoop

My current system is architected in this way.
Log parser will parse raw log at every 5 minutes with format TSV and output to HDFS. I created Hive table out of the TSV file from HDFS.
From some benchmark, I found that Parquet can save up to 30-40% of the space usage. I also found that I can create Hive table out of Parquet file starting Hive 0.13. I would like know if I can convert TSV to Parquet file.
Any suggestion is appreciated.

Yes, in Hive you can easily convert from one format to another by inserting from one table to the other.
For example, if you have a TSV table defined as:
CREATE TABLE data_tsv
(col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
And a Parquet table defined as:
CREATE TABLE data_parquet
(col1 STRING, col2 INT)
STORED AS PARQUET;
You can convert the data with:
INSERT OVERWRITE TABLE data_parquet SELECT * FROM data_tsv;
Or you can skip the Parquet table DDL by:
CREATE TABLE data_parquet STORED AS PARQUET AS SELECT * FROM data_tsv;

Related

How do I convert a sequence file to parquet format

I have a HIVE Table (test) that I need to create in the PARQUET format. I will be using a bunch of SEQUENCE files in order to create and insert into a table.
Once the table is created, is there a way to convert into PARQUET? I mean I know we could have done, say
CREATE TABLE default.test( user_id STRING, location STRING)
PARTITIONED BY ( dt INT ) STORED AS PARQUET
initially while creating the table itself. However, in my case I am forced to use SEQUENCE files to create the table first because it is the format that I have to begin with and cannot directly convert to PARQUET.
Is there a way I could convert into parquet after the table is created and data inserted?

To convert form sequence file to Parquet you need to load the data (CTAS) into a new table.
The question is tagged with presto, so I am giving you Presto syntax for this. I am including partitioning, because example in the question contains it.
CREATE TABLE test_parquet WITH(format='PARQUET', partitioned_by=ARRAY['dt']) AS
SELECT * FROM test_sequencefile;

load text to Orc file

How to load text file into Hive orc external table?
create table MyDB.TEST (
Col1 String,
Col2 String,
Col3 String,
Col4 String)
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
I have already created above table as Orc. but while fetching data from table it show below error
Failed with exception
java.io.IOException:org.apache.orc.FileFormatException: Malformed ORC file hdfs://localhost:9000/Ext/sqooporc/part-m-00000. Invalid
postscript.

There are multiple steps to that. Follows the details.
Create a hive table which is able to read from the plain text file. Assuming that your file is a comma delimited file and your file is on HDFS on a location called /user/data/file1.txt, follows will be the syntax.
create table MyDB.TEST (
Col1 String,
Col2 String,
Col3 String,
Col4 String
)
row format delimited
fields terminated by ','
location '/user/data/file1.txt';
Now you have a schema which is in sync with the format of the data you posses.
Create another table with ORC schema
Now you need to create the ORC table as you were creating earlier. Here is a simpler syntax for creating that table.
create table MyDB.TEST_ORC (
Col1 String,
Col2 String,
Col3 String,
Col4 String)
STORED AS ORC;
Your TEST_ORC table is an empty table now. You can populate this table using the data from TEST table using the following command.
INSERT OVERWRITE TABLE TEST_ORC SELECT * FROM TEST;
The aforementioned statement will select all the records from TEST table and will try to write those records to TEST_ORC table. Since TEST_ORC is an ORC table, the data will be converted to ORC format on the fly when written into the table.
You can even check the storage location of TEST_ORC table for ORC files.
Now your data is in ORC format and your table TEST_ORC has the required schema to parse it. You may drop your TEST table now, if not needed.
Hope that helps!

Can we use TEXT FILE format for Hive Table with Snappy compression?

I have an hive external table in the HDFS and i am trying to create a hive managed table above it.i am using textfile format with snappy compression but i want to know how it helps the table.
CREATE TABLE standard_cd
(
last_update_dttm TIMESTAMP,
last_operation_type CHAR (1) ,
source_commit_dttm TIMESTAMP,
transaction_dttm TIMESTAMP ,
transaction_type CHAR (1)
)
PARTITIONED BY (process_dt DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("orc.compress" = "SNAPPY");
Let me know if any issues in creating in this format.

As such their is no issue while creating.
but difference in properties:
Table created and stored as TEXTFILE:
Table created and stored as ORC:
although the size of both tables were same after loading some data.
also check documentation about ORC file format

Insert xml file on hdfs to Hive Parquet Table

tI have a gzipped 3GBs xml file that I want to map to Hive parquet table.
I'm using xml serde for parsing that file to temporary external table and than I'm using INSERT to insert this data to hive parquet table (I want this data to by placed on Hive table, not create interface to xml file on HDFS).
I came up with this script:
CREATE TEMPORARY EXTERNAL TABLE temp_table (someData1 INT, someData2 STRING, someData3 ARRAY<STRING>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.someData1" ="someXpath1/text()",
"column.xpath.someData2"="someXpath2/text()",
"column.xpath.someData3"="someXpath3/text()",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 'hdfs://locationToGzippedXmlFile'
TBLPROPERTIES (
"xmlinput.start"="<MyItem>",
"xmlinput.end"="</MyItem>"
);
CREATE TABLE parquet_table
STORED AS Parquet
AS select * from temp_table
Main point of this is that I want to have optimized way to access the data. I don't want to parse xml every query instead parse whole file once and put the result into parquet table. And running the script above is taking unlimited amount of time additionally in log's i can see that only 1 mapper is used.
I don't really know if it's the correct approach (maybe it's possible to do that with partitions?)
BTW I'm using Hue with cloudera.

Loading Data from a .txt file to Table Stored as ORC in Hive

I have a data file which is in .txt format. I am using the file to load data into Hive tables. When I load the file in a table like
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS TEXTFILE;
the data is loaded correctly using
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
and I can run a SELECT * FROM test_details_txt; on the table in Hive.
However If I try to load the data in a table that is
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS ORC;
I receive the following error on trying to run a SELECT:
Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://master:6000/user/hive/warehouse/test.db/transaction_details/test_details.txt. Invalid postscript.
While loading the data using above LOAD statement I do not receive any error or exception.
Is there anything else that needs to be done while using the LOAD DATA IN PATH.. command to store data into an ORC table?

LOAD DATA just copies the files to hive datafiles. Hive does not do any transformation while loading data into tables.
So, in this case the input file /home/user/test_details.txt needs to be in ORC format if you are loading it into an ORC table.
A possible workaround is to create a temporary table with STORED AS TEXT, then LOAD DATA into it, and then copy data from this table to the ORC table.
Here is an example:
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
-- Load into Text table
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
-- Copy to ORC table
INSERT INTO TABLE test_details_orc SELECT * FROM test_details_txt;

Steps:
First create a table using stored as TEXTFILE  (i.e default or in
whichever format you want to create table)
Load data into text table.
Create table using stored as ORC as select * from text_table;
Select * from orc table.
Example:
CREATE TABLE text_table(line STRING);
LOAD DATA 'path_of_file' OVERWRITE INTO text_table;
CREATE TABLE orc_table STORED AS ORC AS SELECT * FROM text_table;
SELECT * FROM orc_table; /*(it can now be read)*/

Since Hive does not do any transformation to our input data, the format needs to be the same: either the file should be in ORC format, or we can load data from a text file to a text table in Hive.

ORC file is a binary file format, so you can not directly load text files into ORC tables.
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
First you need to create one normal table as textFile, load your data into the textFile table and then you can use insert overwrite query to write your data into ORC file.
create table table_name1 (schema of the table) row format delimited by ',' | stored as TEXTFILE
create table table_name2 (schema of the table) row format delimited by ',' | stored as ORC
load data local inpath ‘path of your file’ into table table_name1;(loading data from a local system)
INSERT OVERWRITE TABLE table_name2 SELECT * FROM table_name1;
Now all your data will be stored in an ORC file.
The similar procedure is applied to all the binary file formats i.e., Sequence files, RC files and Parquet files in Hive.
You can refer to the below link for more details.
https://acadgild.com/blog/file-formats-in-apache-hive/

Steps to load data into ORC file format in hive
1.Create one normal table using textFile format
2.Load the data normally into this table
3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile
4.Insert overwrite query to copy the data from textFile table to orcfile table
Refer the blog to learn the handson of how to load data into all file formats in hive
Load data into all file formats in hive

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to use Parquet in my current architecture? - hadoop

Related

How do I convert a sequence file to parquet format

load text to Orc file

Can we use TEXT FILE format for Hive Table with Snappy compression?

Insert xml file on hdfs to Hive Parquet Table

Loading Data from a .txt file to Table Stored as ORC in Hive

Categories

Resources