Replicating table setup from ORC to Parquet - hadoop

I have the following table definition with ORC that I would like to replicate to Parquet (there are more fields I am not showing):
CREATE EXTERNAL TABLE `test_a`(
`some_id` int,
`sha_sum` string,
`parent_sha_sum` string,
`md5_sum` string
)
PARTITIONED BY (
`server_date` date
)
CLUSTERED BY (
sha_sum
)
SORTED BY (
sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://cluster/user/myuser/test_a'
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.create.index'='true',
'orc.stripe.size'='130023424',
'orc.row.index.stride'='64000',
'orc.create.index'='true';
I was wondering how can I replicate this to Parquet. I would like to use ZLIB or something like that for compression, I would like to have indexes and potentially tune some of the TBLPROPERTIES for Parquet.
CREATE EXTERNAL TABLE `test_b`(
`some_id` int,
`sha_sum` string,
`parent_sha_sum` string,
`md5_sum` string
)
PARTITIONED BY (
`server_date` date
)
CLUSTERED BY (
sha_sum
)
SORTED BY (
sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
STORED AS PARQUET
LOCATION 'hdfs://cluster/user/myuser/test_b'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true'
)
Is there a list of all of the options available for Parquet through TBLPROPERTIES?

Related

Change Hive External Table Column names to upper case and add new columns

I have an external table for example dump_table, which is partitioned over year, month and day. If i run show create table dump_table i get the following:
CREATE EXTERNAL TABLE `dump_table`
(
`col_name` double,
`col_name_2` timestamp
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
INTO 32 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://somecluster/test.db/dump_table'
TBLPROPERTIES (
'orc.compression'='SNAPPY',
'transient_lastDdlTime'='1564476840')
I have to change its columns to upper case and also add new columns, so it will become something like:
CREATE EXTERNAL TABLE `dump_table_2`
(
`COL_NAME` DOUBLE,
`COL_NAME_2` TIMESTAMP,
`NEW_COL` DOUBLE
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
Option:1
as an option I can run Change (DDL Reference here) to change column names and then add new columns to it. BUT the thing is that i do not have any backup for this table and it contains alot of data. If anything goes wrong I might loose data.
Can I create a new external table and migrate data, partition by partition from dump_table to dump_table_2 ? what will the query look like for this migration?
Is there any better way of achieving this use case? Please help
You can create new table dump_table_2 with new columns and load data using sql:
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dump_table_2 partition (`year`, `month`, `day`)
select col1,
...
colN,
`year`, `month`, `day`
from dump_table_1 t --join other tables if necessary to calculate columns

msck repair table not working on unpartitioned table - hive config issue

I have an unpartitioned EXTERNAL table:
CREATE EXTERNAL TABLE `db.tableName`(
`sid` string,
`uid` int,
`t1` timestamp,
`t2` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<db_location>/tableName'
TBLPROPERTIES (
'serialization.null.format'='',
'transient_lastDdlTime'='1551121065')
When I copy the file tableName.csv to s3://db_location/tableName/tableName.csv and then run msck repair table db.tableName, I get the count back as zero.
There are 10 rows in the CSV and I expect to get the count back as 10.
Any help is appreciated.

Difference in create table properties in hive while using ORC serde

Below is the structure of one of the existing hive table.
CREATE TABLE `tablename`(
col1 datatype,
col2 datatype,
col3 datatype)
partitioned by (col3 datatype)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='T',
'serialization.format'='T')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/file/location'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1536752440')
Now i want to create a table with same properties, how do i define below properties in create table syntax.
field delimiter and seralization format
TBLPROPERTIES to store numFiles, numRows, radDataSize, totalSize (and what all other information we can store in TBLPROPERTIES option)
Below is one of the create table syntax which i have used
create table test_orc_load (a int, b int) partitioned by (c int) stored as ORC;
Table properties which i got using show create table option.
CREATE TABLE `test_orc_load`(
`a` int,
`b` int)
PARTITIONED BY (
`c` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/user/hive/warehouse/alb_supply_chain.db/test_orc_load'
TBLPROPERTIES (
'transient_lastDdlTime'='1537774167')

what is the use of serde in HIVE

Hi I'm beginner to hive and I found the below from one of the sample code, can some one help me in understanding the below piece of code :
CREATE EXTERNAL TABLE emp (
id bigint,
name string,
dept bigint,
salary bigint)
partitioned by (yearofjoining string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3n://xxx/xxxx/xxx/xxx/xx'

XML Serde for Hadoop/Hive

I used JSONSerde to process huge amounts of JSON data stored on S3 using Amazon EMR. One of my clients has a requirement to process massive XML data but I couldn't find any XML Serde to use with HIVE.
Have you folks processed XML with hive? I would appreciate your suggestions and comments regarding this before I start building my own XML Serde.
I use the following for XML parsing serde in hive ---
CREATE EXTERNAL TABLE XYZ(
X STRING,
Y STRING,
Z ARRAY<STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.X"="/XX/#X",
"column.xpath.Y"="/YY/#Y"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/XXX'
TBLPROPERTIES (
"xmlinput.start"="<xml start",
"xmlinput.end"="</xml end>"
);
Link to download the xmlserde is
http://central.maven.org/maven2/com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.0.0/hivexmlserde-1.0.0.0.jar
Put this jar file in path /usr/lib/hive/lib
Once you done with this, you can use this xml serde:
CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics
map<string,string>, financial map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/#customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.demographics"="/record/demographics/*",
"column.xpath.financial"="/record/financial/*"
)
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);

Resources