Difference in create table properties in hive while using ORC serde - hadoop

Below is the structure of one of the existing hive table.
CREATE TABLE `tablename`(
col1 datatype,
col2 datatype,
col3 datatype)
partitioned by (col3 datatype)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='T',
'serialization.format'='T')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/file/location'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1536752440')
Now i want to create a table with same properties, how do i define below properties in create table syntax.
field delimiter and seralization format
TBLPROPERTIES to store numFiles, numRows, radDataSize, totalSize (and what all other information we can store in TBLPROPERTIES option)
Below is one of the create table syntax which i have used
create table test_orc_load (a int, b int) partitioned by (c int) stored as ORC;
Table properties which i got using show create table option.
CREATE TABLE `test_orc_load`(
`a` int,
`b` int)
PARTITIONED BY (
`c` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/user/hive/warehouse/alb_supply_chain.db/test_orc_load'
TBLPROPERTIES (
'transient_lastDdlTime'='1537774167')

Related

Hive table shows NULL values

As per customer requirement, we are migrating the Hive database from AWS EC2 instance to AWS EMR instance.
I have gathered all the create table statements as below
CREATE TABLE abc( col1 double, col2 double, col3 string, col4 timestamp, col5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 's3a://oldprodbucket/hive_folder/hive_database.db/hive_database_ABC'
TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'numFiles'='0', 'numRows'='-1', 'orc.compress'='ZLIB', 'rawDataSize'='-1', 'totalSize'='0', 'transient_lastDdlTime'='1559130496')
We changed the Location value, where the data is present in the new bucket, as below.
CREATE TABLE abc( col1 double, col2 double, col3 string, col4 timestamp, col5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 's3://prodbucket/hive_folder/hive_database.db/hive_database_ABC'
TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'numFiles'='0', 'numRows'='-1', 'orc.compress'='ZLIB', 'rawDataSize'='-1', 'totalSize'='0', 'transient_lastDdlTime'='1559130496')
But when triggering the SELECT query on the table, it shows all the columns as NULL.
| NULL | NULL | NULL | NULL | NULL
Can someone please help in this regards?
Stackoverflow link HIVE ORC returns NULLs
helped me to identify the issue.
With the help of Hive database Admin, we found the property named orc.force.positional.evolution.
After setting it to true as below, we are able to see the data correctly.
ALTER table TableName SET TBLPROPERTIES('orc.force.positional.evolution'='true');

How to add multi-level partition in hive?

I have customer managed table in the hive, partition based on date and customerName. My directory structure is like below:
user/hive/warehouse/test.db/customer/date1=2021-09-16/customerName=xyz
when I am doing show partitions customer it is not giving output. So I tried to add a partition with
MSCK REPAIR TABLE customer;
It give error Execution Error,return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
ALTER TABLE customer ADD PARTITION (date1='2021-09-15') PARTITION (customerName='xyz');
It also give error ValidationFailureSemanticException partition spec {customername=xyz} contain non partition column
How can I add these partitions in hive metastore.
hive> show create table customer;
OK
CREATE TABLE `customer`(
`act` string)
PARTITIONED BY (
`date1` string,
`customername` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='hdfs://hdcluster/user/hive/warehouse/test.db/customer')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://hdcluster/user/hive/warehouse/test.db/customer'
TBLPROPERTIES (
'spark.sql.create.version'='2.4.0',
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'=
'{\"type\":\"struct\",\"fields\":
[{\"name\":\"act\",\"type\":\"string\",\"nullable\":true,\"metadata\":
{}}, {\"name\":\"date1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"customername\",\"type\":\"string\",\"nullable\":true,\"metadata\
":{}}]}','spark.sql.sources.schema.partCol.0'='date1',
'spark.sql.sources.schema.partCol.1'='customername',
'transient_lastDdlTime'='1631781225')

msck repair table not working on unpartitioned table - hive config issue

I have an unpartitioned EXTERNAL table:
CREATE EXTERNAL TABLE `db.tableName`(
`sid` string,
`uid` int,
`t1` timestamp,
`t2` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<db_location>/tableName'
TBLPROPERTIES (
'serialization.null.format'='',
'transient_lastDdlTime'='1551121065')
When I copy the file tableName.csv to s3://db_location/tableName/tableName.csv and then run msck repair table db.tableName, I get the count back as zero.
There are 10 rows in the CSV and I expect to get the count back as 10.
Any help is appreciated.

Replicating table setup from ORC to Parquet

I have the following table definition with ORC that I would like to replicate to Parquet (there are more fields I am not showing):
CREATE EXTERNAL TABLE `test_a`(
`some_id` int,
`sha_sum` string,
`parent_sha_sum` string,
`md5_sum` string
)
PARTITIONED BY (
`server_date` date
)
CLUSTERED BY (
sha_sum
)
SORTED BY (
sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://cluster/user/myuser/test_a'
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.create.index'='true',
'orc.stripe.size'='130023424',
'orc.row.index.stride'='64000',
'orc.create.index'='true';
I was wondering how can I replicate this to Parquet. I would like to use ZLIB or something like that for compression, I would like to have indexes and potentially tune some of the TBLPROPERTIES for Parquet.
CREATE EXTERNAL TABLE `test_b`(
`some_id` int,
`sha_sum` string,
`parent_sha_sum` string,
`md5_sum` string
)
PARTITIONED BY (
`server_date` date
)
CLUSTERED BY (
sha_sum
)
SORTED BY (
sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
STORED AS PARQUET
LOCATION 'hdfs://cluster/user/myuser/test_b'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true'
)
Is there a list of all of the options available for Parquet through TBLPROPERTIES?

what is the use of serde in HIVE

Hi I'm beginner to hive and I found the below from one of the sample code, can some one help me in understanding the below piece of code :
CREATE EXTERNAL TABLE emp (
id bigint,
name string,
dept bigint,
salary bigint)
partitioned by (yearofjoining string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3n://xxx/xxxx/xxx/xxx/xx'

Resources