msck repair table not working on unpartitioned table - hive config issue - hadoop

I have an unpartitioned EXTERNAL table:
CREATE EXTERNAL TABLE `db.tableName`(
`sid` string,
`uid` int,
`t1` timestamp,
`t2` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<db_location>/tableName'
TBLPROPERTIES (
'serialization.null.format'='',
'transient_lastDdlTime'='1551121065')
When I copy the file tableName.csv to s3://db_location/tableName/tableName.csv and then run msck repair table db.tableName, I get the count back as zero.
There are 10 rows in the CSV and I expect to get the count back as 10.
Any help is appreciated.

Related

How to add multi-level partition in hive?

I have customer managed table in the hive, partition based on date and customerName. My directory structure is like below:
user/hive/warehouse/test.db/customer/date1=2021-09-16/customerName=xyz
when I am doing show partitions customer it is not giving output. So I tried to add a partition with
MSCK REPAIR TABLE customer;
It give error Execution Error,return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
ALTER TABLE customer ADD PARTITION (date1='2021-09-15') PARTITION (customerName='xyz');
It also give error ValidationFailureSemanticException partition spec {customername=xyz} contain non partition column
How can I add these partitions in hive metastore.
hive> show create table customer;
OK
CREATE TABLE `customer`(
`act` string)
PARTITIONED BY (
`date1` string,
`customername` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='hdfs://hdcluster/user/hive/warehouse/test.db/customer')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://hdcluster/user/hive/warehouse/test.db/customer'
TBLPROPERTIES (
'spark.sql.create.version'='2.4.0',
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'=
'{\"type\":\"struct\",\"fields\":
[{\"name\":\"act\",\"type\":\"string\",\"nullable\":true,\"metadata\":
{}}, {\"name\":\"date1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"customername\",\"type\":\"string\",\"nullable\":true,\"metadata\
":{}}]}','spark.sql.sources.schema.partCol.0'='date1',
'spark.sql.sources.schema.partCol.1'='customername',
'transient_lastDdlTime'='1631781225')

Difference in create table properties in hive while using ORC serde

Below is the structure of one of the existing hive table.
CREATE TABLE `tablename`(
col1 datatype,
col2 datatype,
col3 datatype)
partitioned by (col3 datatype)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='T',
'serialization.format'='T')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/file/location'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1536752440')
Now i want to create a table with same properties, how do i define below properties in create table syntax.
field delimiter and seralization format
TBLPROPERTIES to store numFiles, numRows, radDataSize, totalSize (and what all other information we can store in TBLPROPERTIES option)
Below is one of the create table syntax which i have used
create table test_orc_load (a int, b int) partitioned by (c int) stored as ORC;
Table properties which i got using show create table option.
CREATE TABLE `test_orc_load`(
`a` int,
`b` int)
PARTITIONED BY (
`c` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/user/hive/warehouse/alb_supply_chain.db/test_orc_load'
TBLPROPERTIES (
'transient_lastDdlTime'='1537774167')

what is the use of serde in HIVE

Hi I'm beginner to hive and I found the below from one of the sample code, can some one help me in understanding the below piece of code :
CREATE EXTERNAL TABLE emp (
id bigint,
name string,
dept bigint,
salary bigint)
partitioned by (yearofjoining string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3n://xxx/xxxx/xxx/xxx/xx'

Hive Error: ORC does not support type conversion from DATE to TIMESTAMP

I have a source table in Hive with DDL as:
CREATE EXTERNAL TABLE JRNL.SOURCE_TAB(
ticket_id varchar(11),
ttr_start timestamp,
ttr_stop timestamp
)
PARTITIONED BY (
exp_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://my.cluster.net:8020//db/data/SOURCE_TAB'
TBLPROPERTIES (
'last_modified_by'='edpintdatd',
'last_modified_time'='1466093031',
'serialization.null.format'='',
'transient_lastDdlTime'='1466093031')
When I am querying the table:
hive> select exp_dt from JRNL.SOURCE_TAB limit 3;
It is giving me an Exception:
Failed with exception java.io.IOException:java.io.IOException: ORC does not support type conversion from DATE to TIMESTAMP
Even when I tried to create a replica table like the above source, using:
CREATE TABLE JRNL.SOURCE_TAB_BKP(
ticket_id varchar(11),
ttr_start timestamp,
ttr_stop timestamp
)
PARTITIONED BY (exp_dt string);
and then inserting data in this table using:
INSERT INTO TABLE JRNL.SOURCE_TAB_BKP PARTITION (exp_dt)
SELECT
ticket_id,
ttr_start,
ttr_stop,
exp_dt string
FROM JRNL.SOURCE_TAB;
it is still giving me the error ORC does not support type conversion from DATE to TIMESTAMP
I tried using
to_utc_timestamp(unix_timestamp(ttr_start),'UTC'),
to_utc_timestamp(unix_timestamp(ttr_stop),'UTC'),
but this isn't helping either.
I have already set the hive.exec.dynamic.partition.mode=nonstrict.
I even used CAST(.... as DATE), CAST(.... as TIMESTAMP). Didn't work either.

hive external partitioned table

First i created hive external table partitioned by code and date
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/old_work/XYZ';
and then i execute insert overwrite on this table taking data from other table
INSERT OVERWRITE TABLE XYZ PARTITION (CODE,DATE)
SELECT
*
FROM TEMP_XYZ;
and after that i count the number of records in hive
select count(*) from XYZ;
it shows me 1000 records are there
and then i rename or move the location '/old_work/XYZ' to '/new_work/XYZ'
and then i again drop the XYZ table and created again pointing location to new directory
means '/new_work/XYZ'
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/new_work/XYZ';
But then when i execute select count(*) from XYZ table in hive , it shows 0 records ,
i think i missed something , please help me on this????
You need not drop the table and re create it the second time:
As soon as you move or rename a external hdfs location of the table just do this :
msck repair table <table_name>
In your case the error was because, The hive metastore wasnt updated with the new path .

Resources