Hive table shows NULL values - hadoop

As per customer requirement, we are migrating the Hive database from AWS EC2 instance to AWS EMR instance.
I have gathered all the create table statements as below
CREATE TABLE abc( col1 double, col2 double, col3 string, col4 timestamp, col5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 's3a://oldprodbucket/hive_folder/hive_database.db/hive_database_ABC'
TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'numFiles'='0', 'numRows'='-1', 'orc.compress'='ZLIB', 'rawDataSize'='-1', 'totalSize'='0', 'transient_lastDdlTime'='1559130496')
We changed the Location value, where the data is present in the new bucket, as below.
CREATE TABLE abc( col1 double, col2 double, col3 string, col4 timestamp, col5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 's3://prodbucket/hive_folder/hive_database.db/hive_database_ABC'
TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'numFiles'='0', 'numRows'='-1', 'orc.compress'='ZLIB', 'rawDataSize'='-1', 'totalSize'='0', 'transient_lastDdlTime'='1559130496')
But when triggering the SELECT query on the table, it shows all the columns as NULL.
| NULL | NULL | NULL | NULL | NULL
Can someone please help in this regards?

Stackoverflow link HIVE ORC returns NULLs
helped me to identify the issue.
With the help of Hive database Admin, we found the property named orc.force.positional.evolution.
After setting it to true as below, we are able to see the data correctly.
ALTER table TableName SET TBLPROPERTIES('orc.force.positional.evolution'='true');

Related

How to get default values of table properties in Hive?

I created an internal table using HiveQL:
CREATE TABLE city (
id INT,
city VARCHAR(15)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS TEXTFILE;
Inserted one record
INSERT INTO city SELECT 1, null;
I want to know which default values are used. But Hive returns 'Table default.city does not have property'
SHOW TBLPROPERTIES city('serialization.format');
SHOW TBLPROPERTIES city('serialization.null.format');
SHOW TBLPROPERTIES city('serialization.encoding');
SHOW TBLPROPERTIES city('serialization.escape.crlf');
I also don't see them using the describe command:
DESCRIBE FORMATTED city;
I found out which values are used analyzing files on HDFS but I want to know if there is any easy way to get default values using HiveQL.

msck repair table not working on unpartitioned table - hive config issue

I have an unpartitioned EXTERNAL table:
CREATE EXTERNAL TABLE `db.tableName`(
`sid` string,
`uid` int,
`t1` timestamp,
`t2` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<db_location>/tableName'
TBLPROPERTIES (
'serialization.null.format'='',
'transient_lastDdlTime'='1551121065')
When I copy the file tableName.csv to s3://db_location/tableName/tableName.csv and then run msck repair table db.tableName, I get the count back as zero.
There are 10 rows in the CSV and I expect to get the count back as 10.
Any help is appreciated.

Difference in create table properties in hive while using ORC serde

Below is the structure of one of the existing hive table.
CREATE TABLE `tablename`(
col1 datatype,
col2 datatype,
col3 datatype)
partitioned by (col3 datatype)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='T',
'serialization.format'='T')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/file/location'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1536752440')
Now i want to create a table with same properties, how do i define below properties in create table syntax.
field delimiter and seralization format
TBLPROPERTIES to store numFiles, numRows, radDataSize, totalSize (and what all other information we can store in TBLPROPERTIES option)
Below is one of the create table syntax which i have used
create table test_orc_load (a int, b int) partitioned by (c int) stored as ORC;
Table properties which i got using show create table option.
CREATE TABLE `test_orc_load`(
`a` int,
`b` int)
PARTITIONED BY (
`c` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'maprfs:/user/hive/warehouse/alb_supply_chain.db/test_orc_load'
TBLPROPERTIES (
'transient_lastDdlTime'='1537774167')

what is the use of serde in HIVE

Hi I'm beginner to hive and I found the below from one of the sample code, can some one help me in understanding the below piece of code :
CREATE EXTERNAL TABLE emp (
id bigint,
name string,
dept bigint,
salary bigint)
partitioned by (yearofjoining string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3n://xxx/xxxx/xxx/xxx/xx'

Hive Create Table Partitions from file name

New to Hadoop. I know how to create a table in Hive (Syntax)
Creating a table with 3 Partition Key. but the keys are in File Names.
FileName Example : ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD
there are hundreds of file in a directory want to create a table with following Partition Keys from file Name :ServerName,ApplicationName,Date and load all the files in to table
Hive Script would be the preference but open to any other ideas
(files are CSV. and I know The schema(column definitions) of the file )
I assume the File Name is in format ServerName_ApplicationName.XXXX.log.YYYY-MM-DD (removed second "applicationname" assuming it to be a typo).
Create a table on the contents of the original file. Some thing like..
create external table default.stack
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
ROW FORMAT DELIMITED
FIELDS terminated by ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://nameservice1/location1...';
Create another partitioned table in another location like
create external table default.stack_part
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
PARTITIONED BY ( servername string, applicationname string, load_date string)
STORED as AVRO -- u can choose any format for the final file
location 'hdfs://nameservice1/location2...';
Insert into partitioned table from base table using below query:
set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.output=true;
set hive.exec.parallel=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
Insert overwrite table default.stack_part
partition ( servername, applicationname, load_date)
select *,
split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[0] as servername
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[0] as applicationname
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[3] as load_date
from default.stack;
I have tested this and it works.

Resources