Hive:- Select on partition column gives result even when table is truncated - hadoop

I have a table in hive which is partitioned on a column.
I truncate the table and I do 2 selects on them.
Select count(*) from table, I get the result as 0. Which is expected.
But If I do a select on the partition column, I get results instead of null rows which I expected.
Select distinct <paritition column>
from table
Result:
partition value 1
partition value 2
......
I can see in hdfs that the partition folders still exists , though they are empty. I expected the metadata to get updated after I did the truncate.
I am not sure why I am getting the above result
Any help is appreciated
Thanks!

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

Hive select * shows 0 but count(1) show returns millions of rows

I create a hive external table and add a partition by
create external table tbA (
colA int,
colB string,
...)
PARTITIONED BY (
`day` string)
stored as parquet;
alter table tbA add partition(day= '2021-09-04');
Then I put a parquet file to the target HDFS directory by hdfs dfs -put ...
I can get expected results using select * from tbA in IMPALA.
In Hive, I can get correct result when using
select count(1) from tbA
However, when I use
select * from tbA limit 10
It returns no result at all.
If anything is wrong with the parquet file or the directory, IMPALA should not get the correct result and Hive can count out row numbers... Why select * from ... shows nothing? Any help is appreciated.
In addition,
running select distinct day from tbA, it returns 2021-09-04
running select * from tbA, it returns data with day = 2021-09-04
It seems this partition is not recognized correctly? I retried to drop the partition and use msck repair table but still not working...

how to make max function in hive query to ignore _HIVE_DEFAULT_PARTITION__

I have a view which uses max to show the latest partition (which is of format 2021-01, 2021-02, 2021-03, 2021-04). The hive table has _HIVE_DEFAULT_PARTITION__ too.
When we run the query in Impala, max on partitions gives the correct value of 2021-04 ignoring _HIVE_DEFAULT_PARTITION__ but the same do not work when we run the query in Hive as it returns _HIVE_DEFAULT_PARTITION__
Is there any way to make Hive query ignore the default partition if exists while returning max on that column?
You can filter it:
select max(partition_col) from your_table where partition_col != "__HIVE_DEFAULT_PARTITION__"
If you do not need data in __HIVE_DEFAULT_PARTITION__, you can drop it:
ALTER TABLE your_table DROP PARTITION (partition_col='__HIVE_DEFAULT_PARTITION__');
Transforming __HIVE_DEFAULT_PARTITION__ to NULL can be a solution if with max(partition_col) you want to aggregate something else and do not want to excluse __HIVE_DEFAULT_PARTITION__ partition:
select max(case when partition_col = "__HIVE_DEFAULT_PARTITION__" then NULL else partition_col end) as max_partition_col,
--aggregate something else including HIVE_DEFAULT_PARTITION
from your_table

Loading Data into an empty Impala Table with account data partitioned by area code

I'm trying to copy data from a table called accounts into an empty table called accounts_by_area_code. I have the following fields in accounts_by_area_code: acct_num INT, first_name STRING, last_name STRING, phone_number STRING. The table is partitioned by areacode (the first 3 digits of phone_number.
I need to use a SELECT statement to extract the area code into an INSERT INTO TABLE command to copy the speciļ¬ed columns to the new table, dynamically partitioning by area code.
This is my last attempt:
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num, first_name, last_name, phone_number, areacode) PARTITION (areacode) SELECT STRLEFT (phone_number,3) AS areacode FROM accounts;"
This generates ERROR: AnalysisException: Column permutation and PARTITION clause mention more columns (5) than the SELECT / VALUES clause and PARTITION clause return (1). I'm not convinced I have even the basic syntax correct so any help would be great as I'm new to Impala.
Impala creates partitions dynamically based on data. So not sure why you want to create an empty table with partitions because it will be auto created while inserting new data.
Still, I think you can create empty table with partitions like this-
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num) PARTITION (areacode)
SELECT CAST(NULL as STRING), STRLEFT (phone_number,3) AS areacode FROM accounts;"

Hive: Insert into hive table with column using select 1

Let's say I have a hive table test_entry with column called entry_id.
hive> desc test_entry;
OK
entry_id int
Time taken: 0.4 seconds, Fetched: 1 row(s)
Suppose I need to insert one row into this above table using select 1 (which returns 1). For example: A syntax which looks like the below:
hive> insert into table test_entry select 1;
But I get the below error:
FAILED: NullPointerException null
So effectively, I would like to insert one row for entry)id whose value will be 1 with such a select statement(without referring another table).
How can this be done?
Hive does not support what you're trying to do. Inserts to ORC based tables was introduced in Hive 0.13.
Prior to that, you have to specify a FROM clause if you're doing INSERT .. SELECT
A workaround might be to create an external table with one row and do the following
INSERT .. SELECT 1 FROM table_with_one_row

Resources