how to make max function in hive query to ignore _HIVE_DEFAULT_PARTITION__

how to make max function in hive query to ignore _HIVE_DEFAULT_PARTITION__ - hadoop

I have a view which uses max to show the latest partition (which is of format 2021-01, 2021-02, 2021-03, 2021-04). The hive table has _HIVE_DEFAULT_PARTITION__ too.
When we run the query in Impala, max on partitions gives the correct value of 2021-04 ignoring _HIVE_DEFAULT_PARTITION__ but the same do not work when we run the query in Hive as it returns _HIVE_DEFAULT_PARTITION__
Is there any way to make Hive query ignore the default partition if exists while returning max on that column?

You can filter it:
select max(partition_col) from your_table where partition_col != "__HIVE_DEFAULT_PARTITION__"
If you do not need data in __HIVE_DEFAULT_PARTITION__, you can drop it:
ALTER TABLE your_table DROP PARTITION (partition_col='__HIVE_DEFAULT_PARTITION__');
Transforming __HIVE_DEFAULT_PARTITION__ to NULL can be a solution if with max(partition_col) you want to aggregate something else and do not want to excluse __HIVE_DEFAULT_PARTITION__ partition:
select max(case when partition_col = "__HIVE_DEFAULT_PARTITION__" then NULL else partition_col end) as max_partition_col,
--aggregate something else including HIVE_DEFAULT_PARTITION
from your_table

Related

Hive:- Select on partition column gives result even when table is truncated

I have a table in hive which is partitioned on a column.
I truncate the table and I do 2 selects on them.
Select count(*) from table, I get the result as 0. Which is expected.
But If I do a select on the partition column, I get results instead of null rows which I expected.
Select distinct <paritition column>
from table
Result:
partition value 1
partition value 2
......
I can see in hdfs that the partition folders still exists , though they are empty. I expected the metadata to get updated after I did the truncate.
I am not sure why I am getting the above result
Any help is appreciated
Thanks!

Select max query returning all the rows in a table in Apache Hive

I am querying my data using this query
SELECT date_col,max(rate) FROM crypto group by date_col ;
I am expecting a single row but it is returning all the rows in the table. What is the mistake in this query?

You'll get one row per date_col because you're grouping by it. If you just want the maximum rate then just do SELECT max(rate) FROM crypto;.
If you want to get the date_col for that record too then:
SELECT
date_col,
rate
FROM crypto
WHERE rate = (SELECT MAX(rate) FROM crypto)

How to replace NULL values in one column to 0 (of a very large table) without creating a new column of the desired results added to the table in HIVE?

I am trying to replace all of the NULL values to 0 in a column of a big table in HIVE.
However, every time I try to implement some code I end up generating a new column to the table. The column I am trying to change/modify still exists and still has the NULL values but the new column that is automatically generated (i.e. _c1) is what I want the column I am trying to modify, to look like.
I tried to run a COALESCE but that also ended up generating a new column. I also tried to implement a CASE WHEN, but the same results ensued.
Select *,
CASE WHEN columnname IS NULL THEN 0
ELSE columnname
END
from tablename;
Also tried
SELECT coalesce(columnname, CAST(0 AS BIGINT)) FROM tablename
I would just like to update the table with the other columns being as is but the column I want to modify still has its original name but instead of NULL values it has 0's that replaced them.
I don't want to generate a new column but modify an existing one.
How should I do that?

Use insert overwrite .. option.
insert overwrite table tablename
select c1,c2,...,coalesce(columnname,0) as columnname
from tablename
Note that you have to specify all the other column names required in select.

updating a table using hive

Right now I run the following Hive query
CREATE TABLE dwo_analysis.exp_shown AS
SELECT
MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name
FROM test
WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''
The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.
so basically run the above query once and every time run it again but with the condition
WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)
How do I update the table using hive to populate the the rows as mentioned in the condition above?

First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC
In any case, you can overwrite the table for a date range like so
INSERT OVERWRITE TABLE dwo_analysis.exp_shown
SELECT * FROM test
WHERE click_date
BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

hive : select row with column having maximum value without join

writing hive query over a table to pick the row with maximum value in column
there is table with following data for example:
key value updated_at
1 "a" 1
1 "b" 2
1 "c" 3
the row which is updated last needs to be selected.
currently using following logic
select tab1.* from table_name tab1
join select tab2.key , max(tab2.updated_at) as max_updated from table_name tab2
on tab1.key=tab2.key and tab1.updated_at = tab2.max_updated;
Is there any other better way to perform this?

If it is true that updated_at is unique for that table, then the following is perhaps a simpler way of getting you what you are looking for:
-- I'm using Hive 0.13.0
SELECT * FROM table_name ORDER BY updated_at DESC LIMIT 1;
If it is possible for updated_at to be non-unique for some reason, you may need to adjust the ORDER BY logic to break any ties in the fashion you wish.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to make max function in hive query to ignore _HIVE_DEFAULT_PARTITION__ - hadoop

Related

Hive:- Select on partition column gives result even when table is truncated

Select max query returning all the rows in a table in Apache Hive

How to replace NULL values in one column to 0 (of a very large table) without creating a new column of the desired results added to the table in HIVE?

updating a table using hive

hive : select row with column having maximum value without join

Categories

Resources