hive : select row with column having maximum value without join - hadoop

writing hive query over a table to pick the row with maximum value in column
there is table with following data for example:
key value updated_at
1 "a" 1
1 "b" 2
1 "c" 3
the row which is updated last needs to be selected.
currently using following logic
select tab1.* from table_name tab1
join select tab2.key , max(tab2.updated_at) as max_updated from table_name tab2
on tab1.key=tab2.key and tab1.updated_at = tab2.max_updated;
Is there any other better way to perform this?

If it is true that updated_at is unique for that table, then the following is perhaps a simpler way of getting you what you are looking for:
-- I'm using Hive 0.13.0
SELECT * FROM table_name ORDER BY updated_at DESC LIMIT 1;
If it is possible for updated_at to be non-unique for some reason, you may need to adjust the ORDER BY logic to break any ties in the fashion you wish.

Related

how to make max function in hive query to ignore _HIVE_DEFAULT_PARTITION__

I have a view which uses max to show the latest partition (which is of format 2021-01, 2021-02, 2021-03, 2021-04). The hive table has _HIVE_DEFAULT_PARTITION__ too.
When we run the query in Impala, max on partitions gives the correct value of 2021-04 ignoring _HIVE_DEFAULT_PARTITION__ but the same do not work when we run the query in Hive as it returns _HIVE_DEFAULT_PARTITION__
Is there any way to make Hive query ignore the default partition if exists while returning max on that column?
You can filter it:
select max(partition_col) from your_table where partition_col != "__HIVE_DEFAULT_PARTITION__"
If you do not need data in __HIVE_DEFAULT_PARTITION__, you can drop it:
ALTER TABLE your_table DROP PARTITION (partition_col='__HIVE_DEFAULT_PARTITION__');
Transforming __HIVE_DEFAULT_PARTITION__ to NULL can be a solution if with max(partition_col) you want to aggregate something else and do not want to excluse __HIVE_DEFAULT_PARTITION__ partition:
select max(case when partition_col = "__HIVE_DEFAULT_PARTITION__" then NULL else partition_col end) as max_partition_col,
--aggregate something else including HIVE_DEFAULT_PARTITION
from your_table

Delete data based on the count & timestamp using pl\sql

I'm new to PL\SQL programming and I'm from DBA background. I got one requirement to delete data from both main table and reference table but need to follow below logic while deleting data because we need to delete 30M of data from the tables so we're reducing data based on the "State_ID" column below.
Following conditions need to consider
1. As per sample data given below(Main Table), sort data based on timestamp with desc order and leave the first 2 rows of data for each "State_id" and delete rest of the data from the both tables based on "state_id" column.
2. select state_id,count() from maintable group by state_id order by timestamp desc Having count()>2;
So if state_id=1 has 5 rows then has to delete 3 rows of data by leaving first 2 rows for state_id=1 and repeat for other state_id values.
Also same matching data should be deleted from the reference table as well.
Please someone help me on this issue. Thanks.
enter image description here
Main table
You should be able to do each table delete as a single SQL command. Anything else would essentially force row-by-row processing, which is the last thing you want for that much data. Something like this:
delete from main_table m
where m.row_id not in (
with keep_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number<3)
select row_id from keep_me)
or
delete from main_table m
where m.row_id in (
with delete_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number>2)
select row_id from delete_me)

Using "contains" as a way to join tables?

The Primary key in table one is used as in table 2 but it is modified as so:
Primary key Column in table 1: 123abc
Column in table 2: 123abc_1
I.e. the key is used but then _1 is added to create a unique value in the column of Table 2.
Is there any way that I can join the two tables, the data in the 2 columns is not identical but it very similar. Could I do something like:
SELECT *
FROM TABLE1 INNER JOIN
TABLE2
ON TABLE1.COUMN1 contains TABLE2.COLUMN2;
I.e. checking that the value in Table 1 is within the value in Table 2?
You can check only the first part of column2; for example
SELECT *
FROM TABLE1 INNER JOIN TABLE2
ON INSTR(COLUMN2, COLUMN1) = 1
or
ON COLUMN2 LIKE COLUMN1 || '%'
However, keeping foreign key in such a way can be really dangerous, not to think about performance on large DBs.
You'd better use a different column in Table2 to store the key of Table 1, even adding a constraint.

Sorting a table with a column with SQL Server 2012

I have a table like this :
How can I sort it out in the form below :
Each set of records has been marked by a FlagID at end
Each set of records has been marked by a FlagID at end
I assume this means that each record is implicitly associated with the first non-null FlagID value that occurs at or after its position along the ID primary key. Thus, we can use a correlated subquery to project this implicit FlagID value for each record, sort by it, then sort by your Row column as tiebreaker for each set.
SELECT *
FROM YourTable T1
ORDER BY
(
SELECT TOP 1 T2.FlagID
FROM YourTable T2
WHERE T2.ID <= T1.ID
AND T2.FlagID IS NOT NULL
ORDER BY T2.ID DESC
),
T1.Row
However, if you're able to alter the database content, I would recommend you to explicitly populate all the FlagID fields, as this would make your life easier. If you do so, then the query becomes:
SELECT *
FROM YourTable T1
ORDER BY T1.FlagID,
T1.Row

How to populate column data in one table based on an id match from another table (using the data from the matched table)

In Oracle I have two tables. Both are populated with data and have a timestamp column.
One table has that column filled with data, the other table doesn't. But I need to get the data from the table that does into the column of the other table based on a match of another column.
Each has 'code' so the timestamp from the one table should only be put in the timestamp of the other table where the codes match.
I've tried cursors etc, but it seems like I'm missing something.
Any suggestions?
It sounds like you want a correlated update. This will update every row of destinationTable with the timestamp_col from sourceTable where there is a match on the code column.
UPDATE destinationTable d
SET timestamp_col = (SELECT s.timestamp_col
FROM sourceTable s
WHERE s.code = d.code )
WHERE EXISTS( SELECT 1
FROM sourceTable s
WHERE s.code = d.code )
Just for completeness, I think I would have done something like this (untested), if the destination has a primary key or unique column:
UPDATE (
SELECT d.primary_key AS pk, d.timestamp_col AS dest_col, MAX(s.timestamp_col) AS src_col
FROM dest d
JOIN src s ON d.code = s.code
GROUP BY pk, dest_col
) j
SET j.dest_col = j.src_col

Resources