I am trying to learn about deleting duplicate records from a Hive table.
My Hive table: 'dynpart' with columns: Id, Name, Technology
Id Name Technology
1 Abcd Hadoop
2 Efgh Java
3 Ijkl MainFrames
2 Efgh Java
We have options like 'Distinct' to use in a select query, but a select query just retrieves data from the table. Could anyone tell how to use a delete query to remove the duplicate rows from a Hive table.
Sure that it is not recommended or not the standard to Delete/Update records in Hive. But I want to learn how do we do it.
You can use insert overwrite statement to update data
insert overwrite table dynpart select distinct * from dynpart;
Just in case when your table has duplicate rows on few or selected columns. Suppose you have a table structure as shown down below:
id Name Technology
1 Abcd Hadoop
2 Efgh Java --> Duplicate
3 Ijkl Mainframe
2 Efgh Python --> Duplicate
Here id & Name columns having duplicate rows.
You can use analytical function to get the duplicate row as:
select * from
(select Id,Name,Technology,
row_Number() over (partition By Id,Name order by id desc) as row_num
from yourtable)tab
where row_num > 1;
This will give you output as:
id Name Technology row_num
2 Efgh Python 2
When you need to get both the duplicate rows:
select * from
(select Id,Name,Technology,
count(*) over (partition By Id,Name order by id desc) as duplicate_count
from yourtable)tab
where duplicate_count> 1;
Output as:
id Name Technology duplicate_count
2 Efgh Java 2
2 Efgh Python 2
you can insert distinct records into some other table
create table temp as select distinct * from dynpart
Related
Let's say I have a table1 in schema1 like this:
Stu_ID
Math
1
A
2
B
3
B+
Now, I want to add a new column, for instance, Literature, into table1 in schema1.
ALTER TABLE schema1.table 1
ADD COLUMN Literature STRING
Table1 now looks like
Stu_ID
Math
Literature
1
A
NULL
2
B
NULL
3
B+
NULL
I want to load data from table2, shema2 based on the respective Stu_ID. Is there a way to do so? I have thought of UPDATE, but Impala only supports updating a kudu table according to my understanding. Please correct me if I'm wrong.
instead of update you can insert+overwrite.
insert overwrite schema1.table1 t1
select
t1.stu_id, t1.Math, t2.Literature
from schema1.table1 t1
join schema2.table2 t2 ON t1.stu_id=t2.stu_id
This will replace whole data of t1 and will replace with old data + new column.
I have 2 tables with the same format: user_id, param1, param2, ...
I have to combine rows from both tables, but in a way that each user_id occurs only once. (If some user_id is in both tables, then use only 2nd table row for this user_id)
So far I tried to use:
SELECT tt.user_id, * FROM
(SELECT * from t2
UNION_ALL
SELECT * from t1) as tt
group by tt.eid
But it only outputs user_id field. Is there maybe a "first_occurance(attribute)" function for grouping that I could use like:
SELECT tt.user_id, first_occurance(tt.param1), first_occurance(tt.param2) FROM ...
Or is there a better way to do that?
PS. Tables have 1-3 million records.
I need your expert suggestion, Actually I am working on a Web project (using JSP and Oracle) having multiple database tables based on categories inwhich most of columns matches in tables, Now I want to create a search functioanlity on database tables which will search only on matching columns (these column exist in all tables). To do that I was thinking to create view (union of all tables) and then perform the search on view but I think this will degrade the performation since these tables are partitioned based on state and city and having huge data.
Example :
Table A
Col 1
Col 2
Col 3
Table B
Col 1
Col 2
Col 3
Col 4
Table C
Col 1
Col 2
Col 3
Col 5
Just to want to perform a search on col1, Col3 and Col3 (these columns exist in all tables)
Is there any other way to create the search to optimize the performance as well..??
Please help.
WITH table-a AS
(select col1,col3 from table1),
Table-b AS
( select col1,col3 from table1)
SELECT col1,col3
FROM table-a, table-b
Just a suggestion
writing hive query over a table to pick the row with maximum value in column
there is table with following data for example:
key value updated_at
1 "a" 1
1 "b" 2
1 "c" 3
the row which is updated last needs to be selected.
currently using following logic
select tab1.* from table_name tab1
join select tab2.key , max(tab2.updated_at) as max_updated from table_name tab2
on tab1.key=tab2.key and tab1.updated_at = tab2.max_updated;
Is there any other better way to perform this?
If it is true that updated_at is unique for that table, then the following is perhaps a simpler way of getting you what you are looking for:
-- I'm using Hive 0.13.0
SELECT * FROM table_name ORDER BY updated_at DESC LIMIT 1;
If it is possible for updated_at to be non-unique for some reason, you may need to adjust the ORDER BY logic to break any ties in the fashion you wish.
I'd like to concatenate values from 2 selected columns and use a result as table name for another select statement:
select a.ColumnA,
a.ColumnB,
b.ColumnG,
(a.ColumnA || '.' || a.ColumnB) "TABLENAME"
(select t.ColumnX from TABLENAME t where t.ColumnY = 'whatever') "GOAL"
from
table a,
table b,
where
....
So assuming that
table a:
ColumnA ColumnB ColumnC ...
dev town 15
table b:
ColumnF ColumnG ColumnH ...
aaa bbb ccc
somewhere there exists table town in schema dev that can be queried using name dev.town:
table dev.town:
ColumnX ColumnY ColumnZ ...
Joe whatever Mr
So "my query" returns
ColumnA ColumnB ColumnG TABLENAME GOAL
--------------------------------------
dev town bbb dev.town Joe
Is there a way to get the results I need?
Thanks.
Not in an SQL statement in Oracle.
If you dived in PL/SQL, then you could use an EXECUTE IMMEDIATE statement to dynamically generate the required table.