how to delete columns with less than 20 repetitions on a hive table - hadoop

I am trying to learn about deleting user_id's repeated in less than 20 times in ratings table (id's with less than 20 votes mess up the prediction)
delete * FROM rating
WHERE COUNT(user_id) <20;
Below is the error I have gotten: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 3:6 Not yet supported place for UDAF 'COUNT'"

There are two big problems
Your query is wrong. to work properly you need to use aggregation function count with groupby on user_id columns.
You can not delete records using delete statement unless you table is transactional table.
To delete the record from non-transnational table you need to use insert overwrite statement to overwrite the table with the records you want.
Syntax:
Insert overwrite table select * from <table_name> where <condition>
you code should look like this
INSERT overwrite TABLE rating
SELECT *
FROM rating
WHERE
user_id IN
(
SELECT user_id
FROM rating
GROUP BY(user_id)
HAVING count(user_id) > 20
);

If you are having transactional table then you can delete user_id having count less than 20 with the following statement.
hive> delete from rating where user_id in
(select user_id from rating group by user_id having count(user_id) < 20);

Related

Delete data based on the count & timestamp using pl\sql

I'm new to PL\SQL programming and I'm from DBA background. I got one requirement to delete data from both main table and reference table but need to follow below logic while deleting data because we need to delete 30M of data from the tables so we're reducing data based on the "State_ID" column below.
Following conditions need to consider
1. As per sample data given below(Main Table), sort data based on timestamp with desc order and leave the first 2 rows of data for each "State_id" and delete rest of the data from the both tables based on "state_id" column.
2. select state_id,count() from maintable group by state_id order by timestamp desc Having count()>2;
So if state_id=1 has 5 rows then has to delete 3 rows of data by leaving first 2 rows for state_id=1 and repeat for other state_id values.
Also same matching data should be deleted from the reference table as well.
Please someone help me on this issue. Thanks.
enter image description here
Main table
You should be able to do each table delete as a single SQL command. Anything else would essentially force row-by-row processing, which is the last thing you want for that much data. Something like this:
delete from main_table m
where m.row_id not in (
with keep_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number<3)
select row_id from keep_me)
or
delete from main_table m
where m.row_id in (
with delete_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number>2)
select row_id from delete_me)

oracle | delete duplicates records

I have identified some duplicates in my table:
-- DUPLICATES: ----
select PPLP_NAME,
START_TIME,
END_TIME,
count(*)
from PPLP_LOAD_GENSTAT
group by PPLP_NAME,
START_TIME,
END_TIME
having count(*) > 1
-- DUPLICATES: ----
How is it possible to delete them?
Even if you don't have the primary key, each record has a unique rowid associated.
By using the query below you delete only the records that don't have the maximum row id by self joining a table with the columns that cause duplication. This will make sure that you delete any duplicates.
DELETE FROM PPLP_LOAD_GENSTAT plg_outer
WHERE ROWID NOT IN(
select MAX(ROWID)
from PPLP_LOAD_GENSTAT plg_inner
WHERE plg_outer.pplp_name = plg_inner.pplg_name
AND plg_outer.start_time= plg_inner.start_time
AND plg_outer.end_time = plg_inner.end_time
);
I'd suggest something easier:
CREATE table NewTable as
SELECT DISTINCT pplp_name,start_time,end_time
FROM YourTable
Then delete your table, and rename the new table.
If you really want to delete records, you can find a few examples of how here.

Updating unique id column for newly added records in table in hive

I have a table in which I want unique identifier to be added automatically as a new record is inserted into it. Considering I have column for unique identifier already created.
hive can't update the table but you can create a temporary table or overwrite your first table.
you can also use concat function to join the two diferent column or string.
here is the examples
function :concat(string A, string B…)
return: string
hive> select concat(‘abc’,'def’,'gh’) from dual;
abcdefgh
HQL &result
insert overwrite table stock select tradedate,concat('aa',tradetime),stockid ,buyprice,buysize ,sellprice,sellsize from stock;
20130726 aa094251 204001 6.6 152000 6.605 100
20130726 aa094106 204001 6.45 13400 6.46 100

How can we concatenate 2 hive tables without duplicates based on 1 column?

I have 2 tables with the same format: user_id, param1, param2, ...
I have to combine rows from both tables, but in a way that each user_id occurs only once. (If some user_id is in both tables, then use only 2nd table row for this user_id)
So far I tried to use:
SELECT tt.user_id, * FROM
(SELECT * from t2
UNION_ALL
SELECT * from t1) as tt
group by tt.eid
But it only outputs user_id field. Is there maybe a "first_occurance(attribute)" function for grouping that I could use like:
SELECT tt.user_id, first_occurance(tt.param1), first_occurance(tt.param2) FROM ...
Or is there a better way to do that?
PS. Tables have 1-3 million records.

When commit records when we are inserting records from other table

I have sql query like :
insert into dedupunclear
select mdnnumber,poivalue
from deduporiginal a
where exists (
select 1
from deduporiginal
where mdnnumber=a.mdnnumber and rowid<a.rowid)
or mdnnumber is null;
There is 500K records in my deduporiginal. I have put this query inside function but it will take around 3hrs to commit records to dedupunclear table.
Is there any alternative to resolve performance issue ?
When this query commit records , At some interval or after getting all results from select query ?
This is how I did it the other day:
delete from table a
where rowid >
(select min(rowid) from table b
where nvl(a.id, 'x') = nvl(b.id, 'x') )
Instead of an insert into a dedupe table, I just deleted the rows directly from the staging table. For a table with 1 million rows, this query worked pretty well. I was worried that the nvl function would kill the index, but it worked well enough.

Resources