How can we concatenate 2 hive tables without duplicates based on 1 column? - hadoop

I have 2 tables with the same format: user_id, param1, param2, ...
I have to combine rows from both tables, but in a way that each user_id occurs only once. (If some user_id is in both tables, then use only 2nd table row for this user_id)
So far I tried to use:
SELECT tt.user_id, * FROM
(SELECT * from t2
UNION_ALL
SELECT * from t1) as tt
group by tt.eid
But it only outputs user_id field. Is there maybe a "first_occurance(attribute)" function for grouping that I could use like:
SELECT tt.user_id, first_occurance(tt.param1), first_occurance(tt.param2) FROM ...
Or is there a better way to do that?
PS. Tables have 1-3 million records.

Related

Spring JPA: Need to join two table with dynamic clauses

Requirement is as below:
Table 1 - Order(ID, orderId, sequence, reference, valueDate, date)
Table 2 - Audit(AuditId, orderId, sequence, status, updatedBy, lastUpdatedDateTime)
For each record in ORDER table, there could be one or many rows in AUDIT table.
User is given a search form where he could enter any of the search params including status from audit table.
I need a way to join these two tables with dynamic where clause and get the top row for a given ORDER from AUDIT table.
Something like below:
select * from
ORDER t1, AUDIT t2
where t1.orderid = t2.orderId
and t1.sequence = t2.sequence
and t2.status = <userProvidedStatusIfAny>
and t2.lastUpdatedDateTime in (select max(t3.lastUpdatedDateTime) --getting latest record
AUDIT t3 where t1.orderid = t3.orderId
and t1.sequence = t3.sequence)
and t1.valudeDate = <userProvidedDataIfAny> ..... so on
Please help
-I cant do this with #Query because i have to form dynamic where clause
-#OneToMany fetches all the records in audit table for given orderId and sequence
-#Formula in ORDER table works but for only one column in select. i need multiple values from AUDIT table
Please help

Deleting data from one table using data from a second table

I have a table table1, where there is a million data and a completely identical table table2, only there are only 106 data. How it is possible to delete these 106 data from table1?
In these two tables i have fields like id, date, param0, param1, param2.
Presuming that uniqueness is enforced through the ID column in both tables, then:
delete from table1 a
where exists (select null
from table2 b
where b.id = a.id
);
Otherwise, add some more columns (into the where clause) which will help you delete only rows you really want.

Using "contains" as a way to join tables?

The Primary key in table one is used as in table 2 but it is modified as so:
Primary key Column in table 1: 123abc
Column in table 2: 123abc_1
I.e. the key is used but then _1 is added to create a unique value in the column of Table 2.
Is there any way that I can join the two tables, the data in the 2 columns is not identical but it very similar. Could I do something like:
SELECT *
FROM TABLE1 INNER JOIN
TABLE2
ON TABLE1.COUMN1 contains TABLE2.COLUMN2;
I.e. checking that the value in Table 1 is within the value in Table 2?
You can check only the first part of column2; for example
SELECT *
FROM TABLE1 INNER JOIN TABLE2
ON INSTR(COLUMN2, COLUMN1) = 1
or
ON COLUMN2 LIKE COLUMN1 || '%'
However, keeping foreign key in such a way can be really dangerous, not to think about performance on large DBs.
You'd better use a different column in Table2 to store the key of Table 1, even adding a constraint.

Distinct Column in Hive

I am trying to get a query result in HiveQL with one column as distinct. However the results are not matching . There are almost 20 columns in the table.
create table uniq_us row format delimited fields terminated by ',' lines terminated by '\n' as select distinct(a),b,c,d,e,f,g,h,i,j from ctry_us_join;
The resulting number of Rows :513238
select count(distinct a) from ctry_us_join;
The resulting number of rows : 151616
How is this possible and is something wrong in my first or second query
U need to use subselect with group by statement.
select count(a) from (
select a, count(*) from ctry_us_join group by a) b
This is just one solution for this.
Distinct is a keyword, not a function. It applies to all columns you list in your select clause. It is quite reasonable that your table has only 151,616 distinct values in the column a, but multiple rows with the same value in the column a have different values in other columns. That might give you 513,238 distinct rows.

oracle find difference between 2 tables

I have 2 tables that are the same structure. One is a temp one and the other is a prod one. The entire data set gets loaded each time and sometimes this dataset will have deleted records from the prior datasets. I load the dataset into temp table first and if any records were deleted I want to deleted them from the prod table also.
So how can I find the records that exist in prod but not in temp? I tried outer join but it doesn't seem to be working. It's returning all the records from the table in the left or right depending on doing left or right outer join.
I then also want to delete those records in the prod table.
One way would be to use the MINUS operator
SELECT * FROM table1
MINUS
SELECT * FROM table2
will show all the rows in table1 that do not have an exact match in table2 (you can obviously specify a smaller column list if you are only interested in determining whether a particular key exists in both tables).
Another would be to use a NOT EXISTS
SELECT *
FROM table1 t1
WHERE NOT EXISTS( SELECT 1
FROM table2 t2
WHERE t1.some_key = t2.some_key )
How about something like:
SELECT * FROM ProdTable WHERE ID NOT IN
(select ID from TempTable);
It'd work the same as a DELETE statement as well:
DELETE FROM ProdTable WHERE ID NOT IN
(select ID from TempTable);
MINUS can work here
The following statement combines results with the MINUS operator, which returns only rows returned by the first query but not by the second:
SELECT * FROM prod
MINUS
SELECT * FROM temp;
Minus will only work if the table structure is same

Resources