Update Column of a Hive Table without using Sub query - hadoop

This is a question regarding updating a new column in a Hive table. Since I think Hive does not allow to update a column of an existing table using subqueries, I wanted to ask what will be the best way to achieve the following update operation.
I have the following two example tables:
Table A:
KeyId ValId Val
W1 V1 10
W2 V2 20
Table B:
KeyId ValId Val
W1 V1 10
W1 V1 30
W1 V3 40
W1 V4 50
W2 V2 0
W2 V2 50
W2 V2 70
W2 V4 80
I want to create another column in Table A, lets say avgVal that takes the KeyId and ValId in each row in Table A and takes the average of Val for those corresponding KeyId and ValId in Table B. Thus, my final output table should look like:
Updated Table A:
KeyId ValId Val avgVal
W1 V1 10 20
W2 V2 20 40
Please let me know if the question is not clear.

It seems you are trying to get aggregate values in table A from table B. In that case you cannot have "val" column in table A because after aggregation which val from table B do you expect in table A?
Assuming that was genuine mistake, and you remove "val" column from table a, your insert statement for table a should look like this:
insert into table table_a select keyid,valid,avg(val) from table_b group by keyid,valid

You can use below query to get avg of data in Table_B corresponding to row in table_A :-
select t1.keyid, t1.valid , t1.val , avgval from table_A t1 left join
(select keyid as k , valid as v, avg(val) as avgval from Table_B group by keyid,valid )temp
on k=t1.keyid and t1.valid=v;
You have to check the table_A is updatable to change the schema else you can make other table to load the data.

Related

I/O issue with PowerCenter Informatica in Oracle

I have two tables in Oracle and I have to synchronize values (Field column) between the tables. I'm using Informatica PowerCenter for this synchronization operation. The source qualifier query causes high I/O usage and I need to solve it.
Table1
Table1 has about 20M data. Field in Table1 is the actual field. Timestamp field holds create & update date and it has daily partition.
Id
Field
Timestamp
1
A
2017-05-12 03:13:40
2
B
2002-11-01 07:30:46
3
C
2008-03-03 03:26:29
Table2
Table2 has about 500M data. Field in Table2 should be as sync as possible to Field in Table1. Timestamp field holds create & update date and it has daily partition. Table2 is also target in the mapping.
Id
Table1_Id
Field
Timestamp
Action
100
1
A
2005-09-30 03:20:41
Nothing
101
1
B
2015-06-29 09:41:44
Update Field as A
102
1
C
2016-01-10 23:35:49
Update Field as A
103
2
A
2019-05-08 07:42:46
Update Field as B
104
2
B
2003-06-02 11:23:57
Nothing
105
2
C
2021-09-21 12:04:24
Update Field as B
106
3
A
2022-01-23 01:17:18
Update Field as C
107
3
B
2008-04-24 15:17:25
Update Field as C
108
3
C
2010-01-15 07:20:13
Nothing
Mapping Queries
Source Qualifier Query
SELECT *
FROM Table1 t1, Table2 t2
WHERE t1.Id = t2.Table1_Id AND t1.Field <> t2.Field
Update Transformation Query
UPDATE Table2
SET
Field = :tu.Field,
Timestamp = SYSDATE
WHERE Id = :tu.Id
You can use below approach.
SQ - Your SQL is correct and you can use it if you see its working but add a <> clause on partition date key column. You can use this SQL to speed it up as well.
SELECT *
FROM Table2 t2
INNER JOIN Table1 t3 ON t3.Id = t2.Table1_Id
LEFT OUTER JOIN Table1 t1 ON t1.Id = t2.Table1_Id AND t1.Field = t2.Field AND t1.partition_date= t2.partition_date -- You did not mention partition_date column but i am assuming there is a separate column which is used to partition.
WHERE t1.id is null -- <> is inefficient.
Then in your infa target T2 definition, make sure you mention partition_date as part of key along with ID.
Then use a update strategy set to DD_UPDATE. You can set the session to update as well.
And remove that target override. This actually applies the update query on the whole table and sometime can be inefficient abd I/O intensive.
Informatica is powerful to update data in bunch through update strategy. You can increase commit interval as per your performance.
You shouldn't try to update a 500M table in a single go using SQL. Yes, you can use PLSQL to update in a bunch.

Clickhouse - Latest Record

We have almost 1B records in a replicated merge tree table.
The primary key is a,b,c
Our App keeps writing into this table with every user action. (we accumulate almost a million records per hour)
We append (store) the latest timestamp (updated_at) for a given unique combination of (a,b)
The key requirement is to provide a roll-up against the latest timestamp for a given combination of a,b,c
Currently, we are processing the queries as
select a,b,c, sum(x), sum(y)...etc
from table_1
where (a,b,updated_at) in (select a,b,max(updated_at) from table_1 group by a,b)
and c in (...)
group by a,b,c
clarification on the sub-query
(select a,b,max(updated_at) from table_1 group by a,b)
^ This part is for illustration only.. our app writes latest updated_at for every a,b implying that the clause shown above is more like
(select a,b,updated_at from tab_1_summary)
[where tab_1_summary has latest record for a given a,b]
Note: We have to keep the grouping criteria as-is.
The table is structured with partition (c) order by (a, b, updated_at)
Question is, is there a way to write a better query. (that can returns results faster..we are required to shave off few seconds from the overall processing)
FYI: We toyed working with Materialized View ReplicatedReplacingMergeTree. But, given the size of this table, and constant inserts + the FINAL clause doesn't necessarily work well as compared to the query above.
Thanks in advance!
Just for test try to use join instead of tuple in (tuples):
select t.a, t.b, t.c, sum(x), sum(y)...etc
from table_1 AS t inner join tab_1_summary using (a, b, updated_at)
where c in (...)
group by t.a, t.b, t.c
Consider using AggregatingMergeTree to pre-calculate result metrics:
CREATE MATERIALIZED VIEW table_1_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(updated_at)
ORDER BY (updated_at, a, b, c)
AS SELECT
updated_at,
a,b,c,
sum(x) AS x, /* see [SimpleAggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/simpleaggregatefunction/) */
sum(y) AS y,
/* For non-simple functions should be used [AggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/aggregatefunction/). */
// etc..
FROM table_1
GROUP BY updated_at, a, b, c;
And use this way to get result:
select a,b,c, sum(x), sum(y)...etc
from table_1_mv
where (updated_at,a,b) in (select updated_at,a,b from tab_1_summary)
and c in (...)
group by a,b,c

Deletting duplicate data on oracle using sql failed

I have a table abc as:
acc subgroup
720V A
720V A
720V A
720V A
111 C
222 D
333 E
My expected output is:
acc subgroup
720V A
111 C
222 D
333 E
Since 720V A is duplicate i want to delete all three duplicate data and only want one data in my table.
So,i tried
DELETE FROM (
select t.*,rownum rn from abc t where acc='720V') where rn>1;
So,I get error as:
ORA-01732: data manipulation operation not legal on this view
How i can get my expected output?
Your table seems to be lacking a primary key column, which is a big problem here. Assuming there actually is a primary key column PK, we can try using ROW_NUMBER to identify any "duplictes":
DELETE
FROM abc t1
WHERE pk IN (SELECT pk
FROM (
SELECT t.pk, ROW_NUMBER() OVER (PARTITION BY acc, subgroup ORDER BY pk) rn
FROM abc t) x
WHERE rn > 1
);
Note that if you can live with keeping your original data, then the most expedient thing to do might be to create a distinct view:
CREATE VIEW abc_view AS
SELECT DISTINCT acc, subgroup
FROM abc;

What is the most efficient way to update values of a table based on a mapping from another table

I have a table including following details.
empID department location segment
1 23 55 12
2 23 11 12
3 25 11 39
I also have a mapping table like following
Field old value new value
Department 23 74
department 25 75
segment 10 24
location 11 22
So My task is to replace old values with new values. I can actually use a cursor and update departments first then segments so on and so forth . But that is time consuming and inefficient. I would like to know if there are any efficient way to do this. Which also need to support in future if we were plan to add more columns to the mapping.
cheers.
Check this if it solves the issue.
update emp set department = (select map.new_value from map where emp.department = map.old_value);
How about copying the data to a new table?
CREATE TABLE newemp AS
SELECT e.empid,
NVL(d.new_value, e.department) AS department,
NVL(l.new_value, e.location) AS location,
NVL(s.new_value, e.segment) AS segment
FROM emp e
LEFT JOIN map d ON d.field='DEPARTMENT' AND e.department = d.old_value
LEFT JOIN map l ON l.field='LOCATION' AND e.location = d.old_value
LEFT JOIN map s ON s.field='SEGMENT' AND e.segment = d.old_value
ORDER BY e.empid;
EMPID DEPARTMENT LOCATION SEGMENT
1 84 55 12
2 84 11 12
3 75 11 39
You'll need obviously three passes through the mapping table, but only one pass through the emp table.
We use a LEFT JOIN because not all values will be changed. If no new_value is found, the NVL function uses the existing value of the emp table.
You could update the original table from this new table (if the new table has a primary key):
UPDATE (SELECT empid,
e.department as old_department,
n.department as new_department,
e.location as old_location,
n.location as new_location,
e.segment as old_segment,
n.segment as new_segment
FROM emp e
JOIN newemp n USING (empid))
SET old_department = new_department,
old_location = new_location,
old_segment = new_segment
WHERE old_department != new_department
OR old_location != new_location
OR old_segment != new_segment;

select statement from a table ONLY if some of the fields were updated ORACLE

Can anyone explain, how I can create a select statement and fetch the data from a table, but only if particular fields were updated ?! Let's say I have:
select a, b, c, d , e, f
from table 1 t1
inner join table2 t2
on t1.a = t2.a
I'm interesting if columns d, e, f were updated since yesterday let's say, than I want to include this row in my select statement, but if d, e, f were not updated since yesterday than ignore this row. In table1 I have a date field when the data was inserted (date_created) and the date field when it was updated (date_modified). The tricky bit is, that data in table1 might be updated by the users during the day, but not obligatory fields d, e, f , lets say user simply updated columns a, b, c. But date_modified column will show that the row has been updated. So I cannot rely purely on the date_modified column. My question is, is there any other way how to filter the data and get correct rows in return ? Triggers and stored procedures is not an option, ideally pure sql .. Any help?
It's unclear which columns belong to which table but one solution is to use a flashback query (provided you have sufficient undo retention to accommodate the 24 hour difference between queries).
An example of finding the differences on a table where columns d, e or f have changed from their value 24 hours ago is:
SELECT t.*
FROM table_name t
INNER JOIN
(
SELECT *
FROM table_name
AS OF TIMESTAMP SYSTIMESTAMP - INTERVAL '1' DAY
) p
ON ( t.a = p.a
AND ( t.d <> p.d OR t.e <> p.e OR t.f <> p.f ) );
Solved! Solution: Add an extra column (lets say Total) to the target table as a sum of columns d, e, f and update it for the first time. After that, if columns d,e,f were changed during the day, the sum of columns d, e, f will differ from the Total column, and you can simply filter it in where clause.
Maybe it is not the most elegant solution, but it does the job.
Thanks for yours ideas !!!

Resources