In hive update status condition as per the dataload - hadoop

I have following query.
Following table is the input
table_1: student
id
ETL_create_date
student_marks
1
2023-02-10
85
2
2023-02-10
75
3
2023-02-10
80
4
2023-02-09
90
5
2023-02-09
65
6
2023-02-09
100
The expected output should as below:-
consider system_date is 2023-02-10.
The status column should check if the max etl_create_date of the student table is equal to the system date or not. If the dates are same then status should be LOADED else NOT LOADED.
etl_create_date column is max etl_create_date of the student table.
Count column the count is number of records in the students for given max etl_create_date
output_table:
table_name
ETL_create_date
STATUS
count
student_table
2023-02-10
LOADED
3

Related

Talend : How to fix the code of method is exceeding the 65535 byte limit

I have a set of 5 tables that have 2 millions rows and 450 columns approximately
My job look like this :
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap---tUnite---tDBOutput
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
It's my 5 tables tables that I'm trying to union, with the tMap where I'm adding an Id to trace which table date come from + reduce number of columns (from 450 to 20)
Then I unite the 5 in one tUnite that load a table in Truncate - Insert mode
I'm trying to make it work but always have the same error which is "The code of method tDBInput-10Process is exceeding the 65535 bytes limit"
If you use only 20 of 450 columns, you could select only those columns in each of your tDBInput, instead of extracting all columns and filtering them in tMap.

SQL Server - Filter only rows which have values in all columns

How do I filter rows those have values in all columns (i.e exclude the row if they have a missing value/null in any of the columns)
Say:
id name age height
------------------------
1 abc 19 NULL
2 fds 34 2.3
3 grt NULL NULL
Output should be only row2. How do I do this?

Adding a column using HIVE

I have following data table.
ID salary occupation
1 5000 Engineer
2 6000 Doctor
3 8000 Pilot
4 1000 Army
1 3000 Engineer
2 4000 Teacher
3 2000 Engineer
1 1000 Teacher
3 1000 Engineer
1 5000 Doctor
Now I want to add another column flag to this table so that it looks in the following way.
ID salary occupation Flag
1 5000 Engineer 0
2 6000 Doctor 0
3 8000 Pilot 0
4 1000 Army 0
1 3000 Engineer 1
2 4000 Teacher 1
3 2000 Engineer 1
1 1000 Teacher 2
3 1000 Engineer 2
1 5000 Doctor 3
Now how can I update my original table to the above format using HIVE?
Kindly help me.
Provided that you have data in your files for the additional column you can use Add Column clause for Alter Table.
In your example do something like this:
Alter table Test ADD COLUMNS (flag TINYINT);
Or you can try REPLACE COLUMNS as well:
Alter Table test REPLACE COLUMNS (id int, salary int, occupation String, flag tinyint);
You might need to load(overwrite) your dataset again though(just a speculation!!!).
You can definitely add new columns in HIVE table using alter command as told above
hive>Alter table Test ADD COLUMNS (flag TINYINT);
In Hive 0.13 and earlier releases, column will have NULL values but HIVE 0.14.0 and later release, you can update the column values using UPDATE command
Another way is, after adding column using ALTER command, you can overwrite the existing data with the new data(having Flag column)
hive> LOAD DATA LOCAL INPATH 'flagfile.txt' OVERWRITE INTO TABLE <tablename>;

How to select two max value from different records that has same ID for every records in table

i have problem with this case, i have log table that has many same ID with diferent condition. i want to select two max condition from this. i've tried but it just show one record only, not every record in table.
Here's my records table:
order_id seq status____________________
1256 2 4
1256 1 2
1257 0 2
1257 3 1
Here my code:
WITH t AS(
SELECT x.order_id
,MAX(y.seq) AS seq2
,MAX(y.extern_order_status) AS status
FROM t_order_demand x
JOIN t_order_log y
ON x.order_id = y.order_id
where x.order_id like '%12%'
GROUP BY x.order_id)
SELECT *
FROM t
WHERE (t.seq2 || t.status) IN (SELECT MAX(tt.seq2 || tt.status) FROM t tt);
this query works, but sometime it gave wrong value or just show some records, not every records.
i want the result is like this:
order_id seq2 status____________________
1256 2 4
1257 3 2
I think you just want an aggregation:
select d.order_id, max(l.seq2) as seq2, max(l.status) as status
from t_order_demand d join
t_order_log l
on d.order_id = l.order_id
where d.order_id like '%12%'
group by d.order_id;
I'm not sure what your final where clause is supposed to do, but it appears to do unnecessary filtering, compared to what you want.

Update query taking long time in oracle 10g

I have a table which holds more then 2 million records, I am trying to update a table using following query
UPDATE toc T
SET RANK =
65535
- (SELECT COUNT (*)
FROM toc T2
WHERE S_KEY LIKE '00010001%'
AND A_ID IS NOT NULL
AND T2.TARGET = T.TARGET
AND T2.RANK > T.RANK)
WHERE S_KEY LIKE '00010001%' AND A_ID IS NOT NULL
Usually this query tooks 5 mins to update 50000 rows in our staging db which is a exact replica of production db but in our production db it is taking 6 hours to execute...
I tried Oracle advisory to select the correct execution plan but nothing is working...
Plan
UPDATE STATEMENT ALL_ROWSCost: 329,471
6 UPDATE TT.TOC
2 TABLE ACCESS BY INDEX ROWID TABLE TT.TOC Cost: 5 Bytes: 4,173,236 Cardinality: 54,911
1 INDEX SKIP SCAN INDEX TT.DATASTAT_SORTKEY_IDX Cost: 4 Cardinality: 1
5 SORT AGGREGATE Bytes: 76 Cardinality: 1
4 TABLE ACCESS BY INDEX ROWID TABLE TT.TOC Cost: 5 Bytes: 76 Cardinality: 1
3 INDEX SKIP SCAN INDEX TT.DATASTAT_SORTKEY_IDX Cost: 4 Cardinality: 1
I can see the following wait events
1,066 db file sequential read 10,267 0 3,993 0 6 39,933,580
1,066 db file scattered read 413 0 188 0 6 1,876,464
Any help will be greatly appreciated.
here is the current list of indexes
DSTAT_SKEY_IDX D_STATUS 1
DSTAT_SKEY_IDX S_KEY 2
IDX$$_165A0002 N_LABEL 1
S_KEY_IDX S_KEY 1
XAK1_TOC N_RELATIONSHIP 1
XAK2_TOC TARGET 1
XAK2_TOC N_LABEL 2
XAK2_TOC D_STATUS 3
XAK2_TOC A_ID 4
XIE1_TOC N_RELBASE 1
XIF4_TOC SOURCE_FILE_ID 1
XIF5_TOC A_ID 1
XPK_TOC N_ID 1
Atif
You're doing a skip scan where you supposedly want to do a range scan.
A range scan is only possible when the index columns are ordered by descending selectivity - in your case it seems that it should be S_KEY - TARGET - RANK
Update: rewriting the query in different order wouldn't make any difference. What matters is the sequence of the columns in the indexes of that table.
first show us the current index columns for that table:
select index_name, column_name, column_position from all_ind_columns where table_name = 'TOC'
then you could create a new index, e.g.
create index toc_i_s_key_target_rank on toc (s_key, target, rank) compress;

Resources