Quering between two tables in Hive - hadoop

I have two tables f and t
f table is
one : two : three
1 dk jkdk
179 dsa ppd
90 dsj dat
t table is
one : two : three
0 100 aus
191 200 NZ
I want to compare f.one value with values t.one and t.two and get return t.three
For example,
if f.one == 90 then value should be aus
expected output:
t.three
aus
NZ
aus
I tried like :
select t.three from t JOIN (select f.one from f) y where y.one>=t.one AND y.one<=t.two;
and got output as:
aus
aus
Nz

you are doing it right the problem is that the result don't keep the order of table t and that's why it seems mixed.
Change the select to t.one,t.three and you'll see its fine.
if the order really matters you can add order by t.one in the end

Related

make two lignes into one based on reference column in hive

I have two rows in same table named Test having the metadata :
ref_col , type_taux, type_tans, date_trans, date_val
row 1 : 1011 , 'FIXE', 'Pay', '22092022', '23092022'
row 2 : 1011 , 'Variable', 'Receive', '22092022', '23092022'
The result should be :
ref_col , type_taux_pay, type_tans_pay, date_trans, date_val, type_taux_receive, type_tans_receive
1011 , 'FIXE', 'Pay', '22092022', '23092022', 'Variable', 'Receive'
In other word, I should keep common variables and add different ones in the same ligne because they have '1011' as same reference
How can I do that in hql ? thanks

Nearest neighbor and distance between points and lines

In oracle spatial I have two tables (AVALREGULACAO and ATROCOADUTOR) representing points and lines, respectively.
The structure of both tables is as follows:
AVALREGULACAO (295 point records)
IPID [number(10)]
GEOMETRY [MDSYS.SDO_GEOMETRY]
ATROCOADUTOR (12536 line records)
IPID [number(10)]
GEOMETRY [MDSYS.SDO_GEOMETRY]
I need to find the nearest ATROCOADUTOR neighbor from each AVALREGULACAO and calculate the distance between them
AVALREGULACAO_IPID | ATROCOADUTOR _IPID | DISTANCE
I’ve used 2 options
1
SELECT /*+ ORDERED */ A.IPID, B.IPID, MIN(SDO_GEOM.SDO_DISTANCE(sdo_cs.make_2d(A.GEOMETRY), sdo_cs.make_2d(B.GEOMETRY), 0.005)) as DISTANCE
FROM AVALREGULACAO A, ATROCOADUTOR B
GROUP BY c_b.IPID,c_d.IPID;
It takes quite a long time to compute - It generates a huge output of 295 x 12536 = 3 698 120 possible combinations (Cartesian product). Furthermore the csv file output cannot accommodate all this records (1 048 576 rows limit)
I only need 295 records corresponding to the 295 AVALREGULACAO.
2
I’ve also tried/adapted another query with the nearest neighbor (nn) operator
PROMPT IPID, nearest_IPID, distance
select /*+ ORDERED USE_NL(s,s2)*/
s.IPID,
s2.IPID as nearest_IPID,
TO_CHAR(REPLACE(mdsys.sdo_geom.sdo_distance(sdo_cs.make_2d(s.GEOMETRY),sdo_cs.make_2d(s2.GEOMETRY),0.05), ',','.')) as distance
from AVALREGULACAO s,
ATROCOADUTOR s2
where s2.IPID in (select IPID
from AVALREGULACAO s3
where sdo_nn(s3.GEOMETRY,s.GEOMETRY,'sdo_batch_size=10',1) = 'TRUE'
and s3.IPID <> s.IPID
and rownum < 2)
order by 1,2;
This query takes forever - I need to shut down the process before it ends.
I guess I'm missing the point on how to optimize/filter the desired results.
Any tips on how to efficiently solve this would be much appreciated.
Thanks in advance,
Pedro
PS:
#Boneist. Thanks a lot for the input.
Unfortunately I got an error after applying your query (still trying to work the semantics/syntax of new commands KEEP, dense_rank)
SELECT a.ipid a_ipid,
MIN(b.ipid) KEEP (dense_rank FIRST order by sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1)) b_ipid,
MIN(sdo_geom.sdo_distance(sdo_cs.make_2d(a.geometry), sdo_cs.make_2d(b.geometry), 0.005)) AS distance
FROM avalregulacao a
INNER JOIN atrocoadutor b ON sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1) = 'TRUE'
GROUP BY a.ipid;
Error
Error starting at line : 1 in command -
SELECT a.ipid a_ipid,
MIN(b.ipid) KEEP (dense_rank FIRST order by sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1)) b_ipid,
MIN(sdo_geom.sdo_distance(sdo_cs.make_2d(a.geometry), sdo_cs.make_2d(b.geometry), 0.005)) AS distance
FROM avalregulacao a
INNER JOIN atrocoadutor b ON sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1) = 'TRUE'
GROUP BY a.ipid
Error at Command Line : 2 Column : 45
Error report -
SQL Error: ORA-29907: foram encontradas etiquetas em duplicado em invocações primárias
29907. 00000 - "found duplicate labels in primary invocations"
*Cause: There are multiple primary invocations of operators with
the same number as the label.
*Action: Use distinct labels in primary invocations.
I think you're probably after something like:
SELECT a.ipid a_ipid,
MIN(b.ipid) KEEP (dense_rank FIRST order by sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1)) b_ipid,
MIN(sdo_geom.sdo_distance(sdo_cs.make_2d(a.geometry), sdo_cs.make_2d(b.geometry), 0.005)) AS distance
FROM avalregulacao a
INNER JOIN atrocoadutor b ON sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1) = 'TRUE'
GROUP BY a.ipid;
This joins both tables on the nearest neighbour function, which should reduce the number of rows being returned.
The MIN(b.ipid) KEEP (dense_rank first order by sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1)) simply returns the lowest b.ipid value for the lowest difference.
(I think this query will work as is, but I can't test it. You might have to do the join and have sdo_nn(a.GEOMETRY,b.GEOMETRY,'sdo_batch_size=10',1) as a column in a subquery and then do the group by in the outer query.)

Indexing very long number column

I have a table with few columns including 2 varchar2(200) columns. In this columns we basically store serial numbers which can be numeric or alpha-numeric. Alpha-numeric values always both serials are same in those 2 columns. However for number serials it is a range like (first column value = 511368000004001226 and second column value = 511368000004001425 with 200 different (Qty)). Maximum length of the serial is 20 digits. I have indexed both the columns.
Now I want to sear a serial in-between the above range. (lets say 511368000004001227). I use following query.
SELECT *
FROM Table_Namr d
WHERE d.FROM_SN <= '511368000004001227'
AND d.TO_SN >= '511368000004001227'
Is it a valid query? Can I use <=> operators for numbers in a varchar column?
Yes, You can use >= and <= operators on Varchar2 columns but it will behave like it is string and comparison between strings will take place.
In this case, 4 will be considered greater than 34 means '4' > '34' but number 4 is less than 34.
It is not a good practice to store a number in Varchar2. You will lose the functionality of Numbers if you store them in varchar2.
You can check the above concept using following:
select * from dual where '4' > '34'; -- gives result 'X'
select * from dual where 4 > 34; -- Gives no result
You can try to convert the varchar2 column to number using to_number if possible in your case.
Cheers!!
Your Query is "valid" in the sense, that it works, and will deliver a result. If you are looking from a numeric standpoint, it will not work correctly, as the range operators for VARCHAR columns work the same way, as it would sort an alphanumeric value.
e.g.
d.FROM_SN >= '51000'
AND d.TO_SN <= '52000'
This would match for values, as you would expect, like 51001, 51700, but would also deliver unexpected values like 52, or 5100000000000000
If you want numeric selection, you would need to parse it - which of course only works, if every value in these columns is numeric:
TO_NUMBER(d.FROM_SN) >= 51000
AND TO_NUMBER(d.TO_SN) <= 52000
You may use alphanumerical comparison provided
1) your ranges are of the same length and
2) all the keys in the range are of the same length
Example data
SERNO
----------------------------------------
101
1011
1012
1013
1014
102
103
104
This doesn't work
select * from tab
where serno >= '101' and serno <= '102';
SERNO
----------------------------------------
101
102
1011
1012
1013
1014
But constraining the lentgh of the result provides the right answer
select * from tab
where serno >= '101' and serno <= '102'
and length(serno) = 3;
SERNO
----------------------------------------
101
102

more efficent way of reading data from two table and writing them in a new one using batch

I'm trying to write a spring batch to move data from two tables to a single table. I'm having a problem now and I thought of many ways to solve this problem but I'm still wondering if there is a more efficent solution to my problem?
Basically the problem is, I have two tables lets call them table A and table B and their structure is as the following:
table A
column 1A column 2A
======== ========
bmw 123555
nissan 123456777
audi 12888
toyota 9800765
kia 85834945
table B
column 1B column 2B
======== ========
12 caraudi
123456 carnissan
123 carbmw
0125 carvvv
88963 carbbn
what I'm trying to do is to create a table c from the batch's wrtier which holds all the data from table B (column 1B and column 2B)and column 1A only without losing any data from both tables and without writing duplicated data based on column 2A and column 1B. column A and column B have only one column in common (coulmn 1B == column 2A) but column 2A has a 3 digits suffix added to each id so if we do a join and compare I have to use a substr method and it will be very slow coz I have huge tables.
The other solution I thinked of is to have a reader for table A and write all results to tempA table without the suffix, then another reader that compare tables tempA and table B and write the data to table c as the following
table c
column 1A ( can be nullable because not all the records in column 2A exists in column 1B)
column 1B
column 2B
so the table will look like this
table C
column 1c column 2c column 3c
========= ========= =========
12 caraudi audi
123456 carnissan nissan
123 carbmw bmw
0125 carvv
88963 carbbn
9800765 toyota
85834945 kia
is this the bet way to solve the problem? or is there any other way that is more efficient?
thanks in advance!
Before giving up on a LEFT OUTER JOIN from tableA to tableB (or a FULL OUTER JOIN if your query conditions require it) consider using db2expln or the Visual Explain utility in IBM Data Studio to determine the cost of some alternative ways to perform a "begins with" match on VARCHAR columns:
ON a.col2a LIKE b.col1b || '___'
ON a.col2a >= b.col1b || '000' AND a.col2a <= b.col1b || '999'
If 1b is a CHAR column, you might need to trim off its trailing spaces before concatenating additional characters to it: RTRIM( b.col1b ) || '000'
Assuming column 1b is indexed, one prefix-based matching predicate or another is bound to make a join between those two tables less expensive than creating, populating, and joining to your own temp table. If I'm wrong (or there are other complicating factors) and a temp table ends up being the best option, be sure to use a declared global temporary table (DGTT) so you can avoid the logging overhead of populating it.

use lag int the next line after its line have been executed

This is a very complicated situation for me and I was wondering if someone can help me with it:
Here is my table:
Record_no Type Solde SQLCalculatedPmu DesiredValues
------------------------------------------------------------------------
2570088 Insertion 60 133 133
2636476 Insertion 67 119,104 119,104
2636477 Insertion 68 117,352 117,352
2958292 Insertion 74 107,837 107,837
3148350 Radiation 73 107,837 107,83 <---
3282189 Insertion 80 98,401 98,395
3646066 Insertion 160 49,201 49,198
3783510 Insertion 176 44,728 44,725
3783511 Insertion 177 44,475 44,472
4183663 Insertion 188 41,873 41,87
4183664 Insertion 189 41,651 41,648
4183665 Radiation 188 41,651 41,64 <---
4183666 Insertion 195 40,156 40,145
4183667 Insertion 275 28,474 28,466
4183668 Insertion 291 26,908 26,901
4183669 Insertion 292 26,816 26,809
4183670 Insertion 303 25,842 25,836
4183671 Insertion 304 25,757 25,751
In my table every value in the SQLCalculatedPmu column or desiredValue Column is calculated based on the preceding value.
As you can see, I have calculated the SQLcalculatedPMU column based on the round on 3 decimals. The case is that on each line radiation, the client want to start the next calculation based on 2 decimals instead of 3(represented in the desired values column). Next values will be recalculated. For example line 6 will change as the value in line 5 is now on 2 decimals. I could handle this if there where one single radiation but in my case I have a lot of Radiations and in this case they will change all based on the calculation of the two decimals.
In summary, Here are the steps:
1 - round the value of the preceding row of a raditaiton and put it in the radiation row.
2 - calculate all next insertion rows.
3 - when we reach another radiation we redo steps 1 and 2 and so on
I m using an oracle DB and I m the owner so I can make procedures, insert, update, select.
But I m not familiar with procedures or loops.
For information, this is the formula for SQLCalculatedPmu uses two additional culmns price and number and this is calculated every line cumulativelly for each investor:
(price * number)+(cumulative (price*number) of the preceeding lines)
I tried something like this :
update PMUTemp
set SQLCalculatedPmu =
case when Type = 'Insertion' then
(number*price)+lag(SQLCalculatedPmu ,1) over (partition by investor
order by Record_no)/
(number+lag(solde,1) over (partition by investor order by Record_no))
else
TRUNC(lag(SQLCalculatedPmu,1) over partition by invetor order by Record_no))
end;
but I gave me this error (I think it's because I m looking at the preceiding line that itself is modified during the SQL statement) :
ORA-30486 : window function are allowed only in the SELECT list of a query.
I was wondering if creating a procedure that will be called as many time as the number of radiations would do the job but I m really not good in procedures
Any help
Regards,
just to make my need simpler, all I want is to have the DesiredValues column starting from the SQLCalculatedPmu column. Steps are
1 - on a radiation the value become = trunc(preceding value,2)
2 - calculate all next insertion rows this way : (price * number)+(cumulative (price*number) of the preceeding lines). As the radiation value have changed then I need to recalculate next lines based on it
3 - when we reach another radiation we redo steps 1 and 2 and so on
Kindest regards
You should not need a procedure here -- a SQL update of the Radiation rows in the table would do this quicker and more reliably.
Something like ..
update my_table t1
set (column_1, column_2) =
(select round(column_1,2), round(column_2,2)
from my_table t2
where t2.type = 'Insertion' and
t2.record_no = (select max(t3.record_no)
from my_table t3
where t3.type = 'Insertion' and
t3.record_no < t1.record_no ))
where t1.type = 'Radiation'

Resources