How to return the match record based on lookup table by using hive - hadoop

Let's say we have a look up table (table_A) and another table (table_B) as follows:
And we want to search string of Table_B from Table_A to return the chemical type and form Table_C, as follows:
How can we implement this by using hive query under hadoop environment?
The challenging part is to search for multiple keywords within same string and create new row for each matched record.
Thank you!

I think you should structure Table_A differently (or keep the current structure but split by comma and use explode in hive) like so:
----------------------------
| Table A |
----------------------------
| Chemical Type | Keyword |
----------------------------
| HF | 100HF |
----------------------------
| HF | 100:HF |
----------------------------
| HCL | HCL200 |
----------------------------
| HCL | 500HCL |
----------------------------
etc...
Then, it seems that you need to perform a cartesian product join:
select distinct b.machine,b.string,a.chemical_type from
Table_A as a, Table_B as b where instr(b.string,a.keyword) > 0;

Related

How to define for each table, the maximum value of one field of a list?

I have a list of Oracle table and fields and I would like to define for each table, the maximum value of the field of the list.
Input:
+------+--------+
| TAB | FIELDS |
+------+--------+
| tab1 | field1 |
+------+--------+
| tab2 | field2 |
+------+--------+
Output:
+------+--------+-----------+
| TAB | FIELDS | Max value |
+------+--------+-----------+
| tab1 | field1 | 10 |
+------+--------+-----------+
| tab2 | field2 | 15 |
+------+--------+-----------+
I want to write a PL / SQL function to create the loop but I have very little knowledge in this language. Do you have any examples to show me?
The input table is dynamic, which is why I want to use a loop.
thanks in advance
The input is build with system table like all_column_tab The output must be store in a table.
It is indeed not a great design to store and retrieve data, but I presume something like this should work for you. I've used a VARCHAR2 variable for storing max value instead of a Numeric because to handle MAX for non-numeric fields. Your table that stores the max val should be defined as VARCHAR2 for it to work normally for such cases.
DECLARE
v_maxVal VARCHAR2(400);
begin
FOR rec IN
( SELECT table_name,column_name
FROM user_tab_columns where table_name IN ('TAB1','TAB2')
)
LOOP
EXECUTE IMMEDIATE
'SELECT MAX('||rec.column_name||') FROM '||rec.table_name
INTO v_maxVal ;
INSERT INTO fieldstab(tab,fields,max_val) VALUES
( rec.table_name,rec.column_name,v_maxVal);
END LOOP;
END;
/
DEMO

Do UDF (which another spark job is needed) to each element of array column in SparkSQL

The structure of a hive table (tbl_a) is as follows:
name | ids
A | [1,7,13,25168,992]
B | [223, 594, 3322, 192928]
C | null
...
Another hive table (tbl_b) have the corresponding mapping between id to new_id. This table is big so cannot be loaded into memory
id | new_id
1 | 'aiks'
2 | 'ficnw'
...
I intend to create a new hive table to have the same structure as tbl_a, but convert the array of id to the array of new_id:
name | ids
A | ['aiks','fsijo','fsdix','sssxs','wie']
B | ['cx', 'dds', 'dfsexx', 'zz']
C | null
...
Could anyone give me some idea on how to implement this scenario in spark sql or in spark DataFrame? Thanks!
This is an expensive operation but you can make it using a coalesce, explode and a left outer join as followed :
tbl_a
.withColumn("ids", coalesce($"ids", array(lit(null).cast("int"))))
.select($"name", explode($"ids").alias("id"))
.join(tbl_b, Seq("id"), "leftouter")
.groupBy("name").agg(collect_list($"new_id").alias("ids"))
.show

Two date columns in source to decide the latest updated record in informatica

I have a requirement as below:
I have a source table like
id | name | address | updt_date_1 | updt_date_2
1 | abc | xyz | 2000-01-01 | 1999-01-01
1 | abc | pqr | 2001-01-01 | 1999-01-01
2 | lmn | ghi | 1999-01-01 | 1999-01-01
2 | lmn | stu | 2000-01-01 | 2008-01-01
I would want to get in target as:
1 | abc | pqr
2 | lmn | stu
i.e. I would want the record with the latest update date in either of the two date columns -updt_date_1 or updt_date_2
Please suggest how can this be implemented in informatica PC
This requirement can be achieved in a effective way by using just 3 transformations (SourceQualifier, Expression and Filter). Please see the steps below
1) Use the following SQL override in the Source Qualifier transformation to reduce the two last_updated_date fields into one
SELECT
id
,name
,address
,CASE WHEN updt_date_1 > updt_date_2 THEN updt_date_1 ELSE updt_date_2 AS updt_date
FROM souce_table
ORDER BY id, updt_date DESC
Now the first row for each id will be the required record.
2) Use an expression transformation to flag the first row of each id. Use the following ports in the same order in the expression transformation (prefix o_ means output port, v_ means variable port and i_ means input port)
PORT EXPRESSION
v_FIRST_ROW_FLAG - IIF(v_PREV_ID==i_id,'N','Y')
v_PREV_ID - i_id
o_FIRST_ROW_FLAG - v_FIRST_ROW_FLAG
3) Next add a filter transformation to filter records which does not satisfy the following condition
IIF(o_FIRST_ROW_FLAG==Y,TRUE,FALSE)
Connect this filter transformation to the target definition. This will give you the expected output.
Basically we have to determine maximum update date1 and update date2. Then we have to choose which one is maximum between them.
Usea souce qualifier and then sort the data based on id, name.
Add an aggregtor after. pull id, name, updt_date_1, updt_date_2 columns. Create two o/p columns - max_upd_dt1, max_upd_dt2 and calculate MAX(updt_date_1), MAX(updt_date_2) respectively . set group by id, name.
Use a joiner to join sorter output and aggregator output based on id,name. so now you have two extra columns- max_upd_dt1 and max_upd_dt2.
Use an expression transformation after joiner. Pull all columns in. Create two output port and set logic like below -
out_upd_dt1 = iif( max_upd_dt1 > max_upd_dt2, max_upd_dt1, updt_date_1 )
out_upd_dt2 = iif( max_upd_dt1 < max_upd_dt2, max_upd_dt2, updt_date_2 )
Use another source qualifier(sort by id,name)and join it with above expression tx. Join based on -
id=id, name=name, out_upd_dt1=updt_date_1, out_upd_dt2= updt_date_2
Pick up id, name, address
HTH
Koushik

cassandra query on map in select clause

i am new to cassandra and i am trying to read a row from database which contains values
siteId | country | someMap
1 | US | {a:b, x:z}
2 | PR | {a:b, x:z}
I have also created an index on table using create index on columnfamily(keys(someMap));
but still when i query as select * from table where siteId=1 and someMap contains key 'a'
it returns an entiremap as
1 | US | {a:b, x:z}
Can somebody help me on what should i do to get the value as
1 | US | {a:b}
You can not: even if internally each entry of a Map|List|Set is stored as a column you can only retrieve the whole collection but not part of it. You are not asking cassandra give me the entry of the map containing X, but the row whom map contains X.
HTH,
Carlo

nested PLSQL in a tabular form

I am trying to achieve the following result (the first line is header)
Level 1 | Level 2 | Level 3 | Level 4 | Person
Technicals | Development | Software | Team leader | Eric
Technicals | Development | Software | Team leader | Steven
Technicals | Development | Software | Team leader | Jana
How can I do so? I tried to use the following code. The first part is to create the hierarchy which works fine. The second part is to have the date in the above mentioned table is a pretty painful.
SELECT * FROM ( /* level2 */
SELECT * FROM ( /* level1 */
SELECT * FROM arc.localnode /*create hierarchy */
WHERE tree_id = 2408362
CONNECT BY PRIOR node_id = parent_id
START WITH parent_id IS NULL ) l1node
LEFT JOIN names on l1node.prent_id = names.name_id ) l2node
At this point, I am quite lost. A bit of guidance and suggestion would be a lot of help :-)
There are two tables. The first table has data like this:
NODE_ID | PREV_ID | NEXT_ID | PARENT_ID
1421864 3482917 1421768
3482981 3482917 1421866 1421768
3482911 3060402 3482913 1421768
3482917 1421864 3482981 1421768
This is a complicated because it is in hieraracy. So obviously a PARENT_ID can be the NODE_ID of some other PARENT_ID. Similarly the parent_ID can be the PREV_ID and NEXT_ID.
The names are in seperate table with name_id. The name ID in this table is similar to NODE_ID of the main table in hieraracy.
You can use the Stragg Package mentioned in AskTom in the below link
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2196162600402
Your can also refer the below link in oracle forum
https://forums.oracle.com/forums/thread.jspa?threadID=2258996
Kindly post create and insert statements for your requirement so that we can test it and confirm

Resources