Pig latin programming - hadoop

I have a table loaded in a variable in pig whose schema looks like this:
What I want to accomplish through a pig-latin script is to populate the value "JKL", "PQR" and so on.. in col 4 that is blank for the rest of the rows. The blank rows must copy only the values in the previous cell in the col 4. Check the example below.
The target table should like this:

if your requirement is to update Col4 value to XYZ for all the records which are having values null or empty then you can use the following code snippet to do the same
--Load input data
input_data = LOAD 'input.txt' USING PigStorage() AS (Col1:chararray, Col2:int, Col3:int, Col4:chararray);
--Perform operation on each record
input_data = FOREACH input_data GENERATE Col1, Col2, Col3, ((Col4 is null or TRIM(Col4) == '') ? 'XYZ' : Col4) as Col4;
here assuming that you are holding your input_data then for each record check whether the Col4 value is null or empty, if it is then update it with the desired value (XYZ) or else just use the existing value

Is the Col1 is same for all the rows. If Yes, then Use two set of Filter else u have to find the uniq value between col1 & Col4 and remove the NULL value thn use below steps
Filter_One will capture Col1 & Col4 where Col4 is not NULL
Filter_Two will capture Col1, Col2, Col3. Use Join Filter_one &
Filter_Two, where Filter_two will be printed 1st, 2nd , 3rd Column
and Filter_one 2nd Column will be pronted at 4th Position,
hope the same will help
The Pig script will be like :
Filter_one = foreach Load_Data generate $0 as col1, $3 as col4;
Filter_one_temp = filter Filter_one by ($1 is not null);
Filter_two = foreach Load_Data generate $0 as col1, $1 as col2, $2 as col3;
Join_filter = JOIN Filter_two by $0 LEFT, Filter_one_temp by $0;
generetate_output = foreach Join_filter generate $0 as col1, $1 as col2 , $2 as col3,$4 as col4;
store generetate_output into 'dfs_path' using PigStorage(',');
as am storing the same with , delimeter so the output will be like
(ABC,34,23,XYZ)
(ABC,12,78,XYZ)
(ABC,4,21,XYZ)
(ABC,22,54,XYZ)
(DEF,32,455,JKL)
(DEF,21,45,JKL)
(DEF,45,687,JKL)
(DEF,232,565,JKL)
(DEF,23,32,JKL)

Related

How to user whereIn with array having null value(conditionally)

I am new to Laravell and sql , I will have two scenarios both depends on user input
and this column may have null value so final array will looks like
DB Data
col1 col2
abc xyz
abc2 xyz2
null null
abc3 xyz3
Scenario1 :
col1 = ["abc","abc2","abc3"];
col2 = ["xyz","xyz2","xyz3"];
Scenario1 :
col1 = ["abc","abc2","abc"];
col2 = ["xyz","xyz2",null];
MY query is based on user input , for example col1 can be empty i.e null and col2 i.e null
I am using this query :
$query->whereIn( $field, $col1);
$query->whereIn( $field2, $col2);
for scenario 1 :
every thing works fine
col1 col2
abc xyz
abc2 xyz2
abc3 xyz3
for scenario 1 :
not working with null
col1 col2
empty
and with this query
$query->whereIn( $field, $col1);
$query->orWhereNull($field)
$query->whereIn( $field2, $col2);
$query->orWhereNull($field2)
no results
can any one guide me in the right direction how can i have the results with null also
and FYI its AND query i.e filter col1 && col2
I can test it right now, but you need to separate your statement.
Something like :
$query->whereHas($field, function ($query) use ($field, $col1) {
if (in_array(null, $col1)) {
$query->whereNull($field)->orWhereIn($field, $col1);
} else {
$query->whereIn($field, $col1);
}
});
$query->whereHas($field2, function ($query) use ($field2, $col2) {
if (in_array(null, $col2)) {
$query->whereNull($field2)->orWhereIn($field2, $col2);
} else {
$query->whereIn($field2, $col2);
}
});

Filter records in Pig

Below is the data
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,dollar
61,62,63,64,pound
col1,col2,col3 will form the combination of unique keys. The use case is to filter the data based on col5.
For the unique key combination we need to filter the record where col5 value is "dollar", only if the same combination has "pound" value.
The expected output is
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,pound
How to proceed further since there is no special operators in Pig like Hive.
A = load 'test1.csv' using PigStorage(',') as (col1:int,col2:int,col3:int,col4:int,col5:chararray);
B = FILTER A BY col5 == 'pound';
Get all the records with 'pound', then get all records with 'dollar' that does not match with the id combination with 'pound' in col5. Finally, marry them off ... UNION.
B = FILTER A BY col5 == 'pound';
C = JOIN A BY (col1,col2,col3) LEFT OUTER,B BY (col1,col2,col3);
D = FILTER C BY (B::col1 is null);
E = FOREACH D GENERATE A::col1,A::col2,A::col3,A::col4,A::col5;
F = UNION B,E;
DUMP F;
Output

How do I get matching values in PIG without using UDF?

Consider these as my input files,
Input 1: (File 1)
12,23,14,15,9
1,2,3,4,5
34,17,8
.
.
Input 2: (File 2)
12 Twelve
23 TwentyThree
34 ThirtyFour
.
.
I will be reading each line from "Input 1" file using my PIG script and I would like to get the results as below, based on the "Input 2" file.
Output:
Twelve,TwentyThree,Fourteen,Fifteen,Nine
One,Two,Three,Four,Five
.
.
Is it possible to achieve this without UDF ? Please let me know your suggestions.
Thanks in Advance !
This violates your criteria of 'No UDF' but the UDF is built-in so I suspect it will suffice.
Query:
data1 = LOAD 'file1' AS (val:chararray);
data2 = LOAD 'file2' AS (num:chararray, desc:chararray);
A = RANK data1; /* creates row number*/
B = FOREACH A GENERATE rank_data1, FLATTEN(TOKENIZE(val, ',')) AS num;
C = RANK B; /* used to keep tuple elements sorted in bag*/
D = JOIN C BY num, data2 BY num;
E = FOREACH D GENERATE C::rank_data1 AS rank_1:long
, C::rank_B AS rank_2:long
, data2::desc AS description;
grpd = GROUP E BY rank_1;
F = FOREACH grpd {
sorted = ORDER E BY rank_2;
GENERATE sorted;
};
X = FOREACH F GENERATE FLATTEN(BagToTuple(sorted.description));
DUMP X;
Output:
(Twelve,TwentyThree,Fourteen,Fifteen,Nine)
(One,Two,Three,Four,Five)
(ThirtyFour,Seventeen,Eight)
Here is a Hive solution:
--Load the data into Hive
CREATE TABLE file1 (
line array<string>
)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY ',';
LOAD DATA INPATH '/tmp/test2/file1' OVERWRITE INTO TABLE file1;
CREATE TABLE file2 (
name string,
value string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
LOAD DATA INPATH '/tmp/test2/file2' OVERWRITE INTO TABLE file2;
--explode the rows from the first table and create a newid to use for correlation
CREATE TABLE file1_exploded
AS
WITH tmp
AS
(SELECT RAND() newid, line from file1)
SELECT newid, item FROM tmp
LATERAL VIEW EXPLODE (line) a AS item;
--apply substitions using the second table, then join lines back together
SELECT CONCAT_WS(',', COLLECT_LIST(value))
FROM
file1_exploded
JOIN file2 ON item = name
GROUP BY newid;

Apache Pig: Filter one tuple on another?

I want to run a Pig script by splitting out two tuples (or whatever it's called in Pig), based off of criteria in col2, and after manipulating col2, into another column, compare the two manipulated tuples and do an additional exclude.
REGISTER /home/user1/piggybank.jar;
log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);
--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;
is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;
Splitting and manipulating is the easy part. This is where it gets complicated. . .
merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};
If I can figure out this line(s), the rest would fall in place.
merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;
STORE merge_limited into 'file';
Here's an example of the I/O:
col1 col2 manipulated
This qWerty W
Is qweRty R
An qwertY Y
Example qwErty E
Of qwerTy T
Example Qwerty Q
Data qWerty W
isnt
E
Y
col1 col2
This qWerty
Is qweRty
Of qwerTy
Example Qwerty
Data qWerty
I'm still not sure quite what you need, but I believe you can reproduce your input and output with the following (untested):
data = LOAD 'input' AS (col1:chararray, col2:chararray);
exclude = LOAD 'exclude' AS (excl:chararray);
m = FOREACH data GENERATE col1, col2, YourUDF(col2) AS manipulated;
test = COGROUP m BY manipulated, exclude BY excl;
-- Here you can choose IsEmpty or NOT IsEmpty according to whether you want to exclude or include
final = FOREACH (FILTER test BY IsEmpty(exclude)) GENERATE FLATTEN(m);
With the COGROUP, you group all tuples in each relation by the grouping key. If the bag of tuples from exclude is empty, it means that the grouping key was not present in the exclude list, so you keep tuples from m with that key. Conversely, if the grouping key was present in exclude, that bag will not be empty and the tuples from m with that key will be filtered out.

For each row of one table, count entries in another table pointing to each of those rows in Oracle

Not sure the title explains the problem well; this is what I'm working with,
I have the following tables,
-- table = kms_doc_ref_currnt_v
DOC_ID VARCHAR2(19)
TO_DOC_ID VARCHAR2(19)
BRANCH_ID NUMBER(8)
REF_TYP_CD VARCHAR2(20)
-- table = kms_fil_nm_t
DOC_ID VARCHAR2(19) PRIMARY KEY UNIQUE
For example, I can get a count of all kms_doc_ref_currnt_v records that have a to_doc_id = 59678, where 59678 is one value in kms_fil_nm_t, with this query,
select 'doc_id 59678 has ' || count(to_doc_id) as cnt from kms_doc_ref_currnt_v where branch_id=1 and ref_typ_cd in ('CONREF', 'KBA') and to_doc_id=59678;
kms_doc_ref_currnt_v.to_doc_id is a field that has one of the kms_fil_nm_t.doc_id values. kms_doc_ref_currnt_v.doc_id is also one of the values in kms_fil_nm_t.
The single query I'm looking for would loop over each kms_fil_nm_t.doc_id and count all the rows in kms_doc_ref_currnt_v that have a similar to_doc_id. Each row returned would look like the output of the query above. Here's example output,
doc_id 1 has 32
doc_id 2 has 314
doc_id 3 has 2718
doc_id 4 has 42
doc_id 5 has 128
doc_id 6 has 11235
.
.
.
Probably simple but I just can't figure it out.
Do a join with two tables and add a GROUP BY clause as below:
SELECT 'doc_id 59678 has ' || count(to_doc_id) as cnt
FROM kms_doc_ref_currnt_v kv, kms_fil_nm_t kt
WHERE kt.doc_id= kv.to_doc_id
AND kv.branch_id=1
AND kv.ref_typ_cd in ('CONREF', 'KBA')
AND kv.to_doc_id=59678
GROUP BY kv.to_doc_id;
EDIT:
To get all records from kms_doc_ref_currnt_v irrespective of their reference availability in kms_fil_nm_t and kv.to_doc_id=59678, do like this:
SELECT 'doc_id 59678 has ' || count(to_doc_id) as cnt
FROM kms_doc_ref_currnt_v kv
LEFT JOIN kms_fil_nm_t kt
ON (kt.doc_id= kv.to_doc_id )
WHERE kv.branch_id=1
AND kv.ref_typ_cd in ('CONREF', 'KBA')
GROUP BY kv.to_doc_id;
to replace the hardcoding 59678, you may want to write:
SELECT 'doc_id ' || kt.doc_id || ` has ' || count(to_doc_id) as cnt
FROM kms_doc_ref_currnt_v kv
LEFT JOIN kms_fil_nm_t kt
ON (kt.doc_id= kv.to_doc_id )
WHERE kv.branch_id=1
AND kv.ref_typ_cd in ('CONREF', 'KBA')
GROUP BY kv.to_doc_id, kt.doc_id;
You need to use an outer join between the driving table that has all the doc_id values and the dependent table that may or may not have matching entries; and a group by clause to define what your aggregate function (count()) is operating against. Something like:
select 'doc_id ' || t.doc_id || ' has ' || count(*)
from kms_fil_nm_t t
left join kms_doc_ref_currnt_v v
on v.to_doc_id = t.doc_id
and v.branch_id = 1
and v.ref_typ_cd in ('CONREF', 'KBA')
group by t.doc_id;
This assumes you want to know when a doc_id isn't used, so you want entries like doc_id 1234 has 0. If you don't want to see those then you could use an inner join instead of an outer - essentially just remove the word left - but if that is the case then you don't really need to join at all, you could just do:
select 'doc_id ' || v.to_doc_id || ' has ' || count(*)
from kms_doc_ref_currnt_v v
where v.branch_id = 1
and v.ref_typ_cd in ('CONREF', 'KBA')
group by v.to_doc_id;
Unless there are to_doc_id values which are not in the other table, which would be included in the results of this query, but excluded if the tables were joined.

Resources