I want to Union/Merge two files using pig. But, this is a different union than a usual union. Following are my files (h* are header of files) :
F1 :
h1,h2,h3,h4
a01,a02,a03,a04
a11,a12,a13,a14
F2 :
h3,h4,h5,h6
a23,a24,b01,b02
a33,a34,b11,b12
The resulting output must be a Union of these files like this :
FR :
h1,h2,h3,h4,h5,h6
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12
One more difficulty is I want to make it generic so that it works for dynamic number of common columns. Currently there are two common columns, it could have 3 or 1 common column or even no common column at all. For example :
F1 :
h1,h2,h3,h4
a1,a2,a3,a4
F2
h5,h6,h7,h8
b1,b2,b3,b4
FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4
Any hint/help is appreciable.
Here is how you can do it statically:
F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;
FR = F1full UNION F2full;
Pig is not very flexible, so I don't think it is possible to generate this dynamically/for the generic case.
If you would want a solution for the generic case, you could use a language like python to build the required command based on metadata of stored tables/files.
I tried to solve the problem using following approach :
1) Load both of the files.
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.
Following is the pig script to do the same. As this script is generic, I have mentioned what all parameters will be required before running the script.
-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);
RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;
COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);
CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;
JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;
STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
I'm using yajra's laravel/datatables plugin and I want to send only the first row of every group from the query sorted according to date of the record descending. This is my query :
$qsrecords = QualityScore::where('clientID', '=', $user['id'])
->whereBetween('day', array($startDate, $endDate))
->where($desiredValue, $operator, $quantity)
->where('previousQualityScore','!=','0');
This query returns every record for this given user id like :
Client ID | Keyword ID | Quality Score | Date
2 81 8 21.08.2016
2 42 9 19.08.2016
2 81 7 16.08.2016
2 42 5 14.08.2016
as you can see, ı got 2 different keywords and my query is giving that output.
but i want my query to generate results like :
Client ID | Keyword ID | Quality Score | Date
2 81 8 21.08.2016
2 42 9 19.08.2016
Only the last records of every keyword. That's I want to achieve.
The way i send the query to view :
// Send data to view via datatables plugin
return Datatables::of($qsrecords)->make(true);
Try this
QualityScore::select( * , DB::raw('MAX(date) as date'))
->where('clientID', '=', $user['id'])
->whereBetween('day', array($startDate, $endDate))
->where($desiredValue, $operator, $quantity)
->where('previousQualityScore','!=','0');
->groupBy('keyword_id')
->get();
Finally i've figured it out! I've changed the way i approach to this issue but the following code has solved the issue, thanks #Kiran-sadvilkar for suggestions.
$groupByMaxDateQuery = ' SELECT qs.adGroup,qs.keyword, qs.previousQualityScore, qs.qualityScore, qs.qualityScoreDifference, qs.day
FROM homestead.qualityScore AS qs
INNER JOIN (
SELECT adGroupID, keywordID, max(day)
AS MaxDay
FROM homestead.qualityScore
GROUP BY adGroupID,keywordID
)
innerTable ON qs.adGroupID = innerTable.adGroupID
AND qs.keywordID = innerTable.keywordID
AND qs.day = innerTable.MaxDay
WHERE qs.clientID = '.$user['id'].' AND
qs.day BETWEEN "'.$startDate.'" AND "'.$endDate.'" AND
qs.'.$desiredValue.' '.$operator.' '.$quantity.' AND
qs.previousQualityScore != 0
';
// Pull data from database with current conditions
$qsrecords = DB::table(DB::raw("($groupByMaxDateQuery) as qs"));
As a result i've decided to use a raw sql query and now it works...
I need the following output.
NE 50
SE 80
I am using pig query to count the country based on zone.
c1 = group country by zone;
c2 = foreach c1 generate COUNT(country.zone), (
case country.zone
when 1 then 'NE'
else 'SE'
);
But I am not able to achieve my output. I am getting error like the following:
2016-03-30 13:57:16,569 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: (Name: Equal Type: null Uid: null)incompatible types in Equal Operator left hand side:bag :tuple(zone:int) right hand side:int
Details at logfile: /home/cloudera/pig_1459370643493.log
But I was able to do using following query.
c2 = foreach c1 generate group, COUNT(country.zone);
This will give following output:
(1,50)
(2,80)
How can I add NE instead of 1 and SE instead of 2? I thought using CASE would help but I am getting error. Can anyone help?
EDIT
Pig 0.12.0 Version now supports CASE expression.
c2 = FOREACH c1 GENERATE (CASE group
WHEN 1 THEN 'NE'
WHEN 2 THEN 'SE'
WHEN 3 THEN 'AE'
ELSE 'VR' END), COUNT(country.zone);
Older Pig Versions
Pig does not have a case statement.Your best option is to use UDF.If the group values are limited to only two then you can use bincond operator to check the value
c2 = foreach c1 generate (group == 1 ? 'NE' : 'SE'), COUNT(country.zone);
If you have multiple values then use this.I've used test values to generate the output.
Input
c2 = FOREACH c1 GENERATE (group == 1 ? 'NE' :
(group == 2 ? 'SE' :
(group == 3 ? 'AE' : 'VR'))), COUNT(country.zone);
Output
In Pig 12 and later, you can use case statement in pig
In your case, country.zone is a bag and you cant compare it to an int
With above posted answer getting this error.
mismatched input ')' expecting END.
So updating a working code:
c2 = FOREACH c1 GENERATE (CASE group
WHEN 1 THEN 'NE'
WHEN 2 THEN 'SE'
WHEN 3 THEN 'AE'
ELSE 'VR' END), COUNT(country.zone);
Output:
(NE, 50)
(SE, 80)
(AE, 30)
I want to filter the records of data set A whose flight_delay_time is less than some specific values(x).
But I will get the value of x from another pig query which is a tuple in the sense x is a tuple.
But using the following statement is throwing an error:
B = FILTER A by flight_delay_time < x;
dump B;
The data in file A is in the following way;
ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26
avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35
supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,68
lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,77
Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28
the value of x = (40) which is stored as a tuple.
the last column in the above data denotes the flight_delay_time.
I am extracting the value of X in the following way.
following is the data stored in C_CONTROL_BATCH.txt
25
35
40
15
I used following code to extract the value of X.
control_batch = LOAD 'C_CONTROL_BATCH.txt' AS (start:int);
variable = ORDER control_batch BY start DESC;
X = LIMIT starttime 1;
Here is the solution:
INPUT
We have two input files:
airlinesdata.txt - Having the rawdata
ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26
avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35
supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,68
lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,77
Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28
x.txt - Having data from where we get the values of x -
20
30
35
38
37
40
29
flight_delay_time column is last column in below relation and of type int.
Note - If you don't declare it here the program with thrown an exception that it cant cast from byterarray to int when you filter in the end.
rawdata = LOAD 'airlinesdata.txt' USING PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray,field6:chararray,field7:chararray,field8:chararray,field9:chararray,field10:chararray,field11:chararray,field12:chararray,field13:chararray,field14:chararray,field15:chararray,field16:chararray,field17:chararray,flight_delay_time:int);
x_data = LOAD 'x.txt' USING PigStorage() AS (x_val:int);
order_x_data = ORDER x_data BY x_val desc;
max_value = LIMIT order_x_data 1;
Here we are again casting the value to int for the filter condition to work.
max_value_casted = FOREACH max_value GENERATE $0 as (maxval:int);
Finally we can issue the filter query to get the results.
Note how the maxval is accessed below by using the . operator from the max_value_casted relation.
output_data = FILTER rawdata BY flight_delay_time < max_value_casted.maxval;
DUMP output_data;
OUTOUT - Values smaller than max value of X (40)
(ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26)
(avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35)
(Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28)
Hope it helps :)
I am using the ternary operator to include values in SUM() operation conditionally. Here is how I am doing it.
GROUPED = GROUP ALL_MERGED BY (fld1, fld2, fld3);
REPORT_DATA = FOREACH GROUPED
{ GENERATE group,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : 0) AS sum1,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : (GROUPED.fld5 * -1)) AS sum2;
}
Schema for ALL_MERGED is
{ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
When I execute this, it gives me following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: SUM in {group: (fld1:chararray, fld2:chararray, fld3:chararray), ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
What am I doing wrong here?
SUM is a UDF which takes a bag as input. What you are doing has a number of problems, and I suspect it would help you to review a good reference on Pig. I recommend Programming Pig, available for free online. To begin with, GROUPED has two fields: a tuple called group and a bag called ALL_MERGED, which is what the error message is trying to tell you. (I say "trying" because Pig error messages are often quite cryptic.)
Also, you cannot pass expressions to UDFs like you wish to do. Instead you will have to GENERATE these fields and then pass them afterward. Try this:
ALL_MERGED_2 =
FOREACH ALL_MERGED
GENERATE
fld1 .. fld5,
((fld4 == 'S') ? fld5 : 0) AS sum_me1,
((fld4 == 'S') ? fld5 : fld5*-1) AS sum_me2;
GROUPED = GROUP ALL_MERGED_2 BY (fld1, fld2, fld3);
DATA =
FOREACH GROUPED
GENERATE
group,
SUM(ALL_MERGED_2.sum_me1) AS sum1,
SUM(ALL_MERGED_2.sum_me2) AS sum2;