How to access each element of Date in Pig Latin?

How to access each element of Date in Pig Latin? - hadoop

Query:
records = LOAD 'input' using PigStorage(' ') as (id:int, name:chararray, desination:chararray, date:chararray, salary: long);
Sample input:
(10102,neha,developer,14/02/13,32000)
(10103,deva,admin,02/02/14,40000)
(10102,neha,developer,01/01/14,45000)
(10245,sasi,developer,01/01/14,20000)
(10109,surya,manager,01/02/2014,56000)
(10102,neha,developer,01/02/2014,45000)
(10245,sasi,developer,02/01/2014,25000)
I want to filter the above data based on year of the date(not entire date).

Check if this works for you.
records = LOAD '/home/abhijit/Downloads/movies.txt' using PigStorage(',') as (id:int, name:chararray, desination:chararray, date:chararray, salary:int);
todate_data = foreach records generate id,name,destination,date, salary,ToDate(date,'yyyy/MM/dd HH:mm:ss') as (date_time:DateTime );
todate_data = foreach records generate name,desination,ToDate(date,'dd/MM/yyyy') as (date_time:DateTime );
getyear_data = foreach todate_data generate name,desination,GetYear(date_time);
groupByYear = group getyear_data by $3;
The final output will be :
(2013,{(neha,developer,2013)})
(2014,{(deva,admin,2014),(neha,developer,2014),(sasi,developer,2014),(surya,manager,2014),(neha,developer,2014),(sasi,developer,2014)})

Related

I am getting error wihile using filter in pig ,when i dump result it gives error

code used in pig is :
studentsR = LOAD 'hdfs://quickstart.cloudera:8020/students/students' using PigStorage() as (name:chararray,rollno:int);
resultR = LOAD 'hdfs://quickstart.cloudera:8020/students/results' using PigStorage() as (rollno:int,result:chararray);
joniR = JOIN studentsR BY rollno,resultR BY rollno;
filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result) ;
filterRPass = FILTER filterR BY resultR.result == 'pass';
dump filterRPass;
error coming as below :
ERROR 0: Scalar has more than one row in the output. 1st : (1,fail), 2nd :(2,fail)

Try dump and describe for your every result set to see the output of each alias used.
Refer : scalar-has-more-than-one-row-in-the-output
studentsR = LOAD '/home/user/students' using PigStorage(' ') as (name:chararray,rollno:int);
dump studentsR;
resultR = LOAD '/home/user/results' using PigStorage(' ') as (rollno:int,result:chararray);
dump resultR;
joniR = JOIN studentsR BY rollno,resultR BY rollno;
dump joniR;
filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
dump filterR;
filterRPass = FILTER filterR BY resultR::result == 'pass';
dump filterRPass;
Modifications:
I used space in the input files as delimiter so used PigStorage(' ')
In filterR i removed the opening and closing round braces () around studentsR::name,studentsR::rollno,resultR::result since output of dump was having additional round braces.
grunt> filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result);
grunt> describe filterR;
filterR: {org.apache.pig.builtin.totuple_studentsR::name_100: (studentsR::name: chararray,studentsR::rollno: int,resultR::result: chararray)}
grunt> filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
grunt> describe filterR;
filterR: {studentsR::name: chararray,studentsR::rollno: int,resultR::result: chararray}
Used resultR::result instead of resultR.result in fifilterRPass
I have used a local set of files and executed pig in local mode for testing.
cat students
a 1
b 2
c 3
cat results
3 pass
2 fail
5 pass
Dump results:
dump studentsR
(a,1)
(b,2)
(c,3)
dump resultR
(3,pass)
(2,fail)
(5,pass)
dump joniR
(b,2,2,fail)
(c,3,3,pass)
dump filterR --filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result);
((b,2,fail))
((c,3,pass))
dump filterR --filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
(b,2,fail)
(c,3,pass)
dump filterRPass; --filterRPass = FILTER filterR BY resultR::result == 'pass'; --or-- filterRPass = FILTER filterR BY $2 == 'pass';
(c,3,pass)

Data transformation using pig

I have a csv file in which there are two variables . I have to add these two variables:- like salary and bonus(in which the salary is comma seperated), but it is not happening in the pig.I tried using the casting also. below is the screenshot of the dataset:-
I used the below pig script:-
register /home/ravimishra/piggybank-0.15.0.jar;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
emp_details_header = LOAD 'data/employee.csv' USING CSVLoader AS (id: int, name: chararray, address: chararray, occupation: chararray,salary: chararray,bonus: double);
ranked = rank emp_details_header;
NoHeader = Filter ranked by (rank_emp_details_header > 1);
B = FOREACH NoHeader GENERATE id,name,address,occupation, (double)salary + bonus as total ;

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

Below is the data I have and the schema for the same is-
student_name, question_number, actual_result(either - false/Correct)
(b,q1,Correct)
(a,q1,false)
(b,q2,Correct)
(a,q2,false)
(b,q3,false)
(a,q3,Correct)
(b,q4,false)
(a,q4,false)
(b,q5,flase)
(a,q5,false)
What I want is to get the count for each student i.e. a/b for total
correct and false answer he/she has made.

For the use case shared, below pig script is suffice.
Pig Script :
student_data = LOAD 'student_data.csv' USING PigStorage(',') AS (student_name:chararray, question_number:chararray, actual_result:chararray);
student_data_grp = GROUP student_data BY student_name;
student_correct_answer_data = FOREACH student_data_grp {
answers = student_data.actual_result;
correct_answers = FILTER answers BY actual_result=='Correct';
incorrect_answers = FILTER answers BY actual_result=='false';
GENERATE group AS student_name, COUNT(correct_answers) AS correct_ans_count, COUNT(incorrect_answers) AS incorrect_ans_count ;
};
Input : student_data.csv :
b,q1,Correct
a,q1,false
b,q2,Correct
a,q2,false
b,q3,false
a,q3,Correct
b,q4,false
a,q4,false
b,q5,false
a,q5,false
Output : DUMP kpi:
-- schema : (student_name, correct_ans_count, incorrect_ans_count)
(a,1,4)
(b,2,3)
Ref : For more details on nested FOR EACH
http://pig.apache.org/docs/r0.12.0/basic.html#foreach
http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach

Use this:
data = LOAD '/abc.txt' USING PigStorage(',') AS (name:chararray, number:chararray,result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
and answer will be like:
(a,false,4)
(a,Correct,1)
(b,false,3)
(b,Correct,2)
Hope this is the output you are looking for

SUM, AVG, in Pig are not working

I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.

Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)

Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;

Apache Pig - How to extract sets of records

I'm new user in Apache Pig, I have below data
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
...
I tried to extract to below records
0012,1,23
0013,2,34
0015,1,45
0011,1,456
...
Below are code that I've tried
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
order_item:chararray,
order_pid: chararray,
order_qty: chararray,
order_price: chararray
);
It doesn't work.
Another tried by save into Bag:
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
B: bag { T: tuple(
order_pid: chararray,
order_qty: chararray,
order_price: char array
)}
);
Still doesn't work.

Can you try this?
input:
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(REGEX_EXTRACT(line,'order=(.*)',1),','));
C = FOREACH B GENERATE FLATTEN(TOBAG(TOTUPLE($0..$2),TOTUPLE($3..$5)));
D = FILTER C BY $0 is not null;
DUMP D;
Output:
(0012,1,23)
(0013,2,34)
(0015,1,45)
(0011,1,456)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to access each element of Date in Pig Latin? - hadoop

Related

I am getting error wihile using filter in pig ,when i dump result it gives error

Data transformation using pig

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

SUM, AVG, in Pig are not working

Apache Pig - How to extract sets of records

Categories

Resources