grouping and counting the records in PIG Latin - hadoop

I am new to PIG Latin and I am trying to solve the below problem
Find number of employees having phone number with each areacode.
EMPID ADD_ID ZIP SAL PHONE DAT
Abcd411 PbcDr60264 953492 46404 111-432-4193 20150113
Abcd874 PbcDr39353 186307 29873 100-432-9164 20150728
Abcd197 PbcDr46725 306185 31908 113-432-4191 20150410
Abcd160 PbcDr77738 330533 61313 105-432-2468 20151007
Abcd327 PbcDr10034 951703 39301 109-432-9235 20150805
Abcd172 PbcDr21679 683299 71686 105-432-5616 20150908
Abcd227 PbcDr57694 876619 46743 109-432-9181 20151101
Abcd900 PbcDr80166 970136 34242 105-432-7415 20150820
Abcd318 PbcDr34711 234066 10989 101-432-9667 20150906
Abcd702 PbcDr86734 997954 97688 105-432-6592 20151026
And below is the way I am trying to solve it.
empdata = LOAD '/home/cloudera/empData.txt' as (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, dateOfJoin:long);
grpdata = GROUP empdata by SUBSTRING(phone, 0, INDEXOF(phone, '-' , 0));
dataCnt = foreach grpdata generate count(grpdata);
But I am not getting error stating that its:- Invalid scalar projection: grpdata : A column needs to be projected from a relation for it to be used as a scalar
And in another problem statement for same data set
Find number of employees having date of joining between 2015-01-01 to 2015-05-28.
I tried below solution , but this time I am not getting any results.
empdata = LOAD '/home/cloudera/empData.txt' as (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, doj:chararray);
filtDate = filter empdata by ToDate(doj, 'yyyyMMdd') >= ToDate('20150101', 'yyyymmdd') AND ToDate(doj, 'yyyyMMdd') <= ToDate('20150528', 'yyyymmdd');
Please help with explanation.

try this
empdata = LOAD '/home/cloudera/empData.txt' as using PigStorage(' ') (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, dateOfJoin:long);
grpdata = GROUP empdata by SUBSTRING(phone, 0, INDEXOF(phone, '-' , 0));
dataCnt = foreach grpdata generate $0, COUNT(empdata);

you should count empdata
dataCnt = foreach grpdata generate COUNT(empdata);

Related

Improving Jira Oracle query performance

We have a process that imports jira data into an oracle database for reporting. The issue I am having at the moment is extracting custom fields and converting a row into a column in oracle.
jira custom data view
jira data view
This is how I am extracting the query, the problem here is that the performance just does not scale.
select A.*, (select cf.date_value from v_jira_custom_fields cf where cf.issue_id = a.issue_id and cf.custom_field_name = 'Start Date') Start_Date,
(select cf.number_value from v_jira_custom_fields cf where cf.issue_id = a.issue_id and cf.custom_field_name = 'Story Points') Story_Points,
(select cf.custom_value from v_jira_custom_fields cf where cf.issue_id = a.issue_id and cf.custom_field_name = 'Ready') Ready
from jira_data A
where A.project = 'DAK'
and A.issue_id = 2222
To really understand where the bottleneck is we'd need to get an execution plan and info about indexes that exists, at least.
Assuming you have indexes on issue_id and project in both tables, what I'd try next is to get rid of the 3 separate selects and join your jira_data to pivoted jira_custom_fields
with P as (
select
project
, issue_id
, story_type_s
, impacted_application_s
, impacted_application_c
, story_points_n
, start_date_d
, end_date_d
, ready_c
from v_jira_custom_fields
pivot (
max(string_value) as s
, max(number_value) as n
, max(text_value) as t
, max(date_value) as d
, max(custom_value) as c
for customfield_id in (
1 story_type
, 2 impacted_application
, 3 story_points
, 4 start_date
, 5 end_date
, 6 ready
)
)
)
select
A.*
, P.start_date_d start_date
, P.story_points_n story_points
, P.ready_c ready
from jira_data A
join P on A.project = P.project and A.issue_id = P.issue_id
where A.project = 'DAK'
and A.issue_id = 2222

Looking for whether a row exists in a subquery

Spoiler alert: I am fairly new to Oracle.
I have four tables: enrollments, courses/sections, standards, and grades.
We are running Honor Roll. I have queries on the first three tables that add various constraints needed to meet honor roll requirements. Then we look at the grades table. If they have a valid enrollment, in a valid course, meeting valid standards, then count up their scores. If their score qty meets thresholds, then they get Honors.
This code is not optimized, and likely can be done in a far better/more compact way I'm sure -- however, it only gets run a few times a year, so I'm willing to trade off optimization in order to increase human readability, so that I can continue to learn the fundamentals. So far I have:
WITH validCC (SELECT CC.ID AS CCID,
CC.STUDENTID AS STUDENTID,
CC.SECTIONID AS SECTIONID,
CC.TERMID AS TERMID,
STUDENTS.DCID AS STUDENTSDCID
FROM CC
INNER JOIN STUDENTS ON CC.STUDENTID = STUDENTS.ID
WHERE TERMID in (2700,2701)
AND CC.SCHOOLID = 406;
), --end validCC
validCrsSect (SELECT SECTIONS.ID AS SECTIONID,
SECTIONS.DCID AS SECTIONSDCID,
SECTIONS.EXCLUDEFROMHONORROLL AS SECTHR,
COURSES.COURSE_NUMBER AS COURSE_NUMBER,
COURSES.COURSE_NAME AS COURSE_NAME,
COURSES.EXCLUDEFROMHONORROLL AS CRSHR
FROM SECTIONS
INNER JOIN COURSES ON SECTIONS.COURSE_NUMBER = COURSES.COURSE_NUMBER AND SECTIONS.SCHOOLID = COURSES.SCHOOLID
WHERE SECTIONS.TERMID IN (2700,2701)
AND SECTIONS.SCHOOLID = 406
AND SECTIONS.EXCLUDEFROMHONORROLL = 0
AND COURSES.EXCLUDEFROMHONORROLL = 0
), --end validCrsSect
validStandard (SELECT STANDARDID,
IDENTIFIER,
TRANSIENTCOURSELIST
FROM STANDARD
WHERE isActive = 1
AND YEARID = 27
AND ( instr (STANDARD.identifier, 'MHS.TS', 1 ,1) > 0 --Is a valid standard for this criteria: MHS TS
or STANDARD.identifier = 'MHTC.TS.2' --or MHTC TS
or STANDARD.identifier = 'MHTC.TS.4' )
), --end validStandard
--sgsWithChecks (
SELECT sgs.STANDARDGRADESECTIONID AS SGSID,
sgs.STUDENTSDCID as STUDENTSDCID,
sgs.STANDARDID AS STANDARDID,
sgs.STORECODE AS STORECODE,
sgs.SECTIONSDCID AS SECTIONSDCID,
sgs.YEARID AS YEARID,
sgs.STANDARDGRADE AS STANDARDGRADE,
(select count(CCID) from validCC INNER JOIN STANDARDGRADESECTION sgs ON sgs.STUDENTSDCID = validCC.STUDENTSDCID and sgs.SECTIONSDCID = validCC.SECTIONID) as CC_OK,
(select count(SECTIONID) from validCrsSection INNER JOIN STANDARDGRADESECTION sgs ON sgs.SECTIONSDCID = validCrsSect.SECTIONSDCID) AS CRS_OK,
(select count(STANDARDID) from validStandard INNER JOIN STANDARDGRADESECTION sgs ON sgs.STANDARDID = validStandard.STANDARDID) AS STD_OK
FROM STANDARDGRADESECTION sgs
The purpose of putting the 'OK' columns in the vGrades table is because the final SELECT (not included) goes through and counts up the instances of certain scores filtering by the checks.
Frustratingly, there are two IDs in both the students table and the sections table (and it's not the same data). So when I go to link everything, some tables use ID as the FK, others use DCID as the FK; and I have to pull in an extra table to make that conversion. Makes the joins more fun that way I guess.
Each individual query works on its own, but I can't get the final select count() to work to pull their data. I tried embedding the initial queries as subqueries, but I couldn't pass the studentid into them, and it would run that query for each student, instead of once at the beginning.
My current error is:
Error starting at line : 13 in command -
SECTIONS.DCID AS SECTIONSDCID,
Error report -
Unknown Command
However before it was saying unknown table and referencing the last line of the join statement. All the table names are valid.
Thoughts?
I replaced the INNER JOIN with a simple WHERE condition. This seems to work.
(SELECT COUNT (CCID) FROM validCC WHERE sgs.STUDENTSDCID = validCC.STUDENTSDCID and sgs.SECTIONSDCID = validCC.SECTIONID) as CC_OK,
(SELECT COUNT (SECTIONID) FROM validCrsSect WHERE sgs.SECTIONSDCID = validCrsSect.SECTIONSDCID) AS CRS_OK,
(SELECT COUNT (STANDARDID) FROM validStandard WHERE sgs.STANDARDID = validStandard.STANDARDID) AS STD_OK
I removed the stray comma at the end of validStandard and replaced from validCrsSection with from validCrsSect (assuming it was meant to refer to that WITH clause and there isn't another validCrsSection table). I am also guessing that the counts are meant to be keyed to the current sgs row and not counts of the whole table. I make it this:
with validcc as
( select cc.id as ccid
, cc.studentid
, cc.sectionid
, cc.termid
, st.dcid as studentsdcid
from cc
join students st on st.id = cc.studentid
where cc.termid in (2700, 2701)
and cc.schoolid = 406
)
, validcrssect as
( select s.id as sectionid
, s.dcid as sectionsdcid
, s.excludefromhonorroll as secthr
, c.course_number
, c.course_name
, c.excludefromhonorroll as crshr
from sections s
join courses c
on c.course_number = s.course_number
and c.schoolid = s.schoolid
where s.termid in (2700, 2701)
and s.schoolid = 406
and s.excludefromhonorroll = 0
and c.excludefromhonorroll = 0
)
, validstandard as
( select standardid
, identifier
, transientcourselist
from standard
where isactive = 1
and yearid = 27
and ( instr(standard.identifier, 'MHS.TS', 1, 1) > 0
or standard.identifier in ('MHTC.TS.2','MHTC.TS.4') )
)
select sgs.standardgradesectionid as sgsid
, sgs.studentsdcid
, sgs.standardid
, sgs.storecode
, sgs.sectionsdcid
, sgs.yearid
, sgs.standardgrade
, ( select count(*) from validcc
where validcc.studentsdcid = sgs.studentsdcid
and validcc.sectionid = sgs.sectionsdcid ) as cc_ok
, ( select count(*) from validcrssect
where validcrssect.sectionsdcid = sgs.sectionsdcid ) as crs_ok
, ( select count(*) from validstandard
where validstandard.standardid = sgs.standardid ) as std_ok
from standardgradesection sgs;
This works with the six table definitions reverse-engineered as:
create table students
( id integer not null
, dcid integer );
create table cc
( id integer
, studentid integer
, sectionid integer
, termid integer
, schoolid integer );
create table courses
( course_number integer
, course_name varchar2(30)
, excludefromhonorroll integer
, schoolid integer );
create table sections
( id integer not null
, dcid integer
, excludefromhonorroll integer
, termid integer
, schoolid integer
, course_number integer );
create table standard
( standardid integer
, identifier varchar2(20)
, transientcourselist varchar2(50)
, isactive integer
, yearid integer );
create table standardgradesection
( standardgradesectionid integer
, studentsdcid integer
, standardid integer
, storecode integer
, sectionsdcid integer
, yearid integer
, standardgrade integer );

Pig script to find the max, min,avg,sum of Salary in each department

I get stuck after grouping the data by department no.The steps followed by me
grunt> A = load '/home/cloudera/naveen1/hive_data/emp_data.txt' using PigStorage(',') as (eno:int,ename:chararray,job:chararray,sal:float,comm:float,dno:int);
grunt> B = group A by don;
grunt> describe B;
B: {group: int,A: {(eno: int,ename: chararray,job: chararray,sal: float,comm: float,dno: int)}}
Please let me know the steps after this.I am bit confused about the Nested Foreach statement execution.
Data contains eno,ename,sal,job,commisson,deptno and i want extract the max sal in each dept and the employee getting the highest salary.
Similary for min sal.
Use the aggregate functions after grouping.
C = FOREACH B GENERATE group,MAX(A.sal),MIN(A.sal),AVG(A.sal),SUM(A.sal);
DUMP C;
To get the name,eno and max sal in each dept,sort the records and get the top row
C = FOREACH B {
max_sal = ORDER A BY sal DESC;
max_limit = LIMIT max_sal 1;
GENERATE FLATTEN(max_limit);
}
DUMP C;

how to use bincode operator in group function in pig

I need to group below data on fname and lastname.
(fname,lname,id)
abc,xyz,I
abc,xyz,N
ppp,xxx,I
ppp,XXX,I
in id field i am expecting only 2 values i.e N or I so if I get both N and I for same fname,lname combination then I should use id as N else need to use value for id field as it is given in the group.
I am expecting below results:
abc,xyz,N
ppp,xxx,I
I have tried below code and its working fine
in =load '/testing/name.txt' USING PigStorage(',') as (fname:chararray,lname:chararray,id:chararray);
grp = group in by (fname,lname);
z = foreach grp generate FLATTEN(group) AS (fname,lname),(COUNT(in.id) >1 ? ('N') :BagToTuple(in.id))as id;
However now I need to check the values of id field instead of counts:
z = foreach grp generate FLATTEN(group) AS (fname,lname),((in.id == 'N' or in.id == 'I') ? ('N') :BagToTuple(in.id))as id;
however its giving below error:
(Name: Equal Type: null Uid: null)incompatible types in Equal Operator left hand side:bag :tuple(id:chararray) right hand side:chararray
however its giving below error:
Two inputs of BinCond must have compatible schemas. left hand side: #31:tuple(#32:chararray) right hand side: org.apache.pig.builtin.bagtotuple_3#35:tuple(id#36:int)
Please guide
You are loading a field that contains chars i.e. N,I into int column? Change the load statement where id column type is chararray.
in =load '/testing/name.txt' USING PigStorage(',') as (fname:chararray,lname:chararray,id:chararray);
grp = group in by (fname,lname);
z = foreach grp generate FLATTEN(group) AS (fname,lname),(COUNT(in.id) > 1 && in.id matches 'N') ? ('N') : in.id;

ORA-30928: "Connect by filtering phase runs out of temp tablespace"

i have created a query that is sued to display a data in a label. This particular query will then be stored into a program that we use. The query runs just fine until this morning when it returns the error ORA-30928: "Connect by filtering phase runs out of temp tablespace". I have Googled and found out that I can do any of the following:
Include a NO FILTERING hint - but did not work properly
Increase the temp tablespace - not applicable to me since this runs in a production server that I don't have any access to.
Are there other ways to fix this? By the way, below is the query that I use.
SELECT * FROM(
SELECT
gn.wipdatavalue
, gn.containername
, gn.l
, gn.q
, gn.d
, gn.l2
, gn.q2
, gn.d2
, gn.l3
, gn.q3
, gn.d3
, gn.old
, gn.qtyperbox
, gn.productname
, gn.slot
, gn.dt
, gn.ws_green
, gn.ws_pnr
, gn.ws_pcn
, intn.mkt_number dsn
, gn.low_number
, gn.high_number
, gn.msl
, gn.baketime
, gn.exptime
, NVL(gn.q, 0) + NVL(gn.q2, 0) + NVL(gn.q3, 0) AS qtybox
, row_number () over (partition by slot order by low_number) as n
FROM
(
SELECT
tr.*
, TO_NUMBER(SUBSTR(wipdatavalue, 1, INSTR (wipdatavalue || '-', '-') - 1)) AS low_number
, TO_NUMBER(SUBSTR(wipdatavalue, 1 + INSTR ( wipdatavalue, '-'))) AS high_number
, pm.msllevel MSL
, pm.baketime BAKETIME
, pm.expstime EXPTIME
FROM trprinting tr
JOIN CONTAINER c ON tr.containername = c.containername
JOIN a_lotattributes ala ON c.containerid = ala.containerid
JOIN product p ON c.productid = p.productid
LEFT JOIN otherdb.pkg_main pm ON trim(p.brandname) = trim(pm.pcode)
WHERE (c.containername = :lot OR tr.SLOT= :lot)
)gn
LEFT JOIN otherdb.intnr intn ON TRIM(gn.productname) = TRIM(intn.part_number)
connect by level <= HIGH_NUMBER + 1 - LOW_NUMBER and LOW_NUMBER = prior LOW_NUMBER and prior SYS_GUID() is not null
ORDER BY low_number,n
)
WHERE n LIKE :n AND wipdatavalue LIKE :wip AND ROWNUM <= 300 AND wipdatavalue NOT LIKE 0
I am using Oracle 11g too.
Thanks for the help everyone.

Resources