I have following PIG script which is taking lot of time for processing 342 files with 256 MB as split size(testing only). Can anybody suggest improvement:
SPLIT filteredalnumcdrs into splitalnumcdrs_1 IF (
(SUBSTRING(aparty,2,3) == '-')),
splitalnumcdrs_2 OTHERWISE;
tmpsplitalnumcdrs_1 = FOREACH splitalnumcdrs_1 GENERATE aparty,srcgt,destgt,SUBSTRING(aparty,0,2) as splitaparty,bparty,smscgt,status,prepost;
groupsplitalnumcdrs_1 = GROUP tmpsplitalnumcdrs_1 BY (aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
distinctsplitalnumcdrs_1 = FOREACH groupsplitalnumcdrs_1 {
uniqsplitalnumcdrs_1 = DISTINCT tmpsplitalnumcdrs_1.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs_1) as countalnumcdrs;
};
tmpsplitalnumcdrs_2 = FOREACH splitalnumcdrs_2 GENERATE aparty,srcgt,destgt,aparty as splitaparty_2,bparty,smscgt,status,prepost;
groupsplitalnumcdrs_2 = GROUP tmpsplitalnumcdrs_2 BY (aparty,srcgt,destgt,splitaparty_2,bparty,smscgt,status,prepost);
distinctsplitalnumcdrs_2 = FOREACH groupsplitalnumcdrs_2 {
uniqsplitalnumcdrs_2 = DISTINCT tmpsplitalnumcdrs_2.(aparty,srcgt,destgt,splitaparty_2,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs_2) as countsplitalnumcdrs_2;
};
distinctalnumcdrs = UNION distinctsplitalnumcdrs_1,distinctsplitalnumcdrs_2;
alnumreportmap = FOREACH distinctalnumcdrs GENERATE aparty,smsiuc_udfs.mapgtabparty(srcgt,destgt,splitaparty,bparty),smscgt,status,prepost,countalnumcdrs PARALLEL 20;
alnumreportmapgroup = GROUP alnumreportmap BY (aparty,mappedreport,smscgt,status,prepost);
alnumreportmaprecord = FOREACH alnumreportmapgroup GENERATE FLATTEN(group),SUM(alnumreportmap.countalnumcdrs) as alnumsmscount;
you can avoid union
tmpsplitalnumcdrs = foreach filteredalnumcdrs generate aparty,srcgt,destgt,(SUBSTRING(aparty,2,3) == '-' ?SUBSTRING(aparty,0,2):aparty) as splitaparty,bparty,smscgt,status,prepost;
distinctsplitalnumcdrs = FOREACH tmpsplitalnumcdrs {
uniqsplitalnumcdrs = DISTINCT tmpsplitalnumcdrs.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs) as countsplitalnumcdrs;
};
why do you need
uniqsplitalnumcdrs = DISTINCT tmpsplitalnumcdrs.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
Related
The following query works, but I want to get the same result without using grp.Sum(). Can we do it?
from item in (await VehicleReplaceCostDataAsync())
group item by (item.type, item.size, item.ADA, item.eseq) into grp
orderby (grp.Key.eseq, grp.Key.size, grp.Key.ADA)
select new VehicleReplacementCost
{
type = grp.Key.type,
size = grp.Key.size,
ADA = grp.Key.ADA,
count = grp.Sum(x => x.count),
cost = grp.Sum(x => x.cost),
Fcount = grp.Sum(x => x.Fcount),
Fcost = grp.Sum(x => x.Fcost),
eseq = grp.Key.eseq,
}).ToList();
Perhaps by using .Aggregate()? [docs]
count = grp.Aggregate(0, (a, b) => a + b.count)
Thanks for the answer from Astrid. It looks like a good one, but I didn't test it. My colleague gave this solution instead by using yield:
var groups = costs
.GroupBy(type => (type.SystemId, type.Type, type.Size, type.ADA, type.Eseq))
.OrderBy(group => (group.Key.SystemId, group.Key.Eseq, group.Key.Size, group.Key.ADA));
foreach (var group in groups)
{
var result = new ProgramGuideVehicleCostRow
{
SystemId = group.Key.SystemId,
Type = group.Key.Type,
Size = group.Key.Size,
ADA = group.Key.ADA,
};
foreach (var row in group)
{
result.Cost += row.Cost;
result.Fcost += row.Fcost;
result.Count += row.Count;
result.Fcount += row.Fcount;
}
yield return result;
}
I have the following linq query that takes 10 seconds or more to run - is there a better way of writing it? It works, but is just very slow:
var searchQuery = (from p in db.Property
where p.PropertyVendorId == loggedInUserId
from aues in db.ApplicationUserEvents
where aues.ApplicationUserEventsPropertyId == p.PropertyId
&& aues.ApplicationUserEventsFeedbackDate != null
group p by new { p.PropertyId, p.PropertyAddress1, p.PropertyAddress2, p.PropertyAddress3, p.PropertyZipOrPostcode } into pg
select new DashboardFeedback
{
PropertyNumber = pg.FirstOrDefault().PropertyNumber,
PropertyId = pg.FirstOrDefault().PropertyId,
PropertyReference = pg.FirstOrDefault().PropertyId,
PropertyAddress1 = pg.FirstOrDefault().PropertyAddress1,
PropertyAddress2 = pg.FirstOrDefault().PropertyAddress2,
PropertyZipOrPostcode = pg.FirstOrDefault().PropertyZipOrPostcode,
DashboardFeedbackChart = (
from aues2 in db.ApplicationUserEvents
where aues2.ApplicationUserEventsPropertyId == pg.FirstOrDefault().PropertyId
&& aues2.ApplicationUserEventsFeedbackDate != null
from fos in db.FeedbackOptions
where fos.FeedbackOptionsApplicationUserEventsId == aues2.ApplicationUserEventsId
from fo in db.FeedbackOption
where fos.FeedbackOptionsFeedbackOptionId == fo.FeedbackOptionId
group fo by new { fo.FeedbackOptionName, aues2.ApplicationUserEventsPropertyId } into g
select new DashboardFeedbackChart
{
FeedbackOptionName = g.FirstOrDefault().FeedbackOptionName,
FeedbackOptionNameCount = g.Count()
}).ToList<DashboardFeedbackChart>()
}).ToList();
One Property has many ApplicationUserEvents
One ApplicationUserEvents has many FeedbackOptions
One FeedbackOptions has one FeedbackOption
Thanks for any advice!
I have this LINQ query and am getting results I need. However it takes 5-6 seconds to show results on localhost, and I can't even run this on Azure.
I'm new to LINQ, and I'm sure that I'm doing something inefficient.
Could someone direct me to optimize?
var joblist = (from t in db.Tracking
group t by t.JobNumber into j
let id = j.Max(x => x.ScanDate)
select new
{
jn = j.Key,
ti = j.FirstOrDefault(y => y.ScanDate == id).TrackingId,
sd = j.FirstOrDefault(y => y.ScanDate == id).ScanDate,
lc = j.FirstOrDefault(y => y.ScanDate == id).LocationId
}).Where(z => z.lc == lid).Where(z => z.jn != null);
jfilter = (from tr in joblist
join lc in db.Location on tr.lc equals lc.LocationId
join lt in db.LocType on lc.LocationType equals lt.LocationType
select new ScanMod
{
TrackingId = tr.ti,
LocationName = lc.LocationName,
JobNumber = tr.jn,
LocationTypeName = lt.LocationTypeName,
ScanDate = tr.sd,
StoneId = ""
}).OrderByDescending(z => z.ScanDate);
UPDATE:
This query runs on Azure(s1) but it takes 30 seconds. This table has 500,000 rows and I assume that OrderByDescending or FirstOrDefault is killing it...
var joblist = db.Tracking
.GroupBy(j => j.JobNumber)
.Select(g => g.OrderByDescending(j => j.ScanDate).FirstOrDefault());
jfilter = (from tr in joblist
join lc in db.Location on tr.LocationId equals lc.LocationId
join lt in db.LocType on lc.LocationType equals lt.LocationType
where tr.LocationId == lid
select new ScanMod
{
TrackingId = tr.TrackingId,
LocationName = lc.LocationName,
JobNumber = tr.JobNumber,
LocationTypeName = lt.LocationTypeName,
ScanDate = tr.ScanDate,
StoneId = ""
}).OrderByDescending(z => z.ScanDate);
I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.
Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)
Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;
I am trying to generate aggregated output. The issue is that all the data is going to a single reducer(Filter and Count are creating a problem). How can I optimize the following script?
Expected output:
group, 10,2,12,34...
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
fltrCol1 = FILTER data BY Col1 == 'Other';
fltrCol2 = FILTER data BY Col2 == 'Other';
fltrCol3 = FILTER data BY Col3 == 'Other';
fltrCol4 = FILTER data BY col4 == 'Other';
fltrCol5 = FILTER data BY col5 == 'Other';
cnt_fltrCol1 = COUNT(fltrCol1);
cnt_fltrCol2 = COUNT(fltrCol2);
cnt_fltrCol3 = COUNT(fltrCol3);
cnt_fltrCol4 = COUNT(fltrCol4);
cnt_fltrCol5 = COUNT(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}
You could put the filter logic before the group by adding fltrCol{1,2,3,4,5} columns as integers, than sum them up. From the top of my head here is the script :
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
filter = FOREACH data GENERATE UA,
((Col1 == 'Other') ? 1 : 0) as fltrCol1,
((Col2 == 'Other') ? 1 : 0) as fltrCol2,
((Col3 == 'Other') ? 1 : 0) as fltrCol3,
((Col4 == 'Other') ? 1 : 0) as fltrCol4,
((Col5 == 'Other') ? 1 : 0) as fltrCol5;
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
cnt_fltrCol1 = SUM(fltrCol1);
cnt_fltrCol2 = SUM(fltrCol2);
cnt_fltrCol3 = SUM(fltrCol3);
cnt_fltrCol4 = SUM(fltrCol4);
cnt_fltrCol5 = SUM(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}