Optimizing pig script - hadoop

I am trying to generate aggregated output. The issue is that all the data is going to a single reducer(Filter and Count are creating a problem). How can I optimize the following script?
Expected output:
group, 10,2,12,34...
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
fltrCol1 = FILTER data BY Col1 == 'Other';
fltrCol2 = FILTER data BY Col2 == 'Other';
fltrCol3 = FILTER data BY Col3 == 'Other';
fltrCol4 = FILTER data BY col4 == 'Other';
fltrCol5 = FILTER data BY col5 == 'Other';
cnt_fltrCol1 = COUNT(fltrCol1);
cnt_fltrCol2 = COUNT(fltrCol2);
cnt_fltrCol3 = COUNT(fltrCol3);
cnt_fltrCol4 = COUNT(fltrCol4);
cnt_fltrCol5 = COUNT(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}

You could put the filter logic before the group by adding fltrCol{1,2,3,4,5} columns as integers, than sum them up. From the top of my head here is the script :
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
filter = FOREACH data GENERATE UA,
((Col1 == 'Other') ? 1 : 0) as fltrCol1,
((Col2 == 'Other') ? 1 : 0) as fltrCol2,
((Col3 == 'Other') ? 1 : 0) as fltrCol3,
((Col4 == 'Other') ? 1 : 0) as fltrCol4,
((Col5 == 'Other') ? 1 : 0) as fltrCol5;
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
cnt_fltrCol1 = SUM(fltrCol1);
cnt_fltrCol2 = SUM(fltrCol2);
cnt_fltrCol3 = SUM(fltrCol3);
cnt_fltrCol4 = SUM(fltrCol4);
cnt_fltrCol5 = SUM(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}

Related

C#: LINQ not returning the same result as SQL

I'm trying to convert the following SQL query to LINQ, but getting different result count with both,
SQL Query:
SELECT T5.CNTR, T5.BenefitCode,T5.ApprovedFlag,
T5.PaymentFrequencyCode, T5.InstalmentAmt, T5.TotalAmt,
T5.CarRego
FROM
dbo.EmployeeBenefit As T5
LEFT JOIN dbo.Payee ON T5.PayeeCntr = dbo.Payee.CNTR
LEFT JOIN dbo.BankDetails ON dbo.Payee.BankCntr = dbo.BankDetails.BankCntr
Left Join dbo.EmployeeCar As T4 on T5.EmployeeCarCntr=T4.Cntr
Inner Join dbo.EmployeeEntity As T1 On T5.EmployeeEntityCntr=T1.EmployeeEntityCntr
Inner Join dbo.EmployerEntity As T2 On T1.EmployerEntityCntr=T2.EmployerEntityCntr
where T5.EmployeeCntr = 117165
AND ((T5.EndDate is Null) OR (T5.EndDate >= GETDATE()))
LINQ:
var result = (from employeeBenefit in context.EmployeeBenefit
from payee in context.Payee.Where(x => x.Cntr == employeeBenefit.PayeeCntr).DefaultIfEmpty()
from bankDetails in context.BankDetails.Where(x => x.BankCntr == employeeBenefit.PayeeCntr).DefaultIfEmpty()
from employeeCar in context.EmployeeCar.Where(x => x.Cntr == payee.BankCntr).DefaultIfEmpty()
from employeeEntity in context.EmployeeEntity
where employeeEntity.EmployeeEntityCntr == employeeBenefit.EmployeeEntityCntr
from employeeEntity1 in context.EmployeeEntity
where employeeEntity.EmployerEntityCntr == employeeEntity1.EmployerEntityCntr
&& employeeBenefit.EmployeeCntr == iEmployeeID
&& (!employeeBenefit.EndDate.HasValue || employeeBenefit.EndDate >= DateTime.Now)
&& employeeBenefit.EmployeeCntr == 117165
&& employeeBenefit.CarRego == registration
select new
{
CNTR = employeeBenefit.Cntr,
BenefitCode = employeeBenefit.BenefitCode,
PaymentFrequencyCode = employeeBenefit.PaymentFrequencyCode,
InstalmentAmount = employeeBenefit.InstalmentAmt,
TotalAmount = employeeBenefit.TotalAmt,
CarRego = employeeBenefit.CarRego,
ApprovedFlag = employeeBenefit.ApprovedFlag
}).ToList();
Please let me know what i'm missing.
For the data in my database the SQL query is returning 10 records. But, the LINQ is returning 2700 records.
Not a full answer (I'm late for work) but:
var result = (from T5 in context.EmployeeBenefit
join PY in dbo.Payee on T5.PayeeCntr equals PY.CNTR into PY1
where T5.EmployeeCntr = 117165
select new {
CNTR = T5.Cntr,
...
}
).ToList();
It was the issue with the condition mismatch that i had done in the LINQ. The below query just worked fine. Thank you for helping me with the issue.
var result = (from employeeBenefit in context.EmployeeBenefit
from payee in context.Payee.Where(x => x.Cntr == employeeBenefit.PayeeCntr).DefaultIfEmpty()
from bankDetails in context.BankDetails.Where(x => x.BankCntr == payee.BankCntr).DefaultIfEmpty()
from employeeCar in context.EmployeeCar.Where(x => x.Cntr == employeeBenefit.EmployeeCarCntr).DefaultIfEmpty()
join employeeEntity in context.EmployeeEntity
on employeeBenefit.EmployeeEntityCntr equals employeeEntity.EmployeeEntityCntr
join employerEntity in context.EmployerEntity
on employeeEntity.EmployerEntityCntr equals employerEntity.EmployerEntityCntr
where employeeBenefit.EmployeeCntr == 117165 && (!employeeBenefit.EndDate.HasValue || employeeBenefit.EndDate >= DateTime.Now)
select new
{
CNTR = employeeBenefit.Cntr,
BenefitCode = employeeBenefit.BenefitCode,
PaymentFrequencyCode = employeeBenefit.PaymentFrequencyCode,
InstalmentAmount = employeeBenefit.InstalmentAmt,
TotalAmount = employeeBenefit.TotalAmt,
CarRego = employeeBenefit.CarRego,
ApprovedFlag = employeeBenefit.ApprovedFlag
}).ToList();

Pig Performance Issues

I have following PIG script which is taking lot of time for processing 342 files with 256 MB as split size(testing only). Can anybody suggest improvement:
SPLIT filteredalnumcdrs into splitalnumcdrs_1 IF (
(SUBSTRING(aparty,2,3) == '-')),
splitalnumcdrs_2 OTHERWISE;
tmpsplitalnumcdrs_1 = FOREACH splitalnumcdrs_1 GENERATE aparty,srcgt,destgt,SUBSTRING(aparty,0,2) as splitaparty,bparty,smscgt,status,prepost;
groupsplitalnumcdrs_1 = GROUP tmpsplitalnumcdrs_1 BY (aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
distinctsplitalnumcdrs_1 = FOREACH groupsplitalnumcdrs_1 {
uniqsplitalnumcdrs_1 = DISTINCT tmpsplitalnumcdrs_1.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs_1) as countalnumcdrs;
};
tmpsplitalnumcdrs_2 = FOREACH splitalnumcdrs_2 GENERATE aparty,srcgt,destgt,aparty as splitaparty_2,bparty,smscgt,status,prepost;
groupsplitalnumcdrs_2 = GROUP tmpsplitalnumcdrs_2 BY (aparty,srcgt,destgt,splitaparty_2,bparty,smscgt,status,prepost);
distinctsplitalnumcdrs_2 = FOREACH groupsplitalnumcdrs_2 {
uniqsplitalnumcdrs_2 = DISTINCT tmpsplitalnumcdrs_2.(aparty,srcgt,destgt,splitaparty_2,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs_2) as countsplitalnumcdrs_2;
};
distinctalnumcdrs = UNION distinctsplitalnumcdrs_1,distinctsplitalnumcdrs_2;
alnumreportmap = FOREACH distinctalnumcdrs GENERATE aparty,smsiuc_udfs.mapgtabparty(srcgt,destgt,splitaparty,bparty),smscgt,status,prepost,countalnumcdrs PARALLEL 20;
alnumreportmapgroup = GROUP alnumreportmap BY (aparty,mappedreport,smscgt,status,prepost);
alnumreportmaprecord = FOREACH alnumreportmapgroup GENERATE FLATTEN(group),SUM(alnumreportmap.countalnumcdrs) as alnumsmscount;
you can avoid union
tmpsplitalnumcdrs = foreach filteredalnumcdrs generate aparty,srcgt,destgt,(SUBSTRING(aparty,2,3) == '-' ?SUBSTRING(aparty,0,2):aparty) as splitaparty,bparty,smscgt,status,prepost;
distinctsplitalnumcdrs = FOREACH tmpsplitalnumcdrs {
uniqsplitalnumcdrs = DISTINCT tmpsplitalnumcdrs.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs) as countsplitalnumcdrs;
};
why do you need
uniqsplitalnumcdrs = DISTINCT tmpsplitalnumcdrs.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);

LINQ - Previous Record

All:
Lets say I have the following table:
RevisionID, Project_ID, Count, Changed_Date
1 2 4 01/01/2016: 01:02:01
2 2 7 01/01/2016: 01:03:01
3 2 8 01/01/2016: 01:04:01
4 2 3 01/01/2016: 01:05:01
5 2 15 01/01/2016: 01:06:01
I am ordering the records based on Updated_Date. A user comes into my site and edits record (RevisionID = 3). For various reasons, using LINQ (with entity framework), I need to get the previous record in the table, which would be RevisionID = 2 so I can perform calculations on "Count". If user went to edit record (RevisionID = 4), I would need to select RevisionID = 3.
I currently have the following:
var x = _db.RevisionHistory
.Where(t => t.Project_ID == input.Project_ID)
.OrderBy(t => t.Changed_Date);
This works in finding the records based on the Project_ID, but how then do I select the record before?
I am trying to do the following, but in one LINQ statement, if possible.
var itemList = from t in _db.RevisionHistory
where t.Project_ID == input.Project_ID
orderby t.Changed_Date
select t;
int h = 0;
foreach (var entry in itemList)
{
if (entry.Revision_ID == input.Revision_ID)
{
break;
}
h = entry.Revision_ID;
}
var previousEntry = _db.RevisionHistory.Find(h);
Here is the correct single query equivalent of your code:
var previousEntry = (
from r1 in db.RevisionHistory
where r1.Project_ID == input.Project_ID && r1.Revision_ID == input.Revision_ID
from r2 in db.RevisionHistory
where r2.Project_ID == r1.Project_ID && r2.Changed_Date < r1.Changed_Date
orderby r2.Changed_Date descending
select r2
).FirstOrDefault();
which generates the following SQL query:
SELECT TOP (1)
[Project1].[Revision_ID] AS [Revision_ID],
[Project1].[Project_ID] AS [Project_ID],
[Project1].[Count] AS [Count],
[Project1].[Changed_Date] AS [Changed_Date]
FROM ( SELECT
[Extent2].[Revision_ID] AS [Revision_ID],
[Extent2].[Project_ID] AS [Project_ID],
[Extent2].[Count] AS [Count],
[Extent2].[Changed_Date] AS [Changed_Date]
FROM [dbo].[RevisionHistories] AS [Extent1]
INNER JOIN [dbo].[RevisionHistories] AS [Extent2] ON [Extent2].[Project_ID] = [Extent1].[Project_ID]
WHERE ([Extent1].[Project_ID] = #p__linq__0) AND ([Extent1].[Revision_ID] = #p__linq__1) AND ([Extent2].[Changed_Date] < [Extent1].[Changed_Date])
) AS [Project1]
ORDER BY [Project1].[Changed_Date] DESC
hope I understood what you want.
Try:
var x = _db.RevisionHistory
.FirstOrDefault(t => t.Project_ID == input.Project_ID && t.Revision_ID == input.Revision_ID -1)
Or, based on what you wrote, but edited:
_db.RevisionHistory
.Where(t => t.Project_ID == input.Project_ID)
.OrderBy(t => t.Changed_Date)
.TakeWhile(t => t.Revision_ID != input.Revision_ID)
.Last()

NullReferenceException Error when trying to iterate a IEnumerator

I have a datatable and want to select some records with LinQ in this format:
var result2 = from row in dt.AsEnumerable()
where row.Field<string>("Media").Equals(MediaTp, StringComparison.CurrentCultureIgnoreCase)
&& (String.Compare(row.Field<string>("StrDate"), dtStart.Year.ToString() +
(dtStart.Month < 10 ? '0' + dtStart.Month.ToString() : dtStart.Month.ToString()) +
(dtStart.Day < 10 ? '0' + dtStart.Day.ToString() : dtStart.Day.ToString())) >= 0
&& String.Compare(row.Field<string>("StrDate"), dtEnd.Year.ToString() +
(dtEnd.Month < 10 ? '0' + dtEnd.Month.ToString() : dtEnd.Month.ToString()) +
(dtEnd.Day < 10 ? '0' + dtEnd.Day.ToString() : dtEnd.Day.ToString())) <= 0)
group row by new { Year = row.Field<int>("Year"), Month = row.Field<int>("Month"), Day = row.Field<int>("Day") } into grp
orderby grp.Key.Year, grp.Key.Month, grp.Key.Day
select new
{
CurrentDate = grp.Key.Year + "/" + grp.Key.Month + "/" + grp.Key.Day,
DayOffset = (new DateTime(grp.Key.Year, grp.Key.Month, grp.Key.Day)).Subtract(dtStart).Days,
Count = grp.Sum(r => r.Field<int>("Count"))
};
and in this code, I try to iterate it with the following code:
foreach (var row in result2)
{
//... row.DayOffset.ToString() + ....
}
this issue occurred :
Object reference not set to an instance of an object.
I think it happens when there's no record with above criteria.
I tried to change it to enumerator like this , and use MoveNext() to check the data is on that or not:
result2.GetEnumerator();
if (enumerator2.MoveNext()) {//--}
but still the same error.
whats the problem?
I guess in one or more rows Media is null.
You then call Equals on null, which results in a NullReferenceException.
You could add a null check:
var result2 = from row in dt.AsEnumerable()
where row.Field<string>("Media") != null
&& row.Field<string>("Media").Equals(MediaTp, StringComparison.CurrentCultureIgnoreCase)
...
or use a surrogate value like:
var result2 = from row in dt.AsEnumerable()
let media = row.Field<string>("Media") ?? String.Empty
where media.Equals(MediaTp, StringComparison.CurrentCultureIgnoreCase)
...
(note that the last approach is slightly different)

Query performance to compare date with many join and entity framework

I have this database model :
I use this query :
public List<Film> ListFilmsSortiesDes7DerniersJoursDVD()
{
DateTime dateDans7Jours = DateTime.Now.AddDays(7);
DateTime dateIlYa7Jours = DateTime.Now.AddDays(-7);
return Query(f => f.Releases.Where(r => r.Langue.langue_code == "FR" && r.TypeRelease.typerelease_code == "DVD").FirstOrDefault().release_date > dateIlYa7Jours
&& f.Releases.Where(r => r.Langue.langue_code == "FR" && r.TypeRelease.typerelease_code == "DVD").FirstOrDefault().release_date < dateDans7Jours && !string.IsNullOrEmpty(f.film_image)).ToList();
}
But the SQL generated have bad performance, about 1.3 seconds to return results (with SQL Server Express 2008 and I already have Index on correct fields):
SELECT [Extent1].[film_id] AS [film_id],
[Extent1].[film_image] AS [film_image],
[Extent1].[film_image_thumb] AS [film_image_thumb],
[Extent1].[film_format] AS [film_format],
[Extent1].[film_motsclefs] AS [film_motsclefs],
[Extent1].[film_nom] AS [film_nom],
[Extent1].[film_nomvf] AS [film_nomvf],
[Extent1].[film_synopsis] AS [film_synopsis],
[Extent1].[film_anneeproduction] AS [film_anneeproduction],
[Extent1].[film_budget] AS [film_budget],
[Extent1].[film_dateajout] AS [film_dateajout],
[Extent1].[film_actif] AS [film_actif],
[Extent1].[utilisateur_id] AS [utilisateur_id],
[Extent1].[film_francais] AS [film_francais],
[Extent1].[film_revenue] AS [film_revenue],
[Extent1].[filmgroupe_id] AS [filmgroupe_id]
FROM [dbo].[Film] AS [Extent1]
OUTER APPLY (SELECT TOP (1) [Filter1].[release_date] AS [release_date]
FROM (SELECT [Extent2].[film_id] AS [film_id],
[Extent3].[release_date] AS [release_date],
[Extent3].[typerelease_id] AS [typerelease_id]
FROM [dbo].[FilmRelease] AS [Extent2]
INNER JOIN [dbo].[Release] AS [Extent3]
ON [Extent3].[release_id] = [Extent2].[release_id]
INNER JOIN [dbo].[Langue] AS [Extent4]
ON [Extent3].[langue_id] = [Extent4].[langue_id]
WHERE N'FR' = [Extent4].[langue_code]) AS [Filter1]
INNER JOIN [dbo].[TypeRelease] AS [Extent5]
ON [Filter1].[typerelease_id] = [Extent5].[typerelease_id]
WHERE ([Extent1].[film_id] = [Filter1].[film_id])
AND (N'CINEMA' = [Extent5].[typerelease_code])) AS [Limit1]
CROSS APPLY (SELECT TOP (1) [Filter3].[release_date] AS [release_date]
FROM (SELECT [Extent6].[film_id] AS [film_id],
[Extent7].[release_date] AS [release_date],
[Extent7].[typerelease_id] AS [typerelease_id]
FROM [dbo].[FilmRelease] AS [Extent6]
INNER JOIN [dbo].[Release] AS [Extent7]
ON [Extent7].[release_id] = [Extent6].[release_id]
INNER JOIN [dbo].[Langue] AS [Extent8]
ON [Extent7].[langue_id] = [Extent8].[langue_id]
WHERE N'FR' = [Extent8].[langue_code]) AS [Filter3]
INNER JOIN [dbo].[TypeRelease] AS [Extent9]
ON [Filter3].[typerelease_id] = [Extent9].[typerelease_id]
WHERE ([Extent1].[film_id] = [Filter3].[film_id])
AND (N'CINEMA' = [Extent9].[typerelease_code])) AS [Limit2]
WHERE ([Limit1].[release_date] > '2013-02-04T00:07:48' /* #p__linq__0 */)
AND ([Limit2].[release_date] < '2013-02-18T00:07:48' /* #p__linq__1 */)
AND ([Extent1].[film_image] IS NOT NULL)
Do you please have any ideas to improve performance of this query ?
Ok why search complicated when the answer is simple :
public List<Film> ListFilmsSortiesDes7DerniersJoursCinema()
{
DateTime dateDans7Jours = DateTime.Now.AddDays(7);
DateTime dateIlYa7Jours = DateTime.Now.AddDays(-7);
return Query(f => f.Releases.Where(r => r.Langue.langue_code == "FR" && r.TypeRelease.typerelease_code == "CINEMA" && r.release_date > dateIlYa7Jours && r.release_date < dateDans7Jours).Any()).ToList();
}
I did a join in too

Resources