Apache PIG: apply LIMIT only if parameter is > 0 - hadoop

How can I achieve the following in PIG, within a foreach:
REL = foreach RELS {
if ( cnt == 0 )
limited_result = NULL/Empty;
else
limited_result = LIMIT results cnt ;
generate limited_result.some_field;
}
I cannot use LIMIT, since it validates that the 'cnt' is bigger than 0;
I've tried to use a SPLIT, but apparently it isn't supported within foreach.

How about FILTERing before your FOREACH?
REL = foreach (filter RELS by cnt > 0) {
limited_result = LIMIT results cnt ;
generate limited_result.some_field;
}
If you still need the records where cnt is 0, you could SPLIT first and then generate an empty bag when cnt is 0:
split RELS into ZERO if cnt == 0, NONZERO if cnt > 0;
NZ_LIM = foreach NONZERO {
result = LIMIT results cnt ;
generate limited_result.some_field;
}
Z_LIM = foreach ZERO generate {} as some_field;
REL = union NZ_LIM, Z_LIM;

Related

Optimised EmEditor macro to populate column based on another column for a large file

I’ve got a really large file, circa 10m rows, in which I’m trying to populate a column based on conditions on another column via a jsee macro. While it is quite quick for small files, it does take some time for the large file.
//pseudocode
//No sorting on Col1, which can have empty cells too
For all lines in file
IF (cell in Col2 IS empty) AND (cell in Col1 IS NOT empty) AND (cell in Col1 = previous cell in Col1)
THEN cell in Col2 = previous cell in Col2
//jsee code
document.CellMode = true; // Must be cell selection mode
totalLines = document.GetLines();
for( i = 1; i < totalLines; i++ ) {
nref = document.GetCell( i, 1, eeCellIncludeNone );
gsize = document.GetCell( i, 2, eeCellIncludeNone );
if (gsize == "" && nref != "" && nref == document.GetCell( i-1, 1, eeCellIncludeNone ) ) {
document.SetCell( i, 2, document.GetCell( i-1, 2, eeCellIncludeNone ) , eeAutoQuote);
}
}
Input File:
Reference
Group Size
14/12/01819
1
14/12/01820
1
15/01/00191
4
15/01/00191
15/01/00191
15/01/00198
15/01/00292
3
15/01/00292
15/01/00292
15/01/00401
5
15/01/00401
15/01/00402
1
15/01/00403
2
15/01/00403
15/01/00403
15/01/00403
15/01/00404
20/01/01400
1
Output File:
Reference
Group Size
14/12/01819
1
14/12/01820
1
15/01/00191
4
15/01/00191
4
15/01/00191
4
15/01/00198
15/01/00292
3
15/01/00292
3
15/01/00292
3
15/01/00401
5
15/01/00401
5
15/01/00402
1
15/01/00403
2
15/01/00403
2
15/01/00403
2
15/01/00403
2
15/01/00404
20/01/01400
1
Any ideas on how to optimise this and make it run even faster?
I wrote a JavaScript for EmEditor macro for you. You might need to set the correct numbers in the first 2 lines for iColReference and iColGroupSize.
iColReference = 1; // the column index of "Reference"
iColGroupSize = 2; // the column index of "Group Size"
document.CellMode = true; // Must be cell selection mode
sDelimiter = document.Csv.Delimiter; // retrieve the delimiter
nOldHeadingLines = document.HeadingLines; // retrieve old headings
document.HeadingLines = 0; // set No Headings
yBottom = document.GetLines(); // retrieve the number of lines
if( document.GetLine( yBottom ).length == 0 ) { // -1 if the last line is empty
--yBottom;
}
str = document.GetColumn( iColReference, sDelimiter, eeCellIncludeQuotes, 1, yBottom ); // get whole 1st column from top to bottom, separated by TAB
sCol1 = str.split( sDelimiter );
str = document.GetColumn( iColGroupSize, sDelimiter, eeCellIncludeQuotes, 1, yBottom ); // get whole 2nd column from top to bottom, separated by TAB
sCol2 = str.split( sDelimiter );
s1 = "";
s2 = "";
for( i = 0; i < yBottom; ++i ) { // loop through all lines
if( sCol2[i].length != 0 ) {
s1 = sCol1[i];
s2 = sCol2[i];
}
else {
if( s1.length != 0 && sCol1[i] == s1 ) { // same value as previous line, copy s2
if( s2.length != 0 ) {
sCol2[i] = s2;
}
}
else { // different value, empty s1 and s2
s1 = "";
s2 = "";
}
}
}
str = sCol2.join( sDelimiter );
document.SetColumn( iColGroupSize, str, sDelimiter, eeDontQuote ); // set whole 2nd column from top to bottom with the new values
document.HeadingLines = nOldHeadingLines; // restore the original number of headings
To run this, save this code as, for instance, Macro.jsee, and then select this file from Select... in the Macros menu. Finally, select Run Macro.jsee in the Macros menu.

how do i get the biggest and most frequent number in a dictionary?

i have a number
(23452)
and i what my function to return the most frequent num in this number and the highest one.
for the num above it should return '2' and for '225566' it should return '6'
i tried :
def most_popular_digit(num):
pop_dig = {}
c = str(num)
for n in range(len(c)):
count = pop_dig.get(c[n],0)
count += 1
pop_dig[c[n]] = count
list_keys = pop_dig.keys()
sorted_num = sorted(list_keys, key=pop_dig.get)
but i cant figure out how to get also the highest number with the highest apperances.
managed to figure it out:
def most_popular_digit(num):
pop_dig = {}
c = str(num)
for n in range(len(c)):
count = pop_dig.get(c[n],0)
count += 1
pop_dig[c[n]] = count
list_keys = pop_dig.keys()
sorted_num = sorted(list_keys, key=pop_dig.get)
a = pop_dig.keys()
b = pop_dig.values()
if b.index(max(b)) == a.index(max(a)):
return a[a.index(max(a))]
else:
return sorted_num[-1]

Count values that are filtered - Apache PIG

I have the following statement
Values = FILTER Input_Data BY Fields > 0
How to count the number of records that was filtered and not?
-- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;
-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);
-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);
Hope this will help!!

Filter with Linq comma separated field

I have a table with a field called COMMA_SEPARATED_VALUES. How can I filter with a single! (I have to integrate it into a larger query) LINQ query
all rows, where one of the entries is in a range of integer.
Table TEST
ID COMMA_SEPARATED_VALUES
-----------------------------------
1 '1,2,3,4'
2 '1,5,100,4,33'
3 '666,999'
4 '5,55,5'
Filter for Range "10 - 99" would result in
ID
------------------------
2 (because of 33)
4 (because of 55)
If you are aware of the performance side effect of calling AsEnumerable() method and it doesn't harm:
int lowerBound = 10; // lower bound of your range
int upperBound = 99; // upper bound of your range
var d = from row in context.Test.AsEnumerable()
let integers = row.COMMA_SEPERATED_VALUES
.Split(new char[] { ',' })
.Select(p => int.Parse(p))
where integers.Any(p => p < upperBound && p > lowerBound)
select row;

LINQ: GroupBy with maximum count in each group

I have a list of duplicate numbers:
Enumerable.Range(1,3).Select(o => Enumerable.Repeat(o, 3)).SelectMany(o => o)
// {1,1,1,2,2,2,3,3,3}
I group them and get quantity of occurance:
Enumerable.Range(1,3).Select(o => Enumerable.Repeat(o, 3)).SelectMany(o => o)
.GroupBy(o => o).Select(o => new { Qty = o.Count(), Num = o.Key })
Qty Num
3 1
3 2
3 3
What I really need is to limit the quantity per group to some number. If the limit is 2 the result for the above grouping would be:
Qty Num
2 1
1 1
2 2
1 2
2 3
1 3
So, if Qty = 10 and limit is 4, the result is 3 rows (4, 4, 2). The Qty of each number is not equal like in example. The specified Qty limit is the same for whole list (doesn't differ based on number).
Thanks
Some of the other answers are making the LINQ query far more complex than it needs to be. Using a foreach loop is certainly faster and more efficient, but the LINQ alternative is still fairly straightforward.
var input = Enumerable.Range(1, 3).SelectMany(x => Enumerable.Repeat(x, 10));
int limit = 4;
var query =
input.GroupBy(x => x)
.SelectMany(g => g.Select((x, i) => new { Val = x, Grp = i / limit }))
.GroupBy(x => x, x => x.Val)
.Select(g => new { Qty = g.Count(), Num = g.Key.Val });
There was a similar question that came up recently asking how to do this in SQL - there's no really elegant solution and unless this is Linq to SQL or Entity Framework (i.e. being translated into a SQL query), I'd really suggest that you not try to solve this problem with Linq and instead write an iterative solution; it's going to be a great deal more efficient and easier to maintain.
That said, if you absolutely must use a set-based ("Linq") method, this is one way you could do it:
var grouped =
from n in nums
group n by n into g
select new { Num = g.Key, Qty = g.Count() };
int maxPerGroup = 2;
var portioned =
from x in grouped
from i in Enumerable.Range(1, grouped.Max(g => g.Qty))
where (x.Qty % maxPerGroup) == (i % maxPerGroup)
let tempQty = (x.Qty / maxPerGroup) == (i / maxPerGroup) ?
(x.Qty % maxPerGroup) : maxPerGroup
select new
{
Num = x.Num,
Qty = (tempQty > 0) ? tempQty : maxPerGroup
};
Compare with the simpler and faster iterative version:
foreach (var g in grouped)
{
int remaining = g.Qty;
while (remaining > 0)
{
int allotted = Math.Min(remaining, maxPerGroup);
yield return new MyGroup(g.Num, allotted);
remaining -= allotted;
}
}
Aaronaught's excellent answer doesn't cover the possibility of getting the best of both worlds... using an extension method to provide an iterative solution.
Untested:
public static IEnumerable<IEnumerable<U>> SplitByMax<T, U>(
this IEnumerable<T> source,
int max,
Func<T, int> maxSelector,
Func<T, int, U> resultSelector
)
{
foreach(T x in source)
{
int number = maxSelector(x);
List<U> result = new List<U>();
do
{
int allotted = Math.Min(number, max);
result.Add(resultSelector(x, allotted));
number -= allotted
} while (number > 0 && max > 0);
yield return result;
}
}
Called by:
var query = grouped.SplitByMax(
10,
o => o.Qty,
(o, i) => new {Num = o.Num, Qty = i}
)
.SelectMany(split => split);

Resources