I am new to XQuery. I need to limit by 1 in XQuery but I'm having trouble.
I am trying to find out the winner of a tournament by finding each players scores and summing up the scores to get the total scores. I then sort in descending order and try to limit the total score, however I am having problems with trying to limit in XQuery.
Here I am wanting to get the top score but I have tried to use subsequence($sequence, $starting-item, $number-of-items) and [position()] but it seems to not be working.
for $pair_ids in distinct-values(fn:doc("tourny.xml")/Competition[#date = "2012-12-12"]/Tennis/Partner/#pair_id)
let $total_scores := sum(fn:doc("tourny.xml")/Competition[#date = "2012-12-12"]/Tennis/Partner[#pair_id = $pair_ids]/#score)
order by $total_scores descending
return
$total_scores
the output is giving me:
$total_scores:
34
11
20
How can I limit the result so I only get 34 as the highest score?
Thanks
You can use fn:max() function as follow:
fn:max(
for $pair_ids in distinct-values(fn:doc("tourny.xml")/Competition[#date = "2012-12-12"]/Tennis/Partner/#pair_id)
let $total_scores := sum(fn:doc("tourny.xml")/Competition[#date = "2012-12-12"]/Tennis/Partner[#pair_id = $pair_ids]/#score)
order by $total_scores descending
return $total_scores
)
Related
I have a data table that contains transactions by supplier. Each row of data represents one transaction. Each transaction contains a "QTY" column as well as a "Supplier" column.
I need to rank these suppliers by the count of transactions (Count of rows per unique supplier) then by the SUM of the "QTY" for all of each supplier's transactions. This needs to be in 1 rank formula, not two separate rankings. This will help in breaking any ties in my ranking.
I have tried dozens of formulas and approaches and can't seem to get it right.
See below example:
Suppliers ABC and EFG each have 4 transactions so they would effectively tie for Rank 1, however ABC has a Quantity of 30 and EFG has a QTY of 25 so ABC should rank 1 and EFG should rank 2.
Can anyone assist?
https://i.stack.imgur.com/vCsCA.png
Welcome to SO. You can create a new calculated column -
Rank =
var SumTable = SUMMARIZE(tbl, tbl[Supplier], "CountTransactions", COUNT(tbl[Transaction Number]), "SumQuantity", SUM(tbl[Quantity]))
var ThisSupplier = tbl[Supplier]
var ThisTransactions = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [CountTransactions])
var ThisQuantity = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [SumQuantity])
var ThisRank =
FILTER(SumTable,
[CountTransactions] >= ThisTransactions &&
[SumQuantity] >= ThisQuantity)
return
COUNTROWS(ThisRank)
Here's the final result -
I'm curious to see if anyone posts an alternative solution. In the meantime, give mine a try and let me know if it works as expected.
This should be a very simple requirement. But it seems impossible to implement in DAX.
Data model, User lookup table joined to many "Cards" linked to each user.
I have a measure setup to count rows in CardUser. That is working fine.
<measureA> = count rows in CardUser
I want to create a new measure,
<measureB> = IF(User.boolean = 1,<measureA>, 16)
If User.boolean = 1, I want to return a fixed value of 16. Effectively, bypassing measureA.
I can't simply put User.boolean = 1 in the IF condition, throws an error.
I can modify measureA itself to return 0 if User.boolean = 1
measureA> =
CALCULATE (
COUNTROWS(CardUser),
FILTER ( User.boolean != 1 )
)
This works, but I still can't find a way to return 16 ONLY if User.boolean = 1.
That's easy in DAX, you just need to learn "X" functions (aka "Iterators"):
Measure B =
SUMX( VALUES(User.boolean),
IF(User.Boolean, [Measure A], 16))
VALUES function generates a list of distinct user.boolean values (1, 0 in this case). Then, SUMX iterates this list, and applies IF logic to each record.
I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.
I follow a help How to handle spill memory in pig from alexeipab, it really works fine, but I have another question now, same sample code:
pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);
pymt_grp_with_salt = GROUP pymt BY (key,salt)
results_with_salt = FOREACH pymt_grp {
--distinct
mid_set = FILTER pymt BY xxx=='abc';
mid_set_result = DISTINCT mid_set.yyy;
result = COUNT(mid_set_result)
}
pymt_grp = GROUP results_with_salt BY key;
result = FOREACH pymt_grp {
GENERATE SUM(results_with_salt.result); --it is WRONG!!
}
I can't use sum in that group, which it will be very different from result that calculated without salt.
is there any solution? if filter first, it will cost many JOIN job, and slow down the performance.
For this to work, you need to have many to one relationship between mid_set.yyy and salt, so that same value for mid_set.yyy from different rows is mapped into the same value of salt. If it is not, than that value of mid_set.yyy will appear in different bags produced by GROUP pymt BY (key, salt), survive DISTINCT in different salts, thus are included multiple times in the final rollup. That is why you can get wrong results when using salts and COUNT of DISTINCT.
An easy way could be to replace salt with mid_set.yyy itself or to write a UDF/static method which calculates salt by taking hash of mid_set.yyy and does mod N, where N could be 1 to infinity, for best distribution N should be a prime number.
Thanks alexeipab, you give me a great help, what i do as below
pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);
pymt = FOREACH pymt GENERATE *, (yyy%$prime_num) as salt;
pymt_grp_with_salt = GROUP pymt BY (key,salt);
It works!!
if yyy is num integer, you can use hash to convert string or others to a integer
I have 2 columns
Name and Amount
I like the linq to return the Name based on who has the Maximum Amount.
So far I have the following:
string name = (from nm in bg
select nm.Name).Max(Amount);
which obviously will not work.
Thank you.
string name = (from nm in bg
where nm.Amount == bg.Max(i=>i.Amount)
select nm.Name)
or
string name = (from nm in bg
orderby nm.Amount desc
select nm.Name).First()
Fastest approach I can think is as this (find max amount, then find item which has max amount, two way traversing, and is O(n)):
decimal amount = bg.Max(x=>x.Amount);
var name = bg.First(x=>x.Amount == amount).Name; // O(n)
Also you can do:
// O(n^2) in worst case, O(n) in best case
bg.First(x=>x.Amount == bg.Max(x=>x.Amount)).Name;
Or
bg.OrderByDescending(x=>x.Amount).First().Name; // O(n log n) in all situation
The expression
bg.OrderByDescending(nm => nm.Amount).First().Name
will get what you want, but will throw if bg is empty. If this is a problem, use
var topNm = bg.OrderByDescending(nm => nm.Amount).FirstOrDefault();
and check topNm for nullity.