DAX - RANKX Using Two Calculations - - dax

I have a data table that contains transactions by supplier. Each row of data represents one transaction. Each transaction contains a "QTY" column as well as a "Supplier" column.
I need to rank these suppliers by the count of transactions (Count of rows per unique supplier) then by the SUM of the "QTY" for all of each supplier's transactions. This needs to be in 1 rank formula, not two separate rankings. This will help in breaking any ties in my ranking.
I have tried dozens of formulas and approaches and can't seem to get it right.
See below example:
Suppliers ABC and EFG each have 4 transactions so they would effectively tie for Rank 1, however ABC has a Quantity of 30 and EFG has a QTY of 25 so ABC should rank 1 and EFG should rank 2.
Can anyone assist?
https://i.stack.imgur.com/vCsCA.png

Welcome to SO. You can create a new calculated column -
Rank =
var SumTable = SUMMARIZE(tbl, tbl[Supplier], "CountTransactions", COUNT(tbl[Transaction Number]), "SumQuantity", SUM(tbl[Quantity]))
var ThisSupplier = tbl[Supplier]
var ThisTransactions = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [CountTransactions])
var ThisQuantity = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [SumQuantity])
var ThisRank =
FILTER(SumTable,
[CountTransactions] >= ThisTransactions &&
[SumQuantity] >= ThisQuantity)
return
COUNTROWS(ThisRank)
Here's the final result -
I'm curious to see if anyone posts an alternative solution. In the meantime, give mine a try and let me know if it works as expected.

Related

Power Bi - Add Total Average column in Matrix

Hi I am trying to add a AVERAGE column in a matrix, but when I put my metric added the average per column, but I need a total AVERAGE and total at the end just once
What I have:
What I need:
Group
Maria
Pedro
average
total
First
4
6
5
10
Second
5
10
7.5
15
Regards
Following the example detailed in the sample data table, to get the Total you could add the following measure;
Total By Group = CALCULATE( SUM(AverageExample[Maria]) + SUM(AverageExample[Pedro]))
and to average
Average By Group = [Total By Group] / 2
Based on the first three columns, this will provide
You have to build a DAX table (or Power Query) and a designated measure.
Matrix Table =
UNION(
DATATABLE("Detail", STRING, "Detail Order", INTEGER, "Type", STRING, {{"Average", 1000, "Agregate"}, {"Total", 1001, "Agregate"}}),
SUMMARIZE('Your Names Table', 'Your Names Table'[Name], 'Your Names Table'[Name Order], "Type", "Names")
)
This should give you a table with the list of people and 2 more lines for the agregations.
After that, you create a measure using variables and a switch function.
Matrix Measure =
var ft = FIRSTNONBLANK('Matrix Table'[Type], 0)
var fd = FIRSTNONBLANK('Matrix Table'[Detail], 0)
return SWITCH(TRUE,
ft = "Names", CALCULATE([Total], KEEPFILTERS('Your Names Table'[Name] = fd)),
fd = "Total", [Your Total Measure],
fd = "Average", [Your Averagex Measure]
)
The rest is up to you to fiddle with orders, add any agregate measures and whatnot.
Note that the Matrix Table should have no relation with any table from your model.
You can also hide it and the Matrix measure.

How can I increase my python-code speed?

I have a dataframe, df1, that reports courses students have taken, where ID is the student’s id, COURSES is a list of courses taken by the student, and TYPE and MAJOR are student attributes. The dataframe looks like this:
ID COURSES TYPE MAJOR
1 ['Intr To Archaeology', 'Statics', 'Circuits I…] Freshman EEEL
2 ['Signals & Systems I', ‘Instrumentation’…] Transfer EEEL
3 ['Keyboard Competence', 'Elementary … ] Freshman EEEL
4 ['Cultural Anthro', 'Vector Analysis’ … ] Freshma EEEL
I created a new dataframe, df2, that reports a dissimilarity measure for each pair of students based on the courses they’ve taken. df2 looks like this:
I created using the following script, but it runs very slowly (there are thousands of students). Can someone suggest a more efficient way to create df2?
One major problem is that the script below calculates the distance between (student 1 and student 2) and (student 2 and student 1), which is redundant since the distances are the same. However, the condition I created to prevent this:
if (id1 >= id2):
continue
doesn't work.
Entire script:
for id1, student1 in df.iterrows():
for id2, student2 in df.iterrows():
if (id1 >= id2):
continue
ID_1 = student1["ID"]
ID_2 = student2["ID"]
# courses as list strings
s1 = student1["COURSES"]
s2 = student2["COURSES"]
try:
# courses as sets
courses1 = set(ast.literal_eval(s1))
courses2 = set(ast.literal_eval(s2))
distance = float(len(courses1.symmetric_difference(courses2)))/(len(courses1) + len(courses2))
except:
# Some strings seem to have a different format
distance = -1
ID_1_Transfer = 1 if student1["TYPE"] == "Transfer" else 0
ID_2_Transfer = 1 if student2["TYPE"] == "Transfer" else 0
df2= df2.append({'ID_1': ID_1,'ID_2': PIDM_2,'Distance': distance, 'ID_1_Transfer': ID_1_Transfer, 'ID_2_Transfer': ID_2_Transfer}, ignore_index=True)

PIG- Aggregations based on multiple columns

My Input data set has 3 columns and schema looks like below:
ActivityDate, EventId, EventDate
Now, using pig i need to derive multiple variables like below in one output file:
1) All Event Ids after ActivityDate >= EventDate -30 days
2) All Event Ids after ActivityDate >= EventDate -60 days
3) All Event Ids after ActivityDate >= EventDate -90 days
I have more than 30 variables like this. If it is one variable, we can use simple FILTER to filter the data.
I am thinking about any UDF implementation which takes bag as input and returns count of Event IDs based on above criteria for each parameter.
What is the best way to aggregate the data on multiple columns in pig ?
I would suggest creating another file with all of your thresholds and cross joining with the file.
so you would have a file containing:
30
60
90
etc
read it like this:
grouping = load 'grouping.txt' using PigStorage(',') as (groups:double);
Then do:
data_with_grouping = cross data, grouping;
Then have this binary condition:
data_with_binary_condition = foreach data_with_grouping generate ActivityDate, EventId, EventDate, groups, (ActivityDate >= EventDate - groups ? 1 : 0) as binary_condition;
Now you will have one column with the threshold and one column with a binary variable that tells you whether the ID follows the condition or not.
you can do a filter out all of the zeros from the binary_condition and then group on the groups column:
data_with_binary_condition_filtered = filter data_with_binary_condition by (binary_condition != 0);
grouped_by_threshold = group data_with_binary_condition_filtered by groups;
count_of_IDS = foreach grouped_by_threshold generate group, COUNT(data_with_binary_condition.EventId);
I hope this works. Obviously, I didn't debug it for you since I don't have your files.
This code will take a tad more time to run, but it will produce the output you need without a UDF.
If I understand your question correctly, you want to divide the difference between EventDate and ActivityDate in 30 days blocks (e.g. 1 to 30, 31 to 60, 61 to 90 and so on) and then count the frequency of each block.
In this case, I would just rearrange the above equation to create the variable 'range' as below:
// assuming input contains 3 columns ActivityDate, EventId, EventDate
// dividing the difference between ED and AD by 30 and casting it to int, so that 1 block is represented by 1 integer.
input1 = FOREACH input GENERATE (int)((EventDate - ActivityDate) / 30) as range;
output1 = GROUP input1 BY range;
output2 = FOREACH output1 GENERATE group AS range, COUNT(range) as count;
Hope this helps.

Regroup By in PigLatin

In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.

group by and joining tables in linq to sql

I have the following 3 classes(mapped to sql tables).
Places table:
Name(key)
Address
Capacity
Events table:
Name(key)
Date
Place
Orders table:
Id(key)
EventName
Qty
The Places and Events tables are connected through Places.Name = Events.Place, while the Events and Orders tables: Events.Name = Orders.EventName .
The task is that given an event, return the tickets left for that event. Capacity is the number a place can hold and Qty is the number of tickets ordered by someone. So some sort of grouping in the Orders table is needed and then subtract the sum from capacity.
Something like this (C# code sample below)?
Sorry for the weird variable names, but event is a keyword :)
I didn't use visual studio, so I hope that the syntax is correct.
string eventName = "Event";
var theEvent = Events.FirstOrDefault(ev => ev.Name == eventName);
int eventOrderNo = Orders.Count(or => or.EventName == eventName);
var thePlace = Places.FirstOrDefault(pl => pl.Name == theEvent.Place);
int ticketsLeft = thePlace.Capacity - eventOrderNo;
If the Event has multiple places, the last two lines would look like this:
int placesCapacity = Places.Where(pl => pl.Name == theEvent.Place)
.Sum(pl => pl.Capacity);
int ticketsLeft = placesCapacity - eventOrderNo;
On a sidenote
LINQ 101 is a great way to get familiar with LINQ: http://msdn.microsoft.com/en-us/vcsharp/aa336746

Resources