How can I increase my python-code speed? - performance

I have a dataframe, df1, that reports courses students have taken, where ID is the student’s id, COURSES is a list of courses taken by the student, and TYPE and MAJOR are student attributes. The dataframe looks like this:
ID COURSES TYPE MAJOR
1 ['Intr To Archaeology', 'Statics', 'Circuits I…] Freshman EEEL
2 ['Signals & Systems I', ‘Instrumentation’…] Transfer EEEL
3 ['Keyboard Competence', 'Elementary … ] Freshman EEEL
4 ['Cultural Anthro', 'Vector Analysis’ … ] Freshma EEEL
I created a new dataframe, df2, that reports a dissimilarity measure for each pair of students based on the courses they’ve taken. df2 looks like this:
I created using the following script, but it runs very slowly (there are thousands of students). Can someone suggest a more efficient way to create df2?
One major problem is that the script below calculates the distance between (student 1 and student 2) and (student 2 and student 1), which is redundant since the distances are the same. However, the condition I created to prevent this:
if (id1 >= id2):
continue
doesn't work.
Entire script:
for id1, student1 in df.iterrows():
for id2, student2 in df.iterrows():
if (id1 >= id2):
continue
ID_1 = student1["ID"]
ID_2 = student2["ID"]
# courses as list strings
s1 = student1["COURSES"]
s2 = student2["COURSES"]
try:
# courses as sets
courses1 = set(ast.literal_eval(s1))
courses2 = set(ast.literal_eval(s2))
distance = float(len(courses1.symmetric_difference(courses2)))/(len(courses1) + len(courses2))
except:
# Some strings seem to have a different format
distance = -1
ID_1_Transfer = 1 if student1["TYPE"] == "Transfer" else 0
ID_2_Transfer = 1 if student2["TYPE"] == "Transfer" else 0
df2= df2.append({'ID_1': ID_1,'ID_2': PIDM_2,'Distance': distance, 'ID_1_Transfer': ID_1_Transfer, 'ID_2_Transfer': ID_2_Transfer}, ignore_index=True)

Related

Power Bi - Add Total Average column in Matrix

Hi I am trying to add a AVERAGE column in a matrix, but when I put my metric added the average per column, but I need a total AVERAGE and total at the end just once
What I have:
What I need:
Group
Maria
Pedro
average
total
First
4
6
5
10
Second
5
10
7.5
15
Regards
Following the example detailed in the sample data table, to get the Total you could add the following measure;
Total By Group = CALCULATE( SUM(AverageExample[Maria]) + SUM(AverageExample[Pedro]))
and to average
Average By Group = [Total By Group] / 2
Based on the first three columns, this will provide
You have to build a DAX table (or Power Query) and a designated measure.
Matrix Table =
UNION(
DATATABLE("Detail", STRING, "Detail Order", INTEGER, "Type", STRING, {{"Average", 1000, "Agregate"}, {"Total", 1001, "Agregate"}}),
SUMMARIZE('Your Names Table', 'Your Names Table'[Name], 'Your Names Table'[Name Order], "Type", "Names")
)
This should give you a table with the list of people and 2 more lines for the agregations.
After that, you create a measure using variables and a switch function.
Matrix Measure =
var ft = FIRSTNONBLANK('Matrix Table'[Type], 0)
var fd = FIRSTNONBLANK('Matrix Table'[Detail], 0)
return SWITCH(TRUE,
ft = "Names", CALCULATE([Total], KEEPFILTERS('Your Names Table'[Name] = fd)),
fd = "Total", [Your Total Measure],
fd = "Average", [Your Averagex Measure]
)
The rest is up to you to fiddle with orders, add any agregate measures and whatnot.
Note that the Matrix Table should have no relation with any table from your model.
You can also hide it and the Matrix measure.

DAX - RANKX Using Two Calculations -

I have a data table that contains transactions by supplier. Each row of data represents one transaction. Each transaction contains a "QTY" column as well as a "Supplier" column.
I need to rank these suppliers by the count of transactions (Count of rows per unique supplier) then by the SUM of the "QTY" for all of each supplier's transactions. This needs to be in 1 rank formula, not two separate rankings. This will help in breaking any ties in my ranking.
I have tried dozens of formulas and approaches and can't seem to get it right.
See below example:
Suppliers ABC and EFG each have 4 transactions so they would effectively tie for Rank 1, however ABC has a Quantity of 30 and EFG has a QTY of 25 so ABC should rank 1 and EFG should rank 2.
Can anyone assist?
https://i.stack.imgur.com/vCsCA.png
Welcome to SO. You can create a new calculated column -
Rank =
var SumTable = SUMMARIZE(tbl, tbl[Supplier], "CountTransactions", COUNT(tbl[Transaction Number]), "SumQuantity", SUM(tbl[Quantity]))
var ThisSupplier = tbl[Supplier]
var ThisTransactions = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [CountTransactions])
var ThisQuantity = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [SumQuantity])
var ThisRank =
FILTER(SumTable,
[CountTransactions] >= ThisTransactions &&
[SumQuantity] >= ThisQuantity)
return
COUNTROWS(ThisRank)
Here's the final result -
I'm curious to see if anyone posts an alternative solution. In the meantime, give mine a try and let me know if it works as expected.

DAX IF measure - return fixed value

This should be a very simple requirement. But it seems impossible to implement in DAX.
Data model, User lookup table joined to many "Cards" linked to each user.
I have a measure setup to count rows in CardUser. That is working fine.
<measureA> = count rows in CardUser
I want to create a new measure,
<measureB> = IF(User.boolean = 1,<measureA>, 16)
If User.boolean = 1, I want to return a fixed value of 16. Effectively, bypassing measureA.
I can't simply put User.boolean = 1 in the IF condition, throws an error.
I can modify measureA itself to return 0 if User.boolean = 1
measureA> =
CALCULATE (
COUNTROWS(CardUser),
FILTER ( User.boolean != 1 )
)
This works, but I still can't find a way to return 16 ONLY if User.boolean = 1.
That's easy in DAX, you just need to learn "X" functions (aka "Iterators"):
Measure B =
SUMX( VALUES(User.boolean),
IF(User.Boolean, [Measure A], 16))
VALUES function generates a list of distinct user.boolean values (1, 0 in this case). Then, SUMX iterates this list, and applies IF logic to each record.

Estimates in subpopulations with weighted data using survey() package

ftp://cran.r-project.org/pub/R/web/packages/survey/vignettes/domain.pdf
The complete dataset is tch2012. However, I am only interested in the subpopulation of tch2012 in which two criteria are met: age <= 5 and gender == "female". And within that subpopulation, I want to compare those with the disease (disease == "1") and without the disease (disease == "0").
This is the code I wrote:
library(survey)
tch2012.tsl.dsgn <- svydesign(id= ~HOSP_KID, strata= ~KID_STRATUM, weights = ~DISCWT, data = tch2012, nest = TRUE)
create a pointer to subpopulation of female pediatric age 5 years and under
tch2012_f_age5.tsl.dsgn <- subset(tch2012.tsl.dsgn, AGE <= 5 & gender == "female")
weighted data of total number with and without the disease in female pediatric patients age 5 years and under
svyby(~count, ~disease, design=tch2012_f_age5.tsl.dsgn, svytotal)
However, I got the below error message when I ran the svyby()
Error in sum(sapply(covmats, ncol)) : invalid 'type' (list) of argument
Since I am not very familiar with dealing with weighted data, I am no clue how to trouble shoot.
Thanks in advance for the help!
this code works
library(survey)
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
x <- subset( dclus1 , sch.wide == 'Yes' )
svyby(~api00, ~stype, design=x, svytotal)
please edit your question by adding a minimal reproducible example How to make a great R reproducible example?

Regroup By in PigLatin

In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.

Resources