Estimates in subpopulations with weighted data using survey() package - survey

ftp://cran.r-project.org/pub/R/web/packages/survey/vignettes/domain.pdf
The complete dataset is tch2012. However, I am only interested in the subpopulation of tch2012 in which two criteria are met: age <= 5 and gender == "female". And within that subpopulation, I want to compare those with the disease (disease == "1") and without the disease (disease == "0").
This is the code I wrote:
library(survey)
tch2012.tsl.dsgn <- svydesign(id= ~HOSP_KID, strata= ~KID_STRATUM, weights = ~DISCWT, data = tch2012, nest = TRUE)
create a pointer to subpopulation of female pediatric age 5 years and under
tch2012_f_age5.tsl.dsgn <- subset(tch2012.tsl.dsgn, AGE <= 5 & gender == "female")
weighted data of total number with and without the disease in female pediatric patients age 5 years and under
svyby(~count, ~disease, design=tch2012_f_age5.tsl.dsgn, svytotal)
However, I got the below error message when I ran the svyby()
Error in sum(sapply(covmats, ncol)) : invalid 'type' (list) of argument
Since I am not very familiar with dealing with weighted data, I am no clue how to trouble shoot.
Thanks in advance for the help!

this code works
library(survey)
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
x <- subset( dclus1 , sch.wide == 'Yes' )
svyby(~api00, ~stype, design=x, svytotal)
please edit your question by adding a minimal reproducible example How to make a great R reproducible example?

Related

DAX - RANKX Using Two Calculations -

I have a data table that contains transactions by supplier. Each row of data represents one transaction. Each transaction contains a "QTY" column as well as a "Supplier" column.
I need to rank these suppliers by the count of transactions (Count of rows per unique supplier) then by the SUM of the "QTY" for all of each supplier's transactions. This needs to be in 1 rank formula, not two separate rankings. This will help in breaking any ties in my ranking.
I have tried dozens of formulas and approaches and can't seem to get it right.
See below example:
Suppliers ABC and EFG each have 4 transactions so they would effectively tie for Rank 1, however ABC has a Quantity of 30 and EFG has a QTY of 25 so ABC should rank 1 and EFG should rank 2.
Can anyone assist?
https://i.stack.imgur.com/vCsCA.png
Welcome to SO. You can create a new calculated column -
Rank =
var SumTable = SUMMARIZE(tbl, tbl[Supplier], "CountTransactions", COUNT(tbl[Transaction Number]), "SumQuantity", SUM(tbl[Quantity]))
var ThisSupplier = tbl[Supplier]
var ThisTransactions = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [CountTransactions])
var ThisQuantity = SUMX(FILTER(SumTable, [Supplier] = ThisSupplier), [SumQuantity])
var ThisRank =
FILTER(SumTable,
[CountTransactions] >= ThisTransactions &&
[SumQuantity] >= ThisQuantity)
return
COUNTROWS(ThisRank)
Here's the final result -
I'm curious to see if anyone posts an alternative solution. In the meantime, give mine a try and let me know if it works as expected.

Tournament scheduling issues

Currently I'm working on a 1 day tournament scheduling application. Since, each year the number of participating teams is different, I want to automate the scheduling.
Teams are split in 2 groups. each group plays a single round robin tournament.
I managed to generate all the games to play, but I'm struggling with the planning.
additionally, the teams need to compete in 3 different sport disciplines, with each a dedicated field. (e.g. football field, volleybal field)
Given:
- games to play
- fields per sport + available timeslots per field (slots of +-15 minutes)
assumptions:
- timeslots are not limited
- 1 field per sport available
- schedule doesn't need to be balanced in 1st iteration
problems:
- the quality of my schedule is not that good. in fact, not all timeslots are fully filled, even if there is a solution. the 'density' of my schedule also depends on the order of games processed.
code snippet:
//algo
while (_games.Any())
{
gameToPlan = _games.Dequeue();
var occupiedHomeTeam = GetTimeslotsOccupiedByTeam(gameToPlan.HomeTeam);
var occupiedAwayTeam = GetTimeslotsOccupiedByTeam(gameToPlan.AwayTeam);
var occupiedTeams = occupiedHomeTeam.Union(occupiedAwayTeam);
var availableFields = fields.Where(f => f.AllowedSports.Contains(gameToPlan.Sport))
.Where(f => f.Timeslots.Any(t => t.Game == null &&
!t.Occupied &&
!occupiedTeams.Any(oc => oc.Start == t.Start &&
oc.End == t.End)));
if (!availableFields.Any())
{
_games.Enqueue(gameToPlan);
continue;
}
var field = availableFields.First();
var timeSlots = field.Timeslots.Where(t => t.Game == null &&
!t.Occupied &&
!occupiedTeams.Any(oc => oc.Start == t.Start &&
oc.End == t.End))
.OrderBy(t => t.Start);
if (!timeSlots.Any())
{
_games.Enqueue(gameToPlan);
continue;
}
var ts = timeSlots.First();
ts.Occupied = true;
ts.Game = gameToPlan;
gameToPlan.Timeslot = ts;
gameToPlan.TimeslotId = ts.Id;
_uow.Save();
}
Can anyone give me an overview of approach, available algorithms,...?
thanks in advance
Regarding your problem, this is clearly a discrete optimization problem. For tournament/timetable problems, you should think about using constraint programming solvers. You need to be familiar with linear/integer programming to do so. For example you can use Choco solver which is in Java. Fun fact is that the last question on their forum is related to tournament scheduling.

How can I increase my python-code speed?

I have a dataframe, df1, that reports courses students have taken, where ID is the student’s id, COURSES is a list of courses taken by the student, and TYPE and MAJOR are student attributes. The dataframe looks like this:
ID COURSES TYPE MAJOR
1 ['Intr To Archaeology', 'Statics', 'Circuits I…] Freshman EEEL
2 ['Signals & Systems I', ‘Instrumentation’…] Transfer EEEL
3 ['Keyboard Competence', 'Elementary … ] Freshman EEEL
4 ['Cultural Anthro', 'Vector Analysis’ … ] Freshma EEEL
I created a new dataframe, df2, that reports a dissimilarity measure for each pair of students based on the courses they’ve taken. df2 looks like this:
I created using the following script, but it runs very slowly (there are thousands of students). Can someone suggest a more efficient way to create df2?
One major problem is that the script below calculates the distance between (student 1 and student 2) and (student 2 and student 1), which is redundant since the distances are the same. However, the condition I created to prevent this:
if (id1 >= id2):
continue
doesn't work.
Entire script:
for id1, student1 in df.iterrows():
for id2, student2 in df.iterrows():
if (id1 >= id2):
continue
ID_1 = student1["ID"]
ID_2 = student2["ID"]
# courses as list strings
s1 = student1["COURSES"]
s2 = student2["COURSES"]
try:
# courses as sets
courses1 = set(ast.literal_eval(s1))
courses2 = set(ast.literal_eval(s2))
distance = float(len(courses1.symmetric_difference(courses2)))/(len(courses1) + len(courses2))
except:
# Some strings seem to have a different format
distance = -1
ID_1_Transfer = 1 if student1["TYPE"] == "Transfer" else 0
ID_2_Transfer = 1 if student2["TYPE"] == "Transfer" else 0
df2= df2.append({'ID_1': ID_1,'ID_2': PIDM_2,'Distance': distance, 'ID_1_Transfer': ID_1_Transfer, 'ID_2_Transfer': ID_2_Transfer}, ignore_index=True)

multiple group by using linq

I need return just 2 lines in my query. One line with a string Today and a number of cases closed today, on my second line I need a string Last Week and a number of cases closed on the last week.
How I group with a range date?
Sum Name
----------- ----------
12 Today
33 Last Weeb
How about this:
var caseCounts = Cases
.Where(c => c.Date == today || (c.Date >= startOfLastWeek && c.Date <= endOfLastWeek))
.GroupBy(c => c.Date == today ? "Today" : "Last Week")
.Select(g => new {
Name = g.Key, Sum = g.Count()
});
You would need to define the 3 dates (today, startOfLastWeek, endOfLastWeek) before hand, but it gives you the results you are after.
GROUP BY YEARWEEK(date) should work. Depending on your dbms, you might be able to use another function, or program your own.
http://www.tutorialspoint.com/sql/sql-date-functions.htm#function_yearweek

Regroup By in PigLatin

In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.

Resources