Trouble specifying datasummary() formula - modelsummary

I am getting weird outputs from my datasummary code. The idea is to create a table that shows the mean and SD for numeric variables and the number of observations for the full sample. I also want to display the shares for the two levels of a binary factor variable. Currently, i get the SD and mean from the only numeric variable (which makes sense), and the N shown is also only shown for the numeric variable. The N shown is also not the number of observations, but the first number in the numeric variable vector. This is my current code
age is the numeric variable
v2 - v4 are factor variables
obama is a factor variable which i want the table to show shares per each of the 2 levels.
datasummary(formula = age + (educated parent= education) + religion + sex ~ Heading("Entire sample") * 1 * (Mean + SD + N) + obama * Percent(), fmt = 3, data = data, title = 'Table 1: Votes for Obama in 2012 - Summary statistics', notes = c('1 = voted for Obama', 'educated parent: 1 = at least one parent has a degree', 'Source: General social survey'))
I am getting the warnings
Warning messages:
1: Summary statistic is length 1693
2: Summary statistic is length 1261
3: Summary statistic is length 432
4: Summary statistic is length 335
5: Summary statistic is length 379
6: Summary statistic is length 123
7: Summary statistic is length 856
8: Summary statistic is length 728
9: Summary statistic is length 965
Which are the values i want to be displayed under the "N" - column.
The table i get as an output looks like this
Table 1:
0 1
age 37.507 62.493
educated parent 0 27.998 46.486
1 9.510 16.007
religion None 3.662 16.125
Catholic 8.919 13.467
Other 1.713 5.552
Protestant 23.213 27.348
sex Male 18.252 24.749
Female 19.256 37.744
1 = voted for Obama
educated parent: 1 = at least one parent has a degree
Source: General social survey
The data is taken from gss_sm from the socviz package. I have created a new religion and a new education variable. Religion is a 4 level factor, and education is a 2 level factor.
I have tried making my own n fuction,
`n<-function() {
if(class(x)!="numeric"){
n<-length(x)
}
else{
n<-sum(!is.na(x))
}
formatC(n,digits=0)
}
`
and plugging that in in the place of "N".
It seems like as if it is the N function that isnt working.

Related

Algorithms for optimal student seating arrangements

Say I need to place n=30 students into groups of between 2 and 6, and I collect the following preference data from each student:
Student Name: Tom
Likes to sit with: Jimi, Eric
Doesn't like to sit with: John, Paul, Ringo, George
It's implied that they're neutral about any other student in the overall class that they haven't mentioned.
How might I best run a large number of simulations of many different/random grouping arrangements, to be able to determine a score for each arrangement, through which I could then pick the "most optimal" score/arrangement?
Alternatively, are there any other methods by which I might be able to calculate a solution that satisfies all of the supplied constraints?
I'd like a generic method that can be reused on different class sizes each year, but within each simulation run, the following constants and variables apply:
Constants: Total number of students, Student preferences
Variables: Group sizes, Student Groupings, Number of different group arrangements/iterations to test
Thanks in advance for any help/advice/pointers provided.
I believe you can state this as an explicit mathematical optimization problem.
Define the binary decision variables:
x(p,g) = 1 if person p is assigned to group g
0 otherwise
I used:
I used your data set with 28 persons, and your preference matrix (with -1,+1,0 elements). For groups, I used 4 groups of 6 and 1 group of 4. A solution can look like:
---- 80 PARAMETER solution using MIQP model
group1 group2 group3 group4 group5
aimee 1
amber-la 1
amber-le 1
andrina 1
catelyn-t 1
charlie 1
charlotte 1
cory 1
daniel 1
ellie 1
ellis 1
eve 1
grace-c 1
grace-g 1
holly 1
jack 1
jade 1
james 1
kadie 1
kieran 1
kristiana 1
lily 1
luke 1
naz 1
nibah 1
niko 1
wiki 1
zeina 1
COUNT 6 6 6 6 4
Notes:
This model can be linearized, so it can be fed into a standard MIP solver
I solved this directly as a MIQP model (actually the solver reformulated the model into a MIP). The model solved in a few seconds.
Probably we need to add extra logic to make sure one person is not getting a really bad assignment. We optimize here only the total sum. This overall sum may allow an individual to get a bad deal. It is an interesting exercise to take this into account in the model. There are some interesting trade-offs.
1st approach should be, create matrix n x n where n is total number of students, indexes for row and columns are ordinals for every student, and each column representing preferences for sitting with the others students. Fills the cells with values 1=Like to sit, -1 = the Opposite, 0 = neutral. Zeroes to be filled too on main diagonal (i,i)
------Mark Maria John Peter
Mark 0 1 -1 1
Maria 0 0 -1 1
John -1 1 0 1
Peter 0
Score calculations are based on sums of these values. So ie: John likes to sit with Maria, = 1, but Maria doesn't like to sit with John -1, result is 0. Best result is when both score (sum) 2.
So on, based on Group Sizes, calculate Score of each posible combination. Bigger the score, better the arrangement. Combinations discriminate values on main diagonal. ie: John grouped with the same John is not a valid combination/group.
In a group size of 2, best score is 2
In a group size of 3, best score is 6,
In a group size of 4, best score is 12
In a group size of n, best score would be (n-1)*n
Now in ordered list of combinations / groups, you should take first the best tuples with highest scores, but avoiding duplicates of students between tuples.
In a recent research, a PSO was implemented to classify students under unknown number of groups of 4 to 6. PSO showed improved capabilities compared to GA. I think that all you need is the specific research.
The paper is: Forming automatic groups of learners using particle swarm optimization for applications of differentiated instruction
You can find the paper here: https://doi.org/10.1002/cae.22191
Perhaps the researchers could guide you through researchgate: https://www.researchgate.net/publication/338078753
Regarding the optimal sitting you need to specify an objective function with the specific data

How to define a algorithm that gives a ranking number for at dentist?

I have some problems with defining a algorithm that will calculate a ranking number for a dentist.
Assume, we have three different dentists:
dentist number 1: Got 125 patients and out of the 125 patients the
dentist have booked a time with 75 of them. 60% of them got a time.
dentist number 2: Got 5 patients and out of the 5 patients the
dentist have booked a time with 4 of them. 80% of them got a time.
dentist number 3: Got 25 patients and out of the 14 patients the
dentist have booked a time with 14 of them. 56% got a time.
If we use the formula:
patients booked time with / totalpatients * 100
it will not be the right way to calculate the ranking, as we will get an output of the higher percentage is, the better the dentist is, but it's wrong. By doing it in that way, the dentists would have a ranking:
dentist number 2 would have a ranking of 1. (80% got a time).
dentist number 1 would have a ranking of 2 (60% got a time).
dentist number 3 would have a ranking of 3. (56% got a time).
But, it should be in this way:
dentist number 1 = ranking 1
dentist number 2 = ranking 2
dentist number 3 = ranking 3
I don't know to make a algorithm that also takes the amount of patients as a factor to the ranking-calculation.
It is quite arbitrary how you define what makes a better dentist in terms of number of patients and the percentage of those that have an appointment with them.
Let's call the number of patients P, the number of those that have an appointment A, and the function determining how "good" a dentist is f. So f would be a function of P and A: f(P, A).
One component of f could indeed be what you already calculated: A/P.
Another component would have to be P, but I would think that the effect on f(P, A) of increasing P with 1 would be much higher for a low P, than for a high P, so this component should not be a linear function. It would also be practical if this component would have a value between 0 and 1, just like the other component.
Taking all this together, I suggest this definition of f, which will give a number between 0 and 1:
f(P,A) = 1/3 * P/(10 + P) + 2/3 * A/P
For the different dentists, this results in:
1: 1/3 * 125/135 + 2/3 * 75/125 = 0.7086419753...
2: 1/3 * 5/15 + 2/3 * 4/5 = 0.6444444444...
3: 1/3 * 25/35 + 2/3 * 14/25 = 0.6114285714...
You could play a bit with the constant factors in the formula, like increasing the term 10. Or you could change the factors 1/3 and 2/3 making sure that their sum is 1.
This is just one way to do it. There are an infinity of other ways...

Summarize different category rankings

I determine the rankings of i.e. 1000 participants in multiple categories.
The results are something like that:
Participant/Category/Place (lower is better):
A|1|1.
A|2|1.
A|3|1.
A|4|7.
B|1|2.
B|2|2.
B|3|2.
B|4|4.
[...]
Now I want to summarize the rankings. The standard method would be to sum up all places and divide it by the number of categories:
Participant A: (1+1+1+7) / 4 = 2,5
Participant B: (2+2+2+4) / 4 = 2,5
But I want to prefer participant A, because he's won 3 of 4 categories.
I could define fixed points for all places, i.e:
Place|Points
1|1000
2|500
3|250
4|125
5|62.5
6|31.25
7|15.625
[...]
Participant A: 1000+1000+1000+15.625 = 3015.625
Participant B: 500+500+500+125 = 1625
The problem is now, that I want to give every place some points, so it's still possible to sort low places. And when I continue to divide the available points by 2, the maximum number of decimal places are insufficient (Available points /2^Number of places).
What can I do?
How about using harmonic mean?
4 / (1/1 + 1/1 + 1/1 + 1/7) = 1.272727
4 / (1/2 + 1/2 + 1/2 + 1/4) = 2.285714

Find the nearest nice number

Given a base currency of GBP £, and a table of other currencies accepted in a shop:
Currency Symbol Subunits LastToGBPRate
------------------------------------------------------
US Dollars $ 100 0.592662000
Euros € 100 0.810237000
Japanese Yen ¥ 1 0.005834610
Bitcoin ฿ 100000000 301.200000000
We have a working method that converts a given amount in GBP Pence (AKA cents) into Currency X cents. Given a price of 999 (£9.99), for the above currencies it would return:
Currency Symbol
---------------------
US Dollars 1686
Euros 1233
Japanese Yen 1755
Bitcoin 3482570
This is all working absolutely fine. We then have a Format Currency method which converts them all into nice looking numbers:
Currency Formatted
---------------------
US Dollars $16.86
Euros €12.33
Japanese Yen ¥1755
Bitcoin ฿0.03482570
Now the problem we want to solve, is to round these amounts to the nearest meaningful pretty number in a general purpose algorithm given the information above.
This serves two important benefits:
Prices for most currencies should appear static for visitors over short-medium term time frames
Presents the visitor with a culturally meaningul price point which encourages sales
A meaningful number is one where the smallest unit displayed isn't smaller than the value of say £0.10, and a pretty number is one which ends in 49 or 99. Example outputs:
Currency Formatted Meaninful and Pretty
-----------------------------------------------------
US Dollars $16.86 $16.99
Euros €12.33 €12.49
Japanese Yen ¥1755 ¥1749
Bitcoin ฿0.03482570 ฿0.0349
I know it is possible to do this with a single algorithm with all the information given, but I'm struggling to work out even where to start. Can anyone show me how to achieve this, or give pointers?
Please note, storing a general formatting rule for each currency is not adequate because assume for example the price of Bitcoin 10x's, the formatting rule will need updating. I'm looking for a solution that doesn't need any manual maintainance/checking.
For a given decimal value X, you want to find the smallest integer Y such that YA + B as close as possible to X, for some given A and B. E.g. in the case of dollar, you have A = .5 and B = .49.
In general, for your problem, A and B can be computed via the formula:
V = value of £0.10 in target currency
K = smallest power of ten (10^k) such that 9*10^k >= V
and k <= -2 (this condition I added based on your examples, but contrary
to your definition)
= 10^min(-2, ceil(log10(V / 9)))
A = 50 * K
B = 49 * K
Note that without the extra condition, since 0.09 dollars is less than 0.10 pounds, we would get 14.9 as the result for 16.86 dollars.
With some transformation we get
Y ~ (X - B) / A
And since Y is integer, we have
Y = round((X - B) / A)
The result is then YA + B.
Convert £0.10 to the current currency to determine the smallest displayable digit (SDD)
(bounded by the number of available digits in that currency).
Now we basically have 3 choices of numbers:
... (3rdSDD-1) 9 9 (if 3rdSDD is 0, it will obviously carry from 4thSDD and so on, as subtraction normally works)
We'll pick this when 10*2ndSDD + 1stSDD < 24
... 3rdSDD 4 9
We'll pick this when 24 <= 10*2ndSDD + 1stSDD < 74
... 3rdSDD 9 9
We'll pick this when 74 < 10*2ndSDD + 1stSDD
It should be trivial to figure it out from here.
Some multiplication and modulus to get you 2ndSDD and 1stSDD.
Basic subtraction to get you ... (3rdSDD-1).
A few if-statements to pick one of the above cases.
Example:
For $16.86, our 3 choices are $15.99, $16.49 and $16.99.
We pick $16.99 since 74 < 86.
For €12.33, our 3 choices are €11.99, €12.49 and €12.99.
We pick €12.49 since 24 <= 33 < 74.
For ¥1755, our 3 choices are ¥1699, ¥1749 and ¥1799.
We pick ¥1749 since 24 <= 55 < 74.
For ฿0.03482570, our 3 choices are ฿0.0299, ฿0.0349 and ฿0.0399.
We pick ฿0.0349 since 24 <= 48 < 74.
And, just to show the carry:
For $100000.23, our 3 choices are $99999.99, $100000.49 and $100000.99.
We pick $99999.99 since 23 < 24.
Here's an ugly answer:
def retail_round(number):
"""takes a decimal.Decimal and retail rounds it"""
ending_digits = str(number)[-2:]
if not ending_digits in ("49","99"):
rounding_adjust = (99 - int(ending_digits)) % 50
if rounding_adjust <= 25:
number = str(number)[:-2]+str(int(ending_digits)+int(rounding_adjust))
else:
if str(number)[-3] == '.':
number = str(int(number) - .01)
else:
number = str(int(str(number)[:-2]+"00")-1)
return decimal.Decimal(number)
>>> import decimal
>>> retail_round(decimal.Decimal("15.50"))
Decimal('14.99')
>>> retail_round(decimal.Decimal("15.51"))
Decimal('14.99')
>>> retail_round(decimal.Decimal("15.75"))
Decimal('15.99')
>>> retail_round(decimal.Decimal("1575"))
Decimal('1599')
>>> retail_round(decimal.Decimal("1550"))
Decimal('1499')
EDIT: this is a bit better solution, using decimal.Decimal
Currency = collections.namedtuple("Currency",["name","symbol",
"subunits"])
def retail_round(currency, amount):
"""returns a decimal.Decimal amount of the currency, rounded to
49 or 99."""
adjusted = ( amount / currency.subunits ) % 100 # last two digits
print(adjusted)
if adjusted < 24:
amount -= (adjusted + 1) * currency.subunits # down to 99
elif 24 <= adjusted < 74:
amount -= (adjusted - 49) * currency.subunits # to 49
else:
amount -= (adjusted - 99) * currency.subunits # up to 99
return amount
Calculate the maximum length of the price, assume its something like 0.00001. (You can do that by changing £0.10 to the currency, then taking the 10 base log of it, getting its ceil and that power of 10).
Eg: £0.10 = 17.1421309¥
log(17.1421309) = 1.234
ceil(1.234) = 2
10^2 = 100
so
¥174055 will be ¥174900
Adjust the number for the digit, add 1, round to 50, subtract 1:
174055 -> (round((174055/100+1)/50)*50-1)*100 = 174900
Plain and simple.

How to calculate one certain value from a rolling-window estimation in Stata

I'm using Stata to estimate Value-at-risk (VaR) with the historical simulation method. Basically, I will create a rolling window with 100 observations, to estimate VaR for the next 250 days (repeat 250 times). Hence, as I've known, the rolling window with time series command in Stata would be useful in this case. Here is the process:
Input: 350 values
1. Ascending sort the very first 100 values (by magnitude).
2. Then I need to take the 5th smallest for each window.
3. Repeat 250 times.
Output: a list of the 5th values (250 in total).
Sound simple, but I cannot do it the right way. This was my attempt below:
program his,rclass
sort lnreturn
return scalar actual=lnreturn in 5
end
tsset stt
time variable: stt, 1 to 350
delta: 1 unit
rolling actual=r(actual), window(100) saving(C:\result100.dta, replace) : his
(running his on estimation sample)
And the result is:
Start end actual
1 100 -.047856
2 101 -.047856
3 102 -.047856
4 103 -.047856
.... ..... ......
251 350 -.047856
What I want is 250 different 5th values in panel "actual", not the same like that.
If I understand this correctly, you want the 5th percentile of values in a window of 100. That should yield to summarize, detail or centile. I see no need to write a program.
Your bug is that your program his calculates the same thing each time it is called. There is no communication about windows other than what is explicit in your code. It is like saying
move here: now add 2 + 2
move there: now add 2 + 2
move to New York: now add 2 + 2
The result is invariant to your supposed position.
Note that I doubt that
return scalar actual=lnreturn in 5
really is your code. lnreturn[5] should work.
UPDATE You don't even need rolling here. Looping over data is easy enough. The data in this example are clearly fake.
clear
* sandpit
set obs 500
set seed 2803
gen y = ceil(exp(rnormal(3,2)))
l y in 1/5
* initialise
gen p5 = .
* windows of length 100: 1..100, 101..200, ...
quietly forval j = 1/401 {
local J = `j' + 99
su y in `j'/`J', detail
replace p5 = r(p5) in `j'
}
* check first calculation
su y in 1/100, detail
l in 1/5

Resources