How to sample data in R, within specific amount of cases - random

I'm trying to resample data, but I don't know how to tell the "sample" function to do it within a condition or specific amount of cases.
The objective is to resample a statistic from a Wald test, balancing the amount of cases of two conditions: control vs the treatment, as the original data is unbalanced.
For this I have 48 "control_IDs" (corresponding to 228 observations across time), and 326 "treatments_IDs", corresponding to 4531 observations. The ID corresponds to the individual subject to the treatment or not.
Overall, one part of the structure is pretty straightforward, and I'm able with the attached code to sample 228 from the 4531 observations of my "treatment", to run the model and Wald test.
The problem that I have is that when I run the code, the 48 IDs of the "control" are fix, but the ones from the "treatment" are not. So I end with 228 observations of 48 control_IDs, but 228 observations of a varying amount of "treatment_IDs" (from ~60 to ~180) depending on the run.
Any idea of how to properly code that the random 228 treatment observations comes from a random sample of 48 "treatment_IDs"?
Thanks!
df2<-subset(original_df,treatment =="control")
resampling<-replicate(5000,{resampled.data=sample(1:nrow(df), 0.0504*nrow(df), replace=F)
x1=df[resampled.data,]
x2=rbind(x1,df2)
Model= glmer(response ~ treatment*var1*var2+var3
+ (1|ID), data=x2, family="binomial",
control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
Anova(Model)[1,-1:-2]
})

Related

Sorting list for most efficient build

I have a problem that I am having some trouble tackling, I am playing this game and thought it would be fun to make a calculator to calculate the most efficient builds.
Every fleet allows for 30 units, and a fleets power is the sum of every units power combined.
Every unit can give a certain amount of power ranging from 1 to 200. The price of these units is different and can vary from unit to unit, (A 200 power unit costs around 1.5, an a 50 power unit costs around 0.02)
I would like to calculate the cheapest build for a fleet of x amount of power.
At first I though I would get the price of all units, get the average price for 1 power and calculate the most efficient units based of price per power. And this gives me a list sorted from most efficient units, and then I would create a list containing the 30 most efficient PpP (Price per Power) units.
I could then do a check to see if my power of the generated fleet is equal or more than my desired amount, and if not, remove the lest powerful unit, and replace it with the next most efficient bigger one from my list, and repeat until it has enough power in the 30.
The problem is that seeing as the price grows exponentially it means that unit with 1-50 power are always the most efficient.
Does anyone know if there an algorithm that I could use or study?
I want to calculate the cheapest way of achieving a fleet of x amount of power with a maximum of 30 units. without it taking weeks to complete
I have a list of about 30 000 units containing the price and power.
Its a hard problem to explain so apologies if you don't understand anything
EDIT: Sorry I messed up in my description, the price of the units vary, meaning it could be more efficient to get, for example for a 3000 power fleet, maybe a 15 x 150power units and 15x 50power units fleet is better than 30x 100mp fleet. Sorry for the confusion
To improve the efficiency of your above algorithm you can find the unit whose power is the closest to 30 times the desired power without exceeding it. After that you use the same method removing units one by one and replacing it with the next higher unit. Since 30 of the next higher unit will always exceed the power required, this part of the algorithm only needs to run once.

Executing genetic algorithm per configuration

I'm trying to understand a paper I'm reading regarding genetic algorithms.
They running there a GA with parameters they presented.
Some of the parameters are:
stop criterion - 50 generations.
Runs per configuration - 30.
After they presented the parameters they said that they executed the algorithm 20 times.
I don't understand two things:
1. what the second parameter means? that every configuration runs untill it reaches 30 generations?
2. when they execute the algorithm 20 times, it means untill they reach 20 generations or that they executed 20 configurations?
thanks.
Question 1
Usually in GA:s parameters such as mutation rates, selection rates, etc, need to be customised for the problem at hand. To get a somewhat good measure to quantify what is a good or bad parameter configuration, each configuration is repeated for a number of runs, whereafter the mean and standard deviation of the best solution in each of these runs are computed. Note that these runs have nothing to do with the number of generations or specific parameters in the GA, it's simply a way to reduce noise and increase reliability when comparing and choosing among different parameter configurations.
On the other hand, "number of runs" in itself can also be a parameter of study, but it's quite obvious that increasing the total number of runs will increase the chance to find a really good solution (say a large deviation from mean GA performance). If studying such a case, it's important that the total number of generations (effectively the total number of function evaluations), over all runs, are the same between different parameter settings.
As an example, consider the following parameter configuration:
numberOfRuns = 30
numberOfGenerations = 50
==> a total of 1500 generation analysed
studied vs configuration:
numberOfRuns = 50
numberOfGenerations = 30
==> a total of 1500 generation analysed
A study like this could examine if it is favourable to have more generations in each GA run (first scenario), of if it's better to have fewer generations but more runs (second scenario: probably favourable for a GA with quick convergence to local optima but with large standard deviation).
The following parameter configuration, however, would not have any meaning to benchmark against the two above, since it has a larger number of total generations:
numberOfRuns = 50
numberOfGenerations = 50
==> a total of 2500 generation analysed
Question 2
If they say they execute the algorithm 20 times, it would generally be equivalent with 20 runs as per described above. Hence, 20 runs each running over 50 generations.

Suitable machine learning algorithm for column selection

I am new in machine learning. In my work I require a machine learning algorithm to select some columns out of many columns in a 2D matrix depending on the spread of the data. Below is a sample of the 2D matrix:
400 700 4 1400
410 710 4 1500
416 716 4 1811
..............
410 710 4 1300
Previously I have used standard deviation method to select columns depending on some threshold values(as a measure of spread of data for a particular column). Observe that the 3rd column is constant and last column in varying tremendously. 1st and 2nd column in also varying but the spread of their data is small. By applying standard deviation on each of the columns I get (sigma) = 10, 10, 0, 200 respectively.
I have considered some experimental threshold values to discard some columns. If the (sigma) crosses the threshold value range then the corresponding column gets discarded. I calculated those threshold values manually. Though this method was very simple but dealing with the threshold values is a very tedious task as there are many existing columns.
For this reason I want to use a standard machine learning algorithm or somehow if I can make these threshold values adaptive. So that I don't require to hard-code the threshold values inside the code. Can anyone please suggest me an appropriate algorithm for this?

Calculating an average metric in GoodData

Based on GoodData's excellent suggestion for implementing Fact tables, I have been able to design a model that meets our client’s requirements for joining different attributes across different tables. The issue I have now is that the model metrics are highly denormalized, with data repeating itself. I am currently trying to figure out a way to dedupe results.
For example, I have two tables—the first is a NAMES table and the second is my fact table:
NAMES
Val2 Name
35 John
36 Bill
37 Sally
FACT
VAL1 VAL2 SCORE COURSEGRADE
1 35 50 90%
2 35 50 80%
3 35 50 60%
4 36 10 75%
5 37 40 95%
What I am trying to do is write a metric in such a way so that we can get an average of SCORE that eliminates the duplicate value. GoodData is excellent in that it can actually give me back the unique results using the COUNT(VARIABLE1,RECORD) metric, but I can’t seem to get the average store to stick when eliminating the breakout information. If I keep all fields (including the VAL2), it shows me everything:
VAL2 SCORE(AVG)
35 50
36 10
37 40
AVG: 33.33
But when I remove VAL2, I suddenly lose the "uniqueness" of the record.
SCORE(AVG)
40
What I want is the score of 33.33 we got above.
I’ve tried using a BY statement in my SELECT AVG(SCORE), but this doesn’t seem to work. It’s almost like I need some kind of DISTINCT clause. Any thoughts on how to get that rollup value shown in my first example above?
Happy to help here. I would try the following:
Create an intermediate metric (let's call it Score by Employee):
SELECT MIN( SCORE ) BY ID ALL IN ALL OTHER DIMENSIONS
Then, once you have this metric defined you should be able to create a metric for the average score as follows:
SELECT AVG( Score by Employee )
The reason we create the first metric is to force the table to normalize score around the ID attribute which gets rid of duplicates when we use this in the next metric (we could have used MAX or AVG also, it doesn't matter).
Hopefully this solves your issue, let me know if it doesn't work and I'll be happy to help out more. Also feel free to check out GoodData's Developer Portal for more information about reporting:
https://developer.gooddata.com/docs/reporting
Best,
JT
you should definitively check "How to build a metric in a metric" presentation, made by Petr Olmer (http://www.slideshare.net/petrolmer/in10-how-to-build-a-metric-in-a-metric).
It can help you to understand it better.
Cheers,
Peter

How do I measure variability of a benchmark comprised of many sub-benchmarks?

(Not strictly programming, but a question that programmers need answered.)
I have a benchmark, X, which is made up of a lot of sub-benchmarks x1..xn. Its quite a noisy test, with the results being quite variable. To accurately benchmark, I must reduce that "variability", which requires that I first measure the variability.
I can easily calculate the variability of each sub-benchmark, using perhaps standard deviation or variance. However, I'd like to get a single number which represents the overall variability as a single number.
My own attempt at the problem is:
sum = 0
foreach i in 1..n
calculate mean across the 60 runs of x_i
foreach j in 1..60
sum += abs(mean[i] - x_i[j])
variability = sum / 60
Best idea: ask at the statistics Stack Exchange once it hits public beta (in a week).
In the meantime: you might actually be more interested in the extremes of variability, rather than the central tendency (mean, etc.). For many applications, I imagine that there's relatively little to be gained by incrementing the typical user experience, but much to be gained by improving the worst user experiences. Try the 95th percentile of the standard deviations and work on reducing that. Alternatively, if the typical variability is what you want to reduce, plot the standard deviations all together. If they're approximately normally distributed, I don't know of any reason why you couldn't just take the mean.
I think you're misunderstanding the standard deviation -- if you run your test 50 times and have 50 different runtimes the standard deviation will be a single number that describes how tight or loose those 50 numbers are distributed around your average. In conjunction with your average run time, the standard deviation will help you see how much spread there is in your results.
Consider the following run times:
12 15 16 18 19 21 12 14
The mean of these run times is 15.875. The sample standard deviation of this set is 3.27. There's a good explanation of what 3.27 actually means (in a normally distributed population, roughly 68% of the samples will fall within one standard deviation of the mean: e.g., between 15.875-3.27 and 15.875+3.27) but I think you're just looking for a way to quantify how 'tight' or 'spread out' the results are around your mean.
Now consider a different set of run times (say, after you compiled all your tests with -O2):
14 16 14 17 19 21 12 14
The mean of these run times is also 15.875. The sample standard deviation of this set is 3.0. (So, roughly 68% of the samples will fall within 15.875-3.0 and 15.875+3.0.) This set is more closely grouped than the first set.
And you have a single number that summarizes how compact or loose a group of numbers is around the mean.
Caveats
Standard deviation is built on the assumption of a normal distribution -- but your application may not be normally distributed, so please be aware that standard deviation may be a rough guideline at best. Plot your run-times in a histogram to see if your data looks roughly normal or uniform or multimodal or...
Also, I'm using the sample standard deviation because these are only a sample out of the population space of benchmark runs. I'm not a professional statistician, so even this basic assumption may be wrong. Either population standard deviation or sample standard deviation will give you good enough results in your application IFF you stick to either sample or population. Don't mix the two.
I mentioned that the standard deviation in conjunction with the mean will help you understand your data: if the standard deviation is almost as large as your mean, or worse, larger, then your data is very dispersed, and perhaps your process is not very repeatable. Interpreting a 3% speedup in the face of a large standard deviation is nearly useless, as you've recognized. And the best judge (in my experience) of the magnitude of the standard deviation is the magnitude of the average.
Last note: yes, you can calculate standard deviation by hand, but it is tedious after the first ten or so. Best to use a spreadsheet or wolfram alpha or your handy high-school calculator.
From Variance:
"the variance of the total group is equal to the mean of the variances of the subgroups, plus the variance of the means of the subgroups."
I had to read that several times, then run it: 464 from this formula == 464, the standard deviation of all the data -- the single number you want.
#!/usr/bin/env python
import sys
import numpy as np
N = 10
exec "\n".join( sys.argv[1:] ) # this.py N= ...
np.set_printoptions( 1, threshold=100, suppress=True ) # .1f
np.random.seed(1)
data = np.random.exponential( size=( N, 60 )) ** 5 # N rows, 60 cols
row_avs = np.mean( data, axis=-1 ) # av of each row
row_devs = np.std( data, axis=-1 ) # spread, stddev, of each row about its av
print "row averages:", row_avs
print "row spreads:", row_devs
print "average row spread: %.3g" % np.mean( row_devs )
# http://en.wikipedia.org/wiki/Variance:
# variance of the total group
# = mean of the variances of the subgroups + variance of the means of the subgroups
avvar = np.mean( row_devs ** 2 )
varavs = np.var( row_avs )
print "sqrt total variance: %.3g = sqrt( av var %.3g + var avs %.3g )" % (
np.sqrt( avvar + varavs ), avvar, varavs)
var_all = np.var( data ) # std^2 all N x 60 about the av of the lot
print "sqrt variance all: %.3g" % np.sqrt( var_all )
row averages: [ 49.6 151.4 58.1 35.7 59.7 48. 115.6 69.4 148.1 25. ]
row devs: [ 244.7 932.1 251.5 76.9 201.1 280. 513.7 295.9 798.9 159.3]
average row dev: 375
sqrt total variance: 464 = sqrt( av var 2.13e+05 + var avs 1.88e+03 )
sqrt variance all: 464
To see how group variance increases, run the example in Wikipedia Variance.
Say we have
60 men of heights 180 +- 10, exactly 30: 170 and 30: 190
60 women of heights 160 +- 7, 30: 153 and 30: 167.
The average standard dev is (10 + 7) / 2 = 8.5 .
Together though, the heights
-------|||----------|||-|||-----------------|||---
153 167 170 190
spread like 170 +- 13.2, much greater than 170 +- 8.5.
Why ? Because we have not only the spreads men +- 10 and women +- 7,
but also the spreads from 160 / 180 about the common mean 170.
Exercise: compute the spread 13.2 in two ways,
from the formula above, and directly.
This is a tricky problem because benchmarks can be of different natural lengths anyway. So, the first thing you need to do is to convert each of the individual sub-benchmark figures into scale-invariant values (e.g., “speed up factor” relative to some believed-good baseline) so that you at least have a chance to compare different benchmarks.
Then you need to pick a way to combine the figures. Some sort of average. There are, however, many types of average. We can reject the use of the mode and the median here; they throw away too much relevant information. But the different kinds of mean are useful because of the different ways they give weight to outliers. I used to know (but have forgotten) whether it was the geometric mean or the harmonic mean that was most useful in practice (the arithmetic mean is less good here). The geometric mean is basically an arithmetic mean in the log-domain, and a harmonic mean is similarly an arithmetic mean in the reciprocal-domain. (Spreadsheets make this trivial.)
Now that you have a means to combine the values for a run of the benchmark suite into something suitably informative, you've then got to do lots of runs. You might want to have the computer do that while you get on with some other task. :-) Then try combining the values in various ways. In particular, look at the variance of the individual sub-benchmarks and the variance of the combined benchmark number. Also consider doing some of the analyses in the log and reciprocal domains.
Be aware that this is a slow business that is difficult to get right and it's usually uninformative to boot. A benchmark only does performance testing of exactly what's in the benchmark, and that's mostly not how people use the code. It's probably best to consider strictly time-boxing your benchmarking work and instead focus on whether users think the software is perceived as fast enough or whether required transaction rates are actually attained in deployment (there are many non-programming ways to screw things up).
Good luck!
You are trying to solve the wrong problem. Better try to minimize it. The differences can be because of caching.
Try running the code on a single (same) core with SetThreadAffinityMask() function on Windows.
Drop the first measurement.
Increase the thead priority.
Stop hyperthreading.
If you have many conditional jumps it can introduce visible differences between calls with different input. (this could be solved by giving exactly the same input for i-th iteration, and then comparing the measured times between these iterations).
You can find here some useful hints: http://www.agner.org/optimize/optimizing_cpp.pdf

Resources