Communicate estimates from GEE o linear mixed models to a general audience - transformation

I want to transform the estimates from a GEE model into estimates easy to interpret. I am analyzing in Stata v17 (for Mac) the data from a therapeutic intervention in a pilot study with 26 individuals, randomized to either the treatment or a placebo. The outcome is a set of inflammatory proteins, which expression has been normalized as an arbitrary log2 scale. There are longitudinal measurements at weeks 0, 1, 2 and 3.
To evaluate the impact of treatment on each protein, I have used GEE models. For each protein, the code looks like this:
xtgee log2_biomarker treatment##c.week, family(gaussian) link(identity) corr(ar 1)
And the model output
GEE population-averaged model Number of obs = 104
Group and time vars: id week Number of groups = 26
Family: Gaussian Obs per group:
Link: Identity min = 4
Correlation: AR(1) avg = 4.0
max = 4
Wald chi2(3) = 11.38
Scale parameter = 1.018093 Prob > chi2 = 0.0098
----------------------------------------------------------------------------------
log2_biomarker | Coefficient Std. err. z P>|z| [95% conf. interval]
-----------------+----------------------------------------------------------------
treatment |
Placebo | .1010699 .379534 0.27 0.790 -.6428031 .8449428
week | -.1841974 .125257 -1.47 0.141 -.4296967 .0613018
|
treatment#c.week |
Placebo | .3919614 .1771402 2.21 0.027 .044773 .7391497
|
_cons | .063295 .268371 0.24 0.814 -.4627026 .5892926
----------------------------------------------------------------------------------
The interaction term "treatment#c.week" indicates that this protein increases over time. In order to put it in context with the estimates from models for the other proteins, I would like to translate this 0.39 coefficient into something like this:
"Subjects in the placebo arm experience a X % (or X-fold) greater protein increase per week".
But, having a log2 transformed outcome, I am struggling to come up with the correct formula.
Thanks!

Related

Why does coxph() combined with cluster() give much smaller standard errors than other methods to adjust for clustering (e.g. coxme() or frailty()?

I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>

How to decide the probability percentage in question

I have the below question:
In the first part of the question, is says the probability that the selected person will be a male is 0.44, it means the number of males is 25*0.44 = 11. That's ok
In the second part, the probability of the selected person will be a male who was born before 1960 is 0.28, Does that mean 0.28 out of the total number which is 25 or out of the number of males?
I mean should the number of male who was born before 1960 equals into 250.28 OR 110.28
I find it easiest to think of these sorts of problems as contingency tables.
You use a maxtrix layout to express the distributions in terms of two or more factors or characteristics, each having two or more categories. The table can be constructed either with probabilities (proportions) or with counts, and switching back and forth is easy based on the total count in the table. Entries in the table are the intersections of the categories, corresponding to and in a verbal description. The numbers to the right or at the bottom of the table are called marginals, because they're found in the margins of the tables, and are always the sum of the table row or column entries in which they occur. The total probability (or count) in the table is found by summing across all the rows and columns. The marginal distribution of gender would be found by summing across rows, and the marginal distribution of birthdays would be found by summing across the columns.
Based on this, you can inferentially determine other values as indicated by the entries in parentheses below. With one more entry, either for gender or in the marginal row for birthdays, you'd be able to fill in the whole table inferentially. (This is related to the concept of degrees of freedom - how many pieces of info can you fill in independently before the others are determined by the known constraint that the totals are fixed or that probability adds to 1.)
Probabilities
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (0.56)
n __|_________|__________|
d | | |
e M | 0.28 | (0.16) | 0.44
r __|_________|__________|______
? ? | 1.00
Counts
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (14)
n __|_________|__________|
d | | |
e M | 7 | (4) | 11
r __|_________|__________|_____
? ? | 25
Conditional probability corresponds to limiting yourself to the subset of rows or columns specified in the condition. If you had been asked what is the probability of a birthday < 1960 given the gender is male, i.e., P{birthday < 1960 | M} in relatively standard notation, you'd be restricting your focus to just the M row, so the answer would be 7/11 = 0.28/0.44. Computationally, you take the probabilities or counts in the qualifying table entries and express them as a proportion of the probabilities or counts of the specified (given) marginal entries. This is often written in prob & stats texts as P(A|B) = P(AB)/P(B), where AB is a set shorthand for A and B (intersection).
0,44 = 11 / 25 people are male.
0,28 = 7 / 25 people are male & born before 1960.

Determine max slope of slowly descending signal

I have an analog power signal from a motor. The signal ramps up quickly, but powers off slowly over the course of several seconds. The signal looks almost like a series of plateaus on the descent. The problem is that the signal doesn't settle back to zero. It settles back to an intermediate level unknown, and varying from motor to motor. See chart below.
I'm trying to find a way determine when the motor is off and at that intermediate level.
My thought is to find and store the max point, and calculate the slopes thereafter until the max slope is greater than some large negative slope value like -160 (~ -60 degrees), and declare that the motor must be powering off. The sample points below are with all duplicates removed. (there's about 5000 samples typically).
My problem is determining the X values. In the formula (y2-y1) / (x2 - x1), the x values could far enough away in time that the slope never appears greater than -30 degrees. Picking an absolute number like 10 would fix this, but is there a more mathematically correct method?
The data shows me calculating slope with method described above and the max of 921. ie (y2 -y1) / ( (10+1) - 10). In this scheme, at datapoint 9, i would say the motor is "Off". I'm looking for a more precise means to determine an X value rather than randomly picking 10 for instance.
+---+-----+----------+
| X | Y | Slope |
+---+-----+----------+
| 1 | 65 | 856.000 |
| 2 | 58 | 863.000 |
| 3 | 57 | 864.000 |
| 4 | 638 | 283.000 |
| 5 | 921 | 0.000 |
| 6 | 839 | -82.000 |
| 7 | 838 | -83.000 |
| 8 | 811 | -110.000 |
| 9 | 724 | -197.000 |
+---+-----+----------+
EDIT: A much simpler answer:
Since your motor is either ON or OFF, and ON wattages are strictly higher than OFF wattages, you should be able to discriminate between ON and OFF wattages by maintaining an average wattage, reporting ON if the current measurement is higher than the average and OFF if it is lower.
Count = 0
Average = 500
Whenever a measurement comes in,
Count = Count + 1
Average = Average + (Measurement - Average) / Count
Return Measurement > Average ? ON : OFF
This represents an average of all the values the wattage has ever been. If we want to eventually "forget" the earliest values (before the motor was ever turned on), we could either keep a buffer of recent values and use that for a moving average, or approximate a moving average with an IIR like
Average = (1-X) * Average + X * Measurement
for some X between 0 and 1 (closer to 0 to change more slowly).
Original answer:
You could treat this as an online clustering problem, where you expect three clusters (before the motor turns on, when the motor is on, and when the motor is turned off), or perhaps four (before the motor turns on, peak power, when the motor is running normally, and when the motor turns off). In effect, you're trying to learn what it looks like when a motor is on (or off).
If you don't have any other information about whether the motor is on or off (which could be used to train a model), here's a simple approach:
Define an "Estimate" to contain:
float Value
int Count
Define an "Estimator" to contain:
float TotalError = 0.0
Estimate COLD_OFF = {Value = 0, Count = 1}
Estimate ON = {Value = 1000, Count = 1}
Estimate WARM_OFF = {Value = 500, Count = 1}
a function Update_Estimate(float Measurement)
Find the Estimate E such that E.Value is closest to Measurement
Update TotalError = TotalError + (E.Value - Measurement)*(E.Value - Measurement)
Update E.Value = (E.Value * E.Count + P) / (E.Count + 1)
Update E.Count = E.Count + 1
return E
This takes initial guesses for what the wattages of these stages should be and updates them with the measurements. However, this has some problems. What if our initial guesses are off?
You could initialize some number of Estimators with different possible (e.g. random) guesses for COLD_OFF, ON, and WARM_OFF; after receiving a measurement, let each Estimator update itself and aggregate their values somehow. This aggregation should reward the better estimates. Since you're storing TotalError for each estimate, you could just pick the output of the Estimator that has the lowest TotalError so far, or you could let the Estimators vote (giving each Estimator's vote a weight proportional to 1/(TotalError + 1) or something like that).

split quantities algorithm (stock exchanges order)

I have a problem where I have multiple (few thousand) quantities that I need to split between a set number of recipients such that each quantity must be split into whole numbers and using the same proportion.
I need to find an algorighm that implements this reliably and efficiently (dont we all ?:-) )
This is to solve a problem in financial markets (stock exchange orders) where an order might get thousands of "fills" and at the end of the day must be distributed to a few clients while maintaining the order's average price. Here's an example:
Total Order Quantity 37300
Quantities filled by the Stock Exchange
Execution 1. 16700 shares filled at price 75.84
Execution 2. 5400 shares filled at price 75.85
Execution 3. 4900 shares filled at price 75.86
Execution 4. 10300 shares filled at price 75.87
Total 37300 shares filled at average price = (16700*75.84 + 5400*75.85 + 4900*75.86 + 10300*75.87) / 37300 = 75.85235925
Suppose I need to split these quantities between 3 clients such that :
Client1: 15000 shares
Client2: 10000 shares
Client3: 12300 shares
Each execution must be split individually (I can't just take each clients requested quantity priced at average price)
My first thought was to split each proportionately :
Client 1 gets 15000/37300=0.402144772
Client 2 gets 10000/37300=0.268096515
Client 3 gets 12300/37300=0.329758713
Which would lead to
Client1 - 15000 Client2 - 10000 Client3 - 12300
Ratio : 0.402144772 Ratio : 0.268096515 Ratio : 0.329758713
Splits (Sorry about the formatting - this was the best I could do in the Post editor)
+-------------+-------------+-------------+
| Client 1 | Client 2 | Client 3 |
+-------------+-------------+-------------+
| 6715.817694 | 4477.211796 | 5506.970509 |
| 2171.581769 | 1447.72118 | 1780.697051 |
| 1970.509383 | 1313.672922 | 1615.817694 |
| 4142.091153 | 2761.394102 | 3396.514745 |
+-------------+-------------+-------------+
| Totals: | | |
| 15000 | 10000 | 12300 |
+-------------+-------------+-------------+
The problem with this is that I can't assign fractional quantities to clients so I need a smart algorithm which adjusts the quantities such that the fractional part of these splits is 0. I understand that this may be impossible in many scenarios so this requirement can be relaxed a little bit so that a certain client gets a little more (or less).
Does anybody know of an algorithm that I can use as a starting point for this problem ?
You can round all the numbers (ratio[n] * totalQuantity) except the last one (possibly the smallest) The last one must be the totalQuantity - sum of the others. This will give you whole number quantities while having a correct total as close to the ratios you choose.
Try to look at this from a different angle. You already know how many shares each client is getting. You want to calculate what's the fair total amount each has to pay and do this without rounding errors.
You therefore want this total dollar amounts to have no rounding issues, i.e. be accurate to the 0.01.
Prices can then be computed using the dollar amounts and displayed to the required precision.
The opposite (calculate prices, then derive amounts) will always yield rounding issues with the dollar amounts.
Assuming price is per 100 units, here's one way to accomplish this:
Calculate total $ for the order (16,700*75.84/100 + 5,400*75.85/100 + 4,900*75.86/100 + 10,300*75.87/100) = $28,292.93
Allocate all clients except 1, based on the ratio quantities ordered / quantities filled:
Client 2 = $28,292.93 / 37,300 * 10,000 = $7,585,24
Price = 7,585,24 / 10,000 * 100 = 75.8524.
Client 3 = $28,292.93 / 37,300 * 12,300 = $9,329.84
Price = $9,329.84 / 12,300 * 100 = 75.85235772
Calculate the last client as the remaining $$$:
$28,292.93 - ($7,585,24 + $9,329.84) = $11,377.85.
Price = $11,377.85 / 15,000 * 100 = 75.85233333
Here I arbitrarily picked Client 1, the one with the largest quantity to be the object of the remainder calculation.

Math algorithm question

I'm not sure if this can be done without some determining factor....but wanted to see if someone knew of a way to do this.
I want to create a shifting scale for numbers.
Let's say I have the number 26000. I want the outcome of this algorithm to be 6500; or 25% of the original number. But if I have the number 5000, I want the outcome to be 2500; or 50% of the original number.
The percentages don't have to be exact, this is just an example.
I just want to have like a sine wave sort of thing. As the input number gets higher, the output number is a lower percentage of the input.
Does that make sense?
Plot some points in Excel and use the "show formula" option on the line.
Something like f(x) = x / log x?
x | f(x)
=======================
26000 | 5889 (22.6 %)
5000 | 1351 (27.2 %)
100000 | 20000 (20 %)
1000000 | 166666 (16.6 %)
Just a simple example. You can tweak it by playing with the base of the logarithm, by adding multiplicative constants on the numerator (x) or denominator (log x), by using square roots, squaring (or taking the root of) log x or x etc.
Here's what f(x) = 2*log(x)^2*sqrt(x) gives:
x | f(x)
=======================
26000 | 6285 (24 %)
5000 | 1934 (38 %)
500 | 325 (65 %)
100 | 80 (80 %)
1000000 | 72000 (7.2 %)
100000 | 15811 (15 %)
A suitable logarithmic scale might help.
It may be possible to define the function you want exactly if you specify a third transformation in addition to the two you've already mentioned. If you have some specific aim in mind, it's quite likely to fit a well-known mathematical definition which at least one poster could identify for you. It does sound as though you're talking about a logarithmic function. You'll have to be more specific about your requirements to define a useful algorithm however.
I'd suggest you play with the power law family of functions, c*x^a == c * pow(x,a) where a is the power. If you want an exact fraction of your answer, you would choose a=1 and it would just be a constant fraction. But you want the percentage to slowly decrease, so you could choose a<1. For example, we might choose a = 0.9 and c = 0.2 and get
1 0.2
10 1.59
100 12.6
1000 100.2
So it ranges from 20% at 1 to 10% at 1000. You can pick smaller a to make the fraction decrease more rapidly. (And you can scale everything to fit your range.)
In particular, if c*5000^a = 2500 and c*26000^a = 6500, then by dividing we get (5.1)^a = 2.6 which we can solve as a = log(2.6)/log(5.1) = 0.58648.... Then we plug back in to get c*147.69 = 2500 so c = 16.927...
Now the progression goes like so
1000 973
3000 1853
5000 2500
10000 3754
15000 4762
26000 6574
50000 9648
90000 13618
This is somewhat similar to simple compression schemes used for analogue audio. See Wikipedia entry for Companding.

Resources