I want to look at the change in the percentage of income savings measured in dollars using repeated measures data for older adults. I want to compare three groups based on their father's occupation while growing up.
I want to estimate different slopes and intercepts for the three groups. For example: group A is subjects whose father's were blue-collar; group B is subjects whose fathers were white-collar; group C is subjects whose fathers were other types of workers. I have repeated measures for 10 annual surveys and subjects reported how much of their income they saved. I want to see if the intercept and trajectory differ for the three groups.
This is what I come up with but I am not sure if this gets at the appropriate verbal explanation.
m1
xtmixed savings time || subjects:, var
m1 looks at the impact of time on savings and estimates intercepts and slopes for each subject.
m2
xtmixed savings time##fathers_occupation || subjects:, var
Does m2 examine the differences in the intercept and slope for the two groups? Or do I need to add fathers_occupation to the right side of || as well?
You don't need it in the right part, unless you think the effect of fathers_occupation on savings is different between all the subjects. m2 estimates different intercepts for each subject and then the time##fathers_occupation estimates the slopes of fathers_occupation in each measurement (time) group of subjects.
I don't know your data but I think it's very unlikely to have different effects for fathers_occupation for all the different subjects, so I think you could rely on the simpler, random intercept model with the interaction term, as specified by m2.
Related
I am looking for advice on the proper model notation to test for differences between sex in my data. My goal is to determine whether or not I need to split my data into M and F, or if I can keep my data combined (I hope that I will be able to keep it combined due to sample size).
I am using the glmmTMB package in R for resource selection function analysis and my plan is to run 1 model with random intercepts and slopes, without sex, and then compare this model to essentially the same model but contains sex as an interaction term. I will compare AIC to determine the most supported model (i.e. if the model with sex is supported, then I will separate my data into M and F and analyze separately. If the model without sex is supported, then I will keep it combined).
I am following the code provide in the supplementary materials by Muff et al. 2019 (model M4): https://conservancy.umn.edu/bitstream/handle/11299/204737/Goats_RSF.html?sequence=21&isAllowed=y
For example:
My model without sex looks like this:
glmmTMB(Used_and_Available_Locations ~ Urbanization + (1|AnimalID) + (0 + Urbanization|AnimalID), family = binomial(),...)
My model with sex is where I am confused...How do I account for sex as a random effect when there is an interaction? Should I not account for sex as a random effect?
glmmTMB(Used_and_Available_Locations ~ Sex + Sex*Urbanization + Urbanization + (1|AnimalID) + (0 + Urbanization|AnimalID), family = binomial(),...)
My goal is to determine whether or not I need to split my data into M and F, or if I can keep my data combined (I hope that I will be able to keep it combined due to sample size).
I have never come across a scenario in which it is a good idea to split data along these lines. It results in a massive loss of statistical power, and provides no benefits.
When you have an "effect" of a predictor variable that differs depending on the level of another predictor such as Urbanization in your model having a different effect in females than in males, the interaction term will uncover this, without any loss of statistical power. The main thing to be aware of when fitting an interaction is that the main effect of the variables involved are then each conditional on the other variable being at zero (or at it's refernce level in the case of the categorical variable such as sex)
My model with sex is where I am confused...How do I account for sex as a random effect when there is an interaction? Should I not account for sex as a random effect?
Sex would never be a random effect. It does not make sense as a random intercept because there are only 2 levels of it (and can't really be considerd as a random factor for any other reason) and since it does not vary within individuals it does not make sense for it to be a random slope either.
I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)
Lets say I have a list of 500 objects. I need to rate each one out of 10.
At random I select two and present them to a friend. I then ask the friend which they prefer. I then use this comparison (ie OBJECT1 is better than OBJECT2) to alter the two objects' rating out of ten.
I then repeat this random selection and comparison thousands of times with a group of friends until I have a list of 500 objects with a reliable rating out of ten.
I need to figure out an algorithm which takes the two objects current ratings, and alters them depending on which is thought to be better...
Each object's rating could be (number of victories)/(number of contests entered) * 10. So the rating of the winner goes up a bit and the rating of the loser goes down a bit, according to how many contests they've previously entered.
For something more complicated and less sensitive to the luck of the draw with smaller numbers of trials, I'd suggest http://en.wikipedia.org/wiki/Elo_rating_system, but it's not out of 10. You could rescale everyone's scores so that the top score becomes 10, but then a match could affect everyone's rating, not just the rating of the two involved.
It all sort of depends what "reliable" means. Different friends' judgements will not be consistent with respect to each other, and possibly not even consistent over time for the same person, so there's no "real" sorted order for you to sanity-check the rankings against.
On a more abstruse point, Arrow's Impossibility Theorem states some nice properties that you'd like to have in a system that takes individual preferences and combines them to form an aggregated group preference. It then proceeds to prove that they're mutually inconsistent - you can't have them all. Any intuitive idea of a "good" overall rating runs a real risk of being unachievable.
I'm working on a project where I need to sort a list of user-submitted articles by their popularity (last week, last month and last year).
I've been mulling on this for a while, but I'm not a great statitician so I figured I could maybe get some input here.
Here are the variables available:
Time [date] the article was originally published
Time [date] the article was recommended by editors (if it has been)
Amount of votes the article has received from users (total, in the last week, in the last month, in the last year)
Number of times the article has been viewed (total, in the last week, in the last month, in the last year)
Number of times the article has been downloaded by users (total, in the last week, in the last month, in the last year)
Comments on the article (total, in the last week, in the last month, in the last year)
Number of times a user has saved the article to their reading-list (Total, in the last week, in the last month, in the last year)
Number of times the article has been featured on a kind of "best we've got to offer" (editorial) list (Total, in the last week, in the last month, in the last year)
Time [date] the article was dubbed 'article of the week' (if it has been)
Right now I'm doing some weighting on each variable, and dividing by the times it has been read. That's pretty much all I could come up with after reading up on Weighted Means. My biggest problem is that there are some user-articles that are always on the top of the popular-list. Probably because the author is "cheating".
I'm thinking of emphasizing the importance of the article being relatively new, but I don't want to "punish" articles that are genuinely popular just because they're a bit old.
Anyone with a more statistically adept mind than mine willing to help me out?
Thanks!
I think the weighted means approach is a good one. But I think there are two things you need to work out.
How to weigh the criteria.
How to prevent "gaming" of the system
How to weigh the criteria
This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:
c1 c2 c3 ...
c1 1 4 2
c2 1/4 1 1/2
c3 1/2 2 1
...
This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.
Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.
Preventing gaming
This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.
You can use outlier theory for detecting anomalies. A very naive way of looking for outliers is using the mahalanobis distance. This is a measure that takes into account the spread of your data, and calculates the relative distance from the center. It can be interpreted as how many standard deviations the article is from the center. This will however include also genuinely very popular articles, but it gives you a first indication that something is odd.
A second, more general approach is building a model. You could regress the variables that can be manipulated by users against those related to editors. One would expect that users and editors would agree to some extent. If they don't, then it's again an indication something is odd.
In both cases, you'll need to define some treshold and try to find some weighting based on that. A possible approach is to use the square rooted mahalanobis distance as an inverse weight. If you're far away from the center, your score will be pulled down. Same can be done using the residuals from the model. Here you could even take the sign into account. If the editor score is lower than what would be expected based on the user score, the residual will be negative. if the editor score is higher than what would be expected based on the user score, the residual is positive and it's very unlikely that the article is gamed. This allows you to define some rules to reweigh the given scores.
An example in R:
Code :
#Test data frame generated at random
test <- data.frame(
quoted = rpois(100,12),
seen = rbinom(100,60,0.3),
download = rbinom(100,30,0.3)
)
#Create some link between user-vars and editorial
test <- within(test,{
editorial = round((quoted+seen+download)/10+rpois(100,1))
})
#add two test cases
test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers
test[102,]<-c(20,18,13,8) # genuinely good article
# mahalanobis distances
mah <- mahalanobis(test,colMeans(test),cov(test))
# simple linear modelling
mod <- lm(editorial~quoted*seen*download,data=test)
# the plots
op <- par(mfrow=c(1,2))
hist(mah,breaks=20,col="grey",main="Mahalanobis distance")
points(mah[101],0,col="red",pch=19)
points(mah[102],0,,col="darkgreen",pch=19)
legend("topright",legend=c("high rated by editors","gamed"),
pch=19,col=c("darkgreen","red"))
hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4))
points(resid(mod)[101],0,col="red",pch=19)
points(resid(mod)[102],0,,col="darkgreen",pch=19)
par(op)
There are any number of ways to do this, and what works for you will depend on your actual dataset and what outcomes you desire for specific articles. As a rough reworking though, I would suggest moving the times it has been read to the weighted numbers and dividing by age of the article, since the older an article is, the more likely it is to have higher numbers in each category.
For example
// x[i] = any given variable above
// w[i] = weighting for that variable
// age = days since published OR
// days since editor recommendation OR
// average of both OR
// ...
score = (x[1]w[1] + ... + x[n]w[n])/age
Your problem of wanting to promote new articles more but not wanting to punish genuinely popular old articles requires consideration of how you can tell whether or not an article is genuinely popular. Then just use the "genuine-ness" algorithm to weight the votes or views rather than a static weighting. You can also change any of the other weightings to be functions rather than constants, and then have non-linear weightings for any variables you wish.
// Fw = some non-linear function
// (possibly multi-variable) that calculates
// a sub-score for the given variable(s)
score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age)
Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.