How do you incorporate random effects with interactions? - random-effects

I am looking for advice on the proper model notation to test for differences between sex in my data. My goal is to determine whether or not I need to split my data into M and F, or if I can keep my data combined (I hope that I will be able to keep it combined due to sample size).
I am using the glmmTMB package in R for resource selection function analysis and my plan is to run 1 model with random intercepts and slopes, without sex, and then compare this model to essentially the same model but contains sex as an interaction term. I will compare AIC to determine the most supported model (i.e. if the model with sex is supported, then I will separate my data into M and F and analyze separately. If the model without sex is supported, then I will keep it combined).
I am following the code provide in the supplementary materials by Muff et al. 2019 (model M4): https://conservancy.umn.edu/bitstream/handle/11299/204737/Goats_RSF.html?sequence=21&isAllowed=y
For example:
My model without sex looks like this:
glmmTMB(Used_and_Available_Locations ~ Urbanization + (1|AnimalID) + (0 + Urbanization|AnimalID), family = binomial(),...)
My model with sex is where I am confused...How do I account for sex as a random effect when there is an interaction? Should I not account for sex as a random effect?
glmmTMB(Used_and_Available_Locations ~ Sex + Sex*Urbanization + Urbanization + (1|AnimalID) + (0 + Urbanization|AnimalID), family = binomial(),...)

My goal is to determine whether or not I need to split my data into M and F, or if I can keep my data combined (I hope that I will be able to keep it combined due to sample size).
I have never come across a scenario in which it is a good idea to split data along these lines. It results in a massive loss of statistical power, and provides no benefits.
When you have an "effect" of a predictor variable that differs depending on the level of another predictor such as Urbanization in your model having a different effect in females than in males, the interaction term will uncover this, without any loss of statistical power. The main thing to be aware of when fitting an interaction is that the main effect of the variables involved are then each conditional on the other variable being at zero (or at it's refernce level in the case of the categorical variable such as sex)
My model with sex is where I am confused...How do I account for sex as a random effect when there is an interaction? Should I not account for sex as a random effect?
Sex would never be a random effect. It does not make sense as a random intercept because there are only 2 levels of it (and can't really be considerd as a random factor for any other reason) and since it does not vary within individuals it does not make sense for it to be a random slope either.

Related

Slow, underlying trend with a random walk or Gaussian process

I'm trying to fit a PyMC3 model to some data regarding sales over time. Here's a brief description :
N salespeople each sell some number of widgets per week
We assume each salesperson sells widgets at a different mean rate per week, and call this beta_i for salesperson i
Our observed data is assumed to be ~Poisson(beta_i).
Weekly average sales data is plotted here in a histogram, with a log-normal fit on top, to give you an idea of the distribution of weekly widget sales by salesperson.
In this first scenario, I get what I think are a reasonable set of betas, although they don't look particularly log-normal :
Because we are hoping to infer something about an underlying trend shared by all salespeople (something analogous to "the economy"), we tried adding something. Our first attempt had "the economy" be just a linear function of time, starting at an intercept value of 1 and having derivative gamma > 0 ( gamma was half-normal with sd=0.5 ). We then had our data ~Poisson(beta_i * (1 + gamma)). In this scenario, betas didn't shift much, and we did infer something about "the economy", though it was a pretty weak effect.
I'm hoping to replace this with a random walk or a Gaussian process, to allow "the economy" to vary somewhat smoothly in time but to have an arbitrary shape. Ideally it would start at a value of 0, and then go wherever it needs to go to capture the underlying trend shared by all salespeople, with the data once again ~Poisson(beta_i * (1 + gamma)). Here's our model.
with pm.Model() as model:
# Salesperson base rate of selling widgets
beta_ = pm.Lognormal("beta", mu=mu_hat, sd=sd_hat, shape=(1, n_salespeople))
# mu_hat and sd_hat were estimated by fitting a log-normal to weekly sales data
# Economy
gamma_ = pm.GaussianRandomWalk("gamma", mu=0, sd=1e-6, shape=(n_weeks, 1))
# Effects
base_rate = beta_
economy = 1 + gamma_
# Observed
lambda_ = base_rate * economy
y = pm.Poisson("y", mu=lambda_, observed=observed_sales + 1e-7)
where observed_sales is an integer array of the number of sales made, of shape (n_weeks, n_salespeople).
First, I'm not sure I'm specifying this model correctly. Without the "economy", I infer a reasonable set of betas ( although it doesn't look log-normal, as in the second screenshot ). Then, the random walk we get back is not at all smooth, no matter how small the sd gets; more often than not, for reasons I'm unsure about, I get a message about "Mass matrix contains zeros on the diagonal.". Finally, even at the beginning, I was getting infinite probabilities if I didn't add a small factor to the observed data... Why is that ?
So, TL; DR : I'm fairly new to probabilistic programming, and I'm fairly sure something is going wrong but I'm not sure what. Any input much, much appreciated !

Distribution of the Training Data vs Distribution of the Test/Prediction

Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/

Cheapest way to classify HTTP post objects

I can use SciPy to classify text on my machine, but I need to categorize string objects from HTTP POST requests at, or in near, real time. What algorithms should I research if my goals are high concurrency, near real-time output and small memory footprint? I figured I could get by with the Support Vector Machine (SVM) implementation in Go, but is that the best algorithm for my use case?
Yes, SVM (with a linear kernel) should be a good starting point. You can use scikit-learn (it wraps liblinear I believe) to train your model. After the model is learned, the model is simply a list of feature:weight for each category you want to classifying into. Something like this (suppose you have only 3 classes):
class1[feature1] = weight11
class1[feature2] = weight12
...
class1[featurek] = weight1k ------- for class 1
... different <feature, weight> ------ for class 2
... different <feature, weight> ------ for class 3 , etc
At prediction time, you don't need scikit-learn at all, you can use whatever language you are using on the server backend to do a linear computation. Suppose a specific POST request contains features (feature3, feature5), what you need to do is like this:
linear_score[class1] = 0
linear_score[class1] += lookup weight of feature3 in class1
linear_score[class1] += lookup weight of feature5 in class1
linear_score[class2] = 0
linear_score[class2] += lookup weight of feature3 in class2
linear_score[class2] += lookup weight of feature5 in class2
..... same thing for class3
pick class1, or class2 or class3 whichever has the highest linear_score
One step further: If you could have some way to define the feature weight (e.g., using tf-idf score of tokens), then your prediction could become:
linear_score[class1] += class1[feature3] x feature_weight[feature3]
so on and so forth.
Note feature_weight[feature k] is usually different for each request.
Since for each request, the total number of active features must be much smaller than the total number of considered features (consider 50 tokens or features vs your entire vocabulary of 1 MM tokens), the prediction should be very fast. I can imagine once your model is ready, an implementation of the prediction could be just written based on a key-value store (e.g., redis).

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure:
User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
Example/Context:
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profile: Video, so Video profile only contains Attributes of Video category
Attributes, so an Attribute in the Video category may be Genre
Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values
Profiles and Attributes are just ways of grouping Attribute Values on two levels.
Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Problem:
Give each user a similarity rating against each other user.
Similarity based on All Attribute Values associated with the user.
Flat/one level
Unequal number of attribute values between two users
Attribute value can only be selected once per user, so no duplicates
Therefore, binary string/boolean array with Cosine Similarity?
1 + Weight Profiles
Give each profile a weight (totaling 1?)
Work out profile similarity, then multiply by weight, and sum?
1 + Weight Attribute Categories and Profiles
As an attribute belongs to a category, categories can be weighted
Similarity per category, weighted sum, then same by profile?
Or merge profile and category weights
3 + Distance between every attribute value
Table of similarity distance for every possible value vs value
Rather than similarity by value === value
'Close' attributes contribute to overall similarity.
No idea how to do this one
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Run feature selection to avoid noisy variables.
Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH

What is the difference between a Confusion Matrix and Contingency Table?

I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj.
But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?
Wikipedia's definition:
In the field of artificial intelligence, a confusion matrix is a
visualization tool typically used in supervised learning (in
unsupervised learning it is typically called a matching matrix). Each
column of the matrix represents the instances in a predicted class,
while each row represents the instances in an actual class.
Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix
predicted class
c1 - c2
Actual class c1 15 - 3
___________________
c2 0 - 2
It tells that:
Column1, row 1 means that the classifier has predicted 15 items as belonging to class c1, and actually 15 items belong to class c1 (which is a correct prediction)
the second column row 1 tells that the classifier has predicted that 3 items belong to class c2, but they actually belong to class c1 (which is a wrong prediction)
Column 1 row 2 means that none of the items that actually belong to class c2 have been predicted to belong to class c1 (which is a wrong prediction)
Column 2 row 2 tells that 2 items that belong to class c2 have been predicted to belong to class c2 (which is a correct prediction)
Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.
Now, for Contingency table:
Wikipedia's definition:
In statistics, a contingency table (also referred to as cross
tabulation or cross tab) is a type of table in a matrix format that
displays the (multivariate) frequency distribution of the variables.
It is often used to record and analyze the relation between two or
more categorical variables.
In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):
Coffee !coffee
tea 150 50 200
!tea 650 150 800
800 200 1000
It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):
150 people like both tea and coffee
50 people like tea but do not like coffee
650 people do not like tea but like coffee
150 people like neither tea nor coffee
Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).
Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.
Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.
Hope this helps.
In short, contingency table is used to describe data. and confusion matrix is, as others have pointed out, often used when comparing two hypothesis. One can think of predicted vs actual classification/categorization as two hypothesis, with the ground truth being the null and the model output being the alternative.

Resources