Is there model selection score for GEE models besides the scale? - statsmodels

I need to compare different repeated measure or random effect models whose parameters have been inferred with GEE.
However, there is no model-selection variable besides the scale.
Has anyone implemented the AIC alternative for GEE proposed by Pan?
http://50.28.42.145/faculty/wp-content/uploads/2012/11/rr2000-013.pdf
Other options for model are welcome

Related

Cross Entropy Loss Gets Negative Values In Training Transformer Model

I built my Transformer model for recovering text. In detail, the source text may contain some redundant, missing or wrong words, my model have to correct as many as possible these words. Moreover, I just want my model learn embedding of the correct sentence, so the sources and the targets are sequences of embedding. Therefore, my loss function - Cross Entropy takes 2 embedding sequence as input and target. In addition, this model is a part of the larger model which the main criterion is Negative-Log Likelihood.
Unfortunately, the values of Cross Entropy Loss is under 0.0 after few epochs, then the sum of Cross Entropy and Negative-Log Likelihood is under 0.0 too. This makes the whole model be not able to converge.
I need helps to resolve this issue. Thank in advance.

Hidden Markov Model from Score Matrix

Im doing a course in bioinformatic.
Im trying to figure out how to model a Hidden Makrov Model (HMM) from a Position Specific Probability Matrix (PSPM).
Is there clear pattern how one should model it?
Can someone show me how to model it, with 3 or more states based on a PSPM.
I will supply a example of a PSPM, but feel free to use your own.
Example taken from MIT course.
The question is not very clear, but maybe this will help.
For this models you can have as many positions as you want in principle.
Here therw is an example of what is possilbe to do in bioinformatics:
Let's imagine you have an alignment of DNA (we can call it "database"). You can build the model like this with hmmer:
hmmbuild <hmm_profile> <alignment>
Then you can check the hmmer_profile which will contain all the transition probabilities for each position of the alignment and also the transition probabilities.
After that you can search new sequences against it for example. Now imagine you have a set of sequences ("queries") in a file and you want to map them against your "database", so you know which positions each query aligns better:
nhmmer -o <output_file> --dna <hmmer_profile> <sequences_file>
The output_file it returns the aligned sequences against the database.

Numeric to nominal filter

When is it compulsory to use the filter to change the data type to nominal? I am doing classification right now, and the results differ by a huge margin if I changed it to nominal compared to as it is. Thank you in advance.
I don't your question is formed well but I will try to answer it anyway.
Nominal and numeric attributes represent different types of attributes and therefore are treated differently by machine learning algorithms.
Nominal attributes are limited to a closed set of values and they don't have order or any other relation between them. Usually nominal attributes should have a small amount of possible values (large set of possible values may cause over-fitting). The color of car is an example of an attribute that probably would be represented as a nominal attribute.
Numeric attributes are usually more common. They represent values on some axis and are not limited to specific values. Usually the classification algorithms will try to find a point on that axis that differentiate well between the classes or use the value to calculate distance between instances. The salary of an employee is an example of an attribute I will probably use as a numeric attribute.
One more thing you need to take into account is how the classification algorithm treats nominal and numeric attributes. Some algorithms don't handle well nominal attributes. Other algorithms will not work well with several numeric attributes if the values of the attributes are not normalized.

How to decide to convert to categorical variable or keep it numeric?

This might be a basic or trivial question and might be straightforward. Still I would like to ask this to clear my doubt once and for all.
Take example of Passanger Class in Famous Titanic Data. Functionally it is indeed a Categorical Data, so it will make perfect sense to convert it to categorical variable. Algorithms as per my understanding tend to see a pattern specific to that class. But at the same time if you see it as numeric variable, it might denote a range also for a decision tree. Say passangers in between first class and second class.
It looks both are correct and both will affect the machine learning algorithm outputs in different ways.
Which one is appropriate and is there anywhere there is a extensive discussion about it? Should we use such ambiguous variables as numeric as well its copy as a categorical variable, which might prove to be a technique to uncover more patterns?
I suppose it's up to you whether you'd rather interpret a continuous PassengerClass variable as "for every one-unit increase in PassengerClass, the passenger's likelihood of survival goes up/down X%," versus a categorical (factor) PassengerClass as, "the likelihoods of survival for groups 2 and 3 (for example, leaving 1st-class passengers as the base group) are X and Y% percent higher, respectively, than the base group, holding all else constant."
I think about variables like PassengerClass almost as "treatment groups." Yes, I suppose you could interpret it as continuous, but I think it makes more sense to consider the unique effects of each class like "people who were given the drug versus those who weren't" - you can very easily compare the impacts of being in a higher class (e.g. 2 or 3) to being in the most common class, 1, which again would be left out.
The problem with mapping categorical notions to numerical is that some algorithms (e.g. neural networks) will interpret the value itself as having a meaning, i.e. you would get different results if you assign values 1,2,3 to passenger classes than, for example 0,1,2 or 3,2,1. The correspondence between the passenger classes and numbers is purely conventional and doesn't necessarily convey any additional meaning.
One could argue that the lesser the number, the "better" the class is, however it's still hard to interpret it as "the first class is twice as good as second class", unless you'll define some measure of "goodness" that will make the relation between numbers "1" and "2" sensible.
In this example, you have categorical data that is ordinal - meaning you can rank the categories (from best accommodations to worst, for example) but they're still categories. Regardless of how you label them, there's no actual information about the relative distances among your categories. You can put them in a table, but not (correctly) on a number line. In cases like this, it's generally best to treat your categorical data as independent categories.

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure:
User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
Example/Context:
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profile: Video, so Video profile only contains Attributes of Video category
Attributes, so an Attribute in the Video category may be Genre
Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values
Profiles and Attributes are just ways of grouping Attribute Values on two levels.
Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Problem:
Give each user a similarity rating against each other user.
Similarity based on All Attribute Values associated with the user.
Flat/one level
Unequal number of attribute values between two users
Attribute value can only be selected once per user, so no duplicates
Therefore, binary string/boolean array with Cosine Similarity?
1 + Weight Profiles
Give each profile a weight (totaling 1?)
Work out profile similarity, then multiply by weight, and sum?
1 + Weight Attribute Categories and Profiles
As an attribute belongs to a category, categories can be weighted
Similarity per category, weighted sum, then same by profile?
Or merge profile and category weights
3 + Distance between every attribute value
Table of similarity distance for every possible value vs value
Rather than similarity by value === value
'Close' attributes contribute to overall similarity.
No idea how to do this one
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Run feature selection to avoid noisy variables.
Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH

Resources