How do you approximate state transitions based on a person's attributes? - probability

Suppose you have a customer attribute classified as high, medium and low. Different class of customers are supposed to have different probabilities of state transition. Are you creating micro ML models to get different probabilities or you are just using relative percentages of past transitions based on historical data?

using distributions based on historical data is the thing most typically used if you have transition data... nevertheless depending on the situation you can also use prediction models to know what transition the agent will use, as long as you have independent variables that you can use to predict this...
This prediction can be done with ML models or statistical models, depending on the situation
You can also go further and use artificial intelligence if there is a constant sequential decision making on where to transition, you can do this if you want to optimize the behavior of these customers... reinforcement learning is used for that and you can use your simulation model to generate a policy for the decision making process of these customers.

Related

How to validate my YOLO model trained on custom data set?

I am doing my research regarding object detection using YOLO although I am from civil engineering field and not familiar with computer science. My advisor is asking me to validate my YOLO detection model trained on custom dataset. But my problem is I really don't know how to validate my model. So, please kindly point me out how to validate my model.
Thanks in advance.
I think first you need to make sure that all the cases you are interested in (location of objects, their size, general view of the scene, etc) are represented in your custom dataset - in other words, the collected data reflects your task. You can discuss it with your advisor. Main rule - you label data qualitatively in same manner as you want to see it on the output. more information can be found here
It's really important - garbage in, garbage out, the quality of output of your trained model is determined by the quality of the input (labelled data)
If this is done, it is common practice to split your data into training and test sets. During model training only train set is used, and you can later validate the quality (generalizing ability, robustness, etc) on data that the model did not see - on the test set. It's also important, that this two subsets don't overlap - than your model will be overfitted and the model will not perform the tasks properly.
Than you can train few different models (with some architectural changes for example) on the same train set and validate them on the same test set, and this is a regular validation process.

Why a "single variable model" overperforms a multivariate model for classification?

I have a dataset with 2 possible outcomes, disease vs healthy, when looking for biomarkers there is one variable that yields ahigher AUROC than a model built with 5 variables including that same feature.
It is hard to answer your question without more information about the data and model you're using.
Generally speaking, making a model more complex (e.g. by adding additional predictors) increases the risk of overfitting to the training data, which can lead to bad performance on the test data.
Another possible reason for decreasing predictive performance when adding additional predictors is multicollinearity between the predictors. You can check this by looking at the correlations between them or, in regression models, at variance inflation factors.

Gensim Word2vec model parameter tuning

I am working on Word2Vec model. Is there any way to get the ideal value for one of its parameter i.e iter. Like the way we used do in K-Means (Elbo curve plot) to get the K value.Or is there any other way for parameter tuning on this model.
There's no one ideal set of parameters for a word2vec session – it depends on your intended usage of the word-vectors.
For example, some research has suggested that using a larger window tends to position the final vectors in a way that's more sensitive to topical/domain similarity, while a smaller window value shifts the word-neighborhoods to be more syntactic/functional drop-in replacements for each other. So depending on your particular project goals, you'd want a different value here.
(Similarly, because the original word2vec paper evaluated models, & tuned model meta-parameters, based on the usefulness of the word-vectors to solve a set of English-language analogy problems, many have often tuned their models to do well on the same analogy task. But I've seen cases where the model that scores best on those analogies does worse when contributing to downstream classification tasks.)
So what you really want is a project-specific way to score a set of word-vectors, well-matched to your goals. Then, you run many alternate word2vec training sessions, and pick the parameters that do best on your score.
The case of iter/epochs is special, in that by the logic of the underlying stochastic-gradient-descent optimization method, you'd ideally want to use as many training-epochs as necessary for the per-epoch running 'loss' to stop improving. At that point, the model is plausibly as good as it can be – 'converged' – given its inherent number of free-parameters and structure. (Any further internal adjustments that improve it for some examples worsen it for others, and vice-versa.)
So potentially, you'd watch this 'loss', and choose a number of training-iterations that's just enough to show the 'loss' stagnating (jittering up-and-down in a tight window) for a few passes. However, the loss-reporting in gensim isn't yet quite optimal – see project bug #2617 – and many word2vec implementations, including gensim and going back to the original word2vec.c code released by Google researchers, just let you set a fixed count of training iterations, rather than implement any loss-sensitive stopping rules.

What type of algorithm should I use for forecasting with only very little historic data?

The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?
Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804
In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html

Validation of hurdle model?

I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?
For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.

Resources