How to make a simulation from a dataset? - algorithm

I have a relatively big dataset for which I want to make a simulation like what are the consequences of some scenario.
I can make a monte carlo simulation but there are lots of variables so it would be impossible for the user to specify a probability distribution function for each variable.
What other ways you suggest to define scenarios which would be possible (user can't specify input for each variable ) and to make the simulation.
I am thinking about a way for example to generate a new dataset from the previous one based on some scenario and which will be scientifically correct.
Notes :
- all of this is going to be inside an application.
- the dataset variables are dependent on time

Related

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

using genetic algorithm to generate test sequences based on extended finite state machine

I want to generate test sequences based on Extended finite state machine (EFSM ) using genetic algorithm. EFSM based testing face the problem of in feasible path by genetic algorithm. My coverage criteria is transition coverage. I have an EFSM model of a system which have input parameters and guards on transitions from one state to another. So by using this EFSM model, I want to generate test sequences. But I am confused about how do I start. I mean how to generate initial population.
Actually, my research is about EFSM based test case generation. I have a model of ATM machine.This model consist of states and transitions. Transitions have guards and actions for the input parameters. Now I want to generate test cases for this machine. I mean model based testing. For this task it is compulsory that there should not be in feasible path. I mean every transition should be covered in test case. So for this purpose, I need to generate test sequences. Genetic algorithm is good for path optimization. but I don't know how to use my model specification in genetic algorithm and generate test sequences.
Given the ramifications I would simplify the random creation of the population part by using a random walk in the graph of FSM (not taking into account the boolean constraints for now) - this is like generation of examples from a regex (or transforming your FSM into transducer producing the input on output and walking through it). Once you generated many random examples of sufficient length, you go through a process of validating them using the _E_FSM part. Given that probably many of them will not be valid you may consider some "fixing" strategy - fixing the individuals which do not validate but are not far from being correct (a heuristic you have to come up with on your own). Then your population is actually a set of individuals (so you evolve a population of sets) and your evaluation metric would be coverage on the set level. Additionally, I would either not use crossover operator or ensure only the valid points and individuals cross. Mutation would be choosing a point in the graph and randomly going a different path. That's about it for a sketch of a solution (I successfully solved a similar problem with GA).

Binary classification of sensor data

My problem is the following: I need to classify a data stream coming from an sensor. I have managed to get a baseline using the
median of a window and I subtract the values from that baseline (I want to avoid negative peaks, so I only use the absolute value of the difference).
Now I need to distinguish an event (= something triggered the sensor) from the noise near the baseline:
The problem is that I don't know which method to use.
There are several approaches of which I thought of:
sum up the values in a window, if the sum is above a threshold the class should be EVENT ('Integrate and dump')
sum up the differences of the values in a window and get the mean value (which gives something like the first derivative), if the value is positive and above a threshold set class EVENT, set class NO-EVENT otherwise
combination of both
(unfortunately these approaches have the drawback that I need to guess the threshold values and set the window size)
using SVM that learns from manually classified data (but I don't know how to set up this algorithm properly: which features should I look at, like median/mean of a window?, integral?, first derivative?...)
What would you suggest? Are there better/simpler methods to get this task done?
I know there exist a lot of sophisticated algorithms but I'm confused about what could be the best way - please have a litte patience with a newbie who has no machine learning/DSP background :)
Thank you a lot and best regards.
The key to evaluating your heuristic is to develop a model of the behaviour of the system.
For example, what is the model of the physical process you are monitoring? Do you expect your samples, for example, to be correlated in time?
What is the model for the sensor output? Can it be modelled as, for example, a discretized linear function of the voltage? Is there a noise component? Is the magnitude of the noise known or unknown but constant?
Once you've listed your knowledge of the system that you're monitoring, you can then use that to evaluate and decide upon a good classification system. You may then also get an estimate of its accuracy, which is useful for consumers of the output of your classifier.
Edit:
Given the more detailed description, I'd suggest trying some simple models of behaviour that can be tackled using classical techniques before moving to a generic supervised learning heuristic.
For example, suppose:
The baseline, event threshold and noise magnitude are all known a priori.
The underlying process can be modelled as a Markov chain: it has two states (off and on) and the transition times between them are exponentially distributed.
You could then use a hidden Markov Model approach to determine the most likely underlying state at any given time. Even when the noise parameters and thresholds are unknown, you can use the HMM forward-backward training method to train the parameters (e.g. mean, variance of a Gaussian) associated with the output for each state.
If you know even more about the events, you can get by with simpler approaches: for example, if you knew that the event signal always reached a level above the baseline + noise, and that events were always separated in time by an interval larger than the width of the event itself, you could just do a simple threshold test.
Edit:
The classic intro to HMMs is Rabiner's tutorial (a copy can be found here). Relevant also are these errata.
from your description a correctly parameterized moving average might be sufficient
Try to understand the Sensor and its output. Make a model and do a Simulator that provides mock-data that covers expected data with noise and all that stuff
Get lots of real sensor data recorded
visualize the data and verify your assuptions and model
annotate your sensor data i. e. generate ground truth (your simulator shall do that for the mock data)
from what you learned till now propose one or more algorithms
make a test system that can verify your algorithms against ground truth and do regression against previous runs
implement your proposed algorithms and run them against ground truth
try to understand the false positives and false negatives from the recorded data (and try to adapt your simulator to reproduce them)
adapt your algotithm(s)
some other tips
you may implement hysteresis on thresholds to avoid bouncing
you may implement delays to avoid bouncing
beware of delays if implementing debouncers or low pass filters
you may implement multiple algorithms and voting
for testing relative improvements you may do regression tests on large amounts data not annotated. then you check the flipping detections only to find performance increase/decrease

Odd correlated posterior traceplots in multilevel model

I'm trying out PyMC3 with a simple multilevel model. When using both fake and real data the traces of the random effect distributions move with each other (see plot below) and appear to be offsets of the same trace. Is this an expected artifact of NUTS or an indication of a problem with my model?
Here is a traceplot on real data:
Here is an IPtyhon notebook of the model and the functions used to create the fake data. Here is the corresponding gist.
I would expect this to happen in accordance with the group mean distribution on alpha. If you think about it, if the group mean shifts around it will influence all alphas to the same degree. You could confirm this by doing a scatter plot of the group mean trace against some of the alphas. Hierarchical models are in general difficult for most samplers because of these complex interdependencies between group mean and variance and the individual RVs. See http://arxiv.org/abs/1312.0906 for more information on this.
In your specific case, the trace doesn't look too worrisome to me, especially after iteration 1000. So you could probably just discard those as burn-in and keep in mind that you have some sampling noise but probably got the right posterior overall. In addition, you might want to perform a posterior predictive check to see if the model can reproduce the patterns in your data you are interested in.
Alternatively, you could try to estimate a better hessian using pm.find_hessian(), e.g. https://github.com/pymc-devs/pymc/blob/3eb2237a8005286fee32776c304409ed9943cfb3/pymc/examples/hierarchical.py#L51
I also found this paper which looks interesting (haven't read it yet but might be cool to implement in PyMC3): arxiv-web3.library.cornell.edu/pdf/1406.3843v1.pdf

Strategy for building best fit multiple regression model with time lagged variables

I am building a multiple regression model - wrapped in a function - with one dependent variable and a dozen independent variables. The reason why I am building a function is that I need to do this analysis with approximately 75 different datasets.
The challenge is that the independent variables correlate better with the dependent variable when they are lagged in time. Unfortunately, not all time lags are the same for each variable and I would like to determine the optimal mix of time lags for each variable while getting the most optimum Adjusted R^2 value for the multiple regression model. Moreover, after building an initial model I will try to reduce the model using the step(modelbase, direction="both") function on the model.
In the approach I currently have I time lag all the independent variables with the same number of weeks. This results in the best possible model where all independent variables have the same time lag, but I believe (with a valid hypothesis supporting this) that there is a better model out there when we differ the time lag for each independent variable. My question is what is the best strategy to determine the best fit model without making the number of options huge. If I want to determine between 0 and 20 weeks time lag in weekly steps for 12 independent variables I am quickly up to trying to find a match between 4.096e+15 variables (=20^12).
I can imagine reducing the problem with the following strategy: Start by finding the best fit model with one independent variable at different time lags. The second step will be to add a second independent variable with its different time lags and find the best model with the two independent variables where the second is tried at different time lags while the first is kept constant. Then add a third variable for which we take a similar approach as the second by keeping the first two variables constant and change try the third with different time lags. Something tells me that this strategy might be decent approach, but something that there also might be a better overall model that contains the not optimal variables for each individual independent variable.
Is there anybody who shine some light on how to tackle this challenge?

Resources