How to organize Ray Tune Trainable class calculations in a k-fold CV setup? - cross-validation

It looks like the Ray Tune docs push you to write a Trainable class with a _train method that does incremental training and reports metrics as a dict. There is some persistance of state through the _load _restore methods.
TLDR; There is no stated good way to run nested parallel runs based on a config and log the results etc. Does this pattern break the tune flow?
If you want to use aggregated scores over different train/test pairs (with different models), it the intended pattern that you map (parellel) over the train/test pairs and models inside the _train method?
If there is a better place for this question about Tune usage let me know.

Related

Time series database - interpolation and filtering

Does any existing time series database contain functionality for interpolation and digital filtering for outlier detection (Hampel, Savitzky-Golay)? Or at least provide interfaces to enable custom query result post-processing?
As far as I know InfluxDB does not offer you anything more than a simple linear interpolation and the basic set of InfluxQL functions.
Looks like, everything more complex than that has to be done by hand with the programming language of your choice. Influx has a number of language clients.
There is an article on anomaly detection with Prometheus, but that looks like an attempt rather then a capability.
However, there is a thing called Kapacitor for InfluxDB. It a quite powerful data processing tools which allows define User Defined Functions (UDF). Here is an article of how to implement custom anomaly detection with Kapacitor.
Akumuli time-series database provides some basic outlier detection, for instance the EWMA and SMA. You can issue a query that will return the difference between the predicted value (given by EWMA or SMA) and the actual value.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

WEKA PartitionMembership filter

I have a question regarding the supervised PartitionMembership filter in WEKA.
When applying this filter using J48 as partition generator, I am able to achieve a much higher accuracy in combination with the KStar classifier.
What does this filter exactly do, because the documentation provided by WEKA is quite limited? And is it valid to use this filter to get an increased accuracy?
When applying this filter on my trainings set, it generates a number of classes. When I try to reapply the model on my test set, the filter generates a different number of classes. Hence, I am not able to use this trained supervised PartitionMembership filter for my test set. How can I use the PartitionMembership filter that was trained on the training set also for the test set?
You are asking two or three questions here. Regarding the first two: What does the PartitionMembership filter do, and how do I use it? - That I don't know to answer properly. Ultimately you can read the source code to check it out.
For the latter question, (~ how do I get it to evaluate my test set), Please use the FilteredClassifier, and there choose your filter and your classification in the dialog box of that classifier.
NAME weka.classifiers.meta.FilteredClassifier
SYNOPSIS Class for running an arbitrary classifier on data that has
been passed through an arbitrary filter. Like the classifier, the
structure of the filter is based exclusively on the training data and
test instances will be processed by the filter without changing their
structure.

Is this a viable MapReduce use case or even possible to execute?

I am new to Hadoop, MapReduce, Big Data and am trying to evaluate it's viability for a specific use case that is extremely interesting to the project that I am working on. I am not sure however if what I would like to accomplish is A) possible or B) recommended with the MapReduce model.
We essentially have a significant volume of widgets (known structure of data) and pricing models (codified in JAR files) and what we want to be able to do is to execute every combination of widget and pricing model to determine the outcomes of the pricing across the permutations of the models. The pricing models themselves will examine each widget and determine pricing based on the decision tree within the model.
This makes sense from a parallel processing on commodity infrastructure perspective in my mind but from a technical perspective I do not know if it's possible to execute external models within the MR jobs and from a practical perspective whether or not I am trying to force a use case into the technology.
The question therefore becomes is it possible; does it make sense to implement in this fashion; and if not what are other options / patterns more suited to this scenario?
EDIT
The volume and variety will grow over time. Assume for the sake of discussion here that we have a terabyte of widgets and 10s of pricing models currently. We would then expect to gro into multiple terabytes and 100s of pricing models and that the execution of the permutations would would happen frequently as widgets change and/or are added and as new categories of pricing models are introduced.
You certainly need a scalable, parallel-izable solution and hadoop can be that. You just have to massage your solution a bit so it would fit into the hadoop world.
First, You'll need to make the models and widgets implement common interfaces (speaking very abstractly here) so that you can apply and arbitrary model to an arbitrary widget without having to know anything about the actual implementation or representation.
Second, you'll have to be able to reference both models and widgets by id. That will let you build objects (writables) that hold the id of a model and the id of a widget and would thus represent one "cell" in the cross product of widgets and models. You distribute these instances across multiple servers and in doing so distribute the application of models to widgets across multiple servers. These objects (call it class ModelApply) would hold the results of a specific model-to-widget application and can be processed in the usual way with hadoop to repost on best applications.
Third, and this is the tricky part, you need to compute the actual cross product of models to widgets. You say the number of models (and therefore model id's) will number in at most the hundreds. This means that you could load that list of id's into memory in a mapper and map that list to widget id's. Each call to the mapper's map() method would pass in a widget id and would write out one instance of ModelApply for each model.
I'll leave it at that for now.

How to use BDD to code complex data structures / data layers

I'm new to behavior-driven development and I can't find any examples or guidelines that parallel my current problem.
My current project involves a massive 3D grid with an arbitrary number of slots in each of the discrete cells. The entities stored in these slots have their own slots and, thus, an arbitrary nesting of entities can exist. The final implementation of the object(s) used will need be backed by some kind of persistent data store, which complicates the API a bit (i.e. using words like load/store instead of get/set and making sure modifying returned items doesn't modify the corresponding items in the data store itself). Don't worry, my first implementation will simply exist in-memory, but the API is what I'm supposed to be defining behavior against, so the actual implementation doesn't matter right now.
The thing I'm stuck on is the fact that BDD literature focuses on the interactions between objects and how mock objects can help with that. That doesn't seem to apply at all here. My abstract data store's only real "behavior" involves loading and storing data from entities outside those represented by the programming language itself; I can't define or test those behaviors since they're implementation-dependent.
So what can I define/test? The natural alternative is state. Store something. Make sure it loads. Modify the thing I loaded and make sure after I reload it's unmodified. Etc. But I'm under the impression that this is a common pitfall for new BDD developers, so I'm wondering if there's a better way that avoids it.
If I do take the state-testing route, a couple other questions arise. Obviously I can test an empty grid first, then an empty entity at one location, but what next? Two entities in different locations? Two entities in the same location? Nested entities? How deep should I test the nesting? Do I test the Cartesian product of these non-exclusive cases, i.e. two entities in the same location AND with nested entities each? The list goes on forever and I wouldn't know where to stop.
The difference between TDD and BDD is about language. Specifically, BDD focuses on function/object/system behavior to improve design and test readability.
Often when we think about behavior we think in terms of object interaction and collaboration and therefore need mocks to unit test. However, there is nothing wrong with an object whose behavior is to modify the state of a grid, if that is appropriate. State or mock based testing can be used in TDD/BDD alike.
However, for testing complex data structures, you should use a Matchers (e.g. Hamcrest in Java) to test only the part of the state you are interested in. You should also consider whether you can decompose the complex data into objects that collaborate (but only if that makes sense from an algorithmic/design standpoint).

Resources