Odd correlated posterior traceplots in multilevel model - pymc

I'm trying out PyMC3 with a simple multilevel model. When using both fake and real data the traces of the random effect distributions move with each other (see plot below) and appear to be offsets of the same trace. Is this an expected artifact of NUTS or an indication of a problem with my model?
Here is a traceplot on real data:
Here is an IPtyhon notebook of the model and the functions used to create the fake data. Here is the corresponding gist.

I would expect this to happen in accordance with the group mean distribution on alpha. If you think about it, if the group mean shifts around it will influence all alphas to the same degree. You could confirm this by doing a scatter plot of the group mean trace against some of the alphas. Hierarchical models are in general difficult for most samplers because of these complex interdependencies between group mean and variance and the individual RVs. See http://arxiv.org/abs/1312.0906 for more information on this.
In your specific case, the trace doesn't look too worrisome to me, especially after iteration 1000. So you could probably just discard those as burn-in and keep in mind that you have some sampling noise but probably got the right posterior overall. In addition, you might want to perform a posterior predictive check to see if the model can reproduce the patterns in your data you are interested in.
Alternatively, you could try to estimate a better hessian using pm.find_hessian(), e.g. https://github.com/pymc-devs/pymc/blob/3eb2237a8005286fee32776c304409ed9943cfb3/pymc/examples/hierarchical.py#L51
I also found this paper which looks interesting (haven't read it yet but might be cool to implement in PyMC3): arxiv-web3.library.cornell.edu/pdf/1406.3843v1.pdf

Related

Differences in FP and FN rates between two algorithems

I am conducting binary classification using logistic regression with and without applying PCA. The application of PCA before logistic regression gives a higher accuracy and lower FNs in comparison to logistic regression alone. I would like to find out why this is happening, specifically why PCA produces less FNs. I have read that cost sensitivity analysis could help explain this, but I am not sure if this is correct. Any suggestions?
There is no need of fancy analysis to explain this behavior.
PCA is used just for "clean" the data by limiting its variance. Let me explain this concept with an example, and then I will turn back to your question.
In general, in any ML problem, the available samples are never sufficient in number to cover all the possible variety of the sample space. You can never have a dataset with all the possible human faces, with all the possible expressions, etc.
So, instead of using all the available features you engineer the features (the pixels, in this example) in a way that you get more meaningful higher level features. You can reduce the resolution of the pictures, as easy example; you will loose the informations on the pictures background, but your model will focus better on the most important part of the picture, i.e. the faces.
When you deal with tabular data, a technique similar to the resolution lowering is cutting off parts of the original features, and that's what PCA do: it keeps the most important components of the features, the "Principal Components", dropping the less important ones.
So, the model trained with PCA gives better results because, by cutting off part of the features, your model focus better on the most important part of your samples, and so it gains robustness against overfitting.
cheers

Which is a better input for an autoencoder, one with correlated features or one with uncorrelated features?

I am trying to visualise my data in 2D in order to detect fraud (outliers), all my features are likely to take bigger values in case of a fraud. But I was careful not to include redundant features,
for example the features :
Activity (a score that is higher for active users who use the service everyday) and Money-earned both tend to take higher values in case of fraud, but one can't be deduced from the other.
I figured that choosing features in this way will translate to bigger coordinates in the 2D representation and would make fraudulent points distant/stand out from the rest of my data.
I also feel like having correlated features would make it easier for autoencoder to reconstruct the data. But I read many times that having correlated features isn’t efficient in machine learning.
Should I make an effort to make my features less correlated ? For example replacing the Activity score (higher for active users) with the times between two uses (lower for active users)?
Or maybe this isn't important for the autoencoder?
You are right about your understanding that "having correlated features would make it easier for autoencoder to reconstruct the data".
For example, in case all your data points are i.i.d. Gaussian it would make data compression very difficult for autoencoders since they would fail to learn a low dimensional representation of the data.
Please refer to this Stanford UFLDL Tutorial link for details.

Transforming crime rates

I am working on a project using campus crime rates as the independent variable. The data is highly positively skewed. I need to transform the data in order to achieve normal distribution to run OLS. However, I know that if I do a log transformation I will lose all instances where the crime rates are 0 (representing an absence of crime). What are other possible solutions?
While you could avoid the loss of cases by calculating something like log(1+rate), the nonnegativity boundis likely to cause trouble anyway. You might consider using a generalized linear model (Analyze > Generalized Linear Models) with a gamma with log link response scale. This can deal with the right-skew issue as well.
Note, though, that it is the error that carries the normality assumption in OLS regression, not the dependent variable.

Binary classification of sensor data

My problem is the following: I need to classify a data stream coming from an sensor. I have managed to get a baseline using the
median of a window and I subtract the values from that baseline (I want to avoid negative peaks, so I only use the absolute value of the difference).
Now I need to distinguish an event (= something triggered the sensor) from the noise near the baseline:
The problem is that I don't know which method to use.
There are several approaches of which I thought of:
sum up the values in a window, if the sum is above a threshold the class should be EVENT ('Integrate and dump')
sum up the differences of the values in a window and get the mean value (which gives something like the first derivative), if the value is positive and above a threshold set class EVENT, set class NO-EVENT otherwise
combination of both
(unfortunately these approaches have the drawback that I need to guess the threshold values and set the window size)
using SVM that learns from manually classified data (but I don't know how to set up this algorithm properly: which features should I look at, like median/mean of a window?, integral?, first derivative?...)
What would you suggest? Are there better/simpler methods to get this task done?
I know there exist a lot of sophisticated algorithms but I'm confused about what could be the best way - please have a litte patience with a newbie who has no machine learning/DSP background :)
Thank you a lot and best regards.
The key to evaluating your heuristic is to develop a model of the behaviour of the system.
For example, what is the model of the physical process you are monitoring? Do you expect your samples, for example, to be correlated in time?
What is the model for the sensor output? Can it be modelled as, for example, a discretized linear function of the voltage? Is there a noise component? Is the magnitude of the noise known or unknown but constant?
Once you've listed your knowledge of the system that you're monitoring, you can then use that to evaluate and decide upon a good classification system. You may then also get an estimate of its accuracy, which is useful for consumers of the output of your classifier.
Edit:
Given the more detailed description, I'd suggest trying some simple models of behaviour that can be tackled using classical techniques before moving to a generic supervised learning heuristic.
For example, suppose:
The baseline, event threshold and noise magnitude are all known a priori.
The underlying process can be modelled as a Markov chain: it has two states (off and on) and the transition times between them are exponentially distributed.
You could then use a hidden Markov Model approach to determine the most likely underlying state at any given time. Even when the noise parameters and thresholds are unknown, you can use the HMM forward-backward training method to train the parameters (e.g. mean, variance of a Gaussian) associated with the output for each state.
If you know even more about the events, you can get by with simpler approaches: for example, if you knew that the event signal always reached a level above the baseline + noise, and that events were always separated in time by an interval larger than the width of the event itself, you could just do a simple threshold test.
Edit:
The classic intro to HMMs is Rabiner's tutorial (a copy can be found here). Relevant also are these errata.
from your description a correctly parameterized moving average might be sufficient
Try to understand the Sensor and its output. Make a model and do a Simulator that provides mock-data that covers expected data with noise and all that stuff
Get lots of real sensor data recorded
visualize the data and verify your assuptions and model
annotate your sensor data i. e. generate ground truth (your simulator shall do that for the mock data)
from what you learned till now propose one or more algorithms
make a test system that can verify your algorithms against ground truth and do regression against previous runs
implement your proposed algorithms and run them against ground truth
try to understand the false positives and false negatives from the recorded data (and try to adapt your simulator to reproduce them)
adapt your algotithm(s)
some other tips
you may implement hysteresis on thresholds to avoid bouncing
you may implement delays to avoid bouncing
beware of delays if implementing debouncers or low pass filters
you may implement multiple algorithms and voting
for testing relative improvements you may do regression tests on large amounts data not annotated. then you check the flipping detections only to find performance increase/decrease

How to get scientific results from non-experimental data (datamining?)

I want to obtain maximum performance out of a process with many variables, many of which cannot be controlled.
I cannot run thousands of experiments, so it'd be nice if I could run hundreds of experiments and
vary many controllable parameters
collect data on many parameters indicating performance
'correct,' as much as possible, for those parameters I couldn't control
Tease out the 'best' values for those things I can control, and start all over again
It feels like this would be called data mining, where you're going through tons of data which doesn't immediately appear to relate, but does show correlation after some effort.
So... Where do I start looking at algorithms, concepts, theory of this sort of thing? Even related terms for purposes of search would be useful.
Background: I like to do ultra-marathon cycling, and keep logs of each ride. I'd like to keep more data, and after hundreds of rides be able to pull out information about how I perform.
However, everything varies - routes, environment (temp, pres., hum., sun load, wind, precip., etc), fuel, attitude, weight, water load, etc, etc, etc. I can control a few things, but running the same route 20 times to test out a new fuel regime would just be depressing, and take years to perform all the experiments that I'd like to do. I can, however, record all these things and more(telemetry on bicycle FTW).
It sounds like you want to do some regression analysis. You certainly have plenty of data!
Regression analysis is an extremely common modeling technique in statistics and science. (It could be argued that statistics is the art and science of regression analysis.) There are many statistics packages out there to do the computation you'll need. (I'd recommend one, but I'm years out of date.)
Data mining has gotten a bad name because far too often people assume correlation equals causation. I found that a good technique is to start with variables you know have an influence and build a statistical model around them first. So you know that wind, weight and climb have an influence on how fast you can travel and statistical software can take your dataset and calculate what the correlation between those factors are. That will give you a statistical model or linear equation:
speed = x*weight + y*wind + z*climb + constant
When you explore new variables, you will be able to see if the model is improved or not by comparing a goodness of fit metric like R-squared. So you might check if temperature or time of day adds anything to the model.
You may want to apply a transformation to you data. For instance, you might find that you perform better on colder days. But really cold days and really hot days might hurt performance. In that case, you could assign temperatures to bins or segments: < 0°C; 0°C to 40°C; > 40°C, or some such. The key is to transform the data in a way that matches a rational model of what is going on in the real world, not just the data itself.
In case someone thinks this is not a programming related topic, notice that you can use these same techniques to analyze system performance.
With that many variables you have too many dimensions and you may want to look at Principal Component Analysis. It takes some of the "art" out of regression analysis and lets the data speak for itself. Some software to do that sort of analysis is shown at the bottom of the link.
I have used the Perl module Statistics::Regression for somewhat similar problems in the past. Be warned, however, that regression analysis is definitely an art. As the warning in the Perl module says, it won't make sense to you if you haven't learned the appropriate math.

Resources