I was assigned with project to do anomaly detection on for our company KPI. I googled and found AnomalyDetection by Twitter. There was an idea from my colleague to do the anomaly detection on the graph images (comparing with previous week images to identify anomaly points) instead of using time-series raw data.
I am not familiar with the Anomaly Detection, anyone here experienced and able to advice which one is better (Anomaly Detection from data or image) in term of:
1. Accuracy
2. Storage
3. Processing

Data-agnostic. Can theoretically be ran on anything where one can get an image/visualization out.
Image models are relatively well understood.
Pretrained models are available.
Requires much more data to learn useful model.
The image pixel space is much more complicated than the time-series it represents. Probably at least 100x.
Requires much more compute power. Both at training time, and at prediction time. Probably at least 100x.
Requires much more storage for datasets. Probably at least 100x.
Sensitive to changes in visualization.
A change in tickmarks or font for example would be an anomaly. Even a change in image compression may impact, if not controlled for.
Lose explain-ability. May be hard to know why a certain image is anomaly, even for simple cases like a mean shift.
Much more complex model setup and infrastructure needed
For an application like Anomaly Detection on Time Series on metrics, I would not recommend doing it. I am not even sure I have seen it studied.
I think it is unlikely that a high performing Anomaly Detection system for metrics can be built effectively with image processing on graphs.
Anomalies are typically quite rare, which means that it is a "low data" scenario. But also many anomalies are quite simple, and can be detected with simple methods - as basic as well chosen thresholds can go a long way. Using image processing does not help with any of these challenges, in fact it is worse in most regards.


Differences in FP and FN rates between two algorithems

I am conducting binary classification using logistic regression with and without applying PCA. The application of PCA before logistic regression gives a higher accuracy and lower FNs in comparison to logistic regression alone. I would like to find out why this is happening, specifically why PCA produces less FNs. I have read that cost sensitivity analysis could help explain this, but I am not sure if this is correct. Any suggestions?
There is no need of fancy analysis to explain this behavior.
PCA is used just for "clean" the data by limiting its variance. Let me explain this concept with an example, and then I will turn back to your question.
In general, in any ML problem, the available samples are never sufficient in number to cover all the possible variety of the sample space. You can never have a dataset with all the possible human faces, with all the possible expressions, etc.
So, instead of using all the available features you engineer the features (the pixels, in this example) in a way that you get more meaningful higher level features. You can reduce the resolution of the pictures, as easy example; you will loose the informations on the pictures background, but your model will focus better on the most important part of the picture, i.e. the faces.
When you deal with tabular data, a technique similar to the resolution lowering is cutting off parts of the original features, and that's what PCA do: it keeps the most important components of the features, the "Principal Components", dropping the less important ones.
So, the model trained with PCA gives better results because, by cutting off part of the features, your model focus better on the most important part of your samples, and so it gains robustness against overfitting.

Keras «Powerful image classification with little data»: disparity between training and validation

I followed this post and first made it work on the dataset «Cats vs dogs». Then I substituted this set with my own images, which show the presence of an object vs the absence of that object. My dataset is even smaller than the one in the post. I only have 496 images containing that object for training and 160 images with that object for validation. For the «absent» class I have numerous samples (without that object in an image).
So far I didn't try class_weight to tackle the imbalanced data problem. I just randomly choose 496 and 160 images without that object for training and validation, respectively. Basically, I do a two class image classification with a smaller dataset using the techniques in this post. Thus I expected a worse performance in comparison due to the insufficient data. But the actual problem is that the performance is not convergent as shown in the figures.
Could you tell me possible reasons that lead to the unconvergence? I guess the problem is related to my dataset as the model works perfectly for «cats vs dogs». But I don't know how to address it. Are there any good techniques to make it convergent?
Thank you.
This performance plot is based on VGG16, keeping all layers up to fully connected layer and training a small fully connected layer with 256 neurons.
This performance plot is also based on VGG16, but using 128 neurons instead of 256 neurons. Also I set epochs to 80.
Based on the suggestions provided so far, I'm thinking to have a customized convnet model to fight the overfitting problem. But how to do this? One of my worries is that a model with fewer layers will downgrade the performance for training. Any guidelines to customize a good model for little data? Thank you.
Now I think I know the half reason that leads to the unconvergent problem. You know, Actually I only have 100+ images. The rest images are downloaded from Flickr. I thought those images having centric objects and better quality will work for the model. But later on I found they can not contribute to the accuracy and even worse the output class probabilities. After removing these downloaded images, the performance is bumping upward a little and the uncovergency is gone. Note I only use 64*2 images for training and 48*2 images for testing. Also I found the image augmentation could not improve the performance for my dataset. Without image augmentation, the training accuracy could reach 1. But if I add some image augmentation, the training accuracy is only around 85%. Did somebody have such experience? Why doesn't data augmentation always work? Because our specific dataset? Thank you very much.
Your model is working great, but it's "overfitting". It means it's capable of memorizing all your training data without really "thinking". That leads to great training results and bad test results.
Common ways to avoid overfitting are:
More data - If you have little data, the chance of overfitting increases
Less units/layers - Make the model less capable, so it will stop memorizing and start thinking.
Add "dropouts" to your layers (something that randomly discards part of the results to prevent the model from being too powerful)
Do more layers mean more power and performance?
If by performance you mean capability of learning, yes. (If you mean "speed", no)
Yes, more layers mean more power. But too much power leads to overfitting: the model is so capable that it can memorize training data.
So there is an optimal point:
A model that is not very capable will not give you the proper results (both training and test results will be bad)
A model that is too capable will memorize the training data (excellent training results, but bad test results)
A balanced model will learn the right things (good training and test results)
That's exactly why we use test data, it's data that is not presented for training, so the model doesn't learn from the test data.

number of layers in convolution neural network

I am a beginner in convolution networks. I use digits to implement them and facing with few doubts.
While trying out a basic classification problem of images, how do we decide on the number of layers - how many conv layers/ fully connected layer, etc.
In digits we have 3 standard papers implemented, for a particular dataset is there any way to find out which architecture to use – or when should we use our own architecture.
How can the hidden layers be helpful in solving the problems – i.e. what possible decisions can we take by looking at the results in the hidden layer
Deciding on how many layers or neurons is needed or the best architecture for building neural network was never clear or possible. the main procedure was taken before is to try building on some parameters and then measure the performance on training set and testing set not bias or to over fit the data and decide on the best parameters, or try some other algorithm like genetic algorithm.
conclusion either you start from scratch every time to measure the network performance or apply other algorithms which doesn't need to start from scratch and can build incrementally by applying transfer learning and fine tuning on the network architecture.
The core philosophy that makes deep learning so democratic and amazing is simple "Don't be a Hero".
What it means is that in most cases the best deep learning models take millions of data points and weeks to train, something most of us cannot achieve with our low performance PC's (yes a single GPU system is low performance). So why would you want to waste your time in building and training NN architectures. Simple you don't.
Transfer learning is your solution!! try to find models that are trained on data similar to your problem and use their pre-trained weights to fine tune your data set. Doing this not only do you get an already proven NN architecture but also a major head start in training.
The best place to find pre-trained models is the caffe model zoo so go have a look at it.

Artificial neural network image transformation

I have a pairs of images (input-output) but I don't know the transformation to going from A (input) to B (output). I want to record image A and get image B. Physically I can change the setup to get A or B, but I want to do it by software.
If I understood well, a trained Artificial Neural Network is able to do that, having an input can give the corresponding output, is it right?
Is there any software/ANN that just "training" it with entering a number of input-output pairs will be able to provide the correct output if the input is a new (but similar to the others) image?
If you have some relevant amount of image pairs (input/output pair) and you don't know transformation between input and output you could train ANN on that training set to imitate that unknown transformation. You will be able to well train your ANN only if you have sufficient amount of training image pairs, but it could be pretty impossible when that unknown transformation is complicated.
For example if that transformation simply increases intensity values of pixels at input image by given value, ANN will very fast learn to imitate that behavior, but if that unknown transformation is some complicated convolution or few serial convolutions or something more complicated it will be very hard, near impossible to train ANN to imitate that transformation. So, more complex transformation will need bigger training set and more complex ANN design.
There are plenty of free opensource ANN libraries implemented in many languages. You could start for example with that tutorial: http://www.codeproject.com/Articles/13091/Artificial-Neural-Networks-made-easy-with-the-FANN
What you are asking is possible in principle -- in theory, an ANN with sufficiently many hidden units can learn an arbitrary function to map inputs to outputs. However, as the comments and other answers have mentioned, there may be many technical issues with your particular problem that could make it impractical. I would classify these problems as (a) mapping complexity, (b) model complexity, (c) scaling complexity, and (d) implementation complexity. They are all somewhat related, but hopefully this is a useful way to break things down.
Mapping complexity
As mentioned by Springfield762, there are many possible functions that map from one image to another image. If the relationship between your input images and your output images is relatively simple -- like increasing the intensity of each pixel by a constant amount -- then an ANN would be able to learn this mapping without much difficulty. There are probably many more transformations that would be similarly easy to learn, such as skewing, flipping, rotating, or translating an image -- basically any affine transformation would be easy to learn. Other, nonlinear transformations could also be feasible, such as squaring the intensity of each pixel.
As a general rule, the more complicated the relationship between your input and output images, the more difficult it will be to get a model to learn this mapping for you.
Model complexity
The more complex the mapping from inputs to outputs, the more complex your ANN model will be to be able to capture this mapping. Models with many hidden layers have been shown in the past 10 years to perform quite well on tasks that people had previously thought impossible, but often these state-of-the-art models have millions or even billions of parameters and take weeks to train on GPU hardware. A simple model can capture many simple mappings, but if you have a complex input-output map to learn, you'll need a large, complex model.
Scaling complexity
Yves mentioned in the comments that it can be difficult to scale models up to typical image sizes. If your images are relatively small (currently the state of the art is to model images on the order of 100x100 pixels), then you can probably just throw a bunch of raw pixel data at an ANN model and see what happens. But if you're using 6000x4000 images from your shiny Nikon DSLR, it's going to be quite difficult to process those in a reasonable amount of time. You'd be better off compressing your image data somehow (PCA is a common technique) and then trying to learn the mapping in the compressed space.
In addition, larger images will have a larger space of possible mappings between them, so you'll need more of your larger images as training data than you would if you had small images.
Springfield762 also mentioned this: If the mapping between your input and output images is simple, then you'll only need a few examples to learn the mapping successfully. But if you have a complicated mapping, then you'll need much more training data to have a chance at learning the mapping properly.
Implementation complexity
It's unlikely that a tool already exists that would let you just throw image data into an ANN model and have a mapping appear. Most likely you'll need, at a minimum, to implement some code that will pre-process your image data. In addition, if you have lots of large images you'll probably need to write code to handle loading data from disk, etc. (There are a lot of "big data" tools for things like this, but they all require some amount of work to get set up.)
There are many, many open source ANN toolkits out there nowadays. FANN (already mentioned) is a popular one in C++ with bindings in other languages. Caffe is quite popular, and is also implemented in C++ with bindings. There seem to be many toolkits that use Python and Theano or some other GPU acceleration library -- Keras, Lasagne, Hebel, Pylearn2, neon, and Theanets (I wrote this one). Many people use Torch, written in Lua. Matlab has at least one neural network toolbox. I'm less familiar with other ecosystems, but Java seems to have Deeplearning4j, C# has Accord, and even R has darch.
But with any of these neural network toolkits, you're going to have to write some code to load the data, process it into the appropriate input format, construct (or load) a network model, train the model, etc.
The problem you're trying to solve is a canonical classification problem that neural networks can help you solve. You treat the B images as a set of labels that you match to A, and once trained, the neural network will be able to match the B images to new input based on where the network locates new input in a high-dimensional vector space. I assume you'd use some combination of convolutional networks to create your features, and softmax for multinomial classification on the output layer. More here: http://deeplearning4j.org/convolutionalnets.html
Since this has been written there has been a lot of work in the realm of cgans ( conditional generative adversarial networks ) please refer to:

Artificial Neural Network Question

Generally speaking what do you get out of extending an artificial neural net by adding more nodes to a hidden layer or more hidden layers?
Does it allow for more precision in the mapping, or does it allow for more subtlety in the relationships it can identify, or something else?
There's a very well known result in machine learning that states that a single hidden layer is enough to approximate any smooth, bounded function (the paper was called "Multilayer feedforward networks are universal approximators" and it's now almost 20 years old). There are several things to note, however.
The single hidden layer may need to be arbitrarily wide.
This says nothing about the ease with which an approximation may be found; in general large networks are hard to train properly and fall victim to overfitting quite frequently (the exception are so-called "convolutional neural networks" which really are only meant for vision problems).
This also says nothing about the efficiency of the representation. Some functions require exponential numbers of hidden units if done with one layer but scale much more nicely with more layers (for more discussion of this read Scaling Learning Algorithms Towards AI)
The problem with deep neural networks is that they're even harder to train. You end up with very very small gradients being backpropagated to the earlier hidden layers and the learning not really going anywhere, especially if weights are initialized to be small (if you initialize them to be of larger magnitude you frequently get stuck in bad local minima). There are some techniques for "pre-training" like the ones discussed in this Google tech talk by Geoff Hinton which attempt to get around this.
This is very interesting question but it's not so easy to answer. It depends on the problem you try to resolve and what neural network you try to use. There are several neural network types.
I general it's not so clear that more nodes equals more precision. Research show that you need mostly only one hidden layer. The numer of nodes should be the minimal numer of nodes that are required to resolve a problem. If you don't have enough of them - you will not reach solution.
From the other hand - if you have reached the number of nodes that is good to resolve solution - you can add more and more of them and you will not see any further progress in result estimation.
That's why there are so many types of neural networks. They try to resolve different types of problems. So you have NN to resolve static problems, to resolve time related problems and so one. The number of nodes is not so important like the design of them.
When you have a hidden layer is that you are creating a combined feature of the input. So, is the problem better tackled by more features of the existing input, or through higher-order features that come from combining existing features? This is the trade-off for a standard feed-forward network.
You have a theoretical reassurance that any function can be represented by a neural network with two hidden layers and non-linear activation.
Also, consider using additional resources for boosting, instead of adding more nodes, if you're not certain of the appropriate topology.
Very rough rules of thumb
generally more elements per layer for bigger input vectors.
more layers may let you model more non-linear systems.
If the kind of network you are using has delays in propagation , more layers may allow modelling of time series . Take care to have time jitter in the delays or it wont work very well. If this is just gobbledegook to you, ignore it.
More layers lets you insert recurrent features. This can be very useful for discrimination tasks. You ANN implementation my not permit this.
The number of units per hidden layer accounts for the ANN's potential to describe an arbitrarily complex function. Some (complicated) functions may require many hidden nodes, or possibly more than one hidden layer.
When a function can be roughly approximated by a certain number of hidden units, any extra nodes will provide more accuracy...but this is only true if the training samples used are enough to justify this addition - otherwise what will happen is "overconvergence". Overconvergence means that your ANN has lost its generalization abilities because it has overemphasized on the particular samples.
In general it is best to use the less hidden units possible, if the resulting network can give good results. The additional training patterns required to justify more hidden nodes can not be found easily in most cases, and accuracy is not the NNs' strong point.
