Neural network and algorithm(s), predicting future outcome from past - algorithm

I was working on a algorithm, where I am given some input and I am given output for them, and given the output for 3 months (give or take) I need a way to find/calculate what might be the future output.
Now, this problem given can be related to stock exchange, we are given certaing constraints and certain outcomes, and we need to find the next.
I stumbled upon neural network stock market prediction, you can Google it, or you can read about it here, here and here.
To get started at making the algorithm, I couldn't figure out what should be the structure of layers.
The given constraint are:
The output would always be integer.
The output would always be between 1 and 100.
There is no exact input for say, just like stock market, we just know that the stock price would fluctuate btw 1 and 100, so we might (or not?) consider this as the only input.
We have record for last 3 months (or more).
Now, my first question is, how many nodes do I take for input?
The output is just one, fine. But as I said, should I take 100 nodes for input layer (given that the stock price would always be integer and would always be btw 1 and 100?)
What about hidden layer? How many nodes there? Say, if I take 100 nodes there too, I don't think that would train the network much, because what I think is that for each input we need to take into account all previous input also.
Say, we are calulating output for 1st day of 4th month, we should have 90 nodes in hidden/middle layer (imagining each month is 30 days for simplicity). Now there are two cases
Our prediction was correct and outcome was same as we predicted.
Our prediction failed, and the outcome was different than what we predicted.
Whatever the case be, now when we are calculating the output for 2nd day of 4th month, we need not only those 90 input(s) but also that last result (and not the prediction, be it the same!) too, so we now have 91 nodes in our middle/hidden layer.
And so on, it would keep increasing the number of nodes each day, AFAICT.
So, my other question is how do I define/set the number of nodes in hidden/middle layer if its dynamically changing.
My last question is, is there any other particular algorithm out there (for this kinda thing/stuff) that I am not aware of? That I should be using instead of messing around with this neural networking stuff?
Lastly, is there anything, that I might be missing that might cause me (rather the algo I am making) to predict the output, I mean any caveats, or anything that might make it go wrong that I might be missing?

There is much to tell as an answer to your question. In fact, your question addresses the problem of time series forecasting in general, and neural networks application for this task. I'm writing here only several most important keys, but after reading this you should possibly dig into Google's results for the query time series prediction neural network. There is a lot of works where the principles are covered in details. A variety of software implementations (with source codes) do also exist (here is just one of examples with codes in c++).
1) I must say that the problem is 99% about data preprocessing and choosing correct input/output factors, and only 1% about concrete instrument to use, whether neural networks or something other. Just as a side note, neural networks can internally implement most of other data analysis methods. For example, you can use neural network for Principal Component Analysis (PCA) which is closely related to SVD, mentioned in another answer.
2) It's very rare that input/output values are strictly fit a specific region. Real life data can be considered as unbounded in absolute values (even if its changes seem producing a channel, it can be broken down just in a moment), but neural network can operate in a stable conditions only. This is why the data is normally converted into increments first (by calculating deltas between i-th point and i-1, or taking log from their ratio). I suggest you do it with your data anyway, though you declare it's inside [0, 100] region. If you don't do it, neural network will most likely degenerate to a so called naive predictor which produce a forecast with each next value equal to previous.
The data then is normalized into [0, 1] or [-1, +1]. The second is appropriate for the case of time series prediction where +1 denotes move up, and -1 - move down. Use hypertanh activation function for neurons in your net.
3) You should feed NN with an input data obtained from a sliding window of dates. For example, if you have a data for a year and every point is a day, you should choose the size of window - say, a month - and slide it day by day, from the past to the future. The day just at the right bound of the window is the target output for NN. This is a very simple approach (there are much more complicated), I mention it just because you ask how to handle data which does continuously arrive. The answer is - you don't need to change/enlarge your NN every day. Just use a constant structure with a fixed window size and "forget" (do not provide to the NN) the oldest point. It's important that you do not treat all the data you have as a single input, but divide it into many small vectors and train NN on them, so the net can generalize data and find regularity.
4) The size of sliding window is your NN input size. The output size is 1. You should play with hidden layer size to find better performance. Start with a value which somethat between input and output, for example sqrt(in*out).
According to lastest researches, Recurrent Neural Networks seem operating better for tasks of time series forecasting.

I agree with Stan when he says
1) I must say that the problem is 99% about data preprocessing
I've applied Neural Networks for 25+ years to various aerospace applications including helicopter flight control - setting up the input/output data set is everything - all else is secondary.
I'm amazed, in smirkman's comment that Neural Networks were quickly dropped "as they produced nothing worthwhile" - that tells me that whoever was working with Neural Networks had little experience with them.
Given that the topic discusses neural network stock market prediction - I'll say that I've made it work. Test results are downloadable from my website at www.nwtai.com.
I don't give away how it was done but there's enough interesting data that should make you want to explore using Neural Networks more seriously.

This kind of problem was particularly well researched by thousands of people who wanted to win the 1M$ NetFlix prize.
Earlier submissions were often based on K Nearest Neigbours. Later submissions were made using Singular Value Decomposition, Support Vector Machines and Stochastic Gradient Descent. The winner used a blend of several techniques.
Reading the excellent Community forums will give you many insights about the best methods to predict the future from the past. You'll also find loads of source code for the different methods.
Amusingly, neural networks were quickly dropped, as they produced nothing worthwhile (and I personally have yet to see a non-trivial NN produce anything of value).
If you are starting out, I'd suggest SVD as a first path; it's quite easy to make and often produces surprising insights into data.
Good luck!

Related

XGBOOST/lLightgbm over-fitting despite no indication in cross-validation test scores?

We aim to identify predictors that may influence the risk of a relatively rare outcome.
We are using a semi-large clinical dataset, with data on nearly 200,000 patients.
The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients).
We have a large set of nearly 1,200 mostly dichotomized possible predictors.
Our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, that may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering have any influence on the risk of developing the condition, so if we were aiming to develop a prediction model it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost/lightgbm.
We have been having difficulties tuning the models. Specifically when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000 from xgboost). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these).
As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of over fitting. This is true regardless of the evaluation metrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of over fitting.
Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.
Are there situations in which over fitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyper parameters?
Relatedly, again, the idea is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher (i.e. more than 2) order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.).
welcome to Stackoverflow!
In the absence of some code and representative data it is not easy to make other than general suggestions.
Your descriptive statistics step may give some pointers to a starting model.
What does existing theory (if it exists!) suggest about the cause of the medical condition?
Is there a male/female difference or old/young age difference that could help get your foot in the door?
Your medical data has similarities to the fraud detection problem where one is trying to predict rare events usually much rarer than your cases.
It may pay you to check out the use of xgboost/lightgbm in the fraud detection literature.

Simulation Performance Metrics

This is a semi-broad question, but it's one that I feel on some level is answerable or at least approachable.
I've spent the last month or so making a fairly extensive simulation. In order to protect the interests of my employer, I won't state specifically what it does... but an analogy of what it does may be explained by... a high school dance.
A girl or boy enters the dance floor, and based on the selection of free dance partners, an optimal choice is made. After a period of time, two dancers finish dancing and are now free for a new partnership.
I've been making partner selection algorithms designed to maximize average match outcome while not sacrificing wait time for a partner too much.
I want a way to gauge / compare versions of my algorithms in order to make a selection of the optimal algorithm for any situation. This is difficult however since the inputs of my simulation are extremely large matrices of input parameters (2-5 per dancer), and the simulation takes several minutes to run (a fact that makes it difficult to test a large number of simulation inputs). I have a few output metrics, but linking them to the large number of inputs is extremely hard. I'm also interested in finding which algorithms completely fail under certain input conditions...
Any pro tips / online resources which might help me in defining input constraints / output variables which might give clarity on an optimal algorithm?
I might not understand what you exactly want. But here is my suggestion. Let me know if my solution is inaccurate/irrelevant and I will edit/delete accordingly.
Assume you have a certain metric (say compatibility of the pairs or waiting time). If you just have the average or total number for this metric over all the users, it is kind of useless. Instead you might want to find the distribution of of this metric over all users. If nothing, you should always keep track of the variance. Once you have the distribution, you can calculate a probability that particular algorithm A is better than B for a certain metric.
If you do not have the distribution of the metric within an experiment, you can always run multiple experiments, and the number of experiments you need to run depends on the variance of the metric and difference between two algorithms.

Neural Network Basics

I'm a computer science student and for this years project, I need to create and apply a Genetic Algorithm to something. I think Neural Networks would be a good thing to apply it to, but I'm having trouble understanding them. I fully understand the concepts but none of the websites out there really explain the following which is blocking my understanding:
How the decision is made for how many nodes there are.
What the nodes actually represent and do.
What part the weights and bias actually play in classification.
Could someone please shed some light on this for me?
Also, I'd really appreciate it if you have any similar ideas for what I could apply a GA to.
Thanks very much! :)
Your question is quite complex and I don't think a small answer will fully satisfy you. Let me try, nonetheless.
First of all, there must be at least three layers in your neural network (assuming a simple feedforward one). The first is the input layer and there will be one neuron per input. The third layer is the output one and there will be one neuron per output value (if you are classifying, there might be more than one f you want to assign a "belong to" meaning to each neuron).. The remaining layer is the hidden one, which will stand between the input and output. Determining its size is a complex task as you can see in the following references:
comp.ai faq
a post on stack exchange
Nevertheless, the best way to proceed would be for you to state your problem more clearly (as weel as industrial secrecy might allow) and let us think a little more on your context.
The number of input and output nodes is determined by the number of inputs and outputs you have. The number of intermediate nodes is up to you. There is no "right" number.
Imagine a simple network: inputs( age, sex, country, married ) outputs( chance of death this year ). Your network might have a 2 "hidden values", one depending on age and sex, the other depending on country and married. You put weights on each. For example, Hidden1 = age * weight1 + sex * weight2. Hidden2 = country * weight3 + married * weight4. You then make another set of weights, Hidden3 and Hidden4 connecting to the output variable.
Then you get a data from, say the census, and run through your neural network to find out what weights best match the data. You can use genetic algorithms to test different sets of weights. This is useful if you have so many edges you could not try every possible weighting. You need to find good weights without exhaustively trying every possible set of weights, so GA lets you "evolve" a good set of weights.
Then you test your weights on data from a different census to see how well it worked.
... my major barrier to understanding this though is understanding how the hidden layer actually works; I don't really understand how a neuron functions and what the weights are for...
Every node in the middle layer is a "feature detector" -- it will (hopefully) "light up" (i.e., be strongly activated) in response to some important feature in the input. The weights are what emphasize an aspect of the previous layer; that is, the set of input weights to a neuron correspond to what nodes in the previous layer are important for that feature.
If a weight connecting myInputNode to myMiddleLayerNode is 0, then you can tell that myInputNode is not important to whatever feature myMiddleLayerNode is detecting. If, though, the weight connecting myInputNode to myMiddleLayerNode is very large (either positive or negative), you know that myInputNode is quite important (if it's very negative it means "No, this feature is almost certainly not there", while if it's very positive it means "Yes, this feature is almost certainly there").
So a corollary of this is that you want the number of your middle-layer nodes to have a correspondence to how many features are needed to classify the input: too few middle-layer nodes and it will be hard to converge during training (since every middle-layer node will have to "double up" on its feature-detection) while too many middle-layer nodes may over-fit your data.
So... a possible use of a genetic algorithm would be to design the architecture of your network! That is, use a GA to set the number of middle-layer nodes and initial weights. Some instances of the population will converge faster and be more robust -- these could be selected for future generations. (Personally, I've never felt this was a great use of GAs since I think it's often faster just to trial-and-error your way into a decent NN architecture, but using GAs this way is not uncommon.)
You might find this wikipedia page on NeuroEvolution of Augmenting Topologies (NEAT) interesting. NEAT is one example of applying genetic algorithms to create the neural network topology.
The best way to explain an Artificial Neural Network (ANN) is to provide the biological process that it attempts to simulate - a neural network. The best example of one is the human brain. So how does the brain work (highly simplified for CS)?
The functional unit (for our purposes) of the brain is the neuron. It is a potential accumulator and "disperser". What that means is that after a certain amount of electric potential (think filling a balloon with air) has been reached, it "fires" (balloon pops). It fires electric signals down any connections it has.
How are neurons connected? Synapses. These synapses can have various weights (in real life due to stronger/weaker synapses from thicker/thinner connections). These weights allow a certain amount of a fired signal to pass through.
You thus have a large collection of neurons connected by synapses - the base representation for your ANN. Note that the input/output structures described by the others are an artifact of the type of problem to which ANNs are applied. Theoretically, any neuron can accept input as well. It serves little purpose in computational tasks however.
So now on to ANNs.
NEURONS: Neurons in an ANN are very similar to their biological counterpart. They are modeled either as step functions (that signal out "1" after a certain combined input signal, or "0" at all other times), or slightly more sophisticated firing sequences (arctan, sigmoid, etc) that produce a continuous output, though scaled similarly to a step. This is closer to the biological reality.
SYNAPSES: These are extremely simple in ANNs - just weights describing the connections between Neurons. Used simply to weight the neurons that are connected to the current one, but still play a crucial role: synapses are the cause of the network's output. To clarify, the training of an ANN with a set structure and neuron activation function is simply the modification of the synapse weights. That is it. No other change is made in going from a a "dumb" net to one that produces accurate results.
STRUCTURE:
There is no "correct" structure for a neural network. The structures are either
a) chosen by hand, or
b) allowed to grow as a result of learning algorithms (a la Cascade-Correlation Networks).
Assuming the hand-picked structure, these are actually chosen through careful analysis of the problem and expected solution. Too few "hidden" neurons/layers, and you structure is not complex enough to approximate a complex function. Too many, and your training time rapidly grows unwieldy. For this reason, the selection of inputs ("features") and the structure of a neural net are, IMO, 99% of the problem. The training and usage of ANNs is trivial in comparison.
To now address your GA concern, it is one of many, many efforts used to train the network by modifying the synapse weights. Why? because in the end, a neural network's output is simply an extremely high-order surface in N dimensions. ANY surface optimization technique can be use to solve the weights, and GA are one such technique. The simple backpropagation method is alikened to a dimension-reduced gradient-based optimization technique.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Neural Network settings for fast training

I am creating a tool for predicting the time and cost of software projects based on past data. The tool uses a neural network to do this and so far, the results are promising, but I think I can do a lot more optimisation just by changing the properties of the network. There don't seem to be any rules or even many best-practices when it comes to these settings so if anyone with experience could help me I would greatly appreciate it.
The input data is made up of a series of integers that could go up as high as the user wants to go, but most will be under 100,000 I would have thought. Some will be as low as 1. They are details like number of people on a project and the cost of a project, as well as details about database entities and use cases.
There are 10 inputs in total and 2 outputs (the time and cost). I am using Resilient Propagation to train the network. Currently it has: 10 input nodes, 1 hidden layer with 5 nodes and 2 output nodes. I am training to get under a 5% error rate.
The algorithm must run on a webserver so I have put in a measure to stop training when it looks like it isn't going anywhere. This is set to 10,000 training iterations.
Currently, when I try to train it with some data that is a bit varied, but well within the limits of what we expect users to put into it, it takes a long time to train, hitting the 10,000 iteration limit over and over again.
This is the first time I have used a neural network and I don't really know what to expect. If you could give me some hints on what sort of settings I should be using for the network and for the iteration limit I would greatly appreciate it.
Thank you!
First of all, thanks for providing so much information about your network! Here are a few pointers that should give you a clearer picture.
You need to normalize your inputs. If one node sees a mean value of 100,000 and another just 0.5, you won't see an equal impact from the two inputs. Which is why you'll need to normalize them.
Only 5 hidden neurons for 10 input nodes? I remember reading somewhere that you need at least double the number of inputs; try 20+ hidden neurons. This will provide your neural network model the capability to develop a more complex model. However, too many neurons and your network will just memorize the training data set.
Resilient backpropagation is fine. Just remember that there are other training algorithms out there like Levenberg-Marquardt.
How many training sets do you have? Neural networks usually need a large dataset to be good at making useful predictions.
Consider adding a momentum factor to your weight-training algorithm to speed things up if you haven't done so already.
Online training tends to be better for making generalized predictions than batch training. The former updates weights after running every training set through the network, while the latter updates the network after passing every data set through. It's your call.
Is your data discrete or continuous? Neural networks tend to do a better job with 0s and 1s than continuous functions. If it is the former, I'd recommend using the sigmoid activation function. A combination of tanh and linear activation functions for the hidden and output layers tend to do a good job with continuously-varying data.
Do you need another hidden layer? It may help if your network is dealing with complex input-output surface mapping.

Resources