Artificial neural network activation function - backpropagation

I've programmed an ANN with backpropagation algorithm to forecast number of customers with 3 layers, 1 output neuron, 3 hidden neurons and 4 input neurons. so i need a continuous output. what activation functions should I use?

In this case what you can do (and I've seen this works well) is to apply a PureLin function to the input and output layer and use the Tanh or Sigmod in the hidden layers. The rest of the work is done by the weights!
Hope this helps!

Related

Usage of validation set in neural networks

I want to describe you an algorithm that I used to choose number of hidden layers and number of neurons within each of it (I couldn't find anywhere any approval of this approach, whereas it seems very logical for me):
Divide data into 60% (training), 20% (validation) and 20% (test) parts.
Now I want to check all possibilities in one hidden layer and two hidden layer network. To do so:
I'll train network with one hidden layer and with 1, 2 and 3 neurons within this one layer (3 different networks). After that I will calculate error on validation set (RMSE and MAE) for each of those networks.
I'll do exactly same thing for network with 2 hidden layers i.e. I'm gonna to estimate network with two hidden layers with all the possibilities of number of neurons in each layer (but number of neurons can only be 1, 2 or 3). It means that I will have 9 pairs of possible outcomes : 1 neuron in first hidden layer, 1 neuron in second hidden layer. 2 neurons in first hidden layer, 1 neuron in second hidden layer and so on... For each of the 9 architectures I'll calculate RMSE and MAE between predictions on validation set and actual values.
Out of all those iterations I'll pick the architecture for which error on the validation set is the lowest.
Could you please tell me if this algorithm make any sense? If not, what else I can do with validation set to choose the best architecture for neural network?
This is the absolutely standard method of hyperparameter/model selection. The "trick" one relies upon is that you select from a very limited set of models (in your case 12), and thus your estimation of error will be very tight. You can refer to "statistical learning theory" book by Vladimir Vapnik giving you exact bounds why this sort of approach is great.
That being said you are working under the assumption that 20% of your data is "big enough" to estimate your performance, if it is not you can look at Cross Validation

Question about activation function of image task in Deep Learning

Let me ask you about the image task of Deep Learning (here, image identification).
DeepLearning recognizes that it can be classified into three layers: input layer, intermediate layer, and output layer.
① Input layer → Intermediate layer
② Intermediate layer → Output layer
I understand that it is normal for ① and ② to use the activation function.
I recognize as follows.
Regarding (1), the ReLU function and sigmoid function are used.
Regarding (2), the softmax function is used.
I would like to know why (1) and (2) each use a specific function by convention.
Also, are there cases where the activation function is used, and are there any results evaluated by various functions?
If anyone knows anything about the above, please let me know.
Also, if you have a reference web page or treatise, please let me know.
The choice of activation function in the hidden layer will control how well the network model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.
An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. Many activation functions are nonlinear and may be referred to as the “nonlinearity” in the layer or the network design. Nonlinear activation functions are preferred as they allow the nodes to learn more complex structures in the data.
Hidden Layer
ReLU (rectified linear units) activation function, is now-a-days the most common function used for hidden layers because it is both simple to implement and effective at overcoming the limitations of other previously popular activation functions, such as Sigmoid and Tanh. Specifically, it is less susceptible to vanishing gradients that prevent deep models from being trained, although it can suffer from other problems like saturated or “dead” units.
A general problem with both the sigmoid and tanh functions is that they saturate. This means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid respectively. Further, the functions are only really sensitive to changes around their mid-point of their input, such as 0.5 for sigmoid and 0.0 for tanh.
The limited sensitivity and saturation of the function happen regardless of whether the summed activation from the node provided as input contains useful information or not. Once saturated, it becomes challenging for the learning algorithm to continue to adapt the weights to improve the performance of the model.
Because rectified linear units are nearly linear, they preserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear models generalize well.
Because the rectified function is linear for half of the input domain and nonlinear for the other half, it is referred to as a piecewise linear function or a hinge function. However, the function remains very close to linear, in the sense that is a piecewise linear function with two linear pieces.
Outer Layer
Common activation functions to consider for use in the output layer are: Linear, Logistic (Sigmoid) and Softmax.
The linear activation function is also called “identity” (multiplied by 1.0) or “no activation.” This is because the linear activation function does not change the weighted sum of the input in any way and instead returns the value directly.
The softmax function outputs a vector of values that sum to 1.0 that can be interpreted as probabilities of class membership. It is related to the argmax function that outputs a 0 for all options and 1 for the chosen option. Softmax is a “softer” version of argmax that allows a probability-like output of a winner-take-all function. As such, the input to the function is a vector of real values and the output is a vector of the same length with values that sum to 1.0 like probabilities.
Choose the activation function for your output layer based on the type of prediction problem that you are solving. Specifically, the type of variable that is being predicted.
For example, you may divide prediction problems into two main groups, predicting a categorical variable (classification) and predicting a numerical variable (regression).
If your problem is a regression problem, you should use a linear activation function.
Regression: One node, linear activation.
If your problem is a classification problem, then there are three main types of classification problems and each may use a different activation function.
Predicting a probability is not a regression problem; it is classification. In all cases of classification, your model will predict the probability of class membership (e.g. probability that an example belongs to each class) that you can convert to a crisp class label by rounding (for sigmoid) or argmax (for softmax).
If there are two mutually exclusive classes (binary classification), then your output layer will have one node and a sigmoid activation function should be used. If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class and a softmax activation should be used. If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class and a sigmoid activation function is used.
Binary Classification: One node, sigmoid activation.
Multi-class Classification: One node per class, softmax activation.
Multi-label Classification: One node per class, sigmoid activation.
The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution. That is, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels.
The function can be used as an activation function for a hidden layer in a neural network, although this is less common. It may be used when the model internally needs to choose or weight multiple different inputs at a bottleneck or concatenation layer.
Reference: machinelearningmastery.com
relu and leakyrelu and tanh activation functions in the input and hidden are used for numeric function prediction. leakyrelu and tanh find signal better for equations and linear trends. I used leakyrelu and tanh for linear problems. sigmoid activation is used for classification and binary cross entropy problems. tanh worked well for binary cross entropy problem in credit loan risk.
you can use softmax when your outputting multiple labels as a probability. In this example, I use the ufo text description to output the probable shape of the ufo.
https://github.com/dnishimoto/python-deep-learning/blob/master/UFO%20.ipynb
['Egg','Cross','Sphere','Triangle','Disk','Oval','Rectangle','Teardrop']
softmax returns a probability for each output label
model=Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(len(LABELS), activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
egg shaped ufos are the most common sightings

How do I intuitively interpret a sigmoidal neural network model?

There are multiple sources, but they explain at a bit too high a level for me a to actually understand.
Here is my knowledge of how this model works;
We feed-forward information in prior layer's nodes using the weight * value. We do NOT use the sigmoid function here. This is because any hidden layers will force the value to be POSITIVE if we use the sigmoid function here. If it is always positive, then subsequent values can never be less than 0.5.
When we have fed forward to the output, we then use the sigmoid function on the output.
So in total we only use the sigmoid function on the output layer values only.
I will try to include a hopefully not terrible diagram
https://imgur.com/a/4EzkpH5
I have tested with my own code, and evidently it should not be the sigmoid function on every value and weight, but I am unsure if it is just the sum of weight*value
So basically you have a set of features for your model. These features are independent variables which will be responsible for producing of the output. So features are the inputs and the predicted values are the outputs. This is indeed a function.
It is easy to understand neural networks if we study them in terms of functions.
First multiply the feature vector with the vector of weights. Meaning, the dot product of the both vectors must be produced.
The dot product is a scalar if you have a single node ( neuron ). Apply sigmoid function on the product. The output is the final prediction.
The whole model could be expressed as a single composite function like,
y = sigmoid( dot( w , x ) )
Also understanding back propogation ( gradient descent ) for NN makes some intuition if we treat NN as functions.
In the above function,
sigmoid : applies sigmoid activation function to the argument.
dot : returns the dot product of two vectors.
Also, use vector notation as far as possible. It saves you from the confusion related with summations.
Hope it helps.
Activation functions serve an important role in neural network models: they can, given the choice of activation function, grant the network the capability to model non-linear datasets.
The example illustrated in the figure you posted (rendered below) will be limited to model linear problems where the output value is between 0 and 1 (the range of the sigmoidal function). However, the model would support non-linear datasets if the sigmoidal was applied to the two nodes in the middle. StackOverflow is not the place to discuss the theoretic foundation of why this works, instead I recommend looking into some light reading like this ebook: Neural Networks and Deep Learning (no affiliation).
As a side note: the final, output layer of a network are sometimes instantiated as a simple sum, or a ReLU. This will widen the range of the network's output.

Neural Networks - why is my training error increasing as I add hidden units (neurons)?

I'm trying to optimise the number of hidden units in my MLP.
I'm using k-fold cross validation, with 10 folds - 16200 training points and 1800 validation points in each fold.
When I run the network with hidden units varying from 1:10, I find the minimum error always occurs at 2 (NMSE of about 7).
3 is slightly higher (NMSE of about 11) and 4 or more hidden units and the error remains constant at about 14 or 15 regardless of many I add.
Why is this?
I find it hard to believe that overfitting is occurring, because of the very large amount of data points being used (with all 10 folds, that's 162000 training points, albeit each repeated 9 times).
Many thanks for any help or advice!
If the input is voltage and current, and question is about the power generated, then it's just P=V*I. Even if you have some noise, the relationship will be still linear. In this case simple linear model would do just fine - and would be far nicer to interpret! That's why simple ANN works best and more complex is overfitting, as it looks for non-linear relationships (which are not there, but it does whatever will minimise cost function).
To summarise, I would recommend to check a simple linear model. Also, since you have a lot of data points, make a 50-25-25 split for training, test and validation sets. Look at your cost function and see how it changes with error rate.

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

Resources