I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.
Related
I'm trying to optimise the number of hidden units in my MLP.
I'm using k-fold cross validation, with 10 folds - 16200 training points and 1800 validation points in each fold.
When I run the network with hidden units varying from 1:10, I find the minimum error always occurs at 2 (NMSE of about 7).
3 is slightly higher (NMSE of about 11) and 4 or more hidden units and the error remains constant at about 14 or 15 regardless of many I add.
Why is this?
I find it hard to believe that overfitting is occurring, because of the very large amount of data points being used (with all 10 folds, that's 162000 training points, albeit each repeated 9 times).
Many thanks for any help or advice!
If the input is voltage and current, and question is about the power generated, then it's just P=V*I. Even if you have some noise, the relationship will be still linear. In this case simple linear model would do just fine - and would be far nicer to interpret! That's why simple ANN works best and more complex is overfitting, as it looks for non-linear relationships (which are not there, but it does whatever will minimise cost function).
To summarise, I would recommend to check a simple linear model. Also, since you have a lot of data points, make a 50-25-25 split for training, test and validation sets. Look at your cost function and see how it changes with error rate.
I am trying to use one-class SVM with Python scikit-learn.
But I do not understand what are the different variables X_outliers, n_error_train, n_error_test, n_error_outliers, etc. which are at this address. Why does X is randomly selected and is not a part of a dataset?
Scikit-learn "documentation" did not help me a lot. Also, I found very few examples on Internet
Can I use One-class SVM for outlier detection in a case of a hudge number of data and if I do not know if there are anomalies in my training set?
One-class SVM is an Unsupervised Outlier Detection (here)
One-class SVM is not an outlier-detection method, but a
novelty-detection method (here)
Is this possible?
Ok, so this is not really a Python question, more of a SVM comprehension question, but eh. A typical SVM is two-classed, and is an algorithm which is going to have two phases :
First, it will learn relationships between variables and attributes. For example, you show your algorithm tomato pictures and banana pictures, telling him each time if it's a banana or a tomato, and you tell him to count the number of red pixels in each picture. If you do it correctly, the SVM will be trained, meaning he will know that pictures with lots of red pixels are more likely to be tomatoes than bananas.
Then comes the predicting phase. You show him a picture of a tomato or a banana without telling him which it is. And since he has been trained before, he will count the red pixels, and know which it is.
In your case of a one-class SVM, it's a bit simpler, basically the training phase is showing him a bunch of variables which are all supposed to be similar. You show him a bunch of tomato pictures telling him "these are tomatoes, everything else too different from these are not tomatoes".
The code you link to is a code to test the SVM's capability of learning. You start by creating variables X_train. Then you generate two other sets, X_test which is similar to X_train (tomato pictures) and X_outliers which is very different. (banana pictures)
Then you show him the X_train variables and tell your SVM "this is the kind of variables we're looking for" with the line clf.fit(X_train). This is equivalent in my example to showing him lots of tomato images, and the SVN learning what a "tomato" is.
And then you test your SVM's capability to sort new variables, by showing him your two other sets (X_test and X_outliers), and asking him whether he thinks they are similar to X_train or not. You ask him that with the predict fuction, and predict will yield for every element in the sets either "1" i.e. "yes this is a similar element to X_train", or "-1", i.e. "this element is very different".
In an ideal case, the SVM should yield only "1" for X_test and only "-1" for X_outliers. But this code is to show you that this is not always the case. The variables n_error_ are here to count the mistakes that the SVM makes, misclassifying X_test elements as "not similar to X_train and X_outliers elements as "similar to X_train". You can see that there are even errors when the SVM is asked to predict on the very set that is has been trained on ! (n_error_train)
Why are there such errors ? Welcome to machine learning. The main difficulty of SVMs is setting the training set such that it enables the SVM to learn efficiently to distinguish between classes. So you need to set carefully the number of images you show him, (and what he has to look out for in the images (in my example, it was the number of red pixels, in the code, it is the value of the variable), but that is a different question).
In the code, the bounded but random initialization of the X sets means that for example you could during on run train the SVM on an X_train set with lots of values between -0.3 and 0 even though they are randomly initialized between -0.3 and 0.3 (espcecially if you have few elements per set, say for example 5, and you get [-0.2 -0.1 0 -0.1 0.1]). And so, when you show the SVM an element with a value of 0.2, then he will have trouble associating it to X_train, because it will have learned that X_train elements are more likely to have negative values.
This is equivalent to show your SVM a few yellow-ish tomatoes when you train him, so when you show him a really red tomato afterwards, it will have trouble clasifying it as a tomato.
This one-class SVM is a classifier to determine whether entries are similar or dissimilar to entries that the classifier has been trained with.
The script generates three sets:
A training set.
A test-set of entries that are similar to the training a set.
A test-set of entries that are dissimilar to the training set.
The error is the number of entries from each of the sets, that have been classified wrongly. That is; That have been classified as dissimilar to the training set when they were similar (for set 1 and 2), or that have been classifier as similar to the training set when they were dissimilar (set 3).
X_outliers: This is set 3.
n_error_train: The number of classification errors for the elements in the train-set (1).
n_error_test: The number of classification errors for the elements in the test-set (2).
n_error_outliers: The number of classification errors for the elements in the outlier-set (3).
This answer should be complementary to scikit-description but I agree that is a bit technical. I will elaborate some aspects of the One Class SVM algorithm (OCSVM) here. OCSVM is designed to solve the unsupervised anomaly detection problem.
Given unstructured (unlabelled) data it will find a n-dimensional space a matrix W^T with d columns (T stands for transpose).
The objective function of all SVM based methods (and OCSVM) is:
$$f(x) = sign(wT x + b)$$, where sign means sign (-1 anomalous 1 nominal) shifted by a bias term b.
In the classification problem the matrix W is associated with the distance(margin) between 2 classes but this differs in OCSVM since there is only 1 class and it maximizes from the origin (original paper of OCSVM demonstrates this ) .
As you see it is a generic algorithm because SVM is a family of models that can approximate any non linear boundary such as neural networks. To achieve something complicated you have to construct your own kernel matrix.
To do this you need to find some convenient mathematical property (suggestions to improve the answer are welcome at this point).
But in the most cases Gaussian kernel is a kernel that has some quite nice mathematical properties and associated ML theorems such as the Large
of large numbers.
The scikit implementation provides a wrapper to LIBSVM implementation for SVM and has 4 such kernels.
-nu parameter is a problem formulation parameter it allows to say to the model here is how dirty my sample is.
More formally it makes the problem a outlier detection problem where you know your data is mixed (nominal and anomalous) instead of pure where the problem is different and it is called novelty detection.
kernel parameter: One of the most important decisions. Mathematically kernel is a big matrix of numbers where by multiplying you achieve to project data in a higher dimensions. A nice read demonstrating the issue is here while the paper of Scholkopf who created OCSVMK goes into more detail.
gamma
In the case of robust kernel you essentially use a gaussian projection.
Disclaimer my interpretation: Essentially with gamma parameter you describe how big the variance of the Normal distribution $N(\mu, \sigma)$ is.
-tolerance
One class svm search the margin tha separates better among training data and the origin. The tolerance refers to the stopping criterion or how small should the tolerance for satisfaction of the quadratic optimization of the
objective function. The objective function the thing that tells SVM what the parameters should like to describe a specific margin - the space between nominal and anomalous) seen in Figure~().
Many Sklearn examples are usually based on randomly generated data. If you want to see an example of how OneClassSVM works on a real dataset for outlier detection, you can go through my post: https://justanoderbit.com/outlier-detection/one-class-svm/
Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.
Before I ask my question, here's a brief summary of my project:
I'm using OPENCV's built-in function to detect a face in a
cam-feed.
After that I'm processing the image which contains the face, i.e.
converting it to grayscale, resizing it to 40X40 pixels and
equalizing its' histogram.
The pixel values of the image are then read, normalized (i.e.
divided by 256 since FANN works with values between 0 and 1, or -1
and 1, depending on the used function) and saved into an array of
1600 elements. This is the data which neural network works with.
Depending on the data, ANN decides either if the face is mine,
unknown, or not a face at all (false positive).
Neural network then returns an array of 3 elements, and the
program decides which group the face belongs to by finding the
maximum.
The thing is, apart from the pretty good detection of false positives, my code gives rather inaccurate results.
Some details on my ANN. I'm using the doublefann.h. The network contains 1600 input neurons (obviously), 3 output neurons (even more so), while the single hidden layer contains 1600 neurons, although I did try out other values in the 800-2400 range. I am using 20 samples of my face, 30 samples of unknown faces and 30 samples of random backgrounds for training. I tried both the RPROP (default in FANN, seems to overfit most of the time) and QUICKPROP (gives nice, smoothly decreasing error while training, but the results are inaccurate) training algorithms, as well as the SIGMOID_SYMMETRIC (training is done very fast, but often ends in abrupt fall of MSE to a near-zero value, and the resulting network is over-fit) activation function. I used both MSE and bit fail value as the criteria for stopping. Sadly, any given combination of those resulted in a rather poor face recognition.
So my questions would be, given the nature of my project and data which it handles, what would be the optimal:
Number of hidden neurons
Activation function
Training algorithm
Number of training samples per group (I was said that 20-30 should suffice)
Stopping criteria
Of course, any advice is welcome, it does not need to answer all or any of those five questions. I know that my problem is a pretty complex one, but I spent a good chunk of time running in circles and got tired of reading literature on the subject, and some first-hand knowledge would be highly appreciated.
I'm working on character recognition (and later fingerprint recognition) using neural networks. I'm getting confused with the sequence of events. I'm training the net with 26 letters. Later I will increase this to include 26 clean letters and 26 noisy letters. If I want to recognize one letter say "A", what is the right way to do this? Here is what I'm doing now.
1) Train network with a 26x100 matrix; each row contains a letter from segmentation of the bmp (10x10).
2) However, for the test targets I use my input matrix for "A". I had 25 rows of zeros after the first row so that my input matrix is the same size as my target matrix.
3) I run perform(net, testTargets,outputs) where outputs are the outputs from the net trained with the 26x100 matrix. testTargets is the matrix for "A".
This doesn't seem right though. Is training supposed by separate from recognizing any character? What I want to happen is as follows.
1) Training the network for an image file that I select (after processing the image into logical arrays).
2) Use this trained network to recognize letter in a different image file.
So train the network to recognize A through Z. Then pick an image, run the network to see what letters are recognized from the picked image.
Okay, so it seems that the question here seems to be more along the lines of "How do I neural networks" I can outline the basic procedure here to try to solidify the idea in your mind, but as far as actually implementing it goes you're on your own. Personally I believe that proprietary languages (MATLAB) are an abomination, but I always appreciate intellectual zeal.
The basic concept of a neural net is that you have a series of nodes in layers with weights that connect them (depending on what you want to do you can either just connect each node to the layer above and beneath, or connect every node, or anywhere in betweeen.). Each node has a "work function" or a probabilistic function that represents the chance that the given node, or neuron will evaluate to "on" or 1.
The general workflow starts from whatever top layer neurons/nodes you've got, initializing them to the values of your data (in your case, you would probably start each of these off as the pixel values in your image, normalized to be binary would be simplest). Each of those nodes would then be multiplied by a weight and fed down towards your second layer, which would be considered a "hidden layer" depending on the sum (either geometric or arithmetic sum, depending on your implementation) which would be used with the work function to determine the state of your hidden layer.
That last point was a little theoretical and hard to follow, so here's an example. Imagine your first row has three nodes ([1,0,1]), and the weights connecting the three of those nodes to the first node in your second layer are something like ([0.5, 2.0, 0.6]). If you're doing an arithmetic sum that means that the weighting on the first node in your "hidden layer" would be
1*0.5 + 0*2.0 + 1*0.6 = 1.1
If you're using a logistic function as your work function (a very common choice, though tanh is also common) this would make the chance of that node evaluating to 1 approximately 75%.
You would probably want your final layer to have 26 nodes, one for each letter, but you could add in more hidden layers to improve your model. You would assume that the letter your model predicted would be the final node with the largest weighting heading in.
After you have that up and running you want to train it though, because you probably just randomly seeded your weights, which makes sense. There are a lot of different methods for this, but I'll generally outline back-propagation which is a very common method of training neural nets. The idea is essentially, since you know which character the image should have been recognized, you compare the result to the one that your model actually predicted. If your model accurately predicted the character you're fine, you can leave the model as is, since it worked. If you predicted an incorrect character you want to go back through your neural net and increment the weights that lead from the pixel nodes you fed in to the ending node that is the character that should have been predicted. You should also decrement the weights that led to the character it incorrectly returned.
Hope that helps, let me know if you have any more questions.