I'm working on character recognition (and later fingerprint recognition) using neural networks. I'm getting confused with the sequence of events. I'm training the net with 26 letters. Later I will increase this to include 26 clean letters and 26 noisy letters. If I want to recognize one letter say "A", what is the right way to do this? Here is what I'm doing now.
1) Train network with a 26x100 matrix; each row contains a letter from segmentation of the bmp (10x10).
2) However, for the test targets I use my input matrix for "A". I had 25 rows of zeros after the first row so that my input matrix is the same size as my target matrix.
3) I run perform(net, testTargets,outputs) where outputs are the outputs from the net trained with the 26x100 matrix. testTargets is the matrix for "A".
This doesn't seem right though. Is training supposed by separate from recognizing any character? What I want to happen is as follows.
1) Training the network for an image file that I select (after processing the image into logical arrays).
2) Use this trained network to recognize letter in a different image file.
So train the network to recognize A through Z. Then pick an image, run the network to see what letters are recognized from the picked image.
Okay, so it seems that the question here seems to be more along the lines of "How do I neural networks" I can outline the basic procedure here to try to solidify the idea in your mind, but as far as actually implementing it goes you're on your own. Personally I believe that proprietary languages (MATLAB) are an abomination, but I always appreciate intellectual zeal.
The basic concept of a neural net is that you have a series of nodes in layers with weights that connect them (depending on what you want to do you can either just connect each node to the layer above and beneath, or connect every node, or anywhere in betweeen.). Each node has a "work function" or a probabilistic function that represents the chance that the given node, or neuron will evaluate to "on" or 1.
The general workflow starts from whatever top layer neurons/nodes you've got, initializing them to the values of your data (in your case, you would probably start each of these off as the pixel values in your image, normalized to be binary would be simplest). Each of those nodes would then be multiplied by a weight and fed down towards your second layer, which would be considered a "hidden layer" depending on the sum (either geometric or arithmetic sum, depending on your implementation) which would be used with the work function to determine the state of your hidden layer.
That last point was a little theoretical and hard to follow, so here's an example. Imagine your first row has three nodes ([1,0,1]), and the weights connecting the three of those nodes to the first node in your second layer are something like ([0.5, 2.0, 0.6]). If you're doing an arithmetic sum that means that the weighting on the first node in your "hidden layer" would be
1*0.5 + 0*2.0 + 1*0.6 = 1.1
If you're using a logistic function as your work function (a very common choice, though tanh is also common) this would make the chance of that node evaluating to 1 approximately 75%.
You would probably want your final layer to have 26 nodes, one for each letter, but you could add in more hidden layers to improve your model. You would assume that the letter your model predicted would be the final node with the largest weighting heading in.
After you have that up and running you want to train it though, because you probably just randomly seeded your weights, which makes sense. There are a lot of different methods for this, but I'll generally outline back-propagation which is a very common method of training neural nets. The idea is essentially, since you know which character the image should have been recognized, you compare the result to the one that your model actually predicted. If your model accurately predicted the character you're fine, you can leave the model as is, since it worked. If you predicted an incorrect character you want to go back through your neural net and increment the weights that lead from the pixel nodes you fed in to the ending node that is the character that should have been predicted. You should also decrement the weights that led to the character it incorrectly returned.
Hope that helps, let me know if you have any more questions.
Related
I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
[
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
...
]
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you
I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)
I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)
I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.
I have a big problem I want to implement my neuronal neutwork with 2 neurons outputs. Sth like that :
And I want to use backpropagation algorithm, but I don't know how to calculate a error, because I have a output with 2 neurons, when I have a only one neuron on a output that's very easy to use a backpropagation algorithm from one exit error, but with two neurons? I thinking about calculate error for every output seperately but then I must calculate seperately back propagation for 2 cases and I get "two different hidden layers" (For every neuron in hidden layer I have a weights for two cases). Mayby anyone knows some better solutions?
I will be very gratefull for any help.
Logically thinking, the first layer of weights should give you a representation (the hidden layer) that is useful for predicting both outputs. So, this layer should be updated based on the error made in both outputs. But the next layer of weights are separate for each output node, so should get separate weight updates.
So, on second layer weights, the weight updates will be calculated separately based on the respective outputs. For the first layer of weights, I would first calculate error derivatives backpropagating from each output separately and then simply combine them to get the final error derivative. Then apply learning rate to get the weight updates.
Watch out for the dynamic range of your outputs. For example, if one output is producing some real value of range [0,10] and another is producing values in range [-1000,1000] then your updates will be dominated by the one with larger range. You can
add a preprocessing step that would change your data set to have same dynamic range in both outputs. Also, add a postprocessing step to restore the actual range.
formulate the error functions for each output so that they produce error values of same dynamic range.
I am trying to solve the problem of matching a human generated gesture with a known gesture. The human generated gesture will be represented by a sequence of points that will need to be interpolated into a path and compared to the existing path. The image below shows what I am trying to compare
Can you please help point me in the right direction with resources or concepts that I can read into to build an algorithm to match these two paths? I have no experience in doing this before so any insights will be appreciated.
Receiving input
Measure input on some interval. Every xx milliseconds, measure the coordinates of the user's hand/finger/stylus.
Storing patterns and input
Patterns (expected input)
Modify the pattern. It's currently a continuous "function," but measuring input as such is difficult. Use discrete points at some interval. This interval can be very short, depending on how accurate you require gestures to be. In fact, it should be very short; the more points to compare against, the better (I'll explain this a little better in the next section).
Input (received from user)
When input is measured, the input-measurement interval needs to be short enough that each received consecutive pair of input points is close enough to compare to the expected points.
Imagine that the user performs some gesture very quickly (and completes it in the time your input-reader reads only three frames). The pattern and input cannot be reliably compared:
To avoid this, your input-reader must have a relatively short interval. However, this probably isn't a huge concern, since most hardware can read even the fastest human gestures.
Back to patterns: they should always be detailed enough to include more points than any possible input. More expected points allow for better accuracy. If a user moves slowly, the input will have more points; if they move quickly, the input will have fewer.
Consider this: completing a single gesture gives you half as many input frames as the pattern includes. The user has moved at a "normal" speed, so, to simplify the algorithm, you can "dumb down" your pattern by a factor of 2, then compare input coordinates to pattern coordinates directly.
This method is easier than the alternative that comes to mind (see next section).
Pattern "density" (coordinate frequency)
If you have a small number of expected points, you'll have to make approximations to match input.
Here's an "extreme" example, but it proves the concept. Given this pattern and input:
Point 3r can't be reliably compared with point 2 or point 3, so you'd have to use some function of points 2, 3, and 3r, to determine if 3r is on the correct path. Now consider the same input, but where the pattern has higher density:
Now, you don't have to compromise, since 3r is essentially definitely on the gesture's pattern. A slight reduction in the pattern's density causes it to match input quite well.
Positioning
Relative positioning
Instead of comparing absolute positions (such as on a touchscreen), you probably want the gesture to be allowed anywhere in some plane of space. To that end, you must relate the start point of the input to some coordinate system.
Normalization
To be user-friendly, allow gestures to be done in a range of "sizes". You don't want to compare raw data, because chances are the size of the plane of the input doesn't match the size of the plane of the pattern.
Normalize the input in the x- and y-direction to match the size of your pattern. Do not maintain aspect ratio.
Relate the input to a coordinate system, as per previous bullet
Find the largest horizontal and vertical distance between any two input points (call them RecMaxH and RecMaxV)
Find the largest horizontal and vertical distance between any two pattern points (call them ExpMaxH and ExpMaxV)
Multiply all input points' x-coordinates by ExpMaxH/RecMaxH
Multiple all input points' y-coordinates by ExpMaxV/RecMaxV
You now have two more-similar sets of points that can be compared. Normalization can be much more detailed than this; for instance, you could normalize sets of 3 points at a time to get incredibly similar images (but you would probably have to do this for each pattern, then compare the sum of all differences to find the most likely matching pattern).
I suggest storing all gestures' pattern as a graph the same size; that reduces computation when measuring closeness of input to possible pattern matches.
When to measure input
User-driven
Imagine a button that, when clicked/activated, causes your program to begin measuring inputs. This would be similar to Google's Voice Search, which doesn't constantly record and search; instead, you say "Ok Jarvis" or click the handy microphone icon and begin speaking your query.
Benefits:
Simplifies algorithm
Prevents user from unintentionally triggering an event. Imagine if every word you spoke was analyzed and sent to Google as part of a search query. Sometimes you just don't mean to do anything.
Drawbacks:
Less user-friendly. User must go out of his/her way to trigger recording for gestures.
If you're writing, for instance, a gesture-search (ridiculous example), this is probably the better method to implement. Nobody wants every move they make interpreted as an action in your application. However, if you're writing a Kinect-style or gesture-based game, you probably want to be constantly recording and looking for gestures.
Constant
Your program constatly records gesture coordinates at the specified interval (this could be reduced to "records if there's movement, otherwise doesn't store coordinates"). You must make a decision: how many "frames" will you record until deciding that the currently-stored motion is not a recognized gesture?
Store coordinates in a buffer: a queue 1.5 or 2 (to be cautious) times as long as the largest number of frames you're willing to record.
Once you determine that there exists in this buffer a sequence of frames that match a pattern, execute that gesture's result, and clear the queue.
If there's the possibility that the next gesture is an "option" for the most-recent gesture, record the application state as "currently waiting on option for ____ gesture," and wait for the option to appear.
If it's determined that the first x frames in the buffer cannot possibly match a pattern (because of their sequence or positioning), remove them from the queue.
Benefits:
Allows for more dynamic handling of gestures
User input recognized automatically
Drawbacks:
More complex algorithm
Heavier computation
If you're writing a game that runs based on real-time input, this is probably the right choice.
Algorithm
If you're using user-driven recognition:
Record all input in the allowed timeframe (or until the user signifies that they're done)
To evaluate the input, reduce the density of your pattern to match that of the input
Relate the input to a coordinate system
Normalize input
Use a method of function comparison (looseness of this calculation is up to you: standard deviation, variance, total difference in values, etc.), and choose the least-different possibility.
If no possibility is similar enough to meet your required threshold (you must decide this), don't accept the input.
If you're using constant measuring:
In your buffer, treat the sequence of max_sequence_size (you decide) beginning at every multiple of frame_multiples (you decide) as a possible gesture. For instance, if all of my possible gestures are at most 20 frames long, and I believe every 5 frames a new gesture could be starting (and I won't lose any critical data in those 5 frames), I'll compare each portions of the buffer to all possible gestures (portions from 0-19, 5-24, 10-29, etc.). This is heavier computing when frame_multiples decreases. For perfect measurement, frame_multiples is 1 (but this is likely not reasonable).
Hope you've enjoyed reading this answer as much as I enjoyed writing it. I've never done this before, but you've piqued my interest in a way that doesn't often happen. Please edit and improve my answer! If there's a portion that seems incomplete, add to it. I'm very curious in (particularly, more-experienced) responses and criticism.