Neural Network-like Data Structure - data-structures

So I'm working on a little side-project for the purpose of experimenting with genetic algorithms. The project involves two classes, Critters and Food. The critters gain hunger each tick and lose hunger when they find food. The Critters can move, and the Food is stationary. Each Critter has a genome which is just a string of randomly generated characters. The goal is that after several generations, the Critters will evolve specialized movement patterns that result in optimal food consumption.
Right now, each Critter is governed by a neural network. The neural network is initialized with weights and biases derived from the Critter's genome. The first input into the neural network is [0,0]. The neural network produces two outputs which dictate the direction of the Critter's x and y movement respectively. This output is used as the input for the neural network at the next tick. For example:
1: [0,0]->NN->[.598.., -.234...] // Critter moves right and up
2: [.598...,-.234...]->NN->[-.409...,-.232...] // Critter moves left and up
3: [-.409...,-.232...]->NN-> etc.
The problem is that, regardless of how the weights are initialized, the neural network is finding a sort of "fixed point." That is, after two or three iterations the output and input are practically the same so the Critter always moves in the same direction. Now I'm not training the neural net and I don't really want to. So what I'm looking for is an alternative method of generating the output.
More specifically, let's say I have n random weights generated by the genome. I need a relation determined by those n weights that can map (in the loosest sense of the word) 2 inputs in the range [-1,1] to two outputs in the same range. The main thing is that I want the weights to have a significant impact on the behavior of the function. I don't want it to be something like y=mx+b where we're only changing y and b.
I know that's a pretty vague description. At first I thought the Neural Network would be perfect, but it seems as though the inputs have virtually no affect on the outputs without training (which is fair since Neural Networks are meant to be trained).
Any advice?

Just an idea.
You have f(genome) -> (w_1, w_2, ..., w_n), where f generates w based on the genome.
You could for example use a hash function h and compute [h(w_1, ..., w_(n/2)), h(w_(n/2+1), ..., w_n))].
Normally a hash function should give very distinct outputs for small change in inputs. But not always. You can look for hash functions that are continuous (small change in input, small change in output). This kind of function would be used for similarity search, http://en.wikipedia.org/wiki/Locality_sensitive_hashing might provide some ideas. This way you could actually use the hash directly on the genome.
Otherwise, you could try to split the genome or the weights and give the splits different purposes. Let's say n = 4.
(w_1, w_2) affects x
(w_3, w_4) affects y
You could then compute x as (w_1 + w_2*Random_[-1,1])/2, where Random_[-1,1] is a random number from the interval [-1,1] and assuming that w_i \in [-1,1] for all i.
Similarly for y.
Your genetic algorithm would then optimize how fast and how randomly the Critters move in order to optimally find food. If you have more weights (or longer genome), you could try to come up with a fancier function in a similar spirit.
This actually shows that with genetic algorithms the problem solving shifts to finding a good genome representation and a good fitness function so don't worry if you get stuck on it a bit.

Related

Markov chains and Random walks on top of biological data

I'm coming from biology's field and thus I have some difficulties in understanding (intuitively?) some of the ideas of that paper. I really tried my best to decipher it step by step by using a lot of google and youtube, but now I feel, it's the time to refer to the professionals in that field.
Before filling out the whole universe with (unordered) questions, let me put the whole thing down and try to introduce you to the subject while at the same time explain to you what I got so far from my research on that.
Microarrays
For those that do not have any idea of what this is, you can imagine, that it is literally an array (matrix) where each cell of it contains a probe for a specific gene. Making the long story short, by the end of the microarray experiment, you have a matrix (in computational terms) with each column representing a sample, each line a different gene while the contents of the matrix represent the expression values of the genes for each sample.
Pathways
In biology pathway / gene-set they call a set of genes that interact with each other forming a small network responsible for a specific function.These pathways are not isolated but they talk/interact with each other too. What that paper does on the first hand, is to expand the initial pathway (let us call it target pathway), by including some other genes from other pathways that might interact with that.
Procedure
1.
Let's assume now that we have a matrix G x S. Where G for genes and S for Samples. We construct a gene co-expression network (G x G) using as weights the Pearson's correlation coefficients between genes' pairs (a). This could also be represented as an undirected weighted graph. .
2.
For each gene (row OR column) we calculate the weighted degree (d) which is nothing more than the sum of all correlation coefficients of that gene.
3.
From the two previous matrices, they construct the transition matrix producing the probabilities (P) to transit from one gene to another by using the
formula
Q1. Why do they call this transition probability? Is there any intuitive way to see this as a probability in the biological context?
4.
Since we have the whole transition matrix, we can define a subnetwork of the initial one, that we want to expand it and it consisted out of let's say 15 genes. In that step, they used formula number 3 (on the paper) which transforms the values of the initial transition matrix as it says. They set the probability of 1 on the nodes that are part of the selected subnetwork because they define them as absorbing states.
Q2. In that same formula (3), I cannot understand what the second condition does. When should the probability be 0? Intuitively, in my opinion, all nodes that didn't exist in subnetwork, should have the P_ij value as a probability.
5.
After that, the newly constructed transition matrix is showed at formula (4) in the paper and I managed to understand it using this excellent article.
6.
Here is where everything is getting more blur for me and where I need the most of the help. What I imagine at that step, is that the algorithm starts randomly from one node and keep walking around the network. In order to construct a relevance function (What that exactly means?), they firstly calculate a probability called joint probability of visiting one node/edge E(i,j) and noted as :
From the other hand they seem to calculate another probability called probability of a walk of length L starting in x and denoted as :
7.
In the next step, they divide the previously calculated probabilities and calculate the number of times a random walk starts in x using the transition from i to j that I don't really understand what this means.
After that step, I lost their reasoning at all :-P.
I'm not expecting an expert to come open my mind and give me understand that procedure. What I'm expecting is some guidelines, hints, ideas, useful resources or more intuitive approaches to understanding the whole procedure. Then when I fully understand it I will try to implement it on R or python.
So any idea / critics is welcome.
Thanks.

What algorithm do I use to calculate voltage across a combination circuit?

I'm trying to programmatically calculate voltage changes over a very large circuit.
*This question may seem geared toward electronics, but it's
more about applying an algorithm over a set of data.
To keep things simple,
here is a complete circuit, with the voltages already calculated:
I'm originally only given the battery voltage and the resistances:
The issue I have is that voltage is calculated differently among parallel and series circuits.
A somewhat similar question asked on SO.
Some formulas:
When resistors are in parallel:
Rtotal = 1/(1/R1 + 1/R2 + 1/R3 ... + 1/Rn)
When resistors are in series:
Rtotal = R1 + R2 + R3 ... + Rn
Ohm's Law:
V = IR
I = V/R
R = V/I
V is voltage (volts)
I is current (amps)
R is resistance(ohms)
Every tutorial I've found on the internet consists of people conceptually grouping together parallel circuits to get the total resistance, and then using that resistance to calculate the resistance in series.
This is fine for small examples, but it's difficult to derive an algorithm out of it for large scale circuits.
My question:
Given a matrix of all complete paths,
is there a way for me to calculate all the voltage drops?
I currently have the system as a graph data structure.
All of the nodes are represented(and can be looked up by) an id number.
So for the example above, if I run the traversals, I'll get back a list of paths like this:
[[0,1,2,4,0]
,[0,1,3,4,0]]
Each number can be used to derive the actual node and it's corresponding data. What kind of transformations/algorithms do I need to perform on this set of data?
It's very likely that portions of the circuit will be compound, and those compound sections may find themselves being in parallel or series with other compound sections.
I think my problem is akin to this:
http://en.wikipedia.org/wiki/Series-parallel_partial_order
Some circuits cannot even be analyzed in terms of series and parallel, for example a circuit which includes the edges of a cube (there's some code at the bottom of that web page that might be helpful; I haven't looked at it). Another example that can't be analyzed into series/parallel is a pentagon/pentagram shape.
A more robust solution than thinking about series and parallel is to use Kirchhoff's laws.
You need to make variables for the currents in each linear section
of the circuit.
Apply Kirchhoff's current law (KCL) to nodes where
linear sections meet.
Apply Kirchhoff's voltage law (KVL) to as many
cycles as you can find.
Use Gaussian elimination to solve the
resulting linear system of equations.
The tricky part is identifying cycles. In the example you give, there are three cycles: through battery and left resistor, battery and right resistor, and through left and right resistors. For planar circuits it's not too hard to find a complete set of cycles; for three dimensional circuits, it can be hard.
You don't actually need all the cycles. In the above example, two would be enough (corresponding to the two bounded regions into which the circuit divides the plane). Then you have three variables (currents in three linear parts of the circuit) and three equations (sum of currents at the top node where three linear segments meet, and voltage drops around two cycles). That is enough to solve the system for currents by Gaussian elimination, then you can calculate voltages from the currents.
If you throw in too many equations (e.g., currents at both nodes in your example, and voltages over three cycles instead of two), things will still work out: Gaussian elimination will just eliminate the redundancies and you'll still get the unique, correct answer. The real problem is if you have too few equations. For example, if you use KCL on the two nodes in your example and KVL around just one cycle, you'll have three equations, but one is redundant, so you'll only really have two independent equations, which is not enough. So I would say throw in every equation you can find and let Gaussian elimination sort it out.
And hopefully you can restrict to planar circuits, for which it is easy to find a nice set of cycles. Otherwise you'll need a graph cycle enumeration algorithm. I'm sure you can find one if you need it.
use a maximum flow algorithm (Dijkstra is your friend).
http://www.cs.princeton.edu/courses/archive/spr04/cos226/lectures/maxflow.4up.pdf
You pretend to be in front of a water flow problem (well, actually it IS a flow problem). You have to compute the flow of water on each segment (the current). Then you can easily compute the voltage drop (water pressure) across every resistor.
I think the way to go here would be something like this:
Sort all your paths into groups of the same length.
While there are more than one group, choose the group with the largest length and:
2a. Find two paths with one item difference.
2b. "Merge" them into a path with the length smaller by one - the merge is dependent on the actual items that are different.
2c. Add the new path into the relevant group.
2d. If there are only paths with more than one item difference, merge the different items so that you have only one different item between the paths.
2e. When there is only one item left, find an item from a "lower" (= length is smaller) with minimum differences, and merge item to match.
When there is one group left with more than one item, keep doing #2 until there is one group left with one item.
Calculate the value of that item directly.
This is very initial, but I think the main idea is clear.
Any improvements are welcome.

neural network training set

My question is about a training set in a supervised artificial neural network (ANN)
Training set, as some of you probably know, consists of pairs (input, desired output)
Training phase itself is the following
for every pair in a training set
-we input the first value of the pair and calculate the output error i.e. how far is the generated output from the desired output, which is the second value of the pair
-based on that error value we use backpropagate algorithm to calculate weight gradients and update weights of ANN
end for
Now assume that there are pair1, pair2, ...pair m, ... in the training set
we take pair1, produce some error, update weights, then take pair2, etc.
later we reach pair m, produce some error, and update weights,
My question is, what if that weight update after pair m will eliminate some weight update, or even updates which happened before ?
For example, if pair m is going to eliminate weight updates happened after pair1, or pair2, or both, then although ANN will produce a reasonable output for input m, it will kinda forget the updates for pair1 and pair2, and the result for inputs 1 and 2 will be poor,
then what's the point of training?
Unless we train ANN with pair1 and pair2 again, after pair m
For example, if pair m is going to eliminate weight updates happened after pair1, or pair2, or both, then although ANN will produce a reasonable output for input m, it will kinda forget the updates for pair1 and pair2, and the result for inputs 1 and 2 will be poor, then what's the point of training ??
The aim of training a neural network is to end up with weights that give you the desired output for all-possible input values. What you're doing here is traversing the error surface as you back-propagate so that you end up in an area where the error is below the error threshold. Keep in mind that when you backpropagate the error for one set of inputs, it doesn't mean that the neural network automatically recognizes that particular input and immediately produces the exact response when that input is presented again. When you backpropagate, all it means is that you have changed your weights in such a manner that your neural network will get better at recognizing that particular input (that is, the error keeps decreasing).
So if you present pair-1 and then pair-2, it is possible that pair-2 may negate the changes to a certain degree. However in the long run the neural network's weights will tend towards recognizing all inputs properly. The thing is, you cannot look at the result of a particular training attempt for a particular set of inputs/outputs and be concerned that the changes will be negated. As I mentioned before, when you're training a neural network you are traversing an error surface to find a location where the error is the lowest. Think of it as walking along a landscape that has a bunch of hills and valleys. Imagine that you don't have a map and that you have a special compass that tells you in what direction you need to move, and by what distance. The compass is basically trying to direct you to the lowest point in this landscape. Now this compass doesn't know the the landscape well either and so in trying to send you to the lowest point, it may go in a slightly-wrong direction (i.e., send you some way up a hill) but it will try and correct itself after that. In the long run, you will eventually end up at the lowest point in the landscape (unless you're in a local minima i.e., a low-point, but one that is not the lowest point).
Whenever you're doing supervised training, you should run several (or even thousands) rounds through a training dataset. Each such round through the training dataset is called an epoch.
There is also two different ways of updating the the parameters in the neural network, during supervised training. Stochastic training and batch training. Batch training is one loop through the dataset, accumulating the total error through the set, and updating the parameters (weights) only once when all error has been accumulated. Stochastic training is the method you describe, where the weights are adjusted for each input, desired output pair.
In almost all cases, where the training data set is relatively representative for the general case, you should prefer stochastic training over batch training. Stochastic training beats batch training in 99 of 100 cases! (Citation needed :-)). (Simple XOR training cases and other toy problems are the exceptions)
Back to your question (which applies for stochastic training): Yes, the second pair could indeed adjust the weights in the opposite direction from the first pair. However it is not really likely that all weights are adjusted opposite direction for two cases. However since you will run several epochs through the set the effect will diminish through each epoch. You should also randomize the order of the pairs for each epoch. (Use some kind of Fisher-Yates algorithm.) This will diminish the effect even more.
Next tip: Keep a benchmark dataset separate from the training data. For each n epoch of training, benchmark the neural network with the benchmark set. That is calculating the total error over the pairs in this benchmark dataset. When the error does not decrease, it's time to stop the training.
Good luck!
If you were performing a stochastic gradient descent (SGD), then this probably wouldn't happen because the parameter updates for pair 1 would take effect before the parameter updates for pair 2 would be computed. That is why SGD may converge faster.
If you are computing your parameter updates using all your data simultaneously (or even a chunk of it) then these two pairs may cancel each other out. However, that is not a bad thing because, clearly, these two pairs of data points are giving conflicting information. This is why batch backprop is typically considered to be more stable.

Frequency determination from sparsely sampled data

I'm observing a sinusoidally-varying source, i.e. f(x) = a sin (bx + d) + c, and want to determine the amplitude a, offset c and period/frequency b - the shift d is unimportant. Measurements are sparse, with each source measured typically between 6 and 12 times, and observations are at (effectively) random times, with intervals between observations roughly between a quarter and ten times the period (just to stress, the spacing of observations is not constant for each source). In each source the offset c is typically quite large compared to the measurement error, while amplitudes vary - at one extreme they are only on the order of the measurement error, while at the other extreme they are about twenty times the error. Hopefully that fully outlines the problem, if not, please ask and i'll clarify.
Thinking naively about the problem, the average of the measurements will be a good estimate of the offset c, while half the range between the minimum and maximum value of the measured f(x) will be a reasonable estimate of the amplitude, especially as the number of measurements increase so that the prospects of having observed the maximum offset from the mean improve. However, if the amplitude is small then it seems to me that there is little chance of accurately determining b, while the prospects should be better for large-amplitude sources even if they are only observed the minimum number of times.
Anyway, I wrote some code to do a least-squares fit to the data for the range of periods, and it identifies best-fit values of a, b and d quite effectively for the larger-amplitude sources. However, I see it finding a number of possible periods, and while one is the 'best' (in as much as it gives the minimum error-weighted residual) in the majority of cases the difference in the residuals for different candidate periods is not large. So what I would like to do now is quantify the possibility that the derived period is a 'false positive' (or, to put it slightly differently, what confidence I can have that the derived period is correct).
Does anybody have any suggestions on how best to proceed? One thought I had was to use a Monte-Carlo algorithm to construct a large number of sources with known values for a, b and c, construct samples that correspond to my measurement times, fit the resultant sample with my fitting code, and see what percentage of the time I recover the correct period. But that seems quite heavyweight, and i'm not sure that it's particularly useful other than giving a general feel for the false-positive rate.
And any advice for frameworks that might help? I have a feeling this is something that can likely be done in a line or two in Mathematica, but (a) I don't know it, an (b) don't have access to it. I'm fluent in Java, competent in IDL and can probably figure out other things...
This looks tailor-made for working in the frequency domain. Apply a Fourier transform and identify the frequency based on where the power is located, which should be clear for a sinusoidal source.
ADDENDUM To get an idea of how accurate is your estimate, I'd try a resampling approach such as cross-validation. I think this is the direction that you're heading with the Monte Carlo idea; lots of work is out there, so hopefully that's a wheel you won't need to re-invent.
The trick here is to do what might seem at first to make the problem more difficult. Rewrite f in the similar form:
f(x) = a1*sin(b*x) + a2*cos(b*x) + c
This is based on the identity for the sin(u+v).
Recognize that if b is known, then the problem of estimating {a1, a2, c} is a simple LINEAR regression problem. So all you need to do is use a 1-variable minimization tool, working on the value of b, to minimize the sum of squares of the residuals from that linear regression model. There are many such univariate optimizers to be found.
Once you have those parameters, it is easy to find the parameter a in your original model, since that is all you care about.
a = sqrt(a1^2 + a2^2)
The scheme I have described is called a partitioned least squares.
If you have a reasonable estimate of the size and the nature of your noise (e.g. white Gaussian with SD sigma), you can
(a) invert the Hessian matrix to get an estimate of the error in your position and
(b) should be able to easily derive a significance statistic for your fit residues.
For (a), compare http://www.physics.utah.edu/~detar/phys6720/handouts/curve_fit/curve_fit/node6.html
For (b), assume that your measurement errors are independent and thus the variance of their sum is the sum of their variances.

How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

There was a post on here recently which posed the following question:
You have a two-dimensional plane of (X, Y) coordinates. A bunch of random points are chosen. You need to select the largest possible set of chosen points, such that no two points share an X coordinate and no two points share a Y coordinate.
This is all the information that was provided.
There were two possible solutions presented.
One suggested using a maximum flow algorithm, such that each selected point maps to a path linking (source → X → Y → sink). This runs in O(V3) time, where V is the number of vertices selected.
Another (mine) suggested using the Hungarian algorithm. Create an n×n matrix of 1s, then set every chosen (x, y) coordinate to 0. The Hungarian algorithm will give you the lowest cost for this matrix, and the answer is the number of coordinates selected which equal 0. This runs in O(n3) time, where n is the greater of the number of rows or the number of columns.
My reasoning is that, for the vast majority of cases, the Hungarian algorithm is going to be faster; V is equal to n in the case where there's one chosen point for each row or column, and substantially greater for any case where there's more than that: given a 50×50 matrix with half the coordinates chosen, V is 1,250 and n is 50.
The counterargument is that there are some cases, like a 109×109 matrix with only two points selected, where V is 2 and n is 1,000,000,000. For this case, it takes the Hungarian algorithm a ridiculously long time to run, while the maximum flow algorithm is blinding fast.
Here is the question: Given that the problem doesn't provide any information regarding the size of the matrix or the probability that a given point is chosen (so you can't know for sure) how do you decide which algorithm, in general, is a better choice for the problem?
You can't, it's an imponderable.
You can only define which is better "in general" by defining what inputs you will see "in general". So for example you could whip up a probability model of the inputs, so that the expected value of V is a function of n, and choose the one with the best expected runtime under that model. But there may be arbitrary choices made in the construction of your model, so that different models give different answers. One model might choose co-ordinates at random, another model might look at the actual use-case for some program you're thinking of writing, and look at the distribution of inputs it will encounter.
You can alternatively talk about which has the best worst case (across all possible inputs with given constraints), which has the virtue of being easy to define, and the flaw that it's not guaranteed to tell you anything about the performance of your actual program. So for instance HeapSort is faster than QuickSort in the worst case, but slower in the average case. Which is faster? Depends whether you care about average case or worst case. If you don't care which case, you're not allowed to care which "is faster".
This is analogous to trying to answer the question "what is the probability that the next person you see will have an above (mean) average number of legs?".
We might implicitly assume that the next person you meet will be selected at random with uniform distribution from the human population (and hence the answer is "slightly less than one", since the mean is less than the mode average, and the vast majority of people are at the mode).
Or we might assume that your next meeting with another person is randomly selected with uniform distribution from the set of all meetings between two people, in which case the answer is still "slightly less than one", but I reckon not the exact same value as the first - one-and-zero-legged people quite possibly congregate with "their own kind" very slightly more than their frequency within the population would suggest. Or possibly they congregate less, I really don't know, I just don't see why it should be exactly the same once you take into account Veterans' Associations and so on.
Or we might use knowledge about you - if you live with a one-legged person then the answer might be "very slightly above 0".
Which of the three answers is "correct" depends precisely on the context which you are forbidding us from talking about. So we can't talk about which is correct.
Given that you don't know what each pill does, do you take the red pill or the blue pill?
If there really is not enough information to decide, there is not enough information to decide. Any guess is as good as any other.
Maybe, in some cases, it is possible to divine extra information to base the decision on. I haven't studied your example in detail, but it seems like the Hungarian algorithm might have higher memory requirements. This might be a reason to go with the maximum flow algorithm.
You don't. I think you illustrated that clearly enough. I think the proper practical solution is to spawn off both implementations in different threads, and then take the response that comes back first. If you're more clever, you can heuristically route requests to implementations.
Many algorithms require huge amounts of memory beyond the physical maximum of a machine, and in these cases, the algorithmically more ineffecient in time but efficient in space algorithm is chosen.
Given that we have distributed parallel computing, I say you just let both horses run and let the results speak for themselves.
This is a valid question, but there's no "right" answer — they are incomparable, so there's no notion of "better".
If your interest is practical, then you need to analyze the kinds of inputs that are likely to arise in practice, as well as the practical running times (constants included) of the two algorithms.
If your interest is theoretical, where worst-case analysis is often the norm, then, in terms of the input size, the O(V3) algorithm is better: you know that V ≤ n2, but you cannot polynomially bound n in terms of V, as you showed yourself. Of course the theoretical best algorithm is a hybrid algorithm that runs both and stops when whichever one of them finishes first, thus its running time would be O(min(V3,n3)).
Theoretically, they are both the same, because you actually compare how the number of operations grows when the size of the problem is increased to infinity.
The way your problem is defined, it has 2 sizes - n and number of points, so this question has no answer.

Resources