Machine Learning Algorithm for Peer-to-Peer Nodes - algorithm

I want to apply machine learning to a classification problem in a parallel environment. Several independent nodes, each with multiple on/off sensors, can communicate their sensor data with the goal of classifying an event as defined by a heuristic, training data or both.
Each peer will be measuring the same data from their unique perspective and will attempt to classify the result while taking into account that any neighbouring node (or its sensors or just the connection to the node) could be faulty. Nodes should function as equal peers and determine the most likely classification by communicating their results.
Ultimately each node should make a decision based on their own sensor data and their peers' data. If it matters, false positives are OK for certain classifications (albeit undesirable) but false negatives would be totally unacceptable.
Given that each final classification will receive good or bad feedback, what would be an appropriate machine learning algorithm to approach this problem with if the nodes could communicate with each other to determine the most likely classification?

If the sensor data in each individual node is generally sufficient to make a reasonable decision, they could just communicate the result and take a majority vote. If majority vote is not appropriate, you could train an additional classifier that uses the outputs of the nodes as its feature vector.
Since you want to have on-line supervised learning with feedback, you could use a neural network with backpropagation or an incremental support vector machine that adds the errors to the training set. Look into classifier biasing to deal with false-positive/false-negative trade-off.

In this instance, a neural network could be very appropriate. The inputs to the network would be each of the sensors onboard the node, along with that of its neighbors. You would calculate weights based on your feedback.
Another option (that is simpler, but can achieve good results as well) is a Gossip Algorithm. You would have to look into incorporating feedback though.

Related

Community Detection in complete and weighted networks

I do have a complete network graph where every vertex is connected with each other and they only differ in form of their different weights. A example network would be: a trade network, where every country is connected with each other somehow and only differ in form of different trading volumina.
Now the question is how I could perform a community detection in that form of network. The usual suspects (algorithm) are only able to perform in either unweighted or incomplete networks well. The main problem is that the geodesic is everywhere the same.
Two option came into my mind:
Cut the network into smaller pieces by cutting them at a certain "weight-threshold-level"
Or use a hierarchical cluster algorithm to turn the whole network into a blockmodel. But I think the problem "no variance in geodesic terms" will remain.
Several methods were suggested.
One simple yet effective method was suggested in Fast unfolding of communities in large networks (Blondel et al., 2008). It supports weighted networks. Quoting from the abstract:
We propose a simple method to extract the community structure of large
networks. Our method is a heuristic method that is based on modularity
optimization. It is shown to outperform all other known community
detection method in terms of computation time. Moreover, the quality
of the communities detected is very good, as measured by the so-called
modularity.
Quoting from the paper:
We now introduce our algorithm that finds high modularity partitions
of large networks in short time and that unfolds a complete
hierarchical community structure for the network, thereby giving
access to different resolutions of community detection.
So it supposed to work well for complete graph, but you should better check it.
A C++ implementation is available here (now maintained here).
Your other idea - using weight-threshold - may prove as a good pre-processing step, especially for algorithms which won't partition complete graphs. I believe it is best to set it to some percentile (e.g. to the median) of the weights.

Comparing two neural networks (nntool in Matlab)

I'm new to the Neural Network Toolbox (nntool) in Matlab. I have trained two networks using the same data set. One of these networks contains a higher number of neurons as the other one.
Now I'm wondering: how can I compare these networks? How can I say network A is better than network B?
Is it all about the number of correctly classified pattern in my test set? Lets say both networks were shown the same test set and network A classified more pattern correctly. Can I say network A is (in general) better than network B?
Or should I also look at the performance according to my performance function?
Are there any other measures for comparing two networks trained with different parameter?
That mainly depends on what is your concern. As I see, in most cases analyzing the predicted labels, or accuracy of the nets can lead to a good pickup decision, especially when your networks have shallow architectures,however there are some side-handed issues that may become more important when you decide to see the nets with wider eyes.
For example, in the training phase, adding even one hidden unit to the first hidden layer comes up with inserting d (dimension of input layer) free parameters (weights) to your model that should be estimated. In other hand, more free parameters your model has, more training data is required to come up with a reliable model. Therefore, bigger networks are well-accepted as long as you have enough data to compensate for the added free parameters. As rule of thumb, inserting more free parameters increase the chance of over-fitting which has been a vital problem in deep neural networks and many efforts has been made to resolve it.
Another case which is less important in shallow nets, is the computational cost imposed by extra hidden nodes. Since we are looking with wide eyes, mentioning this issue is somewhat necessary. In cases when your network goes deeper, this computational cost becomes more challenging. The computational cost in training phase is also an important issue when you use back-propagation to update the parameters.
One other thing that you may mainly see in deep neural networks is the memory requirements. As the number of layers or neurons increase, the number of free parameters grows dramatically such that in deep networks you may see millions of parameters. It is clear that loading this amount of parameters asks for sufficient hardware requirements.
hope it helps.

How to preserve "building blocks" of neural net?

I am making a kind of neural network, with neurons and "synapses". It kind of resembles turings type b nets, connections can go anywhere. It starts with a randomly generated net that has random connections between the neurons. There are both electrical and chemical variant with different effects on the neurons. To the point:
A net is basically a series of neurons with connections to other neurons. I cant figure out how to do "crossover" to form new generations of nets based on the best performing parents. More specifically, if I combine them based on single connections, I will break any potential "structure" or function that may have formed from a certain set of neurons and connections.
I considered splitting the network map, say, taking half from one parent and half from the other, but that may still break any potential functions that may have been created.
It is higly likely that I am missing something, I am learning this as I go.
Is there some way of doing this?
If you are evolving the network structure and weights, there is an excellent algorithm called NEAT.
If you are evolving the weights only, you have several possibilites, but the most basic one is use the weight matrix of the network graph as a genotype. Then, crossover can be done using any continuous GA crossover method, like SBX or BLX-alpha.
The problem of breaking functionality (most often by mutation) is common and can be solved by e.g. fitness sharing (NEAT uses it) or some other mechanism which protects modified individuals for certain amount of time.

Neural Network Basics

I'm a computer science student and for this years project, I need to create and apply a Genetic Algorithm to something. I think Neural Networks would be a good thing to apply it to, but I'm having trouble understanding them. I fully understand the concepts but none of the websites out there really explain the following which is blocking my understanding:
How the decision is made for how many nodes there are.
What the nodes actually represent and do.
What part the weights and bias actually play in classification.
Could someone please shed some light on this for me?
Also, I'd really appreciate it if you have any similar ideas for what I could apply a GA to.
Thanks very much! :)
Your question is quite complex and I don't think a small answer will fully satisfy you. Let me try, nonetheless.
First of all, there must be at least three layers in your neural network (assuming a simple feedforward one). The first is the input layer and there will be one neuron per input. The third layer is the output one and there will be one neuron per output value (if you are classifying, there might be more than one f you want to assign a "belong to" meaning to each neuron).. The remaining layer is the hidden one, which will stand between the input and output. Determining its size is a complex task as you can see in the following references:
comp.ai faq
a post on stack exchange
Nevertheless, the best way to proceed would be for you to state your problem more clearly (as weel as industrial secrecy might allow) and let us think a little more on your context.
The number of input and output nodes is determined by the number of inputs and outputs you have. The number of intermediate nodes is up to you. There is no "right" number.
Imagine a simple network: inputs( age, sex, country, married ) outputs( chance of death this year ). Your network might have a 2 "hidden values", one depending on age and sex, the other depending on country and married. You put weights on each. For example, Hidden1 = age * weight1 + sex * weight2. Hidden2 = country * weight3 + married * weight4. You then make another set of weights, Hidden3 and Hidden4 connecting to the output variable.
Then you get a data from, say the census, and run through your neural network to find out what weights best match the data. You can use genetic algorithms to test different sets of weights. This is useful if you have so many edges you could not try every possible weighting. You need to find good weights without exhaustively trying every possible set of weights, so GA lets you "evolve" a good set of weights.
Then you test your weights on data from a different census to see how well it worked.
... my major barrier to understanding this though is understanding how the hidden layer actually works; I don't really understand how a neuron functions and what the weights are for...
Every node in the middle layer is a "feature detector" -- it will (hopefully) "light up" (i.e., be strongly activated) in response to some important feature in the input. The weights are what emphasize an aspect of the previous layer; that is, the set of input weights to a neuron correspond to what nodes in the previous layer are important for that feature.
If a weight connecting myInputNode to myMiddleLayerNode is 0, then you can tell that myInputNode is not important to whatever feature myMiddleLayerNode is detecting. If, though, the weight connecting myInputNode to myMiddleLayerNode is very large (either positive or negative), you know that myInputNode is quite important (if it's very negative it means "No, this feature is almost certainly not there", while if it's very positive it means "Yes, this feature is almost certainly there").
So a corollary of this is that you want the number of your middle-layer nodes to have a correspondence to how many features are needed to classify the input: too few middle-layer nodes and it will be hard to converge during training (since every middle-layer node will have to "double up" on its feature-detection) while too many middle-layer nodes may over-fit your data.
So... a possible use of a genetic algorithm would be to design the architecture of your network! That is, use a GA to set the number of middle-layer nodes and initial weights. Some instances of the population will converge faster and be more robust -- these could be selected for future generations. (Personally, I've never felt this was a great use of GAs since I think it's often faster just to trial-and-error your way into a decent NN architecture, but using GAs this way is not uncommon.)
You might find this wikipedia page on NeuroEvolution of Augmenting Topologies (NEAT) interesting. NEAT is one example of applying genetic algorithms to create the neural network topology.
The best way to explain an Artificial Neural Network (ANN) is to provide the biological process that it attempts to simulate - a neural network. The best example of one is the human brain. So how does the brain work (highly simplified for CS)?
The functional unit (for our purposes) of the brain is the neuron. It is a potential accumulator and "disperser". What that means is that after a certain amount of electric potential (think filling a balloon with air) has been reached, it "fires" (balloon pops). It fires electric signals down any connections it has.
How are neurons connected? Synapses. These synapses can have various weights (in real life due to stronger/weaker synapses from thicker/thinner connections). These weights allow a certain amount of a fired signal to pass through.
You thus have a large collection of neurons connected by synapses - the base representation for your ANN. Note that the input/output structures described by the others are an artifact of the type of problem to which ANNs are applied. Theoretically, any neuron can accept input as well. It serves little purpose in computational tasks however.
So now on to ANNs.
NEURONS: Neurons in an ANN are very similar to their biological counterpart. They are modeled either as step functions (that signal out "1" after a certain combined input signal, or "0" at all other times), or slightly more sophisticated firing sequences (arctan, sigmoid, etc) that produce a continuous output, though scaled similarly to a step. This is closer to the biological reality.
SYNAPSES: These are extremely simple in ANNs - just weights describing the connections between Neurons. Used simply to weight the neurons that are connected to the current one, but still play a crucial role: synapses are the cause of the network's output. To clarify, the training of an ANN with a set structure and neuron activation function is simply the modification of the synapse weights. That is it. No other change is made in going from a a "dumb" net to one that produces accurate results.
STRUCTURE:
There is no "correct" structure for a neural network. The structures are either
a) chosen by hand, or
b) allowed to grow as a result of learning algorithms (a la Cascade-Correlation Networks).
Assuming the hand-picked structure, these are actually chosen through careful analysis of the problem and expected solution. Too few "hidden" neurons/layers, and you structure is not complex enough to approximate a complex function. Too many, and your training time rapidly grows unwieldy. For this reason, the selection of inputs ("features") and the structure of a neural net are, IMO, 99% of the problem. The training and usage of ANNs is trivial in comparison.
To now address your GA concern, it is one of many, many efforts used to train the network by modifying the synapse weights. Why? because in the end, a neural network's output is simply an extremely high-order surface in N dimensions. ANY surface optimization technique can be use to solve the weights, and GA are one such technique. The simple backpropagation method is alikened to a dimension-reduced gradient-based optimization technique.

Continuous vs Discrete artificial neural networks

I realize that this is probably a very niche question, but has anyone had experience with working with continuous neural networks? I'm specifically interested in what a continuous neural network may be useful for vs what you normally use discrete neural networks for.
For clarity I will clear up what I mean by continuous neural network as I suppose it can be interpreted to mean different things. I do not mean that the activation function is continuous. Rather I allude to the idea of a increasing the number of neurons in the hidden layer to an infinite amount.
So for clarity, here is the architecture of your typical discreet NN:
(source: garamatt at sites.google.com)
The x are the input, the g is the activation of the hidden layer, the v are the weights of the hidden layer, the w are the weights of the output layer, the b is the bias and apparently the output layer has a linear activation (namely none.)
The difference between a discrete NN and a continuous NN is depicted by this figure:
(source: garamatt at sites.google.com)
That is you let the number of hidden neurons become infinite so that your final output is an integral. In practice this means that instead of computing a deterministic sum you instead must approximate the corresponding integral with quadrature.
Apparently its a common misconception with neural networks that too many hidden neurons produces over-fitting.
My question is specifically, given this definition of discrete and continuous neural networks, I was wondering if anyone had experience working with the latter and what sort of things they used them for.
Further description on the topic can be found here:
http://www.iro.umontreal.ca/~lisa/seminaires/18-04-2006.pdf
I think this is either only of interest to theoreticians trying to prove that no function is beyond the approximation power of the NN architecture, or it may be a proposition on a method of constructing a piecewise linear approximation (via backpropagation) of a function. If it's the latter, I think there are existing methods that are much faster, less susceptible to local minima, and less prone to overfitting than backpropagation.
My understanding of NN is that the connections and neurons contain a compressed representation of the data it's trained on. The key is that you have a large dataset that requires more memory than the "general lesson" that is salient throughout each example. The NN is supposedly the economical container that will distill this general lesson from that huge corpus.
If your NN has enough hidden units to densely sample the original function, this is equivalent to saying your NN is large enough to memorize the training corpus (as opposed to generalizing from it). Think of the training corpus as also a sample of the original function at a given resolution. If the NN has enough neurons to sample the function at an even higher resolution than your training corpus, then there is simply no pressure for the system to generalize because it's not constrained by the number of neurons to do so.
Since no generalization is induced nor required, you might as well just memorize the corpus by storing all of your training data in memory and use k-nearest neighbor, which will always perform better than any NN, and will always perform as well as any NN even as the NN's sampling resolution approaches infinity.
The term hasn't quite caught on in the machine learning literature, which explains all the confusion. It seems like this was a one off paper, an interesting one at that, but it hasn't really led to anything, which may mean several things; the author may have simply lost interest.
I know that Bayesian neural networks (with countably many hidden units, the 'continuous neural networks' paper extends to the uncountable case) were successfully employed by Radford Neal (see his thesis all about this stuff) to win the NIPS 2003 Feature Selection Challenge using Bayesian neural networks.
In the past I've worked on a few research projects using continuous NN's. Activation was done using a bipolar hyperbolic tan, the network took several hundred floating point inputs and output around one hundred floating point values.
In this particular case the aim of the network was to learn the dynamic equations of a mineral train. The network was given the current state of the train and predicted speed, inter-wagon dynamics and other train behaviour 50 seconds into the future.
The rationale for this particular project was mainly about performance. This was being targeted for an embedded device and evaluating the NN was much more performance friendly then solving a traditional ODE (ordinary differential equation) system.
In general a continuous NN should be able to learn any kind of function. This is particularly useful when its impossible/extremely difficult to solve a system using deterministic methods. As opposed to binary networks which are often used for pattern recognition/classification purposes.
Given their non-deterministic nature NN's of any kind are touchy beasts, choosing the right kinds of inputs/network architecture can be somewhat a black art.
Feed forward neural networks are always "continuous" -- it's the only way that backpropagation learning actually works (you can't backpropagate through a discrete/step function because it's non-differentiable at the bias threshold).
You might have a discrete (e.g. "one-hot") encoding of the input or target output, but all of the computation is continuous-valued. The output may be constrained (i.e. with a softmax output layer such that the outputs always sum to one, as is common in a classification setting) but again, still continuous.
If you mean a network that predicts a continuous, unconstrained target -- think of any prediction problem where the "correct answer" isn't discrete, and a linear regression model won't suffice. Recurrent neural networks have at various times been a fashionable method for various financial prediction applications, for example.
Continuous neural networks are not known to be universal approximators (in the sense of density in $L^p$ or $C(\mathbb{R})$ for the topology of uniform convergence on compacts, i.e.: as in the universal approximation theorem) but only universal interpolators in the sense of this paper:
https://arxiv.org/abs/1908.07838

Resources