Algorithm to remove orphan neurons from a neural network - algorithm

I'm trying to implement NEAT (Neuro Evolution of Augmenting Topologies).
I have a list of network connections, called "genes". A connection between neuron1 and neuron2 would be gene.from = neuron1, gene.to = neuron2.
My task is to generate a neural network from these genes (The neural network is simply a map from index to neuron, the gene.from and gene.to are the keys to the neurons in the map).
I have numPossibleInputs input nodes, so we add those first (0-numPossibleInputs-1 are input neurons).
I have numOutputs output nodes, so we add those as well.
Then, we sort our genes based on their "to" connection indices.
Finally, we create the hidden layer neurons based on the genes... As the neural network is a map, we just check if the to or from of a connection is already a neuron, otherwise create a new one. This algorithm creates networks just fine.
public void generateNetwork()
{
neuralNetwork.clear();
for(int i = 0; i < numPossibleInputs; i++)
{
neuralNetwork.put(i, new Neuron());
}
for(int i = 0; i < numOutputs; i++)
{
neuralNetwork.put(i+numPossibleInputs+numPossibleHidden, new Neuron());
}
genes.sort((ConnectionGene g1, ConnectionGene g2)-> Integer.compare(g1.toNeuronIndex, g2.toNeuronIndex));
for(ConnectionGene gene : getCleanGenes(genes))
{
if(gene.enabled)
{
if(!neuralNetwork.containsKey(gene.toNeuronIndex))
{
neuralNetwork.put(gene.toNeuronIndex, new Neuron());
}
neuralNetwork.get(gene.toNeuronIndex).incomingConnections.add(gene); // Add this gene to the incoming of the above neuron
if(!neuralNetwork.containsKey(gene.fromNeuronIndex))
{
neuralNetwork.put(gene.fromNeuronIndex, new Neuron());
}
}
}
}
The problem comes when the evolution algorithm turns "off" some of the genes (note the gene.enabled). For example, consider the following genes (There are others, but they are disabled):
2->4
4->4
13->4
0->13
1->13
5->13
We also have disabled genes, 2->5 and 4->13. These can not be used in the network as they arent being expressed. (This is why i have to generate a new network every generation, genes can be added, enabled, disabled, etc.).
This is for numPossibleInputs ==3, so 0 1 and 2 are inputs (2 is bias). 5 is a hidden layer node since 5 > 3, but less than 10 + 3 = 13. 13 is an output node, i had numPossibleHidden == 10 so 10 + 3 = 13... just one output.
Can picture it like this:
[input input input hidden*10 output*1] for 3 inputs, 10 hidden, and 1 output
This is a picture of that network naively generated:
Simple Network
As we can see, the reduced network shouldn't have 4 or 5 at all, since they have no influence on any outputs (In this case only one output, 13). The reduced neural network would just be 0->13 and 1->13.
I had some initial thoughts on how to solve this:
A.
1. Loop over each connection and hash the gene.from ids. These are the neuron ids which are an input to something else
2. After populating the hash, loop again and remove any genes with gene.to not being in the hash (The gene.to is not an input to anything else if it isnt in the hash).
3. Repeat until we don't remove anything
B. Generate the naive network... then, Crawl backwards in the network, from each output until we can't go any further (take care for recurring cycles). Hash each node we find. Once our graph search is done, compare our hash of nodes found with the total nodes expressed in our gene list. Only use genes with neurons in the hash of found nodes and remake the network.
I was hoping to get some feedback on what might be the best algorithm to do this based on my network representation - I'm thinking my B is better than A, but I was hoping there was a more elegant solution that didn't involve me parsing graph topology. Perhaps something clever I can do by sorting the connections (By to, by from)?
Thanks!

I used my B solution above, tested it with all kinds of different network typologies, and it works fine - That is, the network will get rid of all nodes who do not have a proper path from inputs to outputs. I'll post the code here in case anyone wants to use it:
private List<ConnectionGene> cleanGenes(Map<Integer,Neuron> network)
{
// For each output, go backwards
Set<Integer> visited = new HashSet();
for(int i = 0; i < numOutputs; i++)
{
visited.add(i+numPossibleInputs+numPossibleHidden);
cleanGenes(i+numPossibleInputs+numPossibleHidden, network, visited);
}
List<ConnectionGene> slimGenes = new ArrayList();
for(ConnectionGene gene : genes)
{
// Only keep gene if from/to are nodes we visited
if(visited.contains(gene.fromNeuronIndex) && visited.contains(gene.toNeuronIndex))
{
slimGenes.add(gene);
}
}
return slimGenes;
}
private boolean cleanGenes(int neuronIndex, Map<Integer, Neuron> network, Set<Integer> visited)
{
int numGoodConnections = 0;
for(ConnectionGene gene : network.get(neuronIndex).incomingConnections)
{
numGoodConnections++;
if(gene.enabled && !visited.contains(gene.fromNeuronIndex))
{
visited.add(gene.fromNeuronIndex);
if(!cleanGenes(gene.fromNeuronIndex, network, visited))
{
numGoodConnections--;
visited.remove(gene.fromNeuronIndex); // We don't want this node in visited, it has no incoming connections and isn't an input.
}
}
}
if(numGoodConnections == 0)
{
return neuronIndex < numPossibleInputs; // True if an input neuron, false if no incoming connections and not input
}
return true; // Success
}
According to my profiler, the vast majority of time spent in this NEAT algorithm is in the simulation itself. That is, generating the proper network is trivial compared to testing the network against a hard task.

There is a much more efficient way to add neurons. Instead of just adding a new neuron and hopeing for it to be connected one day, you could also take a random connection, split it up in two connections and add a neuron between them.

Related

How to get level(depth) number of two connected nodes in neo4j

I'm using neo4j as a graph database to store user's connections detail into this. here I want to show the level of one user with respect to another user in their connections like Linkedin. for example- first layer connection, second layer connection, third layer and above the third layer shows 3+. but I don't know how this happens using neo4j. i searched for this but couldn't find any solution for this. if anybody knows about this then please help me to implement this functionality.
To find the shortest "connection level" between 2 specific people, just get the shortest path and add 1:
MATCH path = shortestpath((p1:Person)-[*..]-(p2:Person))
WHERE p1.id = 1 AND p2.id = 2
RETURN LENGTH(path) + 1 AS level
NOTE: You may want to put a reasonable upper bound on the variable-length relationship pattern (e.g., [*..6]) to avoid having the query taking too long or running out of memory in a large DB). You should probably ignore very distant connections anyway.
it would be something like this
// get all persons (or users)
MATCH (p:Person)
// create a set of unique combinations , assuring that you do
// not do double work
WITH COLLECT(p) AS personList
UNWIND personList AS personA
UNWIND personList AS personB
WITH personA,personB
WHERE id(personA) < id(personB)
// find the shortest path between any two nodes
MATCH path=shortestPath( (personA)-[:LINKED_TO*]-(personB) )
// return the distance ( = path length) between the two nodes
RETURN personA.name AS nameA,
personB.name AS nameB,
CASE WHEN length(path) > 3 THEN '3+'
ELSE toString(length(path))
END AS distance

Generate “hash” functions programmatically

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )
If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.
Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}

How to get all IP addresses that are not in a given range of IP addresses

I need to be able to output all the ranges of IP addresses that are not in a given list of IP addresses ranges.
There is some sort of algorithm that I can use for this kind of task that I can transform into working code?
Basically I will use Salesforce Apex code, so any JAVA like language will do if a given example is possible.
I think the key for an easy solution is to remember IP addresses can be treated as a number of type long, and so they can be sorted.
I assumed the excluded ranges are given in a "nice" way, meaning no overlaps, no partial overlaps with global range and so on. You can of course add such input checks later on.
In this example I'll to all network ranges (global, included, excluded) as instances of NetworkRange class.
Following is the implementation of NetworkRange. Pay attention to the methods splitByExcludedRange and includes.
public class NetworkRange {
private long startAddress;
private long endAddress;
public NetworkRange(String start, String end) {
startAddress = addressRepresentationToAddress(start);
endAddress = addressRepresentationToAddress(end);
}
public NetworkRange(long start, long end) {
startAddress = start;
endAddress = end;
}
public String getStartAddress() {
return addressToAddressRepresentation(startAddress);
}
public String getEndAddress() {
return addressToAddressRepresentation(endAddress);
}
static String addressToAddressRepresentation(long address) {
String result = String.valueOf(address % 256);
for (int i = 1; i < 4; i++) {
address = address / 256;
result = String.valueOf(address % 256) + "." + result;
}
return result;
}
static long addressRepresentationToAddress(String addressRep) {
long result = 0L;
String[] tokens = addressRep.split("\\.");
for (int i = 0; i < 4; i++) {
result += Math.pow(256, i) * Long.parseLong(tokens[3-i]);
}
return result;
}
public List<NetworkRange> splitByExcludedRange(NetworkRange excludedRange) {
if (this.startAddress == excludedRange.startAddress && this.endAddress == excludedRange.endAddress)
return Arrays.asList();
if (this.startAddress == excludedRange.startAddress)
return Arrays.asList(new NetworkRange(excludedRange.endAddress+1, this.endAddress));
if (this.endAddress == excludedRange.endAddress)
return Arrays.asList(new NetworkRange(this.startAddress, excludedRange.startAddress-1));
return Arrays.asList(new NetworkRange(this.startAddress, excludedRange.startAddress-1),
new NetworkRange(excludedRange.endAddress+1, this.endAddress));
}
public boolean includes(NetworkRange excludedRange) {
return this.startAddress <= excludedRange.startAddress && this.endAddress >= excludedRange.endAddress;
}
public String toString() {
return "[" + getStartAddress() + "-" + getEndAddress() + "]";
}
}
Now comes the class that calculates the network ranges left included. It accepts a global range in constructor.
public class RangeProducer {
private NetworkRange global;
public RangeProducer(NetworkRange global) {
this.global = global;
}
public List<NetworkRange> computeEffectiveRanges(List<NetworkRange> excludedRanges) {
List<NetworkRange> effectiveRanges = new ArrayList<>();
effectiveRanges.add(global);
List<NetworkRange> effectiveRangesSplitted = new ArrayList<>();
for (NetworkRange excludedRange : excludedRanges) {
for (NetworkRange effectiveRange : effectiveRanges) {
if (effectiveRange.includes(excludedRange)) {
effectiveRangesSplitted.addAll(effectiveRange.splitByExcludedRange(excludedRange));
} else {
effectiveRangesSplitted.add(effectiveRange);
}
}
effectiveRanges = effectiveRangesSplitted;
effectiveRangesSplitted = new ArrayList<>();
}
return effectiveRanges;
}
}
You can run the following example:
public static void main(String[] args) {
NetworkRange global = new NetworkRange("10.0.0.0", "10.255.255.255");
NetworkRange ex1 = new NetworkRange("10.0.0.0", "10.0.1.255");
NetworkRange ex2 = new NetworkRange("10.1.0.0", "10.1.1.255");
NetworkRange ex3 = new NetworkRange("10.6.1.0", "10.6.2.255");
List<NetworkRange> excluded = Arrays.asList(ex1, ex2, ex3);
RangeProducer producer = new RangeProducer(global);
for (NetworkRange effective : producer.computeEffectiveRanges(excluded)) {
System.out.println(effective);
}
}
Output should be:
[10.0.2.0-10.0.255.255]
[10.1.2.0-10.6.0.255]
[10.6.3.0-10.255.255.255]
First, I assume you mean that you get one or more disjoint CIDR ranges as input, and need to produce the list of all CIDR ranges not including any of the ones given as input. For convenience, let's further assume that the input does not include the entire IP address space: i.e. 0.0.0.0/0. (That can be accommodated with a single special case but is not of much interest.)
I've written code analogous to this before and, though I'm not at liberty to share the code, I can describe the methodology. It's essentially a binary search algorithm wherein you bisect the full address space repeatedly until you've isolated the one range you're interested in.
Think of the IP address space as a binary tree: At the root is the full IPv4 address space 0.0.0.0/0. Its children each represent half of the address space: 0.0.0.0/1 and 128.0.0.0/1. Those, in turn, can be sub-divided to create children 0.0.0.0/2 / 64.0.0.0/2 and 128.0.0.0/2 / 192.0.0.0/2, respectively. Continue this all the way down and you end up with 2**32 leaves, each of which represents a single /32 (i.e. a single address).
Now, consider this tree to be the parts of the address space that are excluded from your input list. So your task is to traverse this tree, find each range from your input list in the tree, and cut out all parts of the tree that are in your input, leaving the remaining parts of the address space.
Fortunately, you needn't actually create all the 2**32 leaves. Each node at CIDR N can be assumed to include all nodes at CIDR N+1 and above if no children have been created for it (you'll need a flag to remember that it has already been subdivided -- i.e. is no longer a leaf -- see below for why).
So, to start, the entire address space is present in the tree, but can all be represented by a single leaf node. Call the tree excluded, and initialize it with the single node 0.0.0.0/0.
Now, take the first input range to consider -- we'll call this trial (I'll use 14.27.34.0/24 as the initial trial value just to provide a concrete value for demonstration). The task is to remove trial from excluded leaving the rest of the address space.
Start with current node pointer set to the excluded root node.
Start:
Compare the trial CIDR with current. If it is identical, you're done (but this should never happen if your input ranges are disjoint and you've excluded 0.0.0.0/0 from input).
Otherwise, if current is a leaf node (has not been subdivided, meaning it represents the entire address space at this CIDR level and below), set its sub-divided flag, and create two children for it: a left pointer to the first half of its address space, and a right pointer to the latter half. Label each of these appropriately (for the root node's children, that will be 0.0.0.0/1 and 128.0.0.0/1).
Determine whether the trial CIDR falls within the left side or the right side of current. For our initial trial value, it's to the left. Now, if the pointer on that side is already NULL, again you're done (though again that "can't happen" if your input ranges are disjoint).
If the trial CIDR is exactly equivalent to the CIDR in the node on that side, then simply free the node (and any children it might have, which again should be none if you have only disjoint inputs), set the pointer to that side NULL and you're done. You've just excluded that entire range by cutting that leaf out of the tree.
If the trial value is not exactly equivalent to the CIDR in the node on that side, set current to that side and start over (i.e. jump to Start label above).
So, with the initial input range of 14.27.34.0/24, you will first split 0.0.0.0/0 into 0.0.0.0/1 and 128.0.0.0/1. You will then drop down on the left side and split 0.0.0.0/1 into 0.0.0.0/2 and 64.0.0.0/2. You will then drop down to the left again to create 0.0.0.0/3 and 32.0.0.0/3. And so forth, until after 23 splits, you will then split 14.27.34.0/23 into 14.27.34.0/24 and 14.27.35.0/24. You will then delete the left-hand 14.27.34.0/24 child node and set its pointer to NULL, leaving the other.
That will leave you with a sparse tree containing 24 leaf nodes (after you dropped the target one). The remaining leaf nodes are marked with *:
(ROOT)
0.0.0.0/0
/ \
0.0.0.0/1 128.0.0.0/1*
/ \
0.0.0.0/2 64.0.0.0/2*
/ \
0.0.0.0/3 32.0.0.0.0/3*
/ \
0.0.0.0/4 16.0.0.0/4*
/ \
*0.0.0.0/5 8.0.0.0/5
/ \
*8.0.0.0/6 12.0.0.0/6
/ \
*12.0.0.0/7 14.0.0.0/7
/ \
14.0.0.0/8 15.0.0.0/8*
/ \
...
/ \
*14.27.32.0/23 14.27.34.0/23
/ \
(null) 14.27.35.0/24*
(14.27.34.0/24)
For each remaining input range, you will run through the tree again, bisecting leaf nodes when necessary, often resulting in more leaves, but always cutting out some part of the address space.
At the end, you simply traverse the resulting tree in whatever order is convenient, collecting the CIDRs of the remaining leaves. Note that in this phase you must exclude those that have previously been subdivided. Consider for example, in the above tree, if you next processed input range 14.27.35.0/24, you would leave 14.27.34.0/23 with no children, but both its halves have been separately cut out and it should not be included in the output. (With some additional complication, you could of course collapse nodes above it to accommodate that scenario as well, but it's easier to just keep a flag in each node.)
First, what you describe can be simplified to:
you have intervals of the form x.x.x.x - y.y.y.y
you want to output the intervals that are not yet "taken" in this range.
you want to be able to add or remove intervals efficiently
I would suggest the use of an interval tree, where each node stores an interval, and you can efficiently insert and remove nodes; and query for overlaps at a given point (= IP address).
If you can guarantee that there will be no overlaps, you can instead use a simple TreeSet<String>, where you must however guarantee (for correct sorting) that all strings use the xxx.xxx.xxx.xxx-yyy.yyy.yyy.yyy zero-padded format.
Once your intervals are in a tree, you can then generate your desired output, assuming that no intervals overlap, by performing a depth-first pre-order traversal of your tree, and storing the starts and ends of each visited node in a list. Given this list,
pre-pend 0.0.0.0 at the start
append 255.255.255.255 at the end
remove all duplicate ips (which will forcefully be right next to each other in the list)
take them by pairs (the number will always be even), and there you have the intervals of free IPs, perfectly sorted.
Note that 0.0.0.0 and 255.255.255.255 are not actually valid, routable IPs. You should read the relevant RFCs if you really need to output real-world-aware IPs.

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.
There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

Resources