Generate “hash” functions programmatically - algorithm

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )

If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.

Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}

Related

Boost Confidence of Overlapping Observations In Apache Spark

I'm fairly new to scala/spark, so forgive me if my question is elementary but I've searched everywhere and can't find the answer.
Problem
I'm trying to boost the confidence scores a bunch of network router observations (observations of probable router types at different network junctions).
I have a type NetblockObservation combines device types seen on a network with an associated netblock and a confidence. The confidence is the confidence that we accurately identified which device the device we saw.
case class NetblockObservation(
device_type: String
ip_start: Long,
ip_end: Long,
confidence: Double
)
If the confidence is above some threshold thresh, then I want that observation to be in the returned dataset. If it's below thresh, it should not be.
In addition if I have two observations with the same device_type and that one contains the other, the containee should have its confidence increased by by the confidence of the container.
Example
Let's say I have 3 Netblock Observations
// 0.0.0.0/28
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 15, confidence_score: .4)
// 0.0.0.0/29
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 7, confidence_score: .4)
// 0.0.0.0/30
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 3, confidence_score: .4)
With a confidence threshold of 1, I would expect to have a single output of NetblockObservation(device_type: "x", ip_start: 0, ip_end: 4, confidence_score: 1.2)
Explanation: I am allowed to add the confidence scores of NetblockObservation's together if it's contained and has the same device_type
I was allowed to add the confidence score of the 0.0.0.0/29 to the confidence of the 0.0.0.0/30 because it's contained within it.
I was not allowed to add the confidence score of 0.0.0.0/30 to the 0.0.0.0/29 because 0.0.0.0/29 is not contained within 0.0.0.0/30.
My (pitiful) Attempt
Failure reason: Too slow / never completed
I attempted to implement this while simultaneously learning scala/spark so I'm not sure if it's the idea or the implementation which is wrong. I think it would eventually work but after an hour, it hadn't completed on a dataset of size 300,000 (small compared to production scale) so I gave up on it.
The idea is to find the largest netblock and separate the data into netblocks which are contained and netblocks which are not contained. The netblocks which are not contained are recursively passed back into the same function. If the largest netblock has a confidence_score of 1, the entire contained dataset is disregarded and the largest is added to return dataset. If the confidence_score is less then 1, then its confidence_score is added to everything in the contained dataset and that group is recursively passed back to the same function. Eventually, you should only be left with the data which has a confidence_score greater then 1. This algorithm also has the issue of not taking device_type into account.
def handleDataset(largestInNetData: Option[NetblockObservation], netData: RDD[NetblockObservation]): RDD[NetblockObservation] = {
if (netData.isEmpty) spark.sparkContext.emptyRDD else largestInNetData match {
case Some(largest) =>
val grouped = netData.groupBy(item =>
if (item.ip_start >= largest.ip_start && item.ip_end <= largest.ip_end) largestInNetData
else None)
def lookup(k: Option[NetblockObservation]) = grouped.filter(_._1 == k).flatMap(_._2)
val nos = handleDataset(None, lookup(None))
// Threshold is assumed to be 1
val next = if (largest.confidence_score >= 1) spark.sparkContext.parallelize(Seq(largest)) else
handleDataset(None, lookup(largestInNetData)
.filter(x => x != largest)
.map(x => x.copy(confidence_score = x.confidence_score + largest.confidence_score)))
nos ++ next
case None =>
val largest = netData.reduce((a: NetblockObservation, b: NetblockObservation) => if ((a.ip_end - a.ip_start) > (b.ip_end - b.ip_start)) a else b)
handleDataset(Option(largest), netData)
}
}
It is a fairly involved bit of code, so here is a general algorithm that I hope will help:
Forget about Spark for a moment and write a Scala function, probably in the companion object for NetblockObservation, that takes a collection of them and returns a subset of that collection that is contained. You should unit test the heck out of this function, and again this is pure Scala.
Moving now to Spark. Do a groupBy on your RDD[NetblockObservation] with device_type as the key producing essentially a map of String to Iterable[NetblockObservation].
Filter out all the entries in the map that have a value of size 1 and have a confidence below thresh.
For the entries that remain, apply your function from the first step to the collections of NetblockObservations with a mapValues.
Do a reduceByKey or similar to simply add up the confidence_scores of the contained values.
Enjoy a refreshing beverage.

Algorithm to remove orphan neurons from a neural network

I'm trying to implement NEAT (Neuro Evolution of Augmenting Topologies).
I have a list of network connections, called "genes". A connection between neuron1 and neuron2 would be gene.from = neuron1, gene.to = neuron2.
My task is to generate a neural network from these genes (The neural network is simply a map from index to neuron, the gene.from and gene.to are the keys to the neurons in the map).
I have numPossibleInputs input nodes, so we add those first (0-numPossibleInputs-1 are input neurons).
I have numOutputs output nodes, so we add those as well.
Then, we sort our genes based on their "to" connection indices.
Finally, we create the hidden layer neurons based on the genes... As the neural network is a map, we just check if the to or from of a connection is already a neuron, otherwise create a new one. This algorithm creates networks just fine.
public void generateNetwork()
{
neuralNetwork.clear();
for(int i = 0; i < numPossibleInputs; i++)
{
neuralNetwork.put(i, new Neuron());
}
for(int i = 0; i < numOutputs; i++)
{
neuralNetwork.put(i+numPossibleInputs+numPossibleHidden, new Neuron());
}
genes.sort((ConnectionGene g1, ConnectionGene g2)-> Integer.compare(g1.toNeuronIndex, g2.toNeuronIndex));
for(ConnectionGene gene : getCleanGenes(genes))
{
if(gene.enabled)
{
if(!neuralNetwork.containsKey(gene.toNeuronIndex))
{
neuralNetwork.put(gene.toNeuronIndex, new Neuron());
}
neuralNetwork.get(gene.toNeuronIndex).incomingConnections.add(gene); // Add this gene to the incoming of the above neuron
if(!neuralNetwork.containsKey(gene.fromNeuronIndex))
{
neuralNetwork.put(gene.fromNeuronIndex, new Neuron());
}
}
}
}
The problem comes when the evolution algorithm turns "off" some of the genes (note the gene.enabled). For example, consider the following genes (There are others, but they are disabled):
2->4
4->4
13->4
0->13
1->13
5->13
We also have disabled genes, 2->5 and 4->13. These can not be used in the network as they arent being expressed. (This is why i have to generate a new network every generation, genes can be added, enabled, disabled, etc.).
This is for numPossibleInputs ==3, so 0 1 and 2 are inputs (2 is bias). 5 is a hidden layer node since 5 > 3, but less than 10 + 3 = 13. 13 is an output node, i had numPossibleHidden == 10 so 10 + 3 = 13... just one output.
Can picture it like this:
[input input input hidden*10 output*1] for 3 inputs, 10 hidden, and 1 output
This is a picture of that network naively generated:
Simple Network
As we can see, the reduced network shouldn't have 4 or 5 at all, since they have no influence on any outputs (In this case only one output, 13). The reduced neural network would just be 0->13 and 1->13.
I had some initial thoughts on how to solve this:
A.
1. Loop over each connection and hash the gene.from ids. These are the neuron ids which are an input to something else
2. After populating the hash, loop again and remove any genes with gene.to not being in the hash (The gene.to is not an input to anything else if it isnt in the hash).
3. Repeat until we don't remove anything
B. Generate the naive network... then, Crawl backwards in the network, from each output until we can't go any further (take care for recurring cycles). Hash each node we find. Once our graph search is done, compare our hash of nodes found with the total nodes expressed in our gene list. Only use genes with neurons in the hash of found nodes and remake the network.
I was hoping to get some feedback on what might be the best algorithm to do this based on my network representation - I'm thinking my B is better than A, but I was hoping there was a more elegant solution that didn't involve me parsing graph topology. Perhaps something clever I can do by sorting the connections (By to, by from)?
Thanks!
I used my B solution above, tested it with all kinds of different network typologies, and it works fine - That is, the network will get rid of all nodes who do not have a proper path from inputs to outputs. I'll post the code here in case anyone wants to use it:
private List<ConnectionGene> cleanGenes(Map<Integer,Neuron> network)
{
// For each output, go backwards
Set<Integer> visited = new HashSet();
for(int i = 0; i < numOutputs; i++)
{
visited.add(i+numPossibleInputs+numPossibleHidden);
cleanGenes(i+numPossibleInputs+numPossibleHidden, network, visited);
}
List<ConnectionGene> slimGenes = new ArrayList();
for(ConnectionGene gene : genes)
{
// Only keep gene if from/to are nodes we visited
if(visited.contains(gene.fromNeuronIndex) && visited.contains(gene.toNeuronIndex))
{
slimGenes.add(gene);
}
}
return slimGenes;
}
private boolean cleanGenes(int neuronIndex, Map<Integer, Neuron> network, Set<Integer> visited)
{
int numGoodConnections = 0;
for(ConnectionGene gene : network.get(neuronIndex).incomingConnections)
{
numGoodConnections++;
if(gene.enabled && !visited.contains(gene.fromNeuronIndex))
{
visited.add(gene.fromNeuronIndex);
if(!cleanGenes(gene.fromNeuronIndex, network, visited))
{
numGoodConnections--;
visited.remove(gene.fromNeuronIndex); // We don't want this node in visited, it has no incoming connections and isn't an input.
}
}
}
if(numGoodConnections == 0)
{
return neuronIndex < numPossibleInputs; // True if an input neuron, false if no incoming connections and not input
}
return true; // Success
}
According to my profiler, the vast majority of time spent in this NEAT algorithm is in the simulation itself. That is, generating the proper network is trivial compared to testing the network against a hard task.
There is a much more efficient way to add neurons. Instead of just adding a new neuron and hopeing for it to be connected one day, you could also take a random connection, split it up in two connections and add a neuron between them.

How to choose all possible combinations?

Let's assume that we have the list of loans user has like below:
loan1
loan2
loan3
...
loan10
And we have the function which can accept from 2 to 10 loans:
function(loans).
For ex., the following is possible:
function(loan1, loan2)
function(loan1, loan3)
function(loan1, loan4)
function(loan1, loan2, loan3)
function(loan1, loan2, loan4)
function(loan1, loan2, loan3, loan4, loan5, loan6, loan7, loan8, loan9, loan10)
How to write the code to pass all possible combinations to that function?
On RosettaCode you have implemented generating combinations in many languages, choose yourself.
Here's how we could do it in ruby :
loans= ['loan1','loan2', ... , 'loan10']
def my_function(loans)
array_of_loan_combinations = (0..arr.length).to_a.combination(2).map{|i,j| arr[i...j]}
array_of_loan_combinations.each do |combination|
//do something
end
end
To call :
my_function(loans);
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it might be faster than the link you have found.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
To solve your problem, you might want to write a new loans function that takes as input an array of loan objects and works on those objects with the BinCoeff class. In C#, to obtain the array of loans for each unique combination, something like the following example code could be used:
void LoanCombinations(Loan[] Loans)
{
// The Loans array contains all of the loan objects that need
// to be handled.
int LoansCount = Loans.Length;
// Loop though all possible combinations of loan objects.
// Start with 2 loan objects, then 3, 4, and so forth.
for (int N = 2; N <= N; N++)
{
// Loop thru all the possible groups of combinations.
for (int K = N - 1; K < N; K++)
{
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
int[] KIndexes = new int[K];
// Loop thru all the combinations for this N choose K.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination, which in this case
// are the indexes to each loan in Loans.
BC.GetKIndexes(Loop, KIndexes);
// Create a new array of Loan objects that correspond to
// this combination group.
Loan[] ComboLoans = new Loan[K];
for (int Loop = 0; Loop < K; Loop++)
ComboLoans[Loop] = Loans[KIndexes[Loop]];
// Call the ProcessLoans function with the loans to be processed.
ProcessLoans(ComboLoans);
}
}
}
}
I have not tested the above code, but in general it should solve your problem.

In Stata, how do I manipulate matrix elements by their name?

In Stata, after a regression I know it is possible to call the elements of stored results by name. For example, if I want to manipulate the coefficient on the variable precip, I just type _b[precip]. My question is how do I do the same after the tabstat command? For example, say I want to multiply the coefficient on precip by the sample mean of precip:
reg --variables in regression--
tabstat --variables in regression--
mat X=r(StatTotal)
mat Y=_b[precip]*X[1,precip]
Ah, if only it were that simple. But alas, in the last line X[1, precip] is invalid syntax. Oddly, Stata does recognize display X[1, precip]. And Stata would know what I'm trying to do if instead of precip I used the column number where precip appears in the X vector. If I were just doing this operation once, no problem. But I need to do this operation several times (for several different model specifications) and for several variables which change position in the vector from one model to the next, so I cannot just use the column number.
I am not yet sure I understand exactly what you want to do, but here's my attempt to reproduce what you are doing:
sysuse auto, clear
regress price mpg foreign weight
tabstat mpg foreign weight, save
matrix X = r(StatTotal)
matrix Y = _b[mpg]*X[1, colnumb(X, "mpg") ]
If you need to put this into a cycle, that's doable, too:
matrix bb = e(b)
local explvar : colnames bb
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix Y_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix Y_`x' = _b[`x']
}
}
You'd probably want to put this into a program that you will call after each regression model estimation call, e.g.:
program define reg2mat , prefix( name )
if "`e(cmd)'" != "regress" {
// this will intentionally produce an error
regress
}
tempname bb
matrix `bb' = e(b)
local explvar : colnames `bb'
foreach x in `explvar' {
if "`x'" != "_cons" {
matrix `prefix'_`x' = _b[`x'] * X[1, colnumb(X, "`x'")]
}
else {
matrix `prefix'_`x' = _b[`x']
}
}
end // of reg2mat
At many levels, it is not ideal, as it manipulates with the (global) matrices in Stata memory; most of the time, it is a bad idea, as the programs should only manipulate with objects local to them.
I suspect that what you want to do is addressed, in one way or another, by either omnipowerful margins command, or by an appropriate predict, or by matrix score (which is the low level version of predict). Attributing the effects to a variable only makes sense when your regressors are orthogonal, which only happens in carefully designed and conducted experiments.

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

Resources