Related
Question came up in relation to this article:
https://threads-iiith.quora.com/String-Hashing-for-competitive-programming
The author presents this algorithm for hashing a string:
where S is our string, Si is the character at index i, and p is a prime number we've chosen.
He then presents the problem of determining whether a substring of a given string is a palindrome and claims it can be done in logarithmic time through hashing.
He makes the point we can calculate from the beginning of our whole string to the right edge of our substring:
and observes that if we calculate the hash from the beginning to the left edge of our substring (F(L-1)), the difference between this and our hash to our right edge is basically the hash of our substring:
This is all fine, and I think I follow it so far. But he then immediately makes the claim that this allows us to calculate our hash (and thus determine if our substring is a palindrome by comparing this hash with the one generated by moving through our substring in reverse order) in logarithmic time.
I feel like I'm probably missing something obvious but how does this allow us to calculate the hash in logarithmic time?
You already know that you can calculate the difference in constant time. Let me restate the difference (I'll leave the modulo away for clarity):
diff = ∑_{i=L to R} S_i ∗ p^i
Note that this is not the hash of the substring because the powers of p are offset by a constant. Instead, this is (as stated in the article)
diff = Hash(S[L,R])∗p^L
To derive the hash of the substring, you have to multiply the difference with p^-L. Assuming that you already know p^-1 (this can be done in a preprocessing step), you need to calculate (p^-1)^L. With the square-and-multiply method, this takes O(log L) operations, which is probably what the author refers to.
This may become more efficient if your queries are sorted by L. In this case, you could calculate p^-L incrementally.
Today I was asked the following question in an interview:
Given an n by n array of integers that contains no duplicates and values that increase from left to right as well as top to bottom, provide an algorithm that checks whether a given value is in the array.
The answer I provided was similar to the answer in this thread:
Algorithm: efficient way to search an integer in a two dimensional integer array?
This solution is O(2n), which I believed to be the optimal solution.
However, the interviewer then informed me that it is possible to solve this problem in sub linear time. I have racked my brain with how to go about doing this, but I am coming up with nothing.
Is a sub linear solution possible, or is this the optimal solution?
The thing to ask yourself is, what information does each comparison give you? It let's you eliminate the rectangle either "above to the left" or "below to the right".
Suppose you do a comparison at 'x' and it tells you what your are looking for is greater:
XXX...
XXX...
XXx...
......
......
'x' - checked space
'X' - check showed this is not a possible location for your data
'.' - still unknown
You have to use this information in a smart way to check the entire rectangle.
Suppose you do a binary search this way on the middle column...
You'll get a result like
XXX...
XXX...
XXX...
XXXXXX
...XXX
...XXX
Two rectangular spaces are left over of half width and possibly full height. What can you do with this information?
I recommend recurring on the 2 resulting subrectangles of '.'. BUT, now instead of choosing the middle column, you choose the middle row to do your binary search on.
So the resulting run time of an N by M rectangle looks like
T(N, M) = log(N) + T(M/2, N)*2
Note the change in indexes because your recursion stack switches between checking columns and rows. The final run time (I didn't bother solving the recursion) should be something like T(M, N) = log(M) + log(N) (it's probably not exactly this but it will be similar).
In writing some code today, I have happened upon a circumstance that has caused me to write a binary search of a kind I have never seen before. Does this binary search have a name, and is it really a "binary" search?
Motivation
First of all, in order to make the search easier to understand, I will explain the use case that spawned its creation.
Say you have a list of ordered numbers. You are asked to find the index of the number in the list that is closest to x.
int findIndexClosestTo(int x);
The calls to findIndexClosestTo() always follow this rule:
If the last result of findIndexClosestTo() was i, then indices closer to i have greater probability of being the result of the current call to findIndexClosestTo().
In other words, the index we need to find this time is more likely to be closer to the last one we found than further from it.
For an example, imagine a simulated boy that walks left and right on the screen. If we are often querying the index of the boy's location, it is likely he is somewhere near the last place we found him.
Algorithm
Given the case above, we know the last result of findIndexClosestTo() was i (if this is actually the first time the function has been called, i defaults to the middle index of the list, for simplicity, although a separate binary search to find the result of the first call would actually be faster), and the function has been called again. Given the new number x, we follow this algorithm to find its index:
interval = 1;
Is the number we're looking for, x, positioned at i? If so, return i;
If not, determine whether x is above or below i. (Remember, the list is sorted.)
Move interval indices in the direction of x.
If we have found x at our new location, return that location.
Double interval. (i.e. interval *= 2)
If we have passed x, go back interval indices, set interval = 1, go to 4.
Given the probability rule stated above (under the Motivation header), this appears to me to be the most efficient way to find the correct index. Do you know of a faster way?
In the worst case, your algorithm is O((log n)^2).
Suppose you start at 0 (with interval = 1), and the value you seek actually resides at location 2^n - 1.
First you will check 1, 2, 4, 8, ..., 2^(n-1), 2^n. Whoops, that overshoots, so go back to 2^(n-1).
Next you check 2^(n-1)+1, 2^(n-1)+2, ..., 2^(n-1)+2^(n-2), 2^(n-1)+2^(n-1). That last term is 2^n, so whoops, that overshot again. Go back to 2^(n-1) + 2^(n-2).
And so on, until you finally reach 2^(n-1) + 2^(n-2) + ... + 1 == 2^n - 1.
The first overshoot took log n steps. The next took (log n)-1 steps. The next took (log n) - 2 steps. And so on.
So, worst case, you took 1 + 2 + 3 + ... + log n == O((log n)^2) steps.
A better idea, I think, is to switch to traditional binary search once you overshoot the first time. That will preserve the O(log n) worst case performance of the algorithm, while tending to be a little faster when the target really is nearby.
I do not know a name for this algorithm, but I do like it. (By a bizarre coincidence, I could have used it yesterday. Really.)
What you are doing is (IMHO) a version of Interpolation search
In a interpolation search you assume numbers are equally distributed, and then you try to guess the location of a number from first and last number and length of the array.
In your case, you are modifying the interpolation-algo such that you assume the Key is very close to the last number you searched.
Also note that your algo is similar to algo where TCP tries to find the optimal packet size. (dont remember the name :( )
Start slow
Double the interval
if Packet fails restart from the last succeeded packet./ Restart
from default packet size.. 3.
Your routine is typical of interpolation routines. You don't lose much if you call it with random numbers (~ standard binary search), but if you call it with slowly increasing numbers, it won't take long to find the correct index.
This is therefore a sensible default behavior for searching an ordered table for interpolation purposes.
This method is discussed with great length in Numerical Recipes 3rd edition, section 3.1.
This is talking off the top of my head, so I've nothing to back it up but gut feeling!
At step 7, if we've passed x, it may be faster to halve interval, and head back towards x - effectively, interval = -(interval / 2), rather than resetting interval to 1.
I'll have to sketch out a few numbers on paper, though...
Edit: Apologies - I'm talking nonsense above: ignore me! (And I'll go away and have a proper think about it this time...)
I have a fixed array of constant integer values about 300 items long (Set A). The goal of the algorithm is to pick two numbers (X and Y) from this array that fit several criteria based on input R.
Formal requirement:
Pick values X and Y from set A such that the expression X*Y/(X+Y) is as close as possible to R.
That's all there is to it. I need a simple algorithm that will do that.
Additional info:
The Set A can be ordered or stored in any way, it will be hard coded eventually. Also, with a little bit of math, it can be shown that the best Y for a given X is the closest value in Set A to the expression X*R/(X-R). Also, X and Y will always be greater than R
From this, I get a simple iterative algorithm that works ok:
int minX = 100000000;
int minY = 100000000;
foreach X in A
if(X<=R)
continue;
else
Y=X*R/(X-R)
Y=FindNearestIn(A, Y);//do search to find closest useable Y value in A
if( X*Y/(X+Y) < minX*minY/(minX+minY) )
then
minX = X;
minY = Y;
end
end
end
I'm looking for a slightly more elegant approach than this brute force method. Suggestions?
For a possibly 'more elegant' solution see Solution 2.
Solution 1)
Why don't you create all the possible 300*300/2 or (300*299/2) possible exact values of R, sort them into an array B say, and then given an R, find the closest value to R in B using binary search and then pick the corresponding X and Y.
I presume that having array B (with the X&Y info) won't be a big memory hog and can easily be hardcoded (using code to write code! :-)).
This will be reasonably fast: worst case ~ 17 comparisons.
Solution 2)
You can possibly also do the following (didn't try proving it, but seems correct):
Maintain an array of the 1/X values, sorted.
Now given an R, you try and find the closest sum to 1/R with two numbers in the array of 1/Xs.
For this you maintain two pointers to the 1/X array, one at the smallest and one at the largest, and keep incrementing one and decrementing the other to find the one closest to 1/R. (This is a classic interview question: Find if a sorted array has two numbers which sum to X)
This will be O(n) comparisons and additions in the worst case. This is also prone to precision issues. You could avoid some of the precision issues by maintaining a reverse sorted array of X's, though.
Two ideas come to my mind:
1) Since the set A is constant, some pre-processing can be helpful. Assuming the value span of A is not too large, you can create an array of size N = max(A). For each index i you can store the closest value in A to i. This way you can improve your algorithm by finding the closest value in constant time, instead of using a binary search.
2) I see that you omit X<=R, and this is correct. If you define that X<=Y, you can restrict the search range even further, since X>2R will yield no solutions either. So the range to be scanned is R<X<=2R, and this guarantees no symetric solutions, and that X<=Y.
When the size of the input is (roughly) constant, an O(n*log(n)) solution might run faster than a particular O(n) solution.
I would start with the solution that you understand the best, and optimize from there if needed.
I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.