find keystrokes for On Screen Keyboard scala - algorithm

I am trying to solve a recent interview question using Scala..
You have an on screen keyboard which is a grid of 6 rows , 5 columns each. With alphabets from A to Z and blank space are arranged in the grid row first.
You can use this on screen keyboard to type words.. by using your TV Remote by press Left, Right, Up , Down or OK keys to type each character.
Question: given an input string, find the sequence of keystrokes needed to be pressed on the remote to type the input.
The code implementation can be found at
https://github.com/mradityagoyal/scala/blob/master/OnScrKb/src/main/scala/OnScrKB.scala
I have tried to solve this using three different approaches..
Simple forldLeft.
def keystrokesByFL(input: String, startChar: Char = 'A'): String = {
val zero = ("", startChar)
//(acc, last) + next => (acc+ aToB , next)
def op(zero: (String, Char), next: Char): (String, Char) = zero match {
case (acc, last) => (acc + path(last, next), next)
}
val result = input.foldLeft(zero)(op)
result._1
}
divide and conquer - Uses divide and conquer mechanism. The algorithm is similar to merge sort. * We split the input word into two if the length is > 3 * we recursively call the subroutine to get the path of left and right halves from the split. * In the end.. we add the keystrokes for first + keystrokes from end of first string to start of second string + keystrokes for second. * Essentially we divide the input string in two smaller halves till we get to size 4. for smaller than 4 we use the fold right.
def keystrokesByDnQ(input: String, startChar: Char = 'A'): String = {
def splitAndMerge(in: String, startChar: Char): String = {
if (in.length() < 4) {
//if length is <4 then dont split.. as you might end up with one side having only 1 char.
keystrokesByFL(in, startChar)
} else {
//split
val (x, y) = in.splitAt(in.length() / 2)
splitAndMerge(x, startChar) + splitAndMerge(y, x.last)
}
}
splitAndMerge(input, startChar)
}
Fold - uses the property that the underlying operation is associative (but not commutative). * For eg.. the keystrokes("ABCDEFGHI", startChar = 'A') == keystrokes("ABC", startChar='A')+keystrokes("DEF", 'C') + keystrokes("GHI", 'F')
def keystrokesByF(input: String, startChar: Char = 'A'): String = {
val mapped = input.map { x => PathAcc(text = "" + x, path = "") } // map each character in input to case class PathAcc("CharAsString", "")
val z = PathAcc(text = ""+startChar, path = "") //the starting char.
def op(left: PathAcc, right: PathAcc): PathAcc = {
PathAcc(text = left.text + right.text, path = left.path + path(left.text.last, right.text.head) + right.path)
}
val foldresult = mapped.fold(z)(op)
foldresult.path
}
My questions:
1. Is the divide and conquer approach better than Fold?
are Fold and Divide and conquer better than foldLeft (for this specific problem)
Is there a way i can represent the divide and conquer approach or the Fold approach as a Monad? I can see the associative law being satisfied... but i am not able to figure out if a monoid is present here.. and if yes.. what does it achieve for me?
Is Divide and conquer approach the best one available for this particular problem?
Which approach is better suited for spark?
Any suggestions are welcome..

Here's how I would do it:
def keystrokes(input: String, start: Char): String =
((start + input) zip input).par.map((path _).tupled).fold("")(_ ++ _)
The main point here is using the par method to parallelize the sequence of (Char, Char), so that it can parallelize the map, and take the optimal implementation for fold.
The algorithm simply take the characters in the String two by two (representing the units of path to be walked), computes the path between them, and then concatenates the result. Note that fold("")(_ ++ _) is basically mkString (although mkString on parallel collection is implemented by seq.mkString so it is much less efficient).
What your implementations dearly miss is parallelization of tasks. Even in your divide-and-conquer approach, you never run code in parallel, so you will wait for the first half to be finished before starting the second half (even though they are totally independant).
Assuming you use parallelization, the classical implementation of fold on parallel sequences is precisely the divide-and-conquer algorithm you described, but it may be that it is better optimized (for instance, it may choose another value than 3 for chunk size, I tend to trust the scala-collection implementers on these matters).
Note that fold on String is probably implemented with foldLeft, so there is no added value than what you did with foldLeft, unless you use .par before.
Back to your questions (I'll mostly repeat what I just said):
1) Yes, the divide and conquer is better than fold... on String (but not on parallelized String)
2) Fold can only be better than FoldLeft with some kind of parallelization, in which case it will be as good as (or better than, if there is a better implementation for a particular parallelized collection) divide-and-conquer.
3) I don't see what monads have to do with anything here. the operator and zero for fold must indeed form a monoid (otherwise, you'll have some problems with operation ordering if the operator is not associative, and unwanted noise if zero is not a neutral element).
4) Yes, that I know of, once parallelized
5) Spark is inherently parallel, so the main issue would be to join all the pieces back together in the end. What I mean is that an RDD is not ordered, so you'll need to keep some information on which piece of input should be put where in your cluster. Once you've done that correctly (using partitions and such, this would probably be a whole question itself), using map and fold still works as a charm (Spark was designed to have an API as close as possible to scala-collection, so that's really nice here).

Related

Efficient Algorithm to find combination of numbers for an answer [duplicate]

I'm working on a homework problem that asks me this:
Tiven a finite set of numbers, and a target number, find if the set can be used to calculate the target number using basic math operations (add, sub, mult, div) and using each number in the set exactly once (so I need to exhaust the set). This has to be done with recursion.
So, for example, if I have the set
{1, 2, 3, 4}
and target 10, then I could get to it by using
((3 * 4) - 2)/1 = 10.
I'm trying to phrase the algorithm in pseudo-code, but so far haven't gotten too far. I'm thinking graphs are the way to go, but would definitely appreciate help on this. thanks.
This isn't meant to be the fastest solution, but rather an instructive one.
It recursively generates all equations in postfix notation
It also provides a translation from postfix to infix notation
There is no actual arithmetic calculation done, so you have to implement that on your own
Be careful about division by zero
With 4 operands, 4 possible operators, it generates all 7680 = 5 * 4! * 4^3
possible expressions.
5 is Catalan(3). Catalan(N) is the number of ways to paranthesize N+1 operands.
4! because the 4 operands are permutable
4^3 because the 3 operators each have 4 choice
This definitely does not scale well, as the number of expressions for N operands is [1, 8, 192, 7680, 430080, 30965760, 2724986880, ...].
In general, if you have n+1 operands, and must insert n operators chosen from k possibilities, then there are (2n)!/n! k^n possible equations.
Good luck!
import java.util.*;
public class Expressions {
static String operators = "+-/*";
static String translate(String postfix) {
Stack<String> expr = new Stack<String>();
Scanner sc = new Scanner(postfix);
while (sc.hasNext()) {
String t = sc.next();
if (operators.indexOf(t) == -1) {
expr.push(t);
} else {
expr.push("(" + expr.pop() + t + expr.pop() + ")");
}
}
return expr.pop();
}
static void brute(Integer[] numbers, int stackHeight, String eq) {
if (stackHeight >= 2) {
for (char op : operators.toCharArray()) {
brute(numbers, stackHeight - 1, eq + " " + op);
}
}
boolean allUsedUp = true;
for (int i = 0; i < numbers.length; i++) {
if (numbers[i] != null) {
allUsedUp = false;
Integer n = numbers[i];
numbers[i] = null;
brute(numbers, stackHeight + 1, eq + " " + n);
numbers[i] = n;
}
}
if (allUsedUp && stackHeight == 1) {
System.out.println(eq + " === " + translate(eq));
}
}
static void expression(Integer... numbers) {
brute(numbers, 0, "");
}
public static void main(String args[]) {
expression(1, 2, 3, 4);
}
}
Before thinking about how to solve the problem (like with graphs), it really helps to just look at the problem. If you find yourself stuck and can't seem to come up with any pseudo-code, then most likely there is something that is holding you back; Some other question or concern that hasn't been addressed yet. An example 'sticky' question in this case might be, "What exactly is recursive about this problem?"
Before you read the next paragraph, try to answer this question first. If you knew what was recursive about the problem, then writing a recursive method to solve it might not be very difficult.
You want to know if some expression that uses a set of numbers (each number used only once) gives you a target value. There are four binary operations, each with an inverse. So, in other words, you want to know if the first number operated with some expression of the other numbers gives you the target. Well, in other words, you want to know if some expression of the 'other' numbers is [...]. If not, then using the first operation with the first number doesn't really give you what you need, so try the other ops. If they don't work, then maybe it just wasn't meant to be.
Edit: I thought of this for an infix expression of four operators without parenthesis, since a comment on the original question said that parenthesis were added for the sake of an example (for clarity?) and the use of parenthesis was not explicitly stated.
Well, you didn't mention efficiency so I'm going to post a really brute force solution and let you optimize it if you want to. Since you can have parantheses, it's easy to brute force it using Reverse Polish Notation:
First of all, if your set has n numbers, you must use exactly n - 1 operators. So your solution will be given by a sequence of 2n - 1 symbols from {{your given set}, {*, /, +, -}}
st = a stack of length 2n - 1
n = numbers in your set
a = your set, to which you add *, /, +, -
v[i] = 1 if the NUMBER i has been used before, 0 otherwise
void go(int k)
{
if ( k > 2n - 1 )
{
// eval st as described on Wikipedia.
// Careful though, it might not be valid, so you'll have to check that it is
// if it evals to your target value great, you can build your target from the given numbers. Otherwise, go on.
return;
}
for ( each symbol x in a )
if ( x isn't a number or x is a number but v[x] isn't 1 )
{
st[k] = x;
if ( x is a number )
v[x] = 1;
go(k + 1);
}
}
Generally speaking, when you need to do something recursively it helps to start from the "bottom" and think your way up.
Consider: You have a set S of n numbers {a,b,c,...}, and a set of four operations {+,-,*,/}. Let's call your recursive function that operates on the set F(S)
If n is 1, then F(S) will just be that number.
If n is 2, F(S) can be eight things:
pick your left-hand number from S (2 choices)
then pick an operation to apply (4 choices)
your right-hand number will be whatever is left in the set
Now, you can generalize from the n=2 case:
Pick a number x from S to be the left-hand operand (n choices)
Pick an operation to apply
your right hand number will be F(S-x)
I'll let you take it from here. :)
edit: Mark poses a valid criticism; the above method won't get absolutely everything. To fix that problem, you need to think about it in a slightly different way:
At each step, you first pick an operation (4 choices), and then
partition S into two sets, for the left and right hand operands,
and recursively apply F to both partitions
Finding all partitions of a set into 2 parts isn't trivial itself, though.
Your best clue about how to approach this problem is the fact that your teacher/professor wants you to use recursion. That is, this isn't a math problem - it is a search problem.
Not to give too much away (it is homework after all), but you have to spawn a call to the recursive function using an operator, a number and a list containing the remaining numbers. The recursive function will extract a number from the list and, using the operation passed in, combine it with the number passed in (which is your running total). Take the running total and call yourself again with the remaining items on the list (you'll have to iterate the list within the call but the sequence of calls is depth-first). Do this once for each of the four operators unless Success has been achieved by a previous leg of the search.
I updated this to use a list instead of a stack
When the result of the operation is your target number and your list is empty, then you have successfully found the set of operations (those that traced the path to the successful leaf) - set the Success flag and unwind. Note that the operators aren't on a list nor are they in the call: the function itself always iterates over all four. Your mechanism for "unwinding" the operator sequence from the successful leaf to get the sequence is to return the current operator and number prepended to the value returned by recursive call (only one of which will be successful since you stop at success - that, obviously, is the one to use). If none are successful, then what you return isn't important anyhow.
Update This is much harder when you have to consider expressions like the one that Daniel posted. You have combinatorics on the numbers and the groupings (numbers due to the fact that / and - are order sensitive even without grouping and grouping because it changes precedence). Then, of course, you also have the combinatorics of the operations. It is harder to manage the differences between (4 + 3) * 2 and 4 + (3 * 2) because grouping doesn't recurse like operators or numbers (which you can just iterate over in a breadth-first manner while making your (depth-first) recursive calls).
Here's some Python code to get you started: it just prints all the possible expressions, without worrying too much about redundancy. You'd need to modify it to evaluate expressions and compare to the target number, rather than simply printing them.
The basic idea is: given a set S of numbers, partition S into two subsets left and right in all possible ways (where we don't care about the order or the elements in left and right), such that left and right are both nonempty. Now for each of these partitions, find all ways of combining the elements in left (recursively!), and similarly for right, and combine the two resulting values with all possible operators. The recursion bottoms out when a set has just one element, in which case there's only one value possible.
Even if you don't know Python, the expressions function should be reasonably easy to follow; the splittings function contains some Python oddities, but all it does is to find all the partitions of the list l into left and right pieces.
def splittings(l):
n = len(l)
for i in xrange(2**n):
left = [e for b, e in enumerate(l) if i & 2**b]
right = [e for b, e in enumerate(l) if not i & 2**b]
yield left, right
def expressions(l):
if len(l) == 1:
yield l[0]
else:
for left, right in splittings(l):
if not left or not right:
continue
for el in expressions(left):
for er in expressions(right):
for operator in '+-*/':
yield '(' + el + operator + er + ')'
for x in expressions('1234'):
print x
pusedo code:
Works(list, target)
for n in list
tmp=list.remove(n)
return Works(tmp,target+n) or Works(tmp,target-n) or Works(tmp, n-target) or ...
then you just have to put the base case in. I think I gave away to much.

Computing target number from numbers in a set

I'm working on a homework problem that asks me this:
Tiven a finite set of numbers, and a target number, find if the set can be used to calculate the target number using basic math operations (add, sub, mult, div) and using each number in the set exactly once (so I need to exhaust the set). This has to be done with recursion.
So, for example, if I have the set
{1, 2, 3, 4}
and target 10, then I could get to it by using
((3 * 4) - 2)/1 = 10.
I'm trying to phrase the algorithm in pseudo-code, but so far haven't gotten too far. I'm thinking graphs are the way to go, but would definitely appreciate help on this. thanks.
This isn't meant to be the fastest solution, but rather an instructive one.
It recursively generates all equations in postfix notation
It also provides a translation from postfix to infix notation
There is no actual arithmetic calculation done, so you have to implement that on your own
Be careful about division by zero
With 4 operands, 4 possible operators, it generates all 7680 = 5 * 4! * 4^3
possible expressions.
5 is Catalan(3). Catalan(N) is the number of ways to paranthesize N+1 operands.
4! because the 4 operands are permutable
4^3 because the 3 operators each have 4 choice
This definitely does not scale well, as the number of expressions for N operands is [1, 8, 192, 7680, 430080, 30965760, 2724986880, ...].
In general, if you have n+1 operands, and must insert n operators chosen from k possibilities, then there are (2n)!/n! k^n possible equations.
Good luck!
import java.util.*;
public class Expressions {
static String operators = "+-/*";
static String translate(String postfix) {
Stack<String> expr = new Stack<String>();
Scanner sc = new Scanner(postfix);
while (sc.hasNext()) {
String t = sc.next();
if (operators.indexOf(t) == -1) {
expr.push(t);
} else {
expr.push("(" + expr.pop() + t + expr.pop() + ")");
}
}
return expr.pop();
}
static void brute(Integer[] numbers, int stackHeight, String eq) {
if (stackHeight >= 2) {
for (char op : operators.toCharArray()) {
brute(numbers, stackHeight - 1, eq + " " + op);
}
}
boolean allUsedUp = true;
for (int i = 0; i < numbers.length; i++) {
if (numbers[i] != null) {
allUsedUp = false;
Integer n = numbers[i];
numbers[i] = null;
brute(numbers, stackHeight + 1, eq + " " + n);
numbers[i] = n;
}
}
if (allUsedUp && stackHeight == 1) {
System.out.println(eq + " === " + translate(eq));
}
}
static void expression(Integer... numbers) {
brute(numbers, 0, "");
}
public static void main(String args[]) {
expression(1, 2, 3, 4);
}
}
Before thinking about how to solve the problem (like with graphs), it really helps to just look at the problem. If you find yourself stuck and can't seem to come up with any pseudo-code, then most likely there is something that is holding you back; Some other question or concern that hasn't been addressed yet. An example 'sticky' question in this case might be, "What exactly is recursive about this problem?"
Before you read the next paragraph, try to answer this question first. If you knew what was recursive about the problem, then writing a recursive method to solve it might not be very difficult.
You want to know if some expression that uses a set of numbers (each number used only once) gives you a target value. There are four binary operations, each with an inverse. So, in other words, you want to know if the first number operated with some expression of the other numbers gives you the target. Well, in other words, you want to know if some expression of the 'other' numbers is [...]. If not, then using the first operation with the first number doesn't really give you what you need, so try the other ops. If they don't work, then maybe it just wasn't meant to be.
Edit: I thought of this for an infix expression of four operators without parenthesis, since a comment on the original question said that parenthesis were added for the sake of an example (for clarity?) and the use of parenthesis was not explicitly stated.
Well, you didn't mention efficiency so I'm going to post a really brute force solution and let you optimize it if you want to. Since you can have parantheses, it's easy to brute force it using Reverse Polish Notation:
First of all, if your set has n numbers, you must use exactly n - 1 operators. So your solution will be given by a sequence of 2n - 1 symbols from {{your given set}, {*, /, +, -}}
st = a stack of length 2n - 1
n = numbers in your set
a = your set, to which you add *, /, +, -
v[i] = 1 if the NUMBER i has been used before, 0 otherwise
void go(int k)
{
if ( k > 2n - 1 )
{
// eval st as described on Wikipedia.
// Careful though, it might not be valid, so you'll have to check that it is
// if it evals to your target value great, you can build your target from the given numbers. Otherwise, go on.
return;
}
for ( each symbol x in a )
if ( x isn't a number or x is a number but v[x] isn't 1 )
{
st[k] = x;
if ( x is a number )
v[x] = 1;
go(k + 1);
}
}
Generally speaking, when you need to do something recursively it helps to start from the "bottom" and think your way up.
Consider: You have a set S of n numbers {a,b,c,...}, and a set of four operations {+,-,*,/}. Let's call your recursive function that operates on the set F(S)
If n is 1, then F(S) will just be that number.
If n is 2, F(S) can be eight things:
pick your left-hand number from S (2 choices)
then pick an operation to apply (4 choices)
your right-hand number will be whatever is left in the set
Now, you can generalize from the n=2 case:
Pick a number x from S to be the left-hand operand (n choices)
Pick an operation to apply
your right hand number will be F(S-x)
I'll let you take it from here. :)
edit: Mark poses a valid criticism; the above method won't get absolutely everything. To fix that problem, you need to think about it in a slightly different way:
At each step, you first pick an operation (4 choices), and then
partition S into two sets, for the left and right hand operands,
and recursively apply F to both partitions
Finding all partitions of a set into 2 parts isn't trivial itself, though.
Your best clue about how to approach this problem is the fact that your teacher/professor wants you to use recursion. That is, this isn't a math problem - it is a search problem.
Not to give too much away (it is homework after all), but you have to spawn a call to the recursive function using an operator, a number and a list containing the remaining numbers. The recursive function will extract a number from the list and, using the operation passed in, combine it with the number passed in (which is your running total). Take the running total and call yourself again with the remaining items on the list (you'll have to iterate the list within the call but the sequence of calls is depth-first). Do this once for each of the four operators unless Success has been achieved by a previous leg of the search.
I updated this to use a list instead of a stack
When the result of the operation is your target number and your list is empty, then you have successfully found the set of operations (those that traced the path to the successful leaf) - set the Success flag and unwind. Note that the operators aren't on a list nor are they in the call: the function itself always iterates over all four. Your mechanism for "unwinding" the operator sequence from the successful leaf to get the sequence is to return the current operator and number prepended to the value returned by recursive call (only one of which will be successful since you stop at success - that, obviously, is the one to use). If none are successful, then what you return isn't important anyhow.
Update This is much harder when you have to consider expressions like the one that Daniel posted. You have combinatorics on the numbers and the groupings (numbers due to the fact that / and - are order sensitive even without grouping and grouping because it changes precedence). Then, of course, you also have the combinatorics of the operations. It is harder to manage the differences between (4 + 3) * 2 and 4 + (3 * 2) because grouping doesn't recurse like operators or numbers (which you can just iterate over in a breadth-first manner while making your (depth-first) recursive calls).
Here's some Python code to get you started: it just prints all the possible expressions, without worrying too much about redundancy. You'd need to modify it to evaluate expressions and compare to the target number, rather than simply printing them.
The basic idea is: given a set S of numbers, partition S into two subsets left and right in all possible ways (where we don't care about the order or the elements in left and right), such that left and right are both nonempty. Now for each of these partitions, find all ways of combining the elements in left (recursively!), and similarly for right, and combine the two resulting values with all possible operators. The recursion bottoms out when a set has just one element, in which case there's only one value possible.
Even if you don't know Python, the expressions function should be reasonably easy to follow; the splittings function contains some Python oddities, but all it does is to find all the partitions of the list l into left and right pieces.
def splittings(l):
n = len(l)
for i in xrange(2**n):
left = [e for b, e in enumerate(l) if i & 2**b]
right = [e for b, e in enumerate(l) if not i & 2**b]
yield left, right
def expressions(l):
if len(l) == 1:
yield l[0]
else:
for left, right in splittings(l):
if not left or not right:
continue
for el in expressions(left):
for er in expressions(right):
for operator in '+-*/':
yield '(' + el + operator + er + ')'
for x in expressions('1234'):
print x
pusedo code:
Works(list, target)
for n in list
tmp=list.remove(n)
return Works(tmp,target+n) or Works(tmp,target-n) or Works(tmp, n-target) or ...
then you just have to put the base case in. I think I gave away to much.

How can I randomly iterate through a large Range?

I would like to randomly iterate through a range. Each value will be visited only once and all values will eventually be visited. For example:
class Array
def shuffle
ret = dup
j = length
i = 0
while j > 1
r = i + rand(j)
ret[i], ret[r] = ret[r], ret[i]
i += 1
j -= 1
end
ret
end
end
(0..9).to_a.shuffle.each{|x| f(x)}
where f(x) is some function that operates on each value. A Fisher-Yates shuffle is used to efficiently provide random ordering.
My problem is that shuffle needs to operate on an array, which is not cool because I am working with astronomically large numbers. Ruby will quickly consume a large amount of RAM trying to create a monstrous array. Imagine replacing (0..9) with (0..99**99). This is also why the following code will not work:
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
redo if tried[x]
tried[x] = true
f(x) # some function
}
This code is very naive and quickly runs out of memory as tried obtains more entries.
What sort of algorithm can accomplish what I am trying to do?
[Edit1]: Why do I want to do this? I'm trying to exhaust the search space of a hash algorithm for a N-length input string looking for partial collisions. Each number I generate is equivalent to a unique input string, entropy and all. Basically, I'm "counting" using a custom alphabet.
[Edit2]: This means that f(x) in the above examples is a method that generates a hash and compares it to a constant, target hash for partial collisions. I do not need to store the value of x after I call f(x) so memory should remain constant over time.
[Edit3/4/5/6]: Further clarification/fixes.
[Solution]: The following code is based on #bta's solution. For the sake of conciseness, next_prime is not shown. It produces acceptable randomness and only visits each number once. See the actual post for more details.
N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)
x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
I just remembered a similar problem from a class I took years ago; that is, iterating (relatively) randomly through a set (completely exhausting it) given extremely tight memory constraints. If I'm remembering this correctly, our solution algorithm was something like this:
Define the range to be from 0 to
some number N
Generate a random starting point x[0] inside N
Generate an iterator Q less than N
Generate successive points x[n] by adding Q to
the previous point and wrapping around if needed. That
is, x[n+1] = (x[n] + Q) % N
Repeat until you generate a new point equal to the starting point.
The trick is to find an iterator that will let you traverse the entire range without generating the same value twice. If I'm remembering correctly, any relatively prime N and Q will work (the closer the number to the bounds of the range the less 'random' the input). In that case, a prime number that is not a factor of N should work. You can also swap bytes/nibbles in the resulting number to change the pattern with which the generated points "jump around" in N.
This algorithm only requires the starting point (x[0]), the current point (x[n]), the iterator value (Q), and the range limit (N) to be stored.
Perhaps someone else remembers this algorithm and can verify if I'm remembering it correctly?
As #Turtle answered, you problem doesn't have a solution. #KandadaBoggu and #bta solution gives you random numbers is some ranges which are or are not random. You get clusters of numbers.
But I don't know why you care about double occurence of the same number. If (0..99**99) is your range, then if you could generate 10^10 random numbers per second (if you have a 3 GHz processor and about 4 cores on which you generate one random number per CPU cycle - which is imposible, and ruby will even slow it down a lot), then it would take about 10^180 years to exhaust all the numbers. You have also probability about 10^-180 that two identical numbers will be generated during a whole year. Our universe has probably about 10^9 years, so if your computer could start calculation when the time began, then you would have probability about 10^-170 that two identical numbers were generated. In the other words - practicaly it is imposible and you don't have to care about it.
Even if you would use Jaguar (top 1 from www.top500.org supercomputers) with only this one task, you still need 10^174 years to get all numbers.
If you don't belive me, try
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
puts "Oh, no!" if tried[x]
tried[x] = true
}
I'll buy you a beer if you will even once see "Oh, no!" on your screen during your life time :)
I could be wrong, but I don't think this is doable without storing some state. At the very least, you're going to need some state.
Even if you only use one bit per value (has this value been tried yes or no) then you will need X/8 bytes of memory to store the result (where X is the largest number). Assuming that you have 2GB of free memory, this would leave you with more than 16 million numbers.
Break the range in to manageable batches as shown below:
def range_walker range, batch_size = 100
size = (range.end - range.begin) + 1
n = size/batch_size
n.times do |i|
x = i * batch_size + range.begin
y = x + batch_size
(x...y).sort_by{rand}.each{|z| p z}
end
d = (range.end - size%batch_size + 1)
(d..range.end).sort_by{rand}.each{|z| p z }
end
You can further randomize solution by randomly choosing the batch for processing.
PS: This is a good problem for map-reduce. Each batch can be worked by independent nodes.
Reference:
Map-reduce in Ruby
you can randomly iterate an array with shuffle method
a = [1,2,3,4,5,6,7,8,9]
a.shuffle!
=> [5, 2, 8, 7, 3, 1, 6, 4, 9]
You want what's called a "full cycle iterator"...
Here is psudocode for the simplest version which is perfect for most uses...
function fullCycleStep(sample_size, last_value, random_seed = 31337, prime_number = 32452843) {
if last_value = null then last_value = random_seed % sample_size
return (last_value + prime_number) % sample_size
}
If you call this like so:
sample = 10
For i = 1 to sample
last_value = fullCycleStep(sample, last_value)
print last_value
next
It would generate random numbers, looping through all 10, never repeating If you change random_seed, which can be anything, or prime_number, which must be greater than, and not be evenly divisible by sample_size, you will get a new random order, but you will still never get a duplicate.
Database systems and other large-scale systems do this by writing the intermediate results of recursive sorts to a temp database file. That way, they can sort massive numbers of records while only keeping limited numbers of records in memory at any one time. This tends to be complicated in practice.
How "random" does your order have to be? If you don't need a specific input distribution, you could try a recursive scheme like this to minimize memory usage:
def gen_random_indices
# Assume your input range is (0..(10**3))
(0..3).sort_by{rand}.each do |a|
(0..3).sort_by{rand}.each do |b|
(0..3).sort_by{rand}.each do |c|
yield "#{a}#{b}#{c}".to_i
end
end
end
end
gen_random_indices do |idx|
run_test_with_index(idx)
end
Essentially, you are constructing the index by randomly generating one digit at a time. In the worst-case scenario, this will require enough memory to store 10 * (number of digits). You will encounter every number in the range (0..(10**3)) exactly once, but the order is only pseudo-random. That is, if the first loop sets a=1, then you will encounter all three-digit numbers of the form 1xx before you see the hundreds digit change.
The other downside is the need to manually construct the function to a specified depth. In your (0..(99**99)) case, this would likely be a problem (although I suppose you could write a script to generate the code for you). I'm sure there's probably a way to re-write this in a state-ful, recursive manner, but I can't think of it off the top of my head (ideas, anyone?).
[Edit]: Taking into account #klew and #Turtle's answers, the best I can hope for is batches of random (or close to random) numbers.
This is a recursive implementation of something similar to KandadaBoggu's solution. Basically, the search space (as a range) is partitioned into an array containing N equal-sized ranges. Each range is fed back in a random order as a new search space. This continues until the size of the range hits a lower bound. At this point the range is small enough to be converted into an array, shuffled, and checked.
Even though it is recursive, I haven't blown the stack yet. Instead, it errors out when attempting to partition a search space larger than about 10^19 keys. I has to do with the numbers being too large to convert to a long. It can probably be fixed:
# partition a range into an array of N equal-sized ranges
def partition(range, n)
ranges = []
first = range.first
last = range.last
length = last - first + 1
step = length / n # integer division
((first + step - 1)..last).step(step) { |i|
ranges << (first..i)
first = i + 1
}
# append any extra onto the last element
ranges[-1] = (ranges[-1].first)..last if last > step * ranges.length
ranges
end
I hope the code comments help shed some light on my original question.
pastebin: full source
Note: PW_LEN under # options can be changed to a lower number in order to get quicker results.
For a prohibitively large space, like
space = -10..1000000000000000000000
You can add this method to Range.
class Range
M127 = 170_141_183_460_469_231_731_687_303_715_884_105_727
def each_random(seed = 0)
return to_enum(__method__) { size } unless block_given?
unless first.kind_of? Integer
raise TypeError, "can't randomly iterate from #{first.class}"
end
sample_size = self.end - first + 1
sample_size -= 1 if exclude_end?
j = coprime sample_size
v = seed % sample_size
each do
v = (v + j) % sample_size
yield first + v
end
end
protected
def gcd(a,b)
b == 0 ? a : gcd(b, a % b)
end
def coprime(a, z = M127)
gcd(a, z) == 1 ? z : coprime(a, z + 1)
end
end
You could then
space.each_random { |i| puts i }
729815750697818944176
459631501395637888351
189447252093456832526
919263002791275776712
649078753489094720887
378894504186913665062
108710254884732609237
838526005582551553423
568341756280370497598
298157506978189441773
27973257676008385948
757789008373827330134
487604759071646274309
217420509769465218484
947236260467284162670
677052011165103106845
406867761862922051020
136683512560740995195
866499263258559939381
596315013956378883556
326130764654197827731
55946515352016771906
785762266049835716092
515578016747654660267
...
With a good amount of randomness so long as your space is a few orders smaller than M127.
Credit to #nick-steele and #bta for the approach.
This isn't really a Ruby-specific answer but I hope it's permitted. Andrew Kensler gives a C++ "permute()" function that does exactly this in his "Correlated Multi-Jittered Sampling" report.
As I understand it, the exact function he provides really only works if your "array" is up to size 2^27, but the general idea could be used for arrays of any size.
I'll do my best to sort of explain it. The first part is you need a hash that is reversible "for any power-of-two sized domain". Consider x = i + 1. No matter what x is, even if your integer overflows, you can determine what i was. More specifically, you can always determine the bottom n-bits of i from the bottom n-bits of x. Addition is a reversible hash operation, as is multiplication by an odd number, as is doing a bitwise xor by a constant. If you know a specific power-of-two domain, you can scramble bits in that domain. E.g. x ^= (x & 0xFF) >> 5) is valid for the 16-bit domain. You can specify that domain with a mask, e.g. mask = 0xFF, and your hash function becomes x = hash(i, mask). Of course you can add a "seed" value into that hash function to get different randomizations. Kensler lays out more valid operations in the paper.
So you have a reversible function x = hash(i, mask, seed). The problem is that if you hash your index, you might end up with a value that is larger than your array size, i.e. your "domain". You can't just modulo this or you'll get collisions.
The reversible hash is the key to using a technique called "cycle walking", introduced in "Ciphers with Arbitrary Finite Domains". Because the hash is reversible (i.e. 1-to-1), you can just repeatedly apply the same hash until your hashed value is smaller than your array! Because you're applying the same hash, and the mapping is one-to-one, whatever value you end up on will map back to exactly one index, so you don't have collisions. So your function could look something like this for 32-bit integers (pseudocode):
fun permute(i, length, seed) {
i = hash(i, 0xFFFF, seed)
while(i >= length): i = hash(i, 0xFFFF, seed)
return i
}
It could take a lot of hashes to get to your domain, so Kensler does a simple trick: he keeps the hash within the domain of the next power of two, which makes it require very few iterations (~2 on average), by masking out the unnecessary bits. The final algorithm looks like this:
fun next_pow_2(length) {
# This implementation is for clarity.
# See Kensler's paper for one way to do it fast.
p = 1
while (p < length): p *= 2
return p
}
permute(i, length, seed) {
mask = next_pow_2(length)-1
i = hash(i, mask, seed) & mask
while(i >= length): i = hash(i, mask, seed) & mask
return i
}
And that's it! Obviously the important thing here is choosing a good hash function, which Kensler provides in the paper but I wanted to break down the explanation. If you want to have different random permutations each time, you can add a "seed" value to the permute function which then gets passed to the hash function.

Algorithm for linear pattern matching?

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101, 01010100100, OR 10100100010 within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.
Is there a faster method than doing
foreach (patterns) {
for (i=0; i < listLength; i++)
for(t=0; t < patternlength; t++)
if( list[i+t] != pattern[t] ) {
break;
}
if( t == patternlength - 1 ) {
return i; // pattern found!
}
}
}
}
Edit: BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.
Edit: If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.
Edit: Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says
The algorithm preprocesses the target
string (key) that is being searched
for, but not the string being searched
in (unlike some algorithms that
preprocess the string to be searched
and can then amortize the expense of
the preprocessing by searching
repeatedly).
That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?
The key for Googling is "multi-pattern" string matching.
Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.
An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.
I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.
Yes.
The Boyer–Moore string search algorithm
See also: Knuth–Morris–Pratt algorithm
You could Build an SuffixArray and search the runtime is crazy : O ( length(pattern) ).
BUT .. you have to build that array.
It's only worth .. when the Text is static and the pattern dynamic .
A solution that could be efficient:
store the patterns in a trie data structure
start searching the list
check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
step one char and repeat #3
If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.
If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.
Jacob
If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.
It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.
Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.
If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.
The code looks like this:
try:
import psyco
psyco.full()
except ImportError:
pass
def encode(s):
def calc_count(count, c):
return count * (-1 if c == '0' else 1)
result = []
c = s[0]
count = 1
for i in range(1, len(s)):
d = s[i]
if d == c:
count += 1
else:
result.append(calc_count(count, c))
count = 1
c = d
result.append(calc_count(count, c))
return result
def search(encoded_source, targets):
def match(encoded_source, t, max_search_len, len_source):
x = len(t)-1
# Get the indexes of the longest segments and search them first
most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
# Align the signs of the source and target
index = (0 if encoded_source[0] * t[0] > 0 else 1)
unencoded_pos = sum(abs(c) for c in encoded_source[:index])
start_t, end_t = abs(t[0]), abs(t[x])
for i in range(index, len(encoded_source)-x, 2):
if all(t[j] == encoded_source[j+i] for j in most_restrictive):
encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
if start_t <= encoded_start and end_t <= encoded_end:
return unencoded_pos + (abs(encoded_source[i]) - start_t)
unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
if unencoded_pos > max_search_len:
return len_source
return len_source
len_source = sum(abs(c) for c in encoded_source)
i, found, target_index = len_source, None, -1
for j, t in enumerate(targets):
x = match(encoded_source, t, i, len_source)
print "Match at: ", x
if x < i:
i, found, target_index = x, t, j
return (i, found, target_index)
if __name__ == "__main__":
import datetime
def make_source_text(len):
from random import randint
item_len = 8
item_count = 2**item_len
table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
encoded_targets = [encode(t) for t in targets]
data_len = 10*1000*1000
s = datetime.datetime.now()
source_text = make_source_text(data_len)
e = datetime.datetime.now()
print "Make source text(length %d): " % data_len, (e - s)
s = datetime.datetime.now()
encoded_source = encode(source_text)
e = datetime.datetime.now()
print "Encode source text: ", (e - s)
s = datetime.datetime.now()
(i, found, target_index) = search(encoded_source, encoded_targets)
print (i, found, target_index)
print "Target was: ", targets[target_index]
print "Source matched here: ", source_text[i:i+len(targets[target_index])]
e = datetime.datetime.now()
print "Search time: ", (e - s)
On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.
psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:
Make source text(length 10000000): 0:00:02.277000
Encode source text: 0:00:00.329000
Match at: 2517905
Match at: 494990
Match at: 450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was: 1010010001010100100010
Source matched here: 1010010001010100100010
Search time: 0:00:04.325000
It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

Palindrome detection efficiency

I got curious by Jon Limjap's interview mishap and started to look for efficient ways to do palindrome detection. I checked the palindrome golf answers and it seems to me that in the answers are two algorithms only, reversing the string and checking from tail and head.
def palindrome_short(s):
length = len(s)
for i in xrange(0,length/2):
if s[i] != s[(length-1)-i]: return False
return True
def palindrome_reverse(s):
return s == s[::-1]
I think neither of these methods are used in the detection of exact palindromes in huge DNA sequences. I looked around a bit and didn't find any free article about what an ultra efficient way for this might be.
A good way might be parallelizing the first version in a divide-and-conquer approach, assigning a pair of char arrays 1..n and length-1-n..length-1 to each thread or processor.
What would be a better way?
Do you know any?
Given only one palindrome, you will have to do it in O(N), yes. You can get more efficiency with multi-processors by splitting the string as you said.
Now say you want to do exact DNA matching. These strings are thousands of characters long, and they are very repetitive. This gives us the opportunity to optimize.
Say you split a 1000-char long string into 5 pairs of 100,100. The code will look like this:
isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...
etc... The first time you do these matches, you will have to process them. However, you can add all results you've done into a hashtable mapping pairs to booleans:
isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}
etc... this will take way too much memory, though. For pairs of 100,100, the hash map will have 2*4^100 elements. Say that you only store two 32-bit hashes of strings as the key, you will need something like 10^55 megabytes, which is ridiculous.
Maybe if you use smaller strings, the problem can be tractable. Then you'll have a huge hashmap, but at least palindrome for let's say 10x10 pairs will take O(1), so checking if a 1000 string is a palindrome will take 100 lookups instead of 500 compares. It's still O(N), though...
Another variant of your second function. We need no check equals of the right parts of normal and reverse strings.
def palindrome_reverse(s):
l = len(s) / 2
return s[:l] == s[l::-1]
Obviously, you're not going to be able to get better than O(n) asymptotic efficiency, since each character must be examined at least once. You can get better multiplicative constants, though.
For a single thread, you can get a speedup using assembly. You can also do better by examining data in chunks larger than a byte at a time, but this may be tricky due to alignment considerations. You'll do even better to use SIMD, if you can examine chunks as large as 16 bytes at a time.
If you wanted to parallelize it, you could divide the string into N pieces, and have processor i compare the segment [i*n/2, (i+1)*N/2) with the segment [L-(i+1)*N/2, L-i*N/2).
There isn't, unless you do a fuzzy match. Which is what they probably do in DNA (I've done EST searching in DNA with smith-waterman, but that is obviously much harder then matching for a palindrome or reverse-complement in a sequence).
They are both in O(N) so I don't think there is any particular efficiency problem with any of these solutions. Maybe I am not creative enough but I can't see how would it be possible to compare N elements in less than N steps, so something like O(log N) is definitely not possible IMHO.
Pararellism might help, but it still wouldn't change the big-Oh rank of the algorithm since it is equivalent to running it on a faster machine.
Comparing from the center is always much more efficient since you can bail out early on a miss but it alwo allows you to do faster max palindrome search, regardless of whether you are looking for the maximal radius or all non-overlapping palindromes.
The only real paralellization is if you have multiple independent strings to process. Splitting into chunks will waste a lot of work for every miss and there's always much more misses than hits.
On top of what others said, I'd also add a few pre-check criteria for really large inputs :
quick check whether tail-character matches
head character
if NOT, just early exit by returning Boolean-False
if (input-length < 4) {
# The quick check just now already confirmed it's palindrome
return Boolean-True
} else if (200 < input-length) {
# adjust this parameter to your preferences
#
# e.g. make it 20 for longer than 8000 etc
# or make it scale to input size,
# either logarithmically, or a fixed ratio like 2.5%
#
reverse last ( N = 4 ) characters/bytes of the input
if that **DOES NOT** match first N chars/bytes {
return boolean-false # early exit
# no point to reverse rest of it
# when head and tail don't even match
} else {
if N was substantial
trim out the head and tail of the input
you've confirmed; avoid duplicated work
remember to also update the variable(s)
you've elected to track the input size
}
[optional 1 : if that substring of N characters you've
just checked happened to all contain the
same character, let's call it C,
then locate the index position, P, for the first
character that isn't C
if P == input-size
then you've already proven
the entire string is a nonstop repeat
of one single character, which, by def,
must be a palindrome
then just return Boolean-True
but the P is more than the half-way point,
you've also proven it cannot possibly be a
palindrome, because second half contains a
component that doesn't exist in first half,
then just return Boolean-False ]
[optional 2 : for extremely long inputs,
like over 200,000 chars,
take the N chars right at the middle of it,
and see if the reversed one matches
if that fails, then do early exit and save time ]
}
if all pre-checks passed,
then simply do it BAU style :
reverse second-half of it,
and see if it's same as first half
With Python, short code can be faster since it puts the load into the faster internals of the VM (And there is the whole cache and other such things)
def ispalin(x):
return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))
You can use a hashtable to put the character and have a counter variable whose value increases everytime you find an element not in table/map. If u check and find element thats already in table decrease the count.
For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.
See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
char charA[]= s.toCharArray();
for(int i=0;i<charA.length;i++)
{
if(!textMap.containsKey(charA[i]))
{
textMap.put(charA[i], ++count);
}
else
{
textMap.put(charA[i],--count);
}
if(length%2 !=0)
{
if(count == 1)
System.out.println("(odd case:PALINDROME)");
else
System.out.println("(odd case:not palindrome)");
}
else if(length%2==0)
{
if(count ==0)
System.out.println("(even case:palindrome)");
else
System.out.println("(even case :not palindrome)");
}
public class Palindrome{
private static boolean isPalindrome(String s){
if(s == null)
return false; //unitialized String ? return false
if(s.isEmpty()) //Empty Strings is a Palindrome
return true;
//we want check characters on opposite sides of the string
//and stop in the middle <divide and conquer>
int left = 0; //left-most char
int right = s.length() - 1; //right-most char
while(left < right){ //this elegantly handles odd characters string
if(s.charAt(left) != s.charAt(right)) //left char must equal
return false; //right else its not a palindrome
left++; //converge left to right
right--;//converge right to left
}
return true; // return true if the while doesn't exit
}
}
though we are doing n/2 calculations its still O(n)
this can done also using threads, but calculations get messy, best to avoid it. this doesn't test for special characters and is case sensitive. I have code that does it, but this code can be modified to handle that easily.

Resources