Need help understanding this dynamic programming solution - algorithm

So the problem being asked is:
A message containing letters from A-Z is being encoded to numbers using the following mapping:
'A' -> 1
'B' -> 2
...
'Z' -> 26
Given a non-empty string containing only digits, determine the total number of ways to decode it.
Example 1:
Input: "12"
Output: 2
Explanation: It could be decoded as "AB" (1 2) or "L" (12).
Example 2:
Input: "226"
Output: 3
Explanation: It could be decoded as "BZ" (2 26), "VF" (22 6), or "BBF" (2 2 6).
I solved it very inefficiently and was looking at other solutions and saw that dynamic programming was a good method to approach this problem. Since DP is new to me, I've been reading about it and am now coming back to the solution I saw and I'm trying to understand the logic behind the bottom down approach this guy used.
function numDecodings(s) {
if (s.length === 0) return 0;
const N = s.length;
const dp = Array(N+1).fill(0);
dp[0] = 1;
dp[1] = s[0] === '0' ? 0 : 1;
for (let i = 2; i <= N; i++) {
if (s[i-1] !== '0') {
dp[i] += dp[i-1];
}
if (s[i-2] === '1' || s[i-2] === '2' && s[i-1] <= '6') {
dp[i] += dp[i-2];
}
}
return dp[N];
}

First, let's straighten out some terms:
There are "top-down" and "bottom-up" approaches. "bottom-down" is not a useful term.
DP is a "bottom-up" approach, in that each solution is based on smaller and later ones.
The code has a "memo" array, dp. You may see the term "memoization" in your readings. This means that when we first compute a solution to a particular sub-problem, we will make a memo of it (remember the solution), indexed by the parameters. Thereafter, any time we need the solution, we'll simply look it up instead of recomputing it.
At each position in the string, we remember how many ways there are to code the string up to this point, and then compute how many ways total there are when we add the current character to that prefix.
Very briefly:
If the current character is not 0, then we can count one way to continue any previous string: take this as encoding a letter a-i. In this case, every encoding so far is still valid, so we carry forward that count: dp[i] += dp[i-1].
If the previous and current character form a legal encoding, then we can also take those as a 2-digit encoding (letters j-z), and we carry forward the count from before this 2-character code: dp[i] += dp[i-1].
That's all there is to the algorithm. Note that this does not handle all possible digit sequences: if the code reaches a point where there are no possible continuations, it will simply allow dp[i] to remain 0, and continue without issuing a message. For instance, given the input 12000226, the algorithm will identify three ways to encode 12, extend that to include at for 120, and then reset when it hits the next two zeroes. It then starts over with the next 2, and will find 3 ways to encode the remainder of the string, returning 3 as the result.

Related

24 Game/Countdown/Number Game solver, but without parentheses in the answer

I've been browsing the internet all day for an existing solution to creating an equation out of a list of numbers and operators for a specified target number.
I came across a lot of 24 Game solvers, Countdown solvers and alike, but they are all build around the concept of allowing parentheses in the answers.
For example, for a target 42, using the number 1 2 3 4 5 6, a solution could be:
6 * 5 = 30
4 * 3 = 12
30 + 12 = 42
Note how the algorithm remembers the outcome of a sub-equation and later re-uses it to form the solution (in this case 30 and 12), essentially using parentheses to form the solution (6 * 5) + (4 * 3) = 42.
Whereas I'd like a solution WITHOUT the use of parentheses, which is solved from left to right, for example 6 - 1 + 5 * 4 + 2 = 42, if I'd write it out, it would be:
6 - 1 = 5
5 + 5 = 10
10 * 4 = 40
40 + 2 = 42
I have a list of about 55 numbers (random numbers ranging from 2 to 12), 9 operators (2 of each basic operator + 1 random operator) and a target value (a random number between 0 and 1000). I need an algorithm to check whether or not my target value is solvable (and optionally, if it isn't, how close we can get to the actual value). Each number and operator can only be used once, which means there will be a maximum of 10 numbers you can use to get to the target value.
I found a brute-force algorithm which can be easily adjusted to do what I want (How to design an algorithm to calculate countdown style maths number puzzle), and that works, but I was hoping to find something which generates more sophisticated "solutions", like on this page: http://incoherency.co.uk/countdown/
I wrote the solver you mentioned at the end of your post, and I apologise in advance that the code isn't very readable.
At its heart the code for any solver to this sort of problem is simply a depth-first search, which you imply you already have working.
Note that if you go with your "solution WITHOUT the use of parentheses, which is solved from left to right" then there are input sets which are not solvable. For example, 11,11,11,11,11,11 with a target of 144. The solution is ((11/11)+11)*((11/11)+11). My solver makes this easier for humans to understand by breaking the parentheses up into different lines, but it is still effectively using parentheses rather than evaluating from left to right.
The way to "use parentheses" is to apply an operation to your inputs and put the result back in the input bag, rather than to apply an operation to one of the inputs and an accumulator. For example, if your input bag is 1,2,3,4,5,6 and you decide to multiply 3 and 4, the bag becomes 1,2,12,5,6. In this way, when you recurse, that step can use the result of the previous step. Preparing this for output is just a case of storing the history of operations along with each number in the bag.
I imagine what you mean about more "sophisticated" solutions is just the simplicity heuristic used in my javascript solver. The solver works by doing a depth-first search of the entire search space, and then choosing the solution that is "best" rather than just the one that uses the fewest steps.
A solution is considered "better" than a previous solution (i.e. replaces it as the "answer" solution) if it is closer to the target (note that any state in the solver is a candidate solution, it's just that most are further away from the target than the previous best candidate solution), or if it is equally distant from the target and has a lower heuristic score.
The heuristic score is the sum of the "intermediate values" (i.e. the values on the right-hand-side of the "=" signs), with trailing 0's removed. For example, if the intermediate values are 1, 4, 10, 150, the heuristic score is 1+4+1+15: the 10 and the 150 only count for 1 and 15 because they end in zeroes. This is done because humans find it easier to deal with numbers that are divisible by 10, and so the solution appears "simpler".
The other part that could be considered "sophisticated" is the way that some lines are joined together. This simply joins the result of "5 + 3 = 8" and "8 + 2 = 10" in to "5 + 3 + 2 = 10". The code to do this is absolutely horrible, but in case you're interested it's all in the Javascript at https://github.com/jes/cntdn/blob/master/js/cntdn.js - the gist is that after finding the solution which is stored in array form (with information about how each number was made) a bunch of post-processing happens. Roughly:
convert the "solution list" generated from the DFS to a (rudimentary, nested-array-based) expression tree - this is to cope with the multi-argument case (i.e. "5 + 3 + 2" is not 2 addition operations, it's just one addition that has 3 arguments)
convert the expression tree to an array of steps, including sorting the arguments so that they're presented more consistently
convert the array of steps into a string representation for display to the user, including an explanation of how distant from the target number the result is, if it's not equal
Apologies for the length of that. Hopefully some of it is of use.
James
EDIT: If you're interested in Countdown solvers in general, you may want to take a look at my letters solver as it is far more elegant than the numbers one. It's the top two functions at https://github.com/jes/cntdn/blob/master/js/cntdn.js - to use call solve_letters() with a string of letters and a function to get called for every matching word. This solver works by traversing a trie representing the dictionary (generated by https://github.com/jes/cntdn/blob/master/js/mk-js-dict), and calling the callback at every end node.
I use the recursive in java to do the array combination. The main idea is just using DFS to get the array combination and operation combination.
I use a boolean array to store the visited position, which can avoid the same element to be used again. The temp StringBuilder is used to store current equation, if the corresponding result is equal to target, i will put the equation into result. Do not forget to return temp and visited array to original state when you select next array element.
This algorithm will produce some duplicate answer, so it need to be optimized later.
public static void main(String[] args) {
List<StringBuilder> res = new ArrayList<StringBuilder>();
int[] arr = {1,2,3,4,5,6};
int target = 42;
for(int i = 0; i < arr.length; i ++){
boolean[] visited = new boolean[arr.length];
visited[i] = true;
StringBuilder sb = new StringBuilder();
sb.append(arr[i]);
findMatch(res, sb, arr, visited, arr[i], "+-*/", target);
}
for(StringBuilder sb : res){
System.out.println(sb.toString());
}
}
public static void findMatch(List<StringBuilder> res, StringBuilder temp, int[] nums, boolean[] visited, int current, String operations, int target){
if(current == target){
res.add(new StringBuilder(temp));
}
for(int i = 0; i < nums.length; i ++){
if(visited[i]) continue;
for(char c : operations.toCharArray()){
visited[i] = true;
temp.append(c).append(nums[i]);
if(c == '+'){
findMatch(res, temp, nums, visited, current + nums[i], operations, target);
}else if(c == '-'){
findMatch(res, temp, nums, visited, current - nums[i], operations, target);
}else if(c == '*'){
findMatch(res, temp, nums, visited, current * nums[i], operations, target);
}else if(c == '/'){
findMatch(res, temp, nums, visited, current / nums[i], operations, target);
}
temp.delete(temp.length() - 2, temp.length());
visited[i] = false;
}
}
}

C/C++/Java/C#: help parsing numbers

I've got a real problem (it's not homework, you can check my profile). I need to parse data whose formatting is not under my control.
The data look like this:
6,852:6,100,752
So there's first a number made of up to 9 digits, followed by a colon.
Then I know for sure that, after the colon:
there's at least one valid combination of numbers that add up to the number before the column
I know exactly how many numbers add up to the number before the colon (two in this case, but it can go as high as ten numbers)
In this case, 6852 is 6100 + 752.
My problem: I need to find these numbers (in this example, 6100 + 752).
It is unfortunate that in the data I'm forced to parse, the separator between the numbers (the comma) is also the separator used inside the number themselves (6100 is written as 6,100).
Once again: that unfortunate formatting is not under my control and, once again, this is not homework.
I need to solve this for up to 10 numbers that need to add up.
Here's an example with three numbers adding up to 6855:
6,855:360,6,175,320
I fear that there are cases where there would be two possible different solutions. HOWEVER if I get a solution that works "in most cases" it would be enough.
How do you typically solve such a problem in a C-style bracket language?
Well, I would start with the brute force approach and then apply some heuristics to prune the search space. Just split the list on the right by commas and iterate over all possible ways to group them into n terms (where n is the number of terms in the solution). You can use the following two rules to skip over invalid possibilities.
(1) You know that any group of 1 or 2 digits must begin a term.
(2) You know that no candidate term in your comma delimited list can be greater than the total on the left. (This also tells you the maximum number of digit groups that any candidate term can have.)
Recursive implementation (pseudo code):
int total; // The total read before the colon
// Takes the list of tokens as integers after the colon
// tokens is the set of tokens left to analyse,
// partialList is the partial list of numbers built so far
// sum is the sum of numbers in partialList
// Aggregate takes 2 ints XXX and YYY and returns XXX,YYY (= XXX*1000+YYY)
function getNumbers(tokens, sum, partialList) =
if isEmpty(tokens)
if sum = total return partialList
else return null // Got to the end with the wrong sum
var result1 = getNumbers(tokens[1:end], sum+token[0], Add(partialList, tokens[0]))
var result2 = getNumbers(tokens[2:end], sum+Aggregate(token[0], token[1]), Append(partialList, Aggregate(tokens[0], tokens[1])))
if result1 <> null return result1
if result2 <> null return result2
return null // No solution overall
You can do a lot better from different points of view, like tail recursion, pruning (you can have XXX,YYY only if YYY has 3 digits)... but this may work well enough for your app.
Divide-and-conquer would make for a nice improvement.
I think you should try all possible ways to parse the string and calculate the sum and return a list of those results that give the correct sum. This should be only one result in most cases unless you are very unlucky.
One thing to note that reduces the number of possibilities is that there is only an ambiguity if you have aa,bbb and bbb is exactly 3 digits. If you have aa,bb there is only one way to parse it.
Reading in C++:
std::pair<int,std::vector<int> > read_numbers(std::istream& is)
{
std::pair<int,std::vector<int> > result;
if(!is >> result.first) throw "foo!"
for(;;) {
int i;
if(!is >> i)
if(is.eof()) return result;
else throw "bar!";
result.second.push_back(i);
char ch;
if(is >> ch)
if(ch != ',') throw "foobar!";
is >> std::ws;
}
}
void f()
{
std::istringstream iss("6,852:6,100,752");
std::pair<int,std::vector<int> > foo = read_numbers(iss);
std::vector<int> result = get_winning_combination( foo.first
, foo.second.begin()
, foo.second.end() );
for( std::vector<int>::const_iterator i=result.begin(); i!=result.end(), ++i)
std::cout << *i << " ";
}
The actual cracking of the numbers is left as an exercise to the reader. :)
I think your main problem is deciding how to actually parse the numbers. The rest is just rote work with strings->numbers and iteration over combinations.
For instance, in the examples you gave, you could heuristically decide that a single-digit number followed by a three-digit number is, in fact, a four-digit number. Does a heuristic such as this hold true over a larger dataset? If not, you're also likely to have to iterate over the possible input parsing combinations, which means the naive solution is going to have a big polynomic complexity (O(nx), where x is >4).
Actually checking for which numbers add up is easy to do using a recursive search.
List<int> GetSummands(int total, int numberOfElements, IEnumerable<int> values)
{
if (numberOfElements == 0)
{
if (total == 0)
return new List<int>(); // Empty list.
else
return null; // Indicate no solution.
}
else if (total < 0)
{
return null; // Indicate no solution.
}
else
{
for (int i = 0; i < values.Count; ++i)
{
List<int> summands = GetSummands(
total - values[i], numberOfElements - 1, values.Skip(i + 1));
if (summands != null)
{
// Found solution.
summands.Add(values[i]);
return summands;
}
}
}
}

Edit Distance Algorithm

I have a dictionary of 'n' words given and there are 'm' Queries to respond to. I want to output the number of words in dictionary which are edit distance 1 or 2. I want to optimize the result set given that n and m are roughly 3000.
Edit added from answer below:
I will try to word it differently.
Initially there are 'n' words given as a set of Dictionary words. Next 'm' words are given which are query words and for each query word, I need to find if the word already exists in Dictionary (Edit Distance '0') or the total count of words in dictionary which are at edit distance 1 or 2 from the dictionary words.
I hope the Question is now Clear.
Well, it times out if the Time Complexity is (m*n)n.The naive use of DP Edit Distance Algorithm times out. Even Calculating the Diagonal Elements of 2k+1 times out where k is the threshold here k=3 in above case.
You want to use the Levenshtein distance between two words, but I assume you know that since that's what the question's tags say.
You would have to iterate through your List (assumption) and compare every word in the list with the current query you're executing. You could build a BK-tree to limit your search space, but that sounds like an overkill if you only have ~3000 words.
var upperLimit = 2;
var allWords = GetAllWords();
var matchingWords = allWords
.Where(word => Levenshtein(query, word) <= upperLimit)
.ToList();
Added after edit of original question
Finding cases where distance=0 would be easy Contains-queries if you have a case insensitive dictionary. Those cases where distance <= 2 would require a complete scan of the search space, 3000 comparisons per query word. Assuming an equal amount of query words would result in 9 million comparisons.
You mention that it times out, so I presume you have a timeout configured? Could your speed be due to a poor, or slow, implementation of the Levenshtein calculation?
(source: itu.edu.tr)
Above graph is stolen from CLiki: bk-tree
As seen, using bk-tree with an edit distance <= 2 would only visit about 1% of the search space, but that's assuming that you have a very large input data, in their case up to a half million words. I would assume similar numbers in your case, but such a low amount of inputs wouldnt cause much trouble even if stored in a List/Dictionary.
I will try to word it differently.
Initially there are 'n' words given as a set of Dictionary words.
Next 'm' words are given which are query words and for each query word, I need to find if the word already exists in Dictionary (Edit Distance '0') or the total count of words in dictionary which are at edit distance 1 or 2 from the dictionary words.
I hope the Question is now Clear.
Well, it times out if the Time Complexity is (m*n)*n.The naive use of DP Edit Distance Algorithm times out.
Even Calculating the Diagonal Elements of 2*k+1 times out where k is the threshold here k=3 in above case.
PS: BK Tree should suffice the purpose.Any Links about Implementation in C++.
public class Solution {
public int minDistance(String word1, String word2) {
int[][] table = new int[word1.length()+1][word2.length()+1];
for(int i = 0; i < table.length; ++i) {
for(int j = 0; j < table[i].length; ++j) {
if(i == 0)
table[i][j] = j;
else if(j == 0)
table[i][j] = i;
else {
if(word1.charAt(i-1) == word2.charAt(j-1))
table[i][j] = table[i-1][j-1];
else
table[i][j] = 1 + Math.min(Math.min(table[i-1][j-1],
table[i-1][j]), table[i][j-1]);
}
}
}
return table[word1.length()][word2.length()];
}
}
Checkout this simplified solution, solved using Dynamic Programming,
class Solution:
def minDistance(self, word1: str, word2: str) -> int:
return self.edit_distance(word1, word2)
#cache
def edit_distance(self, s, t):
# Edge conditions
if len(s) == 0:
return len(t)
if len(t) == 0:
return len(s)
# If 1st char matches
if s[0] == t[0]:
return self.edit_distance(s[1:], t[1:])
else:
return min(
1 + self.edit_distance(s[1:], t), # delete
1 + self.edit_distance(s, t[1:]), # insert
1 + self.edit_distance(s[1:], t[1:]) # replace
)

Sorting numbers from 1 to 999,999,999 in words as strings

Interesting programming puzzle:
If the integers from 1 to 999,999,999
are written as words, sorted
alphabetically, and concatenated, what
is the 51 billionth letter?
To be precise: if the integers from 1
to 999,999,999 are expressed in words
(omitting spaces, ‘and’, and
punctuation - see note below for format), and sorted
alphabetically so that the first six
integers are
eight
eighteen
eighteenmillion
eighteenmillioneight
eighteenmillioneighteen
eighteenmillioneighteenthousand
and the last is
twothousandtwohundredtwo
then reading top to bottom, left to
right, the 28th letter completes the
spelling of the integer
“eighteenmillion”.
The 51 billionth letter also completes
the spelling of an integer. Which one,
and what is the sum of all the
integers to that point?
Note: For example, 911,610,034 is
written
“ninehundredelevenmillionsixhundredtenthousandthirtyfour”;
500,000,000 is written
“fivehundredmillion”; 1,709 is written
“onethousandsevenhundrednine”.
I stumbled across this on a programming blog 'Occasionally Sane', and couldn't think of a neat way of doing it, the author of the relevant post says his initial attempt ate through 1.5GB of memory in 10 minutes, and he'd only made it up to 20,000,000 ("twentymillion").
Can anyone think of come up with share with the group a novel/clever approach to this?
Edit: Solved!
You can create a generator that outputs the numbers in sorted order. There are a few rules for comparing concatenated strings that I think most of us know implicitly:
a < a+b, where b is non-null.
a+b < a+c, where b < c.
a+b < c+d, where a < c, and a is not a subset of c.
If you start with a sorted list of the first 1000 numbers, you can easily generate the rest by appending "thousand" or "million" and concatenating another group of 1000.
Here's the full code, in Python:
import heapq
first_thousand=[('', 0), ('one', 1), ('two', 2), ('three', 3), ('four', 4),
('five', 5), ('six', 6), ('seven', 7), ('eight', 8),
('nine', 9), ('ten', 10), ('eleven', 11), ('twelve', 12),
('thirteen', 13), ('fourteen', 14), ('fifteen', 15),
('sixteen', 16), ('seventeen', 17), ('eighteen', 18),
('nineteen', 19)]
tens_name = (None, 'ten', 'twenty', 'thirty', 'forty', 'fifty', 'sixty',
'seventy','eighty','ninety')
for number in range(20, 100):
name = tens_name[number/10] + first_thousand[number%10][0]
first_thousand.append((name, number))
for number in range(100, 1000):
name = first_thousand[number/100][0] + 'hundred' + first_thousand[number%100][0]
first_thousand.append((name, number))
first_thousand.sort()
def make_sequence(base_generator, suffix, multiplier):
prefix_list = [(name+suffix, number*multiplier)
for name, number in first_thousand[1:]]
prefix_list.sort()
for prefix_name, base_number in prefix_list:
for name, number in base_generator():
yield prefix_name + name, base_number + number
return
def thousand_sequence():
for name, number in first_thousand:
yield name, number
return
def million_sequence():
return heapq.merge(first_thousand,
make_sequence(thousand_sequence, 'thousand', 1000))
def billion_sequence():
return heapq.merge(million_sequence(),
make_sequence(million_sequence, 'million', 1000000))
def solve(stopping_size = 51000000000):
total_chars = 0
total_sum = 0
for name, number in billion_sequence():
total_chars += len(name)
total_sum += number
if total_chars >= stopping_size:
break
return total_chars, total_sum, name, number
It took a while to run, about an hour. The 51 billionth character is the last character of sixhundredseventysixmillionsevenhundredfortysixthousandfivehundredseventyfive, and the sum of the integers to that point is 413,540,008,163,475,743.
I'd sort the names of the first 20 integers and the names of the tens, hundreds and thousands, work out how many numbers start with each of those, and go from there.
For example, the first few are [ eight, eighteen, eighthundred, eightmillion, eightthousand, eighty, eleven, ....
The numbers starting with "eight" are 8. With "eighthundred", 800-899, 800,000-899,999, 800,000,000-899,999,999. And so on.
The number of letters in the concatenation of words for 0 ( represented by the empty string ) to 99 can be found and totalled; this can be multiplied with "thousand"=8 or "million"=7 added for higher ranges. The value for 800-899 will be 100 times the length of "eighthundred" plus the length of 0-99. And so on.
This guy has a solution to the puzzle written in Haskell. Apparently Michael Borgwardt was right about using a Trie for finding the solution.
Those strings are going to have lots and lots of common prefixes - perfect use case for a trie, which would drastically reduce memory usage and probably also running time.
Here's my python solution that prints out the correct answer in a fraction of a second. I'm not a python programmer generally, so apologies for any egregious code style errors.
#!/usr/bin/env python
import sys
ONES=[
"", "one", "two", "three", "four",
"five", "six", "seven", "eight", "nine",
"ten", "eleven", "twelve", "thirteen", "fourteen",
"fifteen", "sixteen", "seventeen","eighteen", "nineteen",
]
TENS=[
"zero", "ten", "twenty", "thirty", "forty",
"fifty", "sixty", "seventy", "eighty", "ninety",
]
def to_s_h(i):
if(i<20):
return(ONES[i])
return(TENS[i/10] + ONES[i%10])
def to_s_t(i):
if(i<100):
return(to_s_h(i))
return(ONES[i/100] + "hundred" + to_s_h(i%100))
def to_s_m(i):
if(i<1000):
return(to_s_t(i))
return(to_s_t(i/1000) + "thousand" + to_s_t(i%1000))
def to_s_b(i):
if(i<1000000):
return(to_s_m(i))
return(to_s_m(i/1000000) + "million" + to_s_m(i%1000000))
def try_string(s,t):
global letters_to_go,word_sum
l=len(s)
letters_to_go -= l
word_sum += t
if(letters_to_go == 0):
print "solved: " + s
print "sum is: " + str(word_sum)
sys.exit(0)
elif(letters_to_go < 0):
print "failed: " + s + " " + str(letters_to_go)
sys.exit(-1)
def solve(depth,prefix,prefix_num):
global millions,thousands,ones,letters_to_go,onelen,thousandlen,word_sum
src=[ millions,thousands,ones ][depth]
for x in src:
num=prefix + x[2]
nn=prefix_num+x[1]
try_string(num,nn)
if(x[0] == 0):
continue
if(x[0] == 1):
stl=(len(num) * 999) + onelen
ss=(nn*999) + onesum
else:
stl=(len(num) * 999999) + thousandlen + onelen*999
ss=(nn*999999) + thousandsum
if(stl < letters_to_go):
letters_to_go -= stl
word_sum += ss
else:
solve(depth+1,num,nn)
ones=[]
thousands=[]
millions=[]
onelen=0
thousandlen=0
onesum=(999*1000)/2
thousandsum=(999999*1000000)/2
for x in range(1,1000):
s=to_s_b(x)
l=len(s)
ones.append( (0,x,s) )
onelen += l
thousands.append( (0,x,s) )
thousands.append( (1,x*1000,s + "thousand") )
thousandlen += l + (l+len("thousand"))*1000
millions.append( (0,x,s) )
millions.append( (1,x*1000,s + "thousand") )
millions.append( (2,x*1000000,s + "million") )
ones.sort(key=lambda x: x[2])
thousands.sort(key=lambda x: x[2])
millions.sort(key=lambda x: x[2])
letters_to_go=51000000000
word_sum=0
solve(0,"",0)
It works by precomputing the length of the numbers from 1..999 and 1..999999 so that it can skip entire subtrees unless it knows that the answer lies somewhere within them.
(The first attempt at this is wrong, but I will leave it up since it's more useful to see mistakes on the way to solving something rather than just the final answer.)
I would first generate the strings from 0 to 999 and store them into an array called thousandsStrings. The 0 element is "", and "" represents a blank in the lists below.
The thousandsString setup uses the following:
Units: "" one two three ... nine
Teens: ten eleven twelve ... nineteen
Tens: "" "" twenty thirty forty ... ninety
The thousandsString setup is something like this:
thousandsString[0] = ""
for (i in 1..10)
thousandsString[i] = Units[i]
end
for (i in 10..19)
thousandsString[i] = Teens[i]
end
for (i in 20..99)
thousandsString[i] = Tens[i/10] + Units[i%10]
end
for (i in 100..999)
thousandsString[i] = Units[i/100] + "hundred" + thousandsString[i%100]
end
Then, I would sort that array alphabetically.
Then, assuming t1 t2 t3 are strings taken from thousandsString, all of the strings have the form
t1
OR
t1 + million + t2 + thousand + t3
OR
t1 + thousand + t2
To output them in the proper order, I would process the individual strings, followed by the millions strings followed by the string + thousands strings.
foreach (t1 in thousandsStrings)
if (t1 == "")
continue;
process(t1)
foreach (t2 in thousandsStrings)
foreach (t3 in thousandsStrings)
process (t1 + "million" + t2 + "thousand" + t3)
end
end
foreach (t2 in thousandsStrings)
process (t1 + "thousand" + t2)
end
end
where process means store the previous sum length and then add the new string length to the sum and if the new sum is >= your target sum, you spit out the results, and maybe return or break out of the loops, whatever makes you happy.
=====================================================================
Second attempt, the other answers were right that you need to use 3k strings instead of 1k strings as a base.
Start with the thousandsString from above, but drop the blank "" for zero. That leaves 999 elements and call this uStr (units string).
Create two more sets:
tStr = the set of all uStr + "thousand"
mStr = the set of all uStr + "million"
Now create two more set unions:
mtuStr = mStr union tStr union uStr
tuStr = tStr union uStr
Order uStr, tuStr, mtuStr
Now the looping and logic here are a bit different than before.
foreach (s1 in mtuStr)
process(s1)
// If this is a millions or thousands string, add the extra strings that can
// go after the millions or thousands parts.
if (s1.contains("million"))
foreach (s2 in tuStr)
process (s1+s2)
if (s2.contains("thousand"))
foreach (s3 in uStr)
process (s1+s2+s3)
end
end
end
end
if (s1.contains("thousand"))
foreach (s2 in uStr)
process (s1+s2)
end
end
end
What I did:
1) Iterate through 1 - 999 and generate the words for each of these.
As we generate:
2) Create 3 data structures where each node has a pointer to children and each node has a character value, and a pointer to Siblings. (A binary tree, in fact, but we don't want to think of it that way necessarily - for me it's easier to conceptualise as a list of siblings with lists of children hanging off, but if you think about it {draw a pic} you'll realise it is in fact a Binary Tree).
These 3 data structures are created cocurrently as follows:
a) first one with the word as generated (ie 1-999 sorted alphabetically)
b) all the values in the first + all the values with 'thousand' appended (ie 1-999 and 1,000 - 999,000 (step 1000) (1998 values in total)
c) all the values in B + all the values in a with million appended (2997 values in total)
3) For every leaf node in(b) add a Child as (a). For every leaf node in (c) add a child as (b).
4) Traverse the tree, counting how many characters we pass and stopping at 51 Billion.
NOTE: This doesn't sum the values (I didn't read that bit when I originally did it), and runs in just over 3 minutes (about 192 secs usually, using c++).
NOTE 2: (in case it isn't obvious) there are only 5,994 values stored, but they are stored in such a way that there are a billion paths through the tree
I did this about a year or two ago when I stumbled accross it, and have since realised there are many optimisations (the most time consuming bit is traversing the tree - by a LONG WAY). There are a few optimisations that I think would significantly improve this approach, but I could never be bothered taking it further, other than to optimise redundant nodes in the tree slightly, so they stored strings rather than characters
I have seen people claim on line that they've solved it in less than 5 seconds....
weird but fun idea.
build a sparse list of the lengths of the number from 0 to 9, then 10-90 by tens, then 100, 1000, etc etc, to billion, indexes are the value of the integer part who's lenght is stored.
write a function to calculate the number as a string length using the table.
(breaking the number into it's parts, and looking up the length of the aprts, never actally creating a string.)
then you're only doing math as you traverse the numbers, calculating the length from the
table afterward summing for your sum.
with the sum, and the value of the final integer, figure out the integer that's being spelled, and volia, you're done.
Yes, me again, but a completely different approach.
Simply, rather than storing the "onethousandeleventyseven" words, you write the sort to use that when comparing.
Crude java POC:
public class BillionsBillions implements Comparator {
public int compare(Object a, Object b) {
String s1 = (String)a; // "1234";
String s2 = (String)b; // "1235";
int n1 = Integer.valueOf(s1);
int n2 = Integer.valueOf(s2);
String w1 = numberToWords(n1);
String w2 = numberToWords(n2);
return w1.compare(w2);
}
public static void main(String args[]) {
long numbers[] = new long[1000000000]; // Bring your 64 bit JVMs
for(int i = 0; i < 1000000000; i++) {
numbers[i] = i;
}
Arrays.sort(numbers, 0, numbers.length, new BillionsBillions());
long total = 0;
for(int i : numbers) {
String l = numberToWords(i);
long offset = total + l - 51000000000;
if (offset >= 0) {
String c = l.substring(l - offset, l - offset + 1);
System.out.println(c);
break;
}
}
}
}
"numberToWords" is left as an exercise for the reader.
Do you need to save the entire string in memory?
If not, just save how many characters you've appended so far. For each iteration, you check the length the next number's textual representation. If it exceeds the nth letter you are looking for, the letter must be in that string, so extract it by it's index, print it, and stop execution. Otherwise, add the string length to the character count and move to the next number.
All the strings are going to start with either one, ten, two, twenty, three, thirty, four, etc so I'd start with figuring out how many are in each of the buckets. Then you should at least know which bucket you need to look closer at.
Then I'd look at subdividing the buckets further based on the possible prefixes. For example, within ninehundred, you are going to have all the same buckets that you had to start off with, just for numbers starting with 900.
The question is about efficient data storage not string manipulation. Create an enum to represent the words. the words should appear in sorted order so that when it comes time to sort it is a simplish compare. Now generate the list and sort. use the fact that you know how long each word is in conjunction with the enum to add up to the character you need.
Code wins...
#!/bin/bash
n=0
while [ $n -lt 1000000000 ]; do
number -l $n | sed -e 's/[^a-z]//g'
let n=n+1
done | sort > /tmp/bignumbers
awk '
BEGIN {
total = 0;
}
{
l = length($0);
offset = total + l - 51000000000;
print total " " offset
if (offset >= 0) {
c = substr($0, l - offset, 1);
print c;
exit;
}
total = total + l;
}' /tmp/bignumbers
Tested for a much smaller range ;-). Requires a LOT of diskspace, a compressed filesystem would be, umm, valuable, but not so much memory.
Sort has options to compress work files as well, and you could toss in gzip to directly compress data.
Not the zippiest solution.
But it does work.
Honestly I would let an RDBMS like SQL Server or Oracle do the work for me.
Insert the billion strings into an indexed table.
Compute a string length column.
Start pulling off the top X records at a time with a SUM, until I get to 51 billion.
Might beat up the server for a while as it would need to do a lot of Disk IO, but overall I think I could find an answer faster than someone who would write a program to do it.
Sometimes just getting it done is what the client really wants, and could care less what fancy design pattern or data structure you used.
figure out lengths for 1-999 and include length for 0 as 0.
so now you have an array for 0-999 namely uint32 sizes999[1000];
(not going to get into the details of generating this)
also need an array of thousand last letters last_letters[1000]
(again not going to get into the details of generating this as it is even easier even hundreds d even tens y except 10 which is n others cycle though last of on e through nin e zero is irrelavant)
uint32 sizes999[1000];
uint64 totallen = 0;
strlen_million = strlen("million");
strlen_thousand = strlen("thousand");
for (uint32 i = 0; i<1000;++i){
for (uint32 j = 0; j<1000;++j){
for (uint32 j = 0; j<1000;++j){
total_len += sizes999[i]+strlen_million +
sizes999[j]+strlen_thousand +
sizes999[k];
if totallen == 51000000000 goto done;
ASSERT(totallen <51000000000);//he claimed 51000000000 was not intermediate
}
}
}
done:
//now use i j k to get last letter by using last_letters999
//think of i,j,k as digits base 1000
//if k = 0 & j ==0 then the letter is n million
//if only k = 0 then the letter is d thousand
//other wise use the array of last_letters since
//the units digit base 1000, that is k, is not zero
//for the sum of the numbers i,j,k are the digits of the number base 1000 so
n = i*1000000 + j*1000 + k;
//represent the number and use
sum = n*(n+1)/2;
if you need to do it for number other than 51000000000 then also calculate sums_sizes999 and use that in the natural way.
total memory: 0(1000);
total time: 0(n) where n is the number
This is what I'd do:
Create an array of 2,997 strings: "one" through "ninehundredninetynine", "onethousand" through "ninehundredninetyninethousand", and "onemillion" through "ninehundredninetyninemillion".
Store the following about each string: length (this can be calculated of course), the integer value represented by the string, and some enum to signify whether it's "ones", "thousands", or "millions".
Sort the 2,997 strings alphabetically.
With this array created, it's straightforward to find all 999,999,999 strings in order alphabetically based on the following observations:
Nothing can follow a "ones" string
Either nothing, or a "ones" string, can follow a "thousands" string
Either nothing, a "ones" string, a "thousands" string, or a "thousands" string then a "ones" string, can follow a "millions" string.
Constructing the words basically involves creating one- to three-letter "words" based on these 2,997 tokens, making sure that the order of the tokens makes a valid number according to the rules above. Given a particular "word", the next "word" is found like this:
Lengthen the "word" by adding the token first alphabetically, if possible.
If this can't be done, advance the rightmost token to the next one alphabetically, if possible.
If this too is not possible, then remove the rightmost token, and advance the second-rightmost token to the next one alphabetically, if possible.
If this too is not possible, you're done.
At each step you can calculate the total length of the string and the sum of the numbers by just keeping two running totals.
It's important to note that there is a lot of overlapping and double counting if you iterate over all 100 billion possible numbers. It's important to realize that the number of strings that start with "eight" is the same number of numbers that start with "nin" or "seven" or "six" etc...
To me, this begs for a dynamic programming solution where the number of strings for tens, hundreds, thousands, etc are calculated and stored in some type of look up table. Ofcourse, there will be special cases for one vs eleven, two vs twelve, etc
I'll update this if I can get a quick running solution.
WRONG!!!!!!!!! I READ THE PROBLEM WRONG. I thought it meant "what's the last letter of the alphabetically last number"
what's wrong with:
public class Nums {
// if overflows happen, switch to an BigDecimal or something
// with arbitrary precision
public static void main(String[] args) {
System.out.println("last letter: " + lastLetter(1L, 51000000L);
System.out.println("sum: " + sum(1L, 51000000L);
}
static char lastLetter(long start, long end) {
String last = toWord(start);
for(long i = start; i < end; i++)
String current = toWord(i);
if(current.compareTo(last) > 1)
last = current;
return last.charAt(last.length()-1);
}
static String toWord(long num) {
// should be relatively easy, but for now ...
return "one";
}
static long sum(long first, long n) {
return (n * first + n*n) / 2;
}
}
haven't actually tried this :/ LOL
You have one billion numbers and 51 billion characters - there's a good chance that this is a trick question, as there are an average of 51 characters per number. Sum up the conversions of all the numbers and see if it adds up to 51 billion.
Edit: It adds up to 70,305,000,000 characters, so this is the wrong answer.
I solved this in Java sometime in 2008 as part of an application to work at ITA Software.
The code is long, and it now being three years later, I look at it with a bit of horror... So I'm not going to post it.
But I'll post quotes from some notes that I included with the application.
The problem with this puzzle is of course the size. The naïve approach would be to sort the list in word number order and then to iterate through the sorted list counting characters and summing. With a list of size 999,999,999 this would of course take a rather long time and the sort could likely not be done in memory.
But there are natural patterns in the ordering which allow shortcuts.
Immediately following any entry (say the number is X) ending in “million” will come 999,999 entries starting with the same text, representing all the numbers from X +1
to X + 10^6 -1.
The sum of all these numbers can be computed by a classic formula (an “arithmetic series”), and the character count can be computed by a similarly simple formula based on the prefix (X above) and a once-computed character count for the numbers from 1 to 999,999. Both depend only on the “millions” part of the number at the base of the range. Thus if the character count for the entire range will keep the entire count below the search goal, the individual entries need not be traversed.
Similar shortcuts apply for “thousand”, and indeed could be applied to “hundred” or “billion” though I didn’t bother with shortcuts at the hundreds level and the billions level is out of range for this problem.
In order to apply these shortcuts, my code creates and sorts a list of 2997 objects representing the numbers:
1 to 999 stepping by 1
1000 to 999000 stepping by 1000
1000000 to 999000000 stepping by 1000000
The code iterates through this list, accumulating sums and character counts, recursively creating, sorting and traversing similar but smaller lists as needed.
Explicit counting and adding is only needed near the end.
I didn't get the job, but later used the code as a "code sample" for another job, which I did get.
The Java code using these techniques for skipping much of the explicit counting and adding runs in about 8 seconds.

Algorithm for linear pattern matching?

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101, 01010100100, OR 10100100010 within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.
Is there a faster method than doing
foreach (patterns) {
for (i=0; i < listLength; i++)
for(t=0; t < patternlength; t++)
if( list[i+t] != pattern[t] ) {
break;
}
if( t == patternlength - 1 ) {
return i; // pattern found!
}
}
}
}
Edit: BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.
Edit: If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.
Edit: Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says
The algorithm preprocesses the target
string (key) that is being searched
for, but not the string being searched
in (unlike some algorithms that
preprocess the string to be searched
and can then amortize the expense of
the preprocessing by searching
repeatedly).
That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?
The key for Googling is "multi-pattern" string matching.
Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.
An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.
I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.
Yes.
The Boyer–Moore string search algorithm
See also: Knuth–Morris–Pratt algorithm
You could Build an SuffixArray and search the runtime is crazy : O ( length(pattern) ).
BUT .. you have to build that array.
It's only worth .. when the Text is static and the pattern dynamic .
A solution that could be efficient:
store the patterns in a trie data structure
start searching the list
check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
step one char and repeat #3
If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.
If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.
Jacob
If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.
It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.
Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.
If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.
The code looks like this:
try:
import psyco
psyco.full()
except ImportError:
pass
def encode(s):
def calc_count(count, c):
return count * (-1 if c == '0' else 1)
result = []
c = s[0]
count = 1
for i in range(1, len(s)):
d = s[i]
if d == c:
count += 1
else:
result.append(calc_count(count, c))
count = 1
c = d
result.append(calc_count(count, c))
return result
def search(encoded_source, targets):
def match(encoded_source, t, max_search_len, len_source):
x = len(t)-1
# Get the indexes of the longest segments and search them first
most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
# Align the signs of the source and target
index = (0 if encoded_source[0] * t[0] > 0 else 1)
unencoded_pos = sum(abs(c) for c in encoded_source[:index])
start_t, end_t = abs(t[0]), abs(t[x])
for i in range(index, len(encoded_source)-x, 2):
if all(t[j] == encoded_source[j+i] for j in most_restrictive):
encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
if start_t <= encoded_start and end_t <= encoded_end:
return unencoded_pos + (abs(encoded_source[i]) - start_t)
unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
if unencoded_pos > max_search_len:
return len_source
return len_source
len_source = sum(abs(c) for c in encoded_source)
i, found, target_index = len_source, None, -1
for j, t in enumerate(targets):
x = match(encoded_source, t, i, len_source)
print "Match at: ", x
if x < i:
i, found, target_index = x, t, j
return (i, found, target_index)
if __name__ == "__main__":
import datetime
def make_source_text(len):
from random import randint
item_len = 8
item_count = 2**item_len
table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
encoded_targets = [encode(t) for t in targets]
data_len = 10*1000*1000
s = datetime.datetime.now()
source_text = make_source_text(data_len)
e = datetime.datetime.now()
print "Make source text(length %d): " % data_len, (e - s)
s = datetime.datetime.now()
encoded_source = encode(source_text)
e = datetime.datetime.now()
print "Encode source text: ", (e - s)
s = datetime.datetime.now()
(i, found, target_index) = search(encoded_source, encoded_targets)
print (i, found, target_index)
print "Target was: ", targets[target_index]
print "Source matched here: ", source_text[i:i+len(targets[target_index])]
e = datetime.datetime.now()
print "Search time: ", (e - s)
On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.
psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:
Make source text(length 10000000): 0:00:02.277000
Encode source text: 0:00:00.329000
Match at: 2517905
Match at: 494990
Match at: 450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was: 1010010001010100100010
Source matched here: 1010010001010100100010
Search time: 0:00:04.325000
It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

Resources