Project Euler #22 - Incorrect logic?

Project Euler #22 - Incorrect logic? - algorithm

I'm tackling some of the programming challenges on Project Euler. The challenge is as follows:
Using names.txt (right click and 'Save Link/Target As...'),
a 46K text file containing over five-thousand first names,
begin by sorting it into alphabetical order. Then working out
the alphabetical value for each name, multiply this value by
its alphabetical position in the list to obtain a name score.
For example, when the list is sorted into alphabetical order,
COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list.
So, COLIN would obtain a score of 938 53 = 49714.
What is the total of all the name scores in the file?
So I've written it in coffee-script, but I will explain the logic so its understandable.
fs = require 'fs'
total = 0
fs.readFile './names.txt', (err,names) ->
names = names.toString().split(',')
names = names.sort()
for num in [0..(names.length-1)]
asc = 0
for i in [1..names[num].length]
asc += names[num].charCodeAt(i-1) - 64
total += num * asc
console.log total
So basically, I'm reading the file in. I split the names into an array and sort them. I'm looping through each of the names. As I loop through I'm going through each character at getting the charCode of it (as its all capitals). I'm then subtracting it by 64 to get its position in the alphabet. Finally, I add to the total variable the num of the loop * sum of positions of all letters.
The answer I get is 870873746, however it's incorrect and other answers have a slightly higher number.
Can anyone see why?

total += num * asc
I think this is where it went wrong. The loop for num starts from 0 (thats how computers store things). But for ranking, the start should be from 1st and not 0. So for populating the total count, the code should be :
total += (num+1) * asc

Related

How many times does a zero occur on an odometer

I am solving how many times a zero occus on an odometer. I count +1 everytime I see a zero.
10 -> +1
100-> +2 because in 100 I see 2 zero's
10004 -> +3 because I see 3 zero's
So I get,
1 - 100 -> +11
1 - 500 -> +91
1 - 501 -> +92
0 - 4294967295-> +3825876150
I used rubydoctest for it. I am not doing anything with begin_number yet. Can anyone explain how to calculate it without a brute force method?
I did many attempts. They go well for numbers like 10, 1000, 10.000, 100.000.000, but not for numbers like 522, 2280. If I run the rubydoctest, it will fail on # >> algorithm_count_zero(1, 500)
# doctest: algorithm_count_zero(begin_number, end_number)
# >> algorithm_count_zero(1, 10)
# => 1
# >> algorithm_count_zero(1, 1000)
# => 192
# >> algorithm_count_zero(1, 10000000)
# => 5888896
# >> algorithm_count_zero(1, 500)
# => 91
# >> algorithm_count_zero(0, 4294967295)
# => 3825876150
def algorithm_count_zero(begin_number, end_number)
power = Math::log10(end_number) - 1
if end_number < 100
return end_number/10
else
end_number > 100
count = (9*(power)-1)*10**power+1
end
answer = ((((count / 9)+power)).floor) + 1
end
end_number = 20000
begin_number = 10000
puts "Algorithm #{algorithm_count_zero(begin_number, end_number)}"

As noticed in a comment, this is a duplicate to another question, where the solution gives you correct guidelines.
However, if you want to test your own solution for correctness, i'll put in here a one-liner in the parallel array processing language Dyalog APL (which i btw think everyone modelling mathemathics and numbers should use).
Using tryapl.org you'll be able to get a correct answer for any integer value as argument. Tryapl is a web page with a backend that executes simple APL code statements ("one-liners", which are very typical to the APL language and it's extremely compact code).
The APL one-liner is here:
{+/(c×1+d|⍵)+d×(-c←0=⌊(a|⍵)÷d←a×+0.1)+⌊⍵÷a←10*⌽⍳⌈10⍟⍵} 142857
Copy that and paste it into the edit row at tryapl.org, and press enter - you will quickly see an integer, which is the answer to your problem. In the code row above, you can see the argument rightmost; it is 142857 this time but you can change it to any integer.
As you have pasted the one-liner once, and executed it with Enter once, the easiest way to get it back for editing is to press [Up arrow]. This returns the most recently entered statement; then you can edit the number sitting rightmost (after the curly brace) and press Enter again to get the answer for a different argument.
Pasting teh code row above will return 66765 - that many zeroes exist for 142857.
If you paste this 2 characters shorter row below, you will see the individual components of the result - the sum of these components make up the final result. You will be able to see a pattern, which possibly makes it easier to understand what happens.
Try for example
{(c×1+d|⍵)+d×(-c←0=⌊(a|⍵)÷d←a×+0.1)+⌊⍵÷a←10*⌽⍳⌈10⍟⍵} 1428579376
0 100000000 140000000 142000000 142800000 142850000 142857000 142857900 142857930 142857937
... and see how the intermediate results contain segments of the argument 1428579376, starting from left! There are as many intermediate results as there are numbers in the argument (10 this time).
The result for 1428579376 will be 1239080767, ie. the sum of the 10 numbers above. This many zeroes appear in all numbers between 1 and 1428579376 :-).

Consider each odometer position separately. The position x places from the far right changes once every 10^x times. By looking at the numbers to its right, you know how long it will be until it next changes. It will then hold each value for 10^x times before changing, until it reaches the end of the range you are considering, when it will hold its value at that time for some number of times that you can work out given the value at the very end of the range.
Now you have a sequence of the form x...0123456789012...y where you know the length and you know the values of x and y. One way to count the number of 0s (or any other digit) within this sequence is to clip off the prefix from x.. to just before the first 0, and clip off the suffix from just after the last 9 to y. Look for 0s n in this suffix, and measure the length of the long sequence from prefix to suffix. This will be of a length divisible by 10, and will contain each digit the same number of times.
Based on this you should be able to work out, for each position, how often within the range it will assume each of its 10 possible values. By summing up the values for 0 from each of the odometer positions you get the answer you want.

How does this randomizer work

I've a piece of code to randomize the order of words in a text file, but I'm not sure exactly what it's doing. Here is the code.
Randomize()
For count = 1 To 10
rand = (Int((10 - count + 1) * Rnd() + count))
temp = words(count)
words(count) = words(rand)
words(rand) = temp
Next
Could somebody please explain this to me? Thanks in advance.

first check msdn rnd description and note:
The Rnd function returns a value less than 1 but greater than or equal to 0
and
To produce random integers in a given range, use this formula: Int((upperbound - lowerbound + 1) * Rnd + lowerbound)
having this in mind, we see next algorithm:
set current word index to 1 (first word)
pick random index which is equals or greater then current (ie select word from the rest of array)
swap current word with randomly picked
increase current word index (ie reduce size of unassigned words pool)
go to #2
you can also use a little bit different description:
imagine that you have un-ordered set of words, you pick random one, remove it from set and append it to ordered array, so finally you will have randomly ordered array of words from original set

Efficient database lookup based on input where not all digits are sigificant

I would like to do a database lookup based on a 10 digit numeric value where only the first n digits are significant. Assume that there is no way in advance to determine n by looking at the value.
For example, I receive the value 5432154321. The corresponding entry (if it exists) might have key 54 or 543215 or any value based on n being somewhere between 1 and 10 inclusive.
Is there any efficient approach to matching on such a string short of simply trying all 10 possibilities?
Some background
The value is from a barcode scan. The barcodes are EAN13 restricted circulation numbers so they have the following structure:
02[1234567890]C
where C is a check sum. The 10 digits in between the 02 and the check sum consist of an item identifier followed by an item measure. There might be a check digit after the item identifier.
Since I can't depend on the data to adhere to any single standard, I would like to be able to define on an ad-hoc basis, how particular barcodes are structured which means that the portion of the 10 digit number that I extract, can be any length between 1 and 10.

Just a few ideas here:
1)
Maybe store these numbers in reversed form in your DB.
If you have N = 54321 you store it as N = 12345 in the DB.
Say N is the name of the column you stored it in.
When you read K = 5432154321, reverse this one too,
you get K1 = 1234512345, now check the DB column N
(whose value is let's say P), if K1 % 10^s == P,
where s=floor(Math.log(P) + 1).
Note: floor(Math.log(P) + 1) is a formula for
the count of digits of the number P > 0.
The value floor(Math.log(P) + 1) you may also
store in the DB as precomputed one, so that
you don't need to compute it each time.
2) As this 1) is kind of sick (but maybe best of the 3 ideas here),
maybe you just store them in a string column and check it with
'like operator'. But this is trivial, you probably considered it
already.
3) Or ... you store the numbers reversed, but you also
store all their residues mod 10^k for k=1...10.
col1, col2,..., col10
Then you can compare numbers almost directly,
the check will be something like
N % 10 == col1
or
N % 100 == col2
or
...
(N % 10^10) == col10.
Still not very elegant though (and not quite sure
if applicable to your case).
I decided to check my idea 1).
So here is an example
(I did it in SQL Server).
insert into numbers
(number, cnt_dig)
values
(1234, 1 + floor(log10(1234)))
insert into numbers
(number, cnt_dig)
values
(51234, 1 + floor(log10(51234)))
insert into numbers
(number, cnt_dig)
values
(7812334, 1 + floor(log10(7812334)))
select * From numbers
/*
Now we have this in our table:
id number cnt_dig
4 1234 4
5 51234 5
6 7812334 7
*/
-- Note that the actual numbers stored here
-- are the reversed ones: 4321, 43215, 4332187.
-- So far so good.
-- Now we read say K = 433218799 on the input
-- We reverse it and we get K1 = 997812334
declare #K1 bigint
set #K1 = 997812334
select * From numbers
where
#K1 % power(10, cnt_dig) = number
-- So from the last 3 queries,
-- we get this row:
-- id number cnt_dig
-- 6 7812334 7
--
-- meaning we have a match
-- i.e. the actual number 433218799
-- was matched successfully with the
-- actual number (from the DB) 4332187.
So this idea 1) doesn't seem that bad after all.

Range update and querying in a 2D matrix

I don't have a scenario, but here goes the problem. This is one is just driving me crazy. There is a nxn boolean matrix initially all elements are 0, n <= 10^6 and given as input.
Next there will be up to 10^5 queries. Each query can be either set all elements of column c to 0 or 1, or set all elements of row r to 0 or 1. There can be another type of query, printing the total number of 1's in column c or row r.
I have no idea how to solve this and any help would be appreciated. Obviously a O(n) solution per query is not feasible.

The idea of using a number to order the modifications is taken from Dukeling's post.
We will need 2 maps and 4 binary indexed tree (BIT, a.k.a. Fenwick Tree): 1 map and 2 BITs for rows, and 1 map and 2 BITs for columns. Let us call them m_row, f_row[0], and f_row[1]; m_col, f_col[0] and f_col[1] respectively.
Map may be implemented with array, or tree like structure, or hashing. The 2 maps are used to store the last modification to a row/column. Since there can be at most 105 modification, you may use that fact to save space from simple array implementation.
BIT has 2 operations:
adjust(value, delta_freq), which adjusts the frequency of the value by delta_freq amount.
rsq(from_value, to_value), (rsq stands for range sum query) which finds the sum of the all the frequencies from from_value to to_value inclusive.
Let us declare global variable: version
Let us define numRow to be the number of rows in the 2D boolean matrix, and numCol to be the number of columns in the 2D boolean matrix.
The BITs should have size of at least MAX_QUERY + 1, since it is used to count the number of changes to the rows and columns, which can be as many as the number of queries.
Initialization:
version = 1
# Map should return <0, 0> for rows or cols not yet
# directly updated by query
m_row = m_col = empty map
f_row[0] = f_row[1] = f_col[0] = f_col[1] = empty BIT
Update algorithm:
update(isRow, value, idx):
if (isRow):
# Since setting a row/column to a new value will reset
# everything done to it, we need to erase earlier
# modification to it.
# For example, turn on/off on a row a few times, then
# query some column
<prevValue, prevVersion> = m_row.get(idx)
if ( prevVersion > 0 ):
f_row[prevValue].adjust( prevVersion, -1 )
m_row.map( idx, <value, version> )
f_row[value].adjust( version, 1 )
else:
<prevValue, prevVersion> = m_col.get(idx)
if ( prevVersion > 0 ):
f_col[prevValue].adjust( prevVersion, -1 )
m_col.map( idx, <value, version> )
f_col[value].adjust( version, 1 )
version = version + 1
Count algorithm:
count(isRow, idx):
if (isRow):
# If this is row, we want to find number of reverse modifications
# done by updating the columns
<value, row_version> = m_row.get(idx)
count = f_col[1 - value].rsq(row_version + 1, version)
else:
# If this is column, we want to find number of reverse modifications
# done by updating the rows
<value, col_version> = m_col.get(idx)
count = f_row[1 - value].rsq(col_version + 1, version)
if (isRow):
if (value == 1):
return numRow - count
else:
return count
else:
if (value == 1):
return numCol - count
else:
return count
The complexity is logarithmic in worst case for both update and count.

Take version just to mean a value that gets auto-incremented for each update.
Store the last version and last update value at each row and column.
Store a list of (versions and counts of zeros and counts of ones) for the rows. The same for the columns. So that's only 2 lists for the entire grid.
When a row is updated, we set its version to the current version and insert into the list for rows the version and if (oldRowValue == 0) zeroCount = oldZeroCount else zeroCount = oldZeroCount + 1 (so it's not the number of zero's, rather the number of times a value was updated with a zero). Same for oneCount. Same for columns.
If you do a print for a row, we get the row's version and last value, we do a binary search for that version in the column list (first value greater than). Then:
if (rowValue == 1)
target = n*rowValue
- (latestColZeroCount - colZeroCount)
+ (latestColOneCount - colOneCount)
else
target = (latestColOneCount - colOneCount)
Not too sure whether the above will work.
That's O(1) for update, O(log k) for print, where k is the number of updates.

Sorting numbers from 1 to 999,999,999 in words as strings

Interesting programming puzzle:
If the integers from 1 to 999,999,999
are written as words, sorted
alphabetically, and concatenated, what
is the 51 billionth letter?
To be precise: if the integers from 1
to 999,999,999 are expressed in words
(omitting spaces, ‘and’, and
punctuation - see note below for format), and sorted
alphabetically so that the first six
integers are
eight
eighteen
eighteenmillion
eighteenmillioneight
eighteenmillioneighteen
eighteenmillioneighteenthousand
and the last is
twothousandtwohundredtwo
then reading top to bottom, left to
right, the 28th letter completes the
spelling of the integer
“eighteenmillion”.
The 51 billionth letter also completes
the spelling of an integer. Which one,
and what is the sum of all the
integers to that point?
Note: For example, 911,610,034 is
written
“ninehundredelevenmillionsixhundredtenthousandthirtyfour”;
500,000,000 is written
“fivehundredmillion”; 1,709 is written
“onethousandsevenhundrednine”.
I stumbled across this on a programming blog 'Occasionally Sane', and couldn't think of a neat way of doing it, the author of the relevant post says his initial attempt ate through 1.5GB of memory in 10 minutes, and he'd only made it up to 20,000,000 ("twentymillion").
Can anyone think of come up with share with the group a novel/clever approach to this?

Edit: Solved!
You can create a generator that outputs the numbers in sorted order. There are a few rules for comparing concatenated strings that I think most of us know implicitly:
a < a+b, where b is non-null.
a+b < a+c, where b < c.
a+b < c+d, where a < c, and a is not a subset of c.
If you start with a sorted list of the first 1000 numbers, you can easily generate the rest by appending "thousand" or "million" and concatenating another group of 1000.
Here's the full code, in Python:
import heapq
first_thousand=[('', 0), ('one', 1), ('two', 2), ('three', 3), ('four', 4),
('five', 5), ('six', 6), ('seven', 7), ('eight', 8),
('nine', 9), ('ten', 10), ('eleven', 11), ('twelve', 12),
('thirteen', 13), ('fourteen', 14), ('fifteen', 15),
('sixteen', 16), ('seventeen', 17), ('eighteen', 18),
('nineteen', 19)]
tens_name = (None, 'ten', 'twenty', 'thirty', 'forty', 'fifty', 'sixty',
'seventy','eighty','ninety')
for number in range(20, 100):
name = tens_name[number/10] + first_thousand[number%10][0]
first_thousand.append((name, number))
for number in range(100, 1000):
name = first_thousand[number/100][0] + 'hundred' + first_thousand[number%100][0]
first_thousand.append((name, number))
first_thousand.sort()
def make_sequence(base_generator, suffix, multiplier):
prefix_list = [(name+suffix, number*multiplier)
for name, number in first_thousand[1:]]
prefix_list.sort()
for prefix_name, base_number in prefix_list:
for name, number in base_generator():
yield prefix_name + name, base_number + number
return
def thousand_sequence():
for name, number in first_thousand:
yield name, number
return
def million_sequence():
return heapq.merge(first_thousand,
make_sequence(thousand_sequence, 'thousand', 1000))
def billion_sequence():
return heapq.merge(million_sequence(),
make_sequence(million_sequence, 'million', 1000000))
def solve(stopping_size = 51000000000):
total_chars = 0
total_sum = 0
for name, number in billion_sequence():
total_chars += len(name)
total_sum += number
if total_chars >= stopping_size:
break
return total_chars, total_sum, name, number
It took a while to run, about an hour. The 51 billionth character is the last character of sixhundredseventysixmillionsevenhundredfortysixthousandfivehundredseventyfive, and the sum of the integers to that point is 413,540,008,163,475,743.

I'd sort the names of the first 20 integers and the names of the tens, hundreds and thousands, work out how many numbers start with each of those, and go from there.
For example, the first few are [ eight, eighteen, eighthundred, eightmillion, eightthousand, eighty, eleven, ....
The numbers starting with "eight" are 8. With "eighthundred", 800-899, 800,000-899,999, 800,000,000-899,999,999. And so on.
The number of letters in the concatenation of words for 0 ( represented by the empty string ) to 99 can be found and totalled; this can be multiplied with "thousand"=8 or "million"=7 added for higher ranges. The value for 800-899 will be 100 times the length of "eighthundred" plus the length of 0-99. And so on.

This guy has a solution to the puzzle written in Haskell. Apparently Michael Borgwardt was right about using a Trie for finding the solution.

Those strings are going to have lots and lots of common prefixes - perfect use case for a trie, which would drastically reduce memory usage and probably also running time.

Here's my python solution that prints out the correct answer in a fraction of a second. I'm not a python programmer generally, so apologies for any egregious code style errors.
#!/usr/bin/env python
import sys
ONES=[
"", "one", "two", "three", "four",
"five", "six", "seven", "eight", "nine",
"ten", "eleven", "twelve", "thirteen", "fourteen",
"fifteen", "sixteen", "seventeen","eighteen", "nineteen",
]
TENS=[
"zero", "ten", "twenty", "thirty", "forty",
"fifty", "sixty", "seventy", "eighty", "ninety",
]
def to_s_h(i):
if(i<20):
return(ONES[i])
return(TENS[i/10] + ONES[i%10])
def to_s_t(i):
if(i<100):
return(to_s_h(i))
return(ONES[i/100] + "hundred" + to_s_h(i%100))
def to_s_m(i):
if(i<1000):
return(to_s_t(i))
return(to_s_t(i/1000) + "thousand" + to_s_t(i%1000))
def to_s_b(i):
if(i<1000000):
return(to_s_m(i))
return(to_s_m(i/1000000) + "million" + to_s_m(i%1000000))
def try_string(s,t):
global letters_to_go,word_sum
l=len(s)
letters_to_go -= l
word_sum += t
if(letters_to_go == 0):
print "solved: " + s
print "sum is: " + str(word_sum)
sys.exit(0)
elif(letters_to_go < 0):
print "failed: " + s + " " + str(letters_to_go)
sys.exit(-1)
def solve(depth,prefix,prefix_num):
global millions,thousands,ones,letters_to_go,onelen,thousandlen,word_sum
src=[ millions,thousands,ones ][depth]
for x in src:
num=prefix + x[2]
nn=prefix_num+x[1]
try_string(num,nn)
if(x[0] == 0):
continue
if(x[0] == 1):
stl=(len(num) * 999) + onelen
ss=(nn*999) + onesum
else:
stl=(len(num) * 999999) + thousandlen + onelen*999
ss=(nn*999999) + thousandsum
if(stl < letters_to_go):
letters_to_go -= stl
word_sum += ss
else:
solve(depth+1,num,nn)
ones=[]
thousands=[]
millions=[]
onelen=0
thousandlen=0
onesum=(999*1000)/2
thousandsum=(999999*1000000)/2
for x in range(1,1000):
s=to_s_b(x)
l=len(s)
ones.append( (0,x,s) )
onelen += l
thousands.append( (0,x,s) )
thousands.append( (1,x*1000,s + "thousand") )
thousandlen += l + (l+len("thousand"))*1000
millions.append( (0,x,s) )
millions.append( (1,x*1000,s + "thousand") )
millions.append( (2,x*1000000,s + "million") )
ones.sort(key=lambda x: x[2])
thousands.sort(key=lambda x: x[2])
millions.sort(key=lambda x: x[2])
letters_to_go=51000000000
word_sum=0
solve(0,"",0)
It works by precomputing the length of the numbers from 1..999 and 1..999999 so that it can skip entire subtrees unless it knows that the answer lies somewhere within them.

(The first attempt at this is wrong, but I will leave it up since it's more useful to see mistakes on the way to solving something rather than just the final answer.)
I would first generate the strings from 0 to 999 and store them into an array called thousandsStrings. The 0 element is "", and "" represents a blank in the lists below.
The thousandsString setup uses the following:
Units: "" one two three ... nine
Teens: ten eleven twelve ... nineteen
Tens: "" "" twenty thirty forty ... ninety
The thousandsString setup is something like this:
thousandsString[0] = ""
for (i in 1..10)
thousandsString[i] = Units[i]
end
for (i in 10..19)
thousandsString[i] = Teens[i]
end
for (i in 20..99)
thousandsString[i] = Tens[i/10] + Units[i%10]
end
for (i in 100..999)
thousandsString[i] = Units[i/100] + "hundred" + thousandsString[i%100]
end
Then, I would sort that array alphabetically.
Then, assuming t1 t2 t3 are strings taken from thousandsString, all of the strings have the form
t1
OR
t1 + million + t2 + thousand + t3
OR
t1 + thousand + t2
To output them in the proper order, I would process the individual strings, followed by the millions strings followed by the string + thousands strings.
foreach (t1 in thousandsStrings)
if (t1 == "")
continue;
process(t1)
foreach (t2 in thousandsStrings)
foreach (t3 in thousandsStrings)
process (t1 + "million" + t2 + "thousand" + t3)
end
end
foreach (t2 in thousandsStrings)
process (t1 + "thousand" + t2)
end
end
where process means store the previous sum length and then add the new string length to the sum and if the new sum is >= your target sum, you spit out the results, and maybe return or break out of the loops, whatever makes you happy.
=====================================================================
Second attempt, the other answers were right that you need to use 3k strings instead of 1k strings as a base.
Start with the thousandsString from above, but drop the blank "" for zero. That leaves 999 elements and call this uStr (units string).
Create two more sets:
tStr = the set of all uStr + "thousand"
mStr = the set of all uStr + "million"
Now create two more set unions:
mtuStr = mStr union tStr union uStr
tuStr = tStr union uStr
Order uStr, tuStr, mtuStr
Now the looping and logic here are a bit different than before.
foreach (s1 in mtuStr)
process(s1)
// If this is a millions or thousands string, add the extra strings that can
// go after the millions or thousands parts.
if (s1.contains("million"))
foreach (s2 in tuStr)
process (s1+s2)
if (s2.contains("thousand"))
foreach (s3 in uStr)
process (s1+s2+s3)
end
end
end
end
if (s1.contains("thousand"))
foreach (s2 in uStr)
process (s1+s2)
end
end
end

What I did:
1) Iterate through 1 - 999 and generate the words for each of these.
As we generate:
2) Create 3 data structures where each node has a pointer to children and each node has a character value, and a pointer to Siblings. (A binary tree, in fact, but we don't want to think of it that way necessarily - for me it's easier to conceptualise as a list of siblings with lists of children hanging off, but if you think about it {draw a pic} you'll realise it is in fact a Binary Tree).
These 3 data structures are created cocurrently as follows:
a) first one with the word as generated (ie 1-999 sorted alphabetically)
b) all the values in the first + all the values with 'thousand' appended (ie 1-999 and 1,000 - 999,000 (step 1000) (1998 values in total)
c) all the values in B + all the values in a with million appended (2997 values in total)
3) For every leaf node in(b) add a Child as (a). For every leaf node in (c) add a child as (b).
4) Traverse the tree, counting how many characters we pass and stopping at 51 Billion.
NOTE: This doesn't sum the values (I didn't read that bit when I originally did it), and runs in just over 3 minutes (about 192 secs usually, using c++).
NOTE 2: (in case it isn't obvious) there are only 5,994 values stored, but they are stored in such a way that there are a billion paths through the tree
I did this about a year or two ago when I stumbled accross it, and have since realised there are many optimisations (the most time consuming bit is traversing the tree - by a LONG WAY). There are a few optimisations that I think would significantly improve this approach, but I could never be bothered taking it further, other than to optimise redundant nodes in the tree slightly, so they stored strings rather than characters
I have seen people claim on line that they've solved it in less than 5 seconds....

weird but fun idea.
build a sparse list of the lengths of the number from 0 to 9, then 10-90 by tens, then 100, 1000, etc etc, to billion, indexes are the value of the integer part who's lenght is stored.
write a function to calculate the number as a string length using the table.
(breaking the number into it's parts, and looking up the length of the aprts, never actally creating a string.)
then you're only doing math as you traverse the numbers, calculating the length from the
table afterward summing for your sum.
with the sum, and the value of the final integer, figure out the integer that's being spelled, and volia, you're done.

Yes, me again, but a completely different approach.
Simply, rather than storing the "onethousandeleventyseven" words, you write the sort to use that when comparing.
Crude java POC:
public class BillionsBillions implements Comparator {
public int compare(Object a, Object b) {
String s1 = (String)a; // "1234";
String s2 = (String)b; // "1235";
int n1 = Integer.valueOf(s1);
int n2 = Integer.valueOf(s2);
String w1 = numberToWords(n1);
String w2 = numberToWords(n2);
return w1.compare(w2);
}
public static void main(String args[]) {
long numbers[] = new long[1000000000]; // Bring your 64 bit JVMs
for(int i = 0; i < 1000000000; i++) {
numbers[i] = i;
}
Arrays.sort(numbers, 0, numbers.length, new BillionsBillions());
long total = 0;
for(int i : numbers) {
String l = numberToWords(i);
long offset = total + l - 51000000000;
if (offset >= 0) {
String c = l.substring(l - offset, l - offset + 1);
System.out.println(c);
break;
}
}
}
}
"numberToWords" is left as an exercise for the reader.

Do you need to save the entire string in memory?
If not, just save how many characters you've appended so far. For each iteration, you check the length the next number's textual representation. If it exceeds the nth letter you are looking for, the letter must be in that string, so extract it by it's index, print it, and stop execution. Otherwise, add the string length to the character count and move to the next number.

All the strings are going to start with either one, ten, two, twenty, three, thirty, four, etc so I'd start with figuring out how many are in each of the buckets. Then you should at least know which bucket you need to look closer at.
Then I'd look at subdividing the buckets further based on the possible prefixes. For example, within ninehundred, you are going to have all the same buckets that you had to start off with, just for numbers starting with 900.

The question is about efficient data storage not string manipulation. Create an enum to represent the words. the words should appear in sorted order so that when it comes time to sort it is a simplish compare. Now generate the list and sort. use the fact that you know how long each word is in conjunction with the enum to add up to the character you need.

Code wins...
#!/bin/bash
n=0
while [ $n -lt 1000000000 ]; do
number -l $n | sed -e 's/[^a-z]//g'
let n=n+1
done | sort > /tmp/bignumbers
awk '
BEGIN {
total = 0;
}
{
l = length($0);
offset = total + l - 51000000000;
print total " " offset
if (offset >= 0) {
c = substr($0, l - offset, 1);
print c;
exit;
}
total = total + l;
}' /tmp/bignumbers
Tested for a much smaller range ;-). Requires a LOT of diskspace, a compressed filesystem would be, umm, valuable, but not so much memory.
Sort has options to compress work files as well, and you could toss in gzip to directly compress data.
Not the zippiest solution.
But it does work.

Honestly I would let an RDBMS like SQL Server or Oracle do the work for me.
Insert the billion strings into an indexed table.
Compute a string length column.
Start pulling off the top X records at a time with a SUM, until I get to 51 billion.
Might beat up the server for a while as it would need to do a lot of Disk IO, but overall I think I could find an answer faster than someone who would write a program to do it.
Sometimes just getting it done is what the client really wants, and could care less what fancy design pattern or data structure you used.

figure out lengths for 1-999 and include length for 0 as 0.
so now you have an array for 0-999 namely uint32 sizes999[1000];
(not going to get into the details of generating this)
also need an array of thousand last letters last_letters[1000]
(again not going to get into the details of generating this as it is even easier even hundreds d even tens y except 10 which is n others cycle though last of on e through nin e zero is irrelavant)
uint32 sizes999[1000];
uint64 totallen = 0;
strlen_million = strlen("million");
strlen_thousand = strlen("thousand");
for (uint32 i = 0; i<1000;++i){
for (uint32 j = 0; j<1000;++j){
for (uint32 j = 0; j<1000;++j){
total_len += sizes999[i]+strlen_million +
sizes999[j]+strlen_thousand +
sizes999[k];
if totallen == 51000000000 goto done;
ASSERT(totallen <51000000000);//he claimed 51000000000 was not intermediate
}
}
}
done:
//now use i j k to get last letter by using last_letters999
//think of i,j,k as digits base 1000
//if k = 0 & j ==0 then the letter is n million
//if only k = 0 then the letter is d thousand
//other wise use the array of last_letters since
//the units digit base 1000, that is k, is not zero
//for the sum of the numbers i,j,k are the digits of the number base 1000 so
n = i*1000000 + j*1000 + k;
//represent the number and use
sum = n*(n+1)/2;
if you need to do it for number other than 51000000000 then also calculate sums_sizes999 and use that in the natural way.
total memory: 0(1000);
total time: 0(n) where n is the number

This is what I'd do:
Create an array of 2,997 strings: "one" through "ninehundredninetynine", "onethousand" through "ninehundredninetyninethousand", and "onemillion" through "ninehundredninetyninemillion".
Store the following about each string: length (this can be calculated of course), the integer value represented by the string, and some enum to signify whether it's "ones", "thousands", or "millions".
Sort the 2,997 strings alphabetically.
With this array created, it's straightforward to find all 999,999,999 strings in order alphabetically based on the following observations:
Nothing can follow a "ones" string
Either nothing, or a "ones" string, can follow a "thousands" string
Either nothing, a "ones" string, a "thousands" string, or a "thousands" string then a "ones" string, can follow a "millions" string.
Constructing the words basically involves creating one- to three-letter "words" based on these 2,997 tokens, making sure that the order of the tokens makes a valid number according to the rules above. Given a particular "word", the next "word" is found like this:
Lengthen the "word" by adding the token first alphabetically, if possible.
If this can't be done, advance the rightmost token to the next one alphabetically, if possible.
If this too is not possible, then remove the rightmost token, and advance the second-rightmost token to the next one alphabetically, if possible.
If this too is not possible, you're done.
At each step you can calculate the total length of the string and the sum of the numbers by just keeping two running totals.

It's important to note that there is a lot of overlapping and double counting if you iterate over all 100 billion possible numbers. It's important to realize that the number of strings that start with "eight" is the same number of numbers that start with "nin" or "seven" or "six" etc...
To me, this begs for a dynamic programming solution where the number of strings for tens, hundreds, thousands, etc are calculated and stored in some type of look up table. Ofcourse, there will be special cases for one vs eleven, two vs twelve, etc
I'll update this if I can get a quick running solution.

WRONG!!!!!!!!! I READ THE PROBLEM WRONG. I thought it meant "what's the last letter of the alphabetically last number"
what's wrong with:
public class Nums {
// if overflows happen, switch to an BigDecimal or something
// with arbitrary precision
public static void main(String[] args) {
System.out.println("last letter: " + lastLetter(1L, 51000000L);
System.out.println("sum: " + sum(1L, 51000000L);
}
static char lastLetter(long start, long end) {
String last = toWord(start);
for(long i = start; i < end; i++)
String current = toWord(i);
if(current.compareTo(last) > 1)
last = current;
return last.charAt(last.length()-1);
}
static String toWord(long num) {
// should be relatively easy, but for now ...
return "one";
}
static long sum(long first, long n) {
return (n * first + n*n) / 2;
}
}
haven't actually tried this :/ LOL

You have one billion numbers and 51 billion characters - there's a good chance that this is a trick question, as there are an average of 51 characters per number. Sum up the conversions of all the numbers and see if it adds up to 51 billion.
Edit: It adds up to 70,305,000,000 characters, so this is the wrong answer.

I solved this in Java sometime in 2008 as part of an application to work at ITA Software.
The code is long, and it now being three years later, I look at it with a bit of horror... So I'm not going to post it.
But I'll post quotes from some notes that I included with the application.
The problem with this puzzle is of course the size. The naïve approach would be to sort the list in word number order and then to iterate through the sorted list counting characters and summing. With a list of size 999,999,999 this would of course take a rather long time and the sort could likely not be done in memory.
But there are natural patterns in the ordering which allow shortcuts.
Immediately following any entry (say the number is X) ending in “million” will come 999,999 entries starting with the same text, representing all the numbers from X +1
to X + 10^6 -1.
The sum of all these numbers can be computed by a classic formula (an “arithmetic series”), and the character count can be computed by a similarly simple formula based on the prefix (X above) and a once-computed character count for the numbers from 1 to 999,999. Both depend only on the “millions” part of the number at the base of the range. Thus if the character count for the entire range will keep the entire count below the search goal, the individual entries need not be traversed.
Similar shortcuts apply for “thousand”, and indeed could be applied to “hundred” or “billion” though I didn’t bother with shortcuts at the hundreds level and the billions level is out of range for this problem.
In order to apply these shortcuts, my code creates and sorts a list of 2997 objects representing the numbers:
1 to 999 stepping by 1
1000 to 999000 stepping by 1000
1000000 to 999000000 stepping by 1000000
The code iterates through this list, accumulating sums and character counts, recursively creating, sorting and traversing similar but smaller lists as needed.
Explicit counting and adding is only needed near the end.
I didn't get the job, but later used the code as a "code sample" for another job, which I did get.
The Java code using these techniques for skipping much of the explicit counting and adding runs in about 8 seconds.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio