How can I generate the Rowland prime sequence idiomatically in J? - tacit-programming

If you're not familiar with the Rowland prime sequence, you can find out about it here. I've created an ugly, procedural monadic verb in J to generate the first n terms in this sequence, as follows:
rowland =: monad define
result =. 0 $ 0
t =. 1 $ 7
while. (# result) < y do.
a =. {: t
n =. 1 + # t
t =. t , a + n +. a
d =. | -/ _2 {. t
if. d > 1 do.
result =. ~. result , d
end.
end.
result
)
This works perfectly, and it indeed generates the first n terms in the sequence. (By n terms, I mean the first n distinct primes.)
Here is the output of rowland 20:
5 3 11 23 47 101 7 13 233 467 941 1889 3779 7559 15131 53 30323 60647 121403 242807
My question is, how can I write this in more idiomatic J? I don't have a clue, although I did write the following function to find the differences between each successive number in a list of numbers, which is one of the required steps. Here it is, although it too could probably be refactored by a more experienced J programmer than I:
diffs =: monad : '}: (|#-/#(2&{.) , $:#}.) ^: (1 < #) y'

While I already marked estanford's answer as the correct one, I've come a long, long way with J since I asked this question. Here's a much more idiomatic way to generate the rowland prime sequence in J:
~. (#~ 1&<) | 2 -/\ (, ({: + ({: +. >:##)))^:(1000 - #) 7
The expression (, ({: + ({: +. >:##)))^:(1000 - #) 7 generates the so-called original sequence up to 1000 members. The first differences of this sequence can be generated by | 2 -/\, i.e., the absolute values of the differences of every two elements. (Compare this to my original, long-winded diffs verb from the original question.)
Lastly, we remove the ones and the duplicate primes ~. (#~ 1&<) to get the sequence of primes.
This is vastly superior to the way I was doing it before. It can easily be turned into a verb to generate n number of primes with a little recursion.

I don't have a full answer yet, but this essay by Roger Hui has a tacit construct you can use to replace explicit while loops. Another (related) avenue would be to make the inner logic of the block into a tacit expression like so:
FUNWITHTACIT =: ] , {: + {: +. 1 + #
rowland =: monad define
result =. >a:
t =. 7x
while. (# result) < y do.
t =. FUNWITHTACIT t
d =. | -/ _2 {. t
result =. ~.result,((d>1)#d)
end.
result
)
(You might want to keep the if block for efficiency, though, since I wrote the code in such a way that result is modified regardless of whether or not the condition was met -- if it wasn't, the modification has no effect. The if logic could even be written back into the tacit expression by using the Agenda operator.)
A complete solution would consist of finding out how to represent all the logic inside the while loop of as a single function, and then use Roger's trick to implement the while logic as a tacit expression. I'll see what I can turn up.
As an aside, I got J to build FUNWITHTACIT for me by taking the first few lines of your code, manually substituting in the functions you declared for the variable values (which I could do because they were all operating on a single argument in different ways), replaced every instance of t with y and told J to build the tacit equivalent of the resulting expression:
]FUNWITHTACIT =: 13 : 'y,({:y)+(1+#y)+.({:y)'
] , {: + {: +. 1 + #
Using 13 to declare the monad is how J knows to take a monad (otherwise explicitly declared with 3 : 0, or monad define as you wrote in your program) and convert the explicit expression into a tacit expression.
EDIT:
Here are the functions I wrote for avenue (2) that I mentioned in the comment:
candfun =: 3 : '(] , {: + {: +. 1 + #)^:(y) 7'
firstdiff =: [: | 2 -/\ ]
triage =: [: /:~ [: ~. 1&~: # ]
rowland2 =: triage #firstdiff #candfun
This function generates the first n-many candidate numbers using the Rowland recurrence relation, evaluates their first differences, discards all first-differences equal to 1, discards all duplicates, and sorts them in ascending order. I don't think it's completely satisfactory yet, since the argument sets the number of candidates to try instead of the number of results. But, it's still progress.
Example:
rowland2 1000
3 5 7 11 13 23 47 101 233 467 941
Here's a version of the first function I posted, keeping the size of each argument to a minimum:
NB. rowrec takes y=(n,an) where n is the index and a(n) is the
NB. nth member of the rowland candidate sequence. The base case
NB. is rowrec 1 7. The output is (n+1,a(n+1)).
rowrec =: (1 + {.) , }. + [: +./ 1 0 + ]
rowland3 =: 3 : 0
result =. >a:
t =. 1 7
while. y > #result do.
ts =. (<"1)2 2 $ t,rowrec t
fdiff =. |2-/\,>(}.&.>) ts
if. 1~:fdiff do.
result =. ~. result,fdiff
end.
t =. ,>}.ts
end.
/:~ result
)
which finds the first y-many distinct Rowland primes and presents them in ascending order:
rowland3 20
3 5 7 11 13 23 47 53 101 233 467 941 1889 3779 7559 15131 30323 60647 121403 242807
Most of the complexity of this function comes from my handling of boxed arrays. It's not pretty code, but it only keeps 4+#result many data elements (which grows on a log scale) in memory during computation. The original function rowland keeps (#t)+(#result) elements in memory (which grows on a linear scale). rowland2 y builds an array of y-many elements, which makes its memory profile almost the same as rowland even though it never grows beyond a specified bound. I like rowland2 for its compactness, but without a formula to predict the exact size of y needed to generate n-many distinct primes, that task will need to be done on a trial-and-error basis and thus potentially use many more cycles than rowland or rowland3 on redundant calculations. rowland3 is probably more efficient than my version of rowland, since FUNWITHTACIT recomputes #t on every loop iteration -- rowland3 just increments a counter, which is less computationally intensive.
Still, I'm not happy with rowland3's explicit control structures. It seems like there should be a way to accomplish this behavior using recursion or something.

Related

Cartesian product in J

I'm trying to reproduce APL code for the game of life function in J. A YouTube video explaining this code can be found after searching "Game of Life in APL". Currently I have a matrix R which is:
0 0 0 0 0 0 0
0 0 0 1 1 0 0
0 0 1 1 0 0 0
0 0 0 1 0 0 0
0 0 0 0 0 0 0
I wrote J code which produces the adjacency list (number of living cells in adjacent squares) which is the following:
+/ ((#:i.4),-(#:1+i.2),(1 _1),.(_1 1)) |. R
And produces:
0 0 1 2 2 1 0
0 1 3 4 3 1 0
0 1 4 5 3 0 0
0 1 3 2 1 0 0
0 0 1 1 0 0 0
My main issue with this code is that it isn't elegant, as ((#:i.4),-(#:1+i.2),(1 _1),.(_1 1)) is needed just to produce:
0 0
0 1
1 0
1 1
0 _1
_1 0
_1 1
1 _1
Which is really just the outer product or Cartesian product between vectors 1 0 _1 and itself. I could not find an easy way to produce this Cartesian product, so my end question is how would I produce the required vector more elegantly?
A Complete Catalog
#Michael Berry's answer is very clear and concise. A sterling example of the J idiom table (f"0/~). I love it because it demonstrates how the subtle design of J has permitted us to generalize and extend a concept familiar to anyone from 3rd grade: arithmetic tables (addition tables, +/~ i. 10 and multiplication tables */~ i.12¹), which even in APL were relatively clunky.
In addition to that fine answer, it's also worth noting that there is a primitive built into J to calculate the Cartesian product, the monad {.
For example:
> { 2 # <1 0 _1 NB. Or i:_1 instead of 1 0 _1
1 1
1 0
1 _1
0 1
0 0
0 _1
_1 1
_1 0
_1 _1
Taking Inventory
Note that the input to monad { is a list of boxes, and the number of boxes in that list determines the number of elements in each combination. A list of two boxes produces an array of 2-tuples, a list of 3 boxes produces an array of 3-tuples, and so on.
A Tad Excessive
Given that full outer products (Cartesian products) are so expensive (O(n^m)), it occurs to one to ask: why does J have a primitive for this?
A similar misgiving arises when we inspect monad {'s output: why is it boxed? Boxes are used in J when, and only when, we want to consolidate arrays of incompatible types and shapes. But all the results of { y will have identical types and shapes, by the very definition of {.
So, what gives? Well, it turns out these two issues are related, and justified, once we understand why the monad { was introduced in the first place.
I'm Feeling Ambivalent About This
We must recall that all verbs in J are ambivalent perforce. J's grammar does not admit a verb which is only a monad, or only a dyad. Certainly, one valence or another might have an empty domain (i.e. no valid inputs, like monad E. or dyad ~. or either valence of [:), but it still exists.
An valence with an empty domain is "real", but its range of valid inputs is empty (an extension of the idea that the range of valid inputs to e.g. + is numbers, and anything else, like characters, produces a "domain error").
Ok, fine, so all verbs have two valences, so what?
A Selected History
Well, one of the primary design goals Ken Iverson had for J, after long experience with APL, was ditching the bracket notation for array indexing (e.g. A[3 4;5 6;2]), and recognizing that selection from an array is a function.
This was an enormous insight, with a serious impact on both the design and use of the language, which unfortunately I don't have space to get into here.
And since all functions need a name, we had to give one to the selection function. All primitive verbs in J are spelled with either a glyph, an inflected glyph (in my head, the ., :, .: etc suffixes are diacritics), or an inflected alphanumeric.
Now, because selection is so common and fundamental to array-oriented programming, it was given some prime real estate (a mark of distinction in J's orthography), a single-character glyph: {².
So, since { was defined to be selection, and selection is of course dyadic (i.e having two arguments: the indices and the array), that accounts for the dyad {. And now we see why it's important to note that all verbs are ambivalent.
I Think I'm Picking Up On A Theme
When designing the language, it would be nice to give the monad { some thematic relationship to "selection"; having the two valences of a verb be thematically linked is a common pattern in J, for elegance and mnemonic purposes.
That broad pattern is also a topic worthy of a separate discussion, but for now let's focus on why catalog / Cartesian product was chosen for monad {. What's the connection? And what accounts for the other quirk, that its results are always boxed?
Bracketectomy
Well, remember that { was introduced to replace -- replace completely -- the old bracketing subscript syntax of APL and (and many other programming languages and notations). This at once made selection easier, more useful, and also simplified J's syntax: in APL, the grammar, and consequently parser, had to have special rules for indexing like:
A[3 4;5 6;2]
The syntax was an anomaly. But boy, wasn't it useful and expressive from the programmer's perspective, huh?
But why is that? What accounts for the multi-dimensional bracketing notation's economy? How is it that we can say so much in such little space?
Well, let's look at what we're saying. In the expression above A[3 4;5 6;2], we're asking for the 3rd and 4th rows, the 5th and 6th columns, and the 2nd plane.
That is, we want
plane 2, row 3, column 5, and
plane 2, row 3, column 6, and
plane 2, row 4, column 5 and
plane 2, row 4, column 6
Think about that a second. I'll wait.
The Moment Ken Blew Your Mind
Boom, right?
Indexing is a Cartesian product.
Always has been. But Ken saw it.
So, now, instead of saying A[3 4;5 6;2] in APL (with some hand-waving about whether []IO is 1 or 0), in J we say:
(3 4;5 6;2) { A
which is, of course, just shorthand, or syntactic sugar, for:
idx =. { 3 4;5 6;2 NB. Monad {
idx { A NB. Dyad {
So we retained the familiar, convenient, and suggestive semicolon syntax (what do you want to bet link being spelled ; is also not a coincidence?) while getting all the benefits of turning { into a first-class function, as it always should have been³.
Opening The Mystery Box
Which brings us back to that other, final, quibble. Why the heck are monad {'s results boxed, if they're all regular in type and shape? Isn't that superfluous and inconvenient?
Well, yes, but remember that an unboxed, i.e. numeric, LHA in x { y only selects items from y.
This is convenient because it's a frequent need to select the same item multiple times (e.g. in replacing 'abc' with 'ABC' and defaulting any non-abc character to '?', we'd typically say ('abc' i. y) { 'ABC','?', but that only works because we're allowed to select index 4, which is '?', multiple times).
But that convenience precludes using straight numeric arrays to also do multidimensional indexing. That is, the convenience of unboxed numbers to select items (most common use case) interferes with also using unboxed numeric arrays to express, e.g. A[17;3;8] by 17 3 8 { A. We can't have it both ways.
So we needed some other notation to express multi-dimensional selections, and since dyad { has left-rank 0 (precisely because of the foregoing), and a single, atomic box can encapsulate an arbitrary structure, boxes were the perfect candidate.
So, to express A[17;3;8], instead of 17 3 8 { A, we simply say (< 17;3;8) { A, which again is straighforward, convenient, and familiar, and allows us to do any number of multi-dimensional selections simultaneously e.g. ( (< 17;3;8) , (<42; 7; 2) { A), which is what you'd want and expect in an array-oriented language.
Which means, of course, that in order to produce the kinds of outputs that dyad { expects as inputs, monad { must produce boxes⁴. QED.
Oh, and PS: since, as I said, boxing permits arbitrary structure in a single atom, what happens if we don't box a box, or even a list of a boxes, but box a boxed box? Well, have you ever wanted a way to say "I want every index except the last" or 3rd, or 42nd and 55th? Well...
Footnotes:
¹ Note that in the arithmetic tables +/~ i.10 and */~ i.12, we can elide the explicit "0 (present in ,"0/~ _1 0 1) because arithmetic verbs are already scalar (obviously)
² But why was selection given that specific glyph, {?
Well, Ken intentionally never disclosed the specific mnemonic choices used in J's orthography, because he didn't want to dictate such a personal choice for his users, but to me, Dan, { looks like a little funnel pointing right-to-left. That is, a big stream of data on the right, and a much smaller stream coming out the left, like a faucet dripping.
Similarly, I've always seen dyad |: like a little coffee table or Stonehenge trilithon kicked over on its side, i.e. transposed.
And monad # is clearly mnemonic (count, tally, number of items), but the dyad was always suggestive to me because it looked like a little net, keeping the items of interest and letting everything else "fall through".
But, of course, YMMV. Which is precisely why Ken never spelled this out for us.
³ Did you also notice that while in APL the indices, which are control data, are listed to the right of the array, whereas in J they're now on the left, where control data belong?
⁴ Though this Jer would still like to see monad { produce unboxed results, at the cost of some additional complexity within the J engine, i.e. at the expense of the single implementer, and to the benefit of every single user of the language
n There is a lot of interesting literature which goes into this material in more detail, but unfortunately I do not have time now to dig it up. If there's enough interest, I may come back and edit the answer with references later. For now, it's worth reading Mastering J, an early paper on J by one of the luminaries of J, Donald McIntyre, which makes mention of the eschewing of the "anomalous bracket notation" of APL, and perhaps a tl;dr version of this answer I personally posted to the J forums in 2014.
,"0/ ~ 1 0 _1
will get you the Cartesian product you ask for (but you may want to reshape it to 9 by 2).
The cartesian product is the monadic verb catalog: {
{ ;~(1 0 _1)
┌────┬────┬─────┐
│1 1 │1 0 │1 _1 │
├────┼────┼─────┤
│0 1 │0 0 │0 _1 │
├────┼────┼─────┤
│_1 1│_1 0│_1 _1│
└────┴────┴─────┘
Ravel (,) and unbox (>) for a 9,2 list:
>,{ ;~(1 0 _1)
1 1
1 0
1 _1
0 1
0 0
0 _1
_1 1
_1 0
_1 _1

Improve a working "Bulgarian Solitaire" J verb

"Bulgarian Solitaire" is a mathematical curiosity. It is played with a deck of 45 (any triangular number will work) unmarked cards. Put them into randomly sized piles. Then, to play a round, remove a card from each pile and create a new pile with the removed cards. Repeating this step eventually yields the configuration 1 2 3 4 5 6 7 8 9 (for 45 cards), which is clearly a fixed point of the game and thus the end of the solitaire. I wanted to simulate this game in J.
After a couple of days thinking about it and some long-awaited insight into J gerunds, I came up with a solution, on which I would like some opinions. It starts with this verb:
bsol =: ((#~ ~:&0) , #)#:(-&1)^:(<_)
Given a vector of positive integers whose sum is triangular, this verb returns a rank 2 array showing the rounds of the solitaire that results. I also came up with this verb to generate an initial configuration, but I'm less happy with it:
t =: 45 & - # (+/) NB. Would work with any triangular number
cards =: (]`(]#,>:#?&t#]))#.(0&<#t)^:_
Given a vector y of positive integers, t returns the defect from 45, i.e., the number 45 - +/ y of cards not accounted for in the piles represented by the argument. Using t, the verb cards appends to such a vector y an integer from >: i. t y repeatedly until the defect is 0.
Expanding t explicitly, I get
cards =: (]`(]#,>:#?&(45 & - # (+/))#]))#.(0&<#(45 & - # (+/)))^:_
I feel like this is not very brief, and maybe overly parenthesized. But it does work, and the complete solution now looks like this:
bsol # cards # >: # ? 44 NB. Choose the first pile randomly from >: i. 44
Without the named verbs:
(((#~ ~:&0) , #)#:(-&1)^:(<_)) #: ((]`(]#,>:#?&(45 & - # (+/))#]))#.(0&<#(45 & - # (+/)))^:_)#>:#? 44
Not knowing much about J idiom, I have the same feeling about this: it's not very brief, certainly redundant (would it be better to use a local verb like t here, since it's repeated, e.g.?), and probably overly parenthesized. What opportunities do I have to improve this program?
You can improve t with
t =: 45 - +/
Using 46 - +/ will spare you some >:.
You can replace cards with a recursive definition:
cards =: }.`(($:#] , -) ?)#.(0&<)
where, now, cards n produces an initial configuration with sum n.
In bsol you don't need -&1 if you remove (-.) the zeroes and rearrange it like:
bsol =: (0 -.~ [: (, #) <:)^:(<_)

Complex Numbers Seemingly Arising from Non-Complex Logarithms

I have a simple program written in TI-BASIC that converts from base 10 to base 2
0->B
1->E
Input "DEC:",D
Repeat D=0
int(round(log(D)/log(2),1))->E
round(E)->E
B+10^E->B
D-2^E->D
End
Disp B
This will sometimes return an the error 'ERR: DATA TYPE'. I checked, and this is because the variable D, will sometimes become a complex number. I am not sure how this happens.
This happens with seemingly random numbers, like 5891570. It happens with this number, but not something close to it like 5891590 Which is strange. It also happens with 1e30, But not 1e25. Another example is 1111111111111111, and not 1111111111111120.
I haven't tested this thoroughly, and don't see any pattern in these numbers. Any help would be appreciated.
The error happens because you round the logarithm to one decimal place before taking the integer part; therefore, if log(D)/log(2) is something like 8.99, you will round E up rather than down, and 2^9 will be subtracted from D instead of 2^8, causing, in the next iteration, D to become negative and its logarithm to be complex. Let's walk through your code when D is 511, which has base-2 logarithm 8.9971:
Repeat D=0 ;Executes first iteration without checking whether D=0
log(D)/log(2 ;8.9971
round(Ans,1 ;9.0
int(Ans ;9.0
round(Ans)->E ;E = 9.0
B+10^E->B ;B = 1 000 000 000
D-2^E->D ;D = 511-512 = -1
End ;loops again, since D≠0
---next iteration:----
log(D ;log(-1) = 1.364i; throws ERR:NONREAL ANS in Real mode
Rounding the logarithm any more severely than nine decimal places (nine digits is the default for round( without a "digits" argument) is completely unnecessary, as on my TI-84+ rounding errors do not accumulate: round(int(log(2^X-1)/log(2)) returns X-1 and round(int(log(2^X)/log(2)) returns X for all integer X≤28, which is high enough that precision would be lost anyway in other parts of the calculation.
To fix your code, simply round only once, and only to nine places. I've also removed the unnecessary double-initialization of E, removed your close-parens (it's still legal code!), and changed the Repeat (which always executes one loop before checking the condition D=0) to a While loop to prevent ERR:DOMAIN when the input is 0.
0->B
Input "DEC:",D
While D
int(round(log(D)/log(2->E
B+10^E->B
D-2^E->D
End
B ;on the last line, so it prints implicitly
Don't expect either your code or my fix to work correctly for D > 213 or so, because your calculator can only store 14 digits in its internal representation of any number. You'll lose the digits while you store the result into B!
Now for a trickier, optimized way of computing the binary representation (still only works for D < 213:
Input D
int(2fPart(D/2^cumSum(binomcdf(13,0
.1sum(Ans10^(cumSum(1 or Ans

Check whether a point is inside a rectangle by bit operator

Days ago, my teacher told me it was possible to check if a given point is inside a given rectangle using only bit operators. Is it true? If so, how can I do that?
This might not answer your question but what you are looking for could be this.
These are the tricks compiled by Sean Eron Anderson and he even put a bounty of $10 for those who can find a single bug. The closest thing I found here is a macro that finds if any integer X has a word which is between M and N
Determine if a word has a byte between m and n
When m < n, this technique tests if a word x contains an unsigned byte value, such that m < value < n. It uses 7 arithmetic/logical operations when n and m are constant.
Note: Bytes that equal n can be reported by likelyhasbetween as false positives, so this should be checked by character if a certain result is needed.
Requirements: x>=0; 0<=m<=127; 0<=n<=128
#define likelyhasbetween(x,m,n) \
((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
This technique would be suitable for a fast pretest. A variation that takes one more operation (8 total for constant m and n) but provides the exact answer is:
#define hasbetween(x,m,n) \
((~0UL/255*(127+(n))-((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
It is possible if the number is a finite positive integer.
Suppose we have a rectangle represented by the (a1,b1) and (a2,b2). Given a point (x,y), we only need to evaluate the expression (a1<x) & (x<a2) & (b1<y) & (y<b2). So the problems now is to find the corresponding bit operation for the expression c
Let ci be the i-th bit of the number c (which can be obtained by masking ci and bit shift). We prove that for numbers with at most n bit, c<d is equivalent to r_(n-1), where
r_i = ((ci^di) & ((!ci)&di)) | (!(ci^di) & r_(i-1))
Prove: When the ci and di are different, the left expression might be true (depends on ((!ci)&di)), otherwise the right expression might be true (depends on r_(i-1) which is the comparison of next bit).
The expression ((!ci)&di) is actually equivalent to the bit comparison ci < di. Hence, this recursive relation return true that it compares the bit by bit from left to right until we can decide c is smaller than d.
Hence there is an purely bit operation expression corresponding to the comparison operator, and so it is possible to find a point inside a rectangle with pure bitwise operation.
Edit: There is actually no need for condition statement, just expands the r_(n+1), then done.
x,y is in the rectangle {x0<x<x1 and y0<y<y1} if {x0<x and x<x1 and y0<y and y<y1}
If we can simulate < with bit operators, then we're good to go.
What does it mean to say something is < in binary? Consider
a: 0 0 0 0 1 1 0 1
b: 0 0 0 0 1 0 1 1
In the above, a>b, because it contains the first 1 whose counterpart in b is 0. We are those seeking the leftmost bit such that myBit!=otherBit. (== or equiv is a bitwise operator which can be represented with and/or/not)
However we need some way through to propagate information in one bit to many bits. So we ask ourselves this: can we "code" a function using only "bit" operators, which is equivalent to if(q,k,a,b) = if q[k] then a else b. The answer is yes:
We create a bit-word consisting of replicating q[k] onto every bit. There are two ways I can think of to do this:
1) Left-shift by k, then right-shift by wordsize (efficient, but only works if you have shift operators which duplicate the last bit)
2) Inefficient but theoretically correct way:
We left-shift q by k bits
We take this result and and it with 10000...0
We right-shift this by 1 bit, and or it with the non-right-shifted version. This copies the bit in the first place to the second place. We repeat this process until the entire word is the same as the first bit (e.g. 64 times)
Calling this result mask, our function is (mask and a) or (!mask and b): the result will be a if the kth bit of q is true, other the result will be b
Taking the bit-vector c=a!=b and a==1111..1 and b==0000..0, we use our if function to successively test whether the first bit is 1, then the second bit is 1, etc:
a<b :=
if(c,0,
if(a,0, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,1,
if(a,1, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,2,
if(a,2, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,3,
if(a,3, B_LESSTHAN_A, A_LESSTHAN_B),
if(...
if(c,64,
if(a,64, B_LESSTHAN_A, A_LESSTHAN_B),
A_EQUAL_B)
)
...)
)
)
)
)
This takes wordsize steps. It can however be written in 3 lines by using a recursively-defined function, or a fixed-point combinator if recursion is not allowed.
Then we just turn that into an even larger function: xMin<x and x<xMax and yMin<y and y<yMax

String to Number and back algorithm

This is a hard one (for me) I hope people can help me. I have some text and I need to transfer it to a number, but it has to be unique just as the text is unique.
For example:
The word 'kitty' could produce 12432, but only the word kitty produces that number. The text could be anything and a proper number should be given.
One problem the result integer must me a 32-bit unsigned integer, that means the largest possible number is 2147483647. I don't mind if there is a text length restriction, but I hope it can be as large as possible.
My attempts. You have the letters A-Z and 0-9 so one character can have a number between 1-36. But if A = 1 and B = 2 and the text is A(1)B(2) and you add it you will get the result of 3, the problem is the text BA produces the same result, so this algoritm won't work.
Any ideas to point me in the right direction or is it impossible to do?
Your idea is generally sane, only needs to be developed a little.
Let f(c) be a function converting character c to a unique number in range [0..M-1]. Then you can calculate result number for the whole string like this.
f(s[0]) + f(s[1])*M + f(s[2])*M^2 + ... + f(s[n])*M^n
You can easily prove that number will be unique for particular string (and you can get string back from the number).
Obviously, you can't use very long strings here (up to 6 characters for your case), as 36^n grows fast.
Imagine you were trying to store Strings from the character set "0-9" only in a number (the equivalent of obtaining a number of a string of digits). What would you do?
Char 9 8 7 6 5 4 3 2 1 0
Str 0 5 2 1 2 5 4 1 2 6
Num = 6 * 10^0 + 2 * 10^1 + 1 * 10^2...
Apply the same thing to your characters.
Char 5 4 3 2 1 0
Str A B C D E F
L = 36
C(I): transforms character to number: C(0)=0, C(A)=10, C(B)=11, ...
Num = C(F) * L ^ 0 + C(E) * L ^ 1 + ...
Build a dictionary out of words mapped to unique numbers and use that, that's the best you can do.
I doubt there are more than 2^32 number of words in use, but this is not the problem you're facing, the problem is that you need to map numbers back to words.
If you were only mapping words over to numbers, some hash algorithm might work, although you'd have to work a bit to guarantee that you have one that won't produce collisions.
However, for numbers back to words, that's quite a different problem, and the easiest solution to this is to just build a dictionary and map both ways.
In other words:
AARDUANI = 0
AARDVARK = 1
...
If you want to map numbers to base 26 characters, you can only store 6 characters (or 5 or 7 if I miscalculated), but not 12 and certainly not 20.
Unless you only count actual words, and they don't follow any good countable rules. The only way to do that is to just put all the words in a long list, and start assigning numbers from the start.
If it's correctly spelled text in some language, you can have a number for each word. However you'd need to consider all possible plurals, place and people names etc. which is generally impossible. What sort of text are we talking about? There's usually going to be some existing words that can't be coded in 32 bits in any way without prior knowledge of them.
Can you build a list of words as you go along? Just give the first word you see the number 1, second number 2 and check if a word has a number already or it needs a new one. Then save your newly created dictionary somewhere. This would likely be the only workable solution if you require 100% reliable, reversible mapping from the numbers back to original words given new unknown text that doesn't follow any known pattern.
With 64 bits and a sufficiently good hash like MD5 it's extremely unlikely to have collisions, but for 32 bits it doesn't seem likely that a safe hash would exist.
Just treat each character as a digit in base 36, and calculate the decimal equivalent?
So:
'A' = 0
'B' = 1
[...]
'Z' = 25
'0' = 26
[...]
'9' = 35
'AA' = 36
'AB' = 37
[...]
'CAB' = 46657

Resources