Algorithm for assigning numerics to a string? - algorithm

I'd like to design a dictionary which stores a value for a string, such that when two strings are compared, the two corresponding values can be used to determine which string comes first in a dictionary.
For example, the value for "a" should be less than the value for "ab". The value for "z" should be greater than the value for "az". And so on.
I tried googling for this but I wasn't able to find it :( Is there a good way to implement this? I see it is very similar to a decimal system but in base 26. (For example aaa would be like 111, and aaz would be like 11(26).) But that wouldn't work for the case that "z" > "az", since that would be saying (26) > 1(26).
One solution I came up with was to take the length of the largest word (let's say m), and then assign a value by doing 26^m + 26^(m-1) and so on for each letter. However this requires knowing the length of the largest word. Are there any such algorithms that do not require this?

This is not possible with only natural numbers/integers because between any two strings there are an infinite number of others (ex. "asdf" < "asdfa" < "asdfaa" < ... < "asdg"), but between any two integers there is only a finite number of integers.
However, as suggested in the comments, if you can use real numbers, you can map a string to char1 + char2/27 + char3/27^2+.... However, for long strings, this will hit the max floating point precision and stop working correctly.

Related

Convert string to perfect number

Given a string, we need to find the largest square which can be obtained by replace its characters by digits (leading zeros are not allowed) where same characters always map to the same digits and different characters always map to different digits. If no solution, return -1.
Consider the string "ab" If we replace character a with 8 and b with 1, we get 81, which is a square.
How to find it for given string ? It is given that string length can be at max 11.
Please help me find a suitable and efficient way
Sorry can't comment, not enough reputation for it so I'll answer here.
#mat7 about what you said in your question comments, no you don't have to do it for every letter from a to z. You only have to do it for the letters present in your string (so at max 12 letters, not 26).
The first thing I would even check is how much different letter you have, if it's 11 or 12 different letters you can directly return -1 since you can't have different letters having the same number.
Now, supposing the input string being "fdsadrtas", you take a new array with only each different letter => "fdsadrt"
And with this array you try all possibilities (exclude the obvious mismatching options, if you set 'f' to 4 and 'd' to 5, 's' can only be 12367890 (and f can never be 0)), this way you will exclude lots of possibilities, having as worst case 10! instead of 12^10. (actually 9*9! with the test of the first one never beeing 0 but it's close enough)
EDIT 2 : +1 samgak nice idea !
The last digit can only be 0,1,4,5,6,9 so the worst number of tests drop even to 9*6*8!
10! is by far small enough to be brute tested, keep the higher square value you found and you are done.
EDIT :
Actually It would work (in a finite reasonable amount of time) but it is the wrong approach now that I have thought about it.
You will use less time in looking all the squares numbers that could be a solution for your string, using the exemple I gave above it's a string of length 9, and checking each square who is length 9 if he could be successfully mapped into the string.
For a string of length 12 (the worst case) you will have to check the square values of 316'228 to 999'999, who is way less than the >2 millions check of the previous proposition. The other proposition might become faster if you start accepting long strings but with only 12 you are faster this way.

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Using a set of integers to generate unique key

Now I have some sets of integers, say:
set1 = {int1, int2, int3};
set2 = {int2, int3, int1};
set3 = {int1, int4, int2};
The order or the numbers is not taken into consideration, so set1 and set2 are the same, while set3 are not with the other two.
Now I want to generate a unique key for these sets to distinguish them, in that way, set1 and set2 should generate the same key.
I think this for a while, thoughts as sum up the integers came to my mind but can be easily proved wrong. Sort the set and do
key = n1 + n2*2^16 + n3*2^32
may be a possible way but I wonder if this can be solved more elegantly.
The key can be either integer or string.
So any one has some idea about solving this as fast as possible? Or any reading material is welcome.
More info:
The numbers are in fact colors so each integer is less than 0xffffff
If these were small integers (all within the range(0,63) for example) then you could represent each set as a bitstring (1 for any integer that's present in the set; 0 for any that's absent). For sparse sets of large integers this would be horrendously expensive in terms of storage/memory).
One other method that comes to mind would be to sort the set and form the key as the concatenation of each number's digital representation (separated by some delimiter). So the set {2,1,3} -> "1/2/3" (using "/" as the delimiter) and {30,1,2,4} => "1/2/4/30"
I suppose you could also use a hybrid approach. All elements < 63 are encoded into a hex string and all others are encoded into a string as described. Then your final resulting key is formed by: HEXxA/B/c ... (with the "x" separating the small int hex string from the larger ints in the set).
If numbers of your set is not so large, I think hashing each set into one string can be one of proper solution.
Then they are lager ones, you can make it small ones by mod function or whatever. And by this, they can be dealed with in the same way.
Hope this will help your solution if there is no better idea.
I think a key of practical size can only be a hash value - there will always be a few pairs of inputs that hash to the same key, but you can make this unlikely.
I think the idea of sorting and then applying a standard hash function is good, but I don't like your hash multipliers. If arithmetic is mod 2^32, then multiplying by 2^32 is multiplying by zero. If it is mod 2^64, then multiplying by 2^32 will lose the top 32 bits of the input.
I would use a hash function like that described in Why chose 31 to do the multiplication in the hashcode() implementation ?, where you keep a running total, multiplying the hash value by some odd number before you add then next item into it. Multiplying by an odd number mod 2^n will at least not lose information immediately. I would suggest 131, but Java has a tradition of using 31.

Is there an algorithm for choosing a few strings from a list so that the number of strings equal the number of different letters in them?

Edit2: I think the solution of David Eisenstat works but I will check it before I call the question solved.
Example list of strings:
1.) "a"
2.) "ab"
3.) "bc"
4.) "dc"
5.) "efa"
6.) "ef"
7.) "gh"
8.) "hi"
You can choose number 1.) there's 1 string and 1 letter in it: "a"
You can also choose 1.) and 2.) these are 2 strings with only two different letters in them "a" and "b"
other valid string combinations:
1.) 2.) 3.)
1.) 5.) 6.)
there's no valid combination with "h" (it would be ideal if cases like this could be proven however you can assume the program only needs to work when there's a valid answer)
There could be an extra condition that the strings you choose must include one specified letter, however simply finding all the possible combinations would solve the problem just as well. eg. specified letter "c" the only solution in this case would be: 1.) 2.) 3.)
[optional information] The purpose of this: I want to make a program which can choose from a big list of equations (probably around 100) which ones can be used to solve for a variable. Each equation gets one string, each letter in the string representing one unknown. The list of equations are all different eg. cannot be derived from each other, so you need as many equations as many unknowns there are in them. Solving for the unknowns will be done in a CAS, so you don't need to worry about it. However I believe the CAS (Maxima) might have a limit on how many equations it can solve simultaneously and it might be too slow if you give it too many unnecessary equations at a time.
As a start I would use an algorithm to reduce the number of strings just to make it faster. First all strings containing specified letter are in the reduced list, then all strings containing the letters from the strings in the reduced list are part of the reduced list until none is added. eg reduced list of "g" would be 7.) "gh" and 8.) "hi" This would only remove some unnecessary strings, but the task would remain the same with the rest.
I think this can be solved by taking away unnecessary strings from the reduced list until all the remaining are needed, however I don't know how to explicitly define which strings would be unnecessary (except for those mentioned in the previous paragraph).
If you work with the extra condition: This is an optimization task. I don't need a perfect solution, only an optimal solution. The program doesn't need to find the absolute minimum number of strings that give a solution. Having a few extra strings in the solution would probably only slow the computer down, but it would be acceptable.
Edit: Optional clarification about the meaning of the strings: Each letter in a string represent an unknown in an equation so the equation a=2 would be represented by "a" because that's the only unknown. The equation a+b=0 would be represented by "ab" and b^2-c=0 by "bc"
I'm not sure what to call this problem. It seems NP-hard, so I'm going to suggest an integer programming formulation, which can be attacked by an off-the-shelf solver.
Let x_i be a 0-1 variable indicating whether equation i is included in the output. Let y_j be a 0-1 variable indicating whether variable j is included in the output. We have constraints
for all equations i, for all variables j in equation i, y_j - x_i >= 0.
We need as many equations as variables in the output.
(sum over all equations i of x_i) - (sum over all variables j of y_j) = 0
As you point out, the empty set needs specifically to be disallowed. Let k be a variable that must appear in the output.
sum over all equations i containing variable k of x_i >= 1
Naturally, the objective is
minimize sum over all equations i of x_i.

Is it possible to create an algorithm which generates an autogram?

An autogram is a sentence which describes the characters it contains, usually enumerating each letter of the alphabet, but possibly also the punctuation it contains. Here is the example given in the wiki page.
This sentence employs two a’s, two c’s, two d’s, twenty-eight e’s, five f’s, three g’s, eight h’s, eleven i’s, three l’s, two m’s, thirteen n’s, nine o’s, two p’s, five r’s, twenty-five s’s, twenty-three t’s, six v’s, ten w’s, two x’s, five y’s, and one z.
Coming up with one is hard, because you don't know how many letters it contains until you finish the sentence. Which is what prompts me to ask: is it possible to write an algorithm which could create an autogram? For example, a given parameter would be the start of the sentence as an input e.g. "This sentence employs", and assuming that it uses the same format as the above "x a's, ... y z's".
I'm not asking for you to actually write an algorithm, although by all means I'd love to see if you know one to exist or want to try and write one; rather I'm curious as to whether the problem is computable in the first place.
You are asking two different questions.
"is it possible to write an algorithm which could create an autogram?"
There are algorithms to find autograms. As far as I know, they use randomization, which means that such an algorithm might find a solution for a given start text, but if it doesn't find one, then this doesn't mean that there isn't one. This takes us to the second question.
"I'm curious as to whether the problem is computable in the first place."
Computable would mean that there is an algorithm which for a given start text either outputs a solution, or states that there isn't one. The above-mentioned algorithms can't do that, and an exhaustive search is not workable. Therefore I'd say that this problem is not computable. However, this is rather of academic interest. In practice, the randomized algorithms work well enough.
Let's assume for the moment that all counts are less than or equal to some maximum M, with M < 100. As mentioned in the OP's link, this means that we only need to decide counts for the 16 letters that appear in these number words, as counts for the other 10 letters are already determined by the specified prefix text and can't change.
One property that I think is worth exploiting is the fact that, if we take some (possibly incorrect) solution and rearrange the number-words in it, then the total letter counts don't change. IOW, if we ignore the letters spent "naming themselves" (e.g. the c in two c's) then the total letter counts only depend on the multiset of number-words that are actually present in the sentence. What that means is that instead of having to consider all possible ways of assigning one of M number-words to each of the 16 letters, we can enumerate just the (much smaller) set of all multisets of number-words of size 16 or less, having elements taken from the ground set of number-words of size M, and for each multiset, look to see whether we can fit the 16 letters to its elements in a way that uses each multiset element exactly once.
Note that a multiset of numbers can be uniquely represented as a nondecreasing list of numbers, and this makes them easy to enumerate.
What does it mean for a letter to "fit" a multiset? Suppose we have a multiset W of number-words; this determines total letter counts for each of the 16 letters (for each letter, just sum the counts of that letter across all the number-words in W; also add a count of 1 for the letter "S" for each number-word besides "one", to account for the pluralisation). Call these letter counts f["A"] for the frequency of "A", etc. Pretend we have a function etoi() that operates like C's atoi(), but returns the numeric value of a number-word. (This is just conceptual; of course in practice we would always generate the number-word from the integer value (which we would keep around), and never the other way around.) Then a letter x fits a particular number-word w in W if and only if f[x] + 1 = etoi(w), since writing the letter x itself into the sentence will increase its frequency by 1, thereby making the two sides of the equation equal.
This does not yet address the fact that if more than one letter fits a number-word, only one of them can be assigned it. But it turns out that it is easy to determine whether a given multiset W of number-words, represented as a nondecreasing list of integers, simultaneously fits any set of letters:
Calculate the total letter frequencies f[] that W implies.
Sort these frequencies.
Skip past any zero-frequency letters. Suppose there were k of these.
For each remaining letter, check whether its frequency is equal to one less than the numeric value of the number-word in the corresponding position. I.e. check that f[k] + 1 == etoi(W[0]), f[k+1] + 1 == etoi(W[1]), etc.
If and only if all these frequencies agree, we have a winner!
The above approach is naive in that it assumes that we choose words to put in the multiset from a size M ground set. For M > 20 there is a lot of structure in this set that can be exploited, at the cost of slightly complicating the algorithm. In particular, instead of enumerating straight multisets of this ground set of all allowed numbers, it would be much better to enumerate multisets of {"one", "two", ..., "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}, and then allow the "fit detection" step to combine the number-words for multiples of 10 with the single-digit number-words.

Resources