Minimum number of char substitutions to get a palindrome - algorithm

I would like to solve this problem from TopCoder, in which a String is given and in each step you have to replace all occurrences of an character (of your choice) with another character (of your choice), so that at the end after all steps you get a palindrome. The problem is to identify the minimum total number of replacements.
Ideas so far:
I can identify that the string after every step is simply a node/vertex in a graph and that the cost of every edge is the number of replacements made in the step, but I don't see how to use greedy for that (it is definitely not the Minimum Spanning Tree problem). I don't think it makes sense to identify all possible nodes & edge costs and to convert the problem in the Shortest Path problem. On the other side, I think in every step it makes sense to replace the character X with the biggest number of conflicts, with the character Y in conflict with X that occurs most in the string.
Anyway, I can't either prove that it works. Also I can't identify any known problems in this. Any ideas?

You need to identify disjunct sets of characters. A disjunct set of characters is a set of characters that will all have to become the same character in order for the string to become a palindrome.
Example:
Let's say we have the string abcdefgfmdebac
It has 3 disjunct sets, abc, de and fgm
Algorithm:
Pick the first character and check all occurences of it picking up other characters in the set.
In the example string we start with a and pick up band c (because they sit on the opposite sides of the two ain our string). We repeat the process for band c, but no new characters are added to the set. So abc is our first disjunct set.
Continue doing this with the remaining characters.
A disjunct set of n characters (counting all characters) needs n-m replacements, where m is the number of occurences of the most frequent character.
So simply sum over the sets.
In our example it takes 4 + 2 + 2 = 8 replacements.

Related

Check if string includes part of Fibonacci Sequence

Which way should I follow to create an algorithm to find out whether fibonacci sequence exists in a given string ?
The string includes only digits with no whitespaces and there may be more than one sequence, I need to find all of them.
If as your comment says the first number must have less than 6 digits, you can simply search for all positions there one of the 25 fibonacci numbers (there are only 25 with less than 6 digits) and than try to expand this 1 number sequence in both directions.
After your update:
You can even speed things up when you are only looking for sequences of at least 3 numbers.
Prebuild all 25 3-number-Strings that start with one of the 25 first fibonnaci-numbers this should give much less matches than the search for the single fibonacci-numbers I suggested above.
Than search for them (like described above and try to expand the found 3-number-sequences).
here's how I would approach this.
The main algorithm could search for triplets then try to extend them to as long a sequence as possible.
This leaves us with the subproblem of finding triplets. So if you are scanning through a string to look for fibonacci numbers, one thing you can take advantage of is that the next number must have the same number of digits or one more digit.
e.g. if you have the string "987159725844" and are considering "[987]159725844" then the next thing you need to look at is "987[159]725844" and "987[1597]25844". Then the next part you would find is "[2584]4" or "[25844]".
Once you have the 3 numbers you can check if they form an arithmetic progression with C - B == B - A. If they do you can now check if they are from the fibonacci sequence by seeing if the ratio is roughly 1.6 and then running the fibonacci iteration backwards down to the initial conditions 1,1.
The overall algorithm would then work by scanning through looking for all triples starting with width 1, then width 2, width 3 up to 6.
I'd say you should first find all interesting Fibonacci items (which, having 6 or less digits, are no more than 30) and store them into an array.
Then, loop every position in your input string, and try to find upon there the longest possible Fibonacci number (that is, you must browse the array backwards).
If some Fib number is found, then you must bifurcate to a secondary algorithm, consisting of merely going through the array from current position to the end, trying to match every item in the following substring. When the matching ends, you must get back to the main algorithm to keep searching in the input string from the current position.
None of these two algorithms is recursive, nor too expensive.
update
Ok. If no tables are allowed, you could still use this approach replacing in the first loop the way to get the bext Fibo number: Instead of indexing, apply your formula.

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Is it possible to create an algorithm which generates an autogram?

An autogram is a sentence which describes the characters it contains, usually enumerating each letter of the alphabet, but possibly also the punctuation it contains. Here is the example given in the wiki page.
This sentence employs two a’s, two c’s, two d’s, twenty-eight e’s, five f’s, three g’s, eight h’s, eleven i’s, three l’s, two m’s, thirteen n’s, nine o’s, two p’s, five r’s, twenty-five s’s, twenty-three t’s, six v’s, ten w’s, two x’s, five y’s, and one z.
Coming up with one is hard, because you don't know how many letters it contains until you finish the sentence. Which is what prompts me to ask: is it possible to write an algorithm which could create an autogram? For example, a given parameter would be the start of the sentence as an input e.g. "This sentence employs", and assuming that it uses the same format as the above "x a's, ... y z's".
I'm not asking for you to actually write an algorithm, although by all means I'd love to see if you know one to exist or want to try and write one; rather I'm curious as to whether the problem is computable in the first place.
You are asking two different questions.
"is it possible to write an algorithm which could create an autogram?"
There are algorithms to find autograms. As far as I know, they use randomization, which means that such an algorithm might find a solution for a given start text, but if it doesn't find one, then this doesn't mean that there isn't one. This takes us to the second question.
"I'm curious as to whether the problem is computable in the first place."
Computable would mean that there is an algorithm which for a given start text either outputs a solution, or states that there isn't one. The above-mentioned algorithms can't do that, and an exhaustive search is not workable. Therefore I'd say that this problem is not computable. However, this is rather of academic interest. In practice, the randomized algorithms work well enough.
Let's assume for the moment that all counts are less than or equal to some maximum M, with M < 100. As mentioned in the OP's link, this means that we only need to decide counts for the 16 letters that appear in these number words, as counts for the other 10 letters are already determined by the specified prefix text and can't change.
One property that I think is worth exploiting is the fact that, if we take some (possibly incorrect) solution and rearrange the number-words in it, then the total letter counts don't change. IOW, if we ignore the letters spent "naming themselves" (e.g. the c in two c's) then the total letter counts only depend on the multiset of number-words that are actually present in the sentence. What that means is that instead of having to consider all possible ways of assigning one of M number-words to each of the 16 letters, we can enumerate just the (much smaller) set of all multisets of number-words of size 16 or less, having elements taken from the ground set of number-words of size M, and for each multiset, look to see whether we can fit the 16 letters to its elements in a way that uses each multiset element exactly once.
Note that a multiset of numbers can be uniquely represented as a nondecreasing list of numbers, and this makes them easy to enumerate.
What does it mean for a letter to "fit" a multiset? Suppose we have a multiset W of number-words; this determines total letter counts for each of the 16 letters (for each letter, just sum the counts of that letter across all the number-words in W; also add a count of 1 for the letter "S" for each number-word besides "one", to account for the pluralisation). Call these letter counts f["A"] for the frequency of "A", etc. Pretend we have a function etoi() that operates like C's atoi(), but returns the numeric value of a number-word. (This is just conceptual; of course in practice we would always generate the number-word from the integer value (which we would keep around), and never the other way around.) Then a letter x fits a particular number-word w in W if and only if f[x] + 1 = etoi(w), since writing the letter x itself into the sentence will increase its frequency by 1, thereby making the two sides of the equation equal.
This does not yet address the fact that if more than one letter fits a number-word, only one of them can be assigned it. But it turns out that it is easy to determine whether a given multiset W of number-words, represented as a nondecreasing list of integers, simultaneously fits any set of letters:
Calculate the total letter frequencies f[] that W implies.
Sort these frequencies.
Skip past any zero-frequency letters. Suppose there were k of these.
For each remaining letter, check whether its frequency is equal to one less than the numeric value of the number-word in the corresponding position. I.e. check that f[k] + 1 == etoi(W[0]), f[k+1] + 1 == etoi(W[1]), etc.
If and only if all these frequencies agree, we have a winner!
The above approach is naive in that it assumes that we choose words to put in the multiset from a size M ground set. For M > 20 there is a lot of structure in this set that can be exploited, at the cost of slightly complicating the algorithm. In particular, instead of enumerating straight multisets of this ground set of all allowed numbers, it would be much better to enumerate multisets of {"one", "two", ..., "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}, and then allow the "fit detection" step to combine the number-words for multiples of 10 with the single-digit number-words.

minimal cyclic sub string in a bigger cyclic string

I am trying to find an algorithm that culd return the length of the shortest cyclic sub string in a larger cyclic string.
A cyclic string would be defined as a concatenation of tow or more identicle strings, e.g. "abababab", or "aaaa"...
Now in a given for example a string T = "abbcabbcabbcabbc" there is a cycle of the pattern "abbc" but the shortest cyclic sub string would be "bb".
If you're just looking for a substring that appears more than once:
Build a Suffix tree from the string.
While creating the suffix tree, you can count re-occurrences of every substring and save it on the number of occurrences on the node.
Then just do a BFS search on the tree (which will give you a layered search, from shorter to longer strings) and find the first substring which is longer than 1 that occurred more than once.
Total complexity: O(n) where n is the length of the string
Edit:
The paths from the root to the leaves
have a one-to-one relationship with
the suffixes of S
You can implement the tree that each node contains one letter, that will give you better granularity and allow you to see all the substrings by length.
Here's a suffix tree of banana where every node contains one letter, you can see that you have all the substrings there.
If you'll look at the applications section of the suffix tree, you'll see that it is used for exactly this kind of tasks - finding stuff about substrings.
Look at the image from the root, you can see ALL the substrings start from the root (BFS list):
b
a
n
ba
an
na
ban
ana
nan
bana
anan
nana
banan
anana
banana
Let me call "abbc" the generator in your example - i.e. the string that you repeat in order to get the bigger string.
The very first observation is that the smaller string should be made by repeating some substring twice.
It's clear that the smallest string should be smaller than the generator repeated twice (2*generator), because 2*generator is cyclic.
Now note that you only need to consider the string obtained by taking the generator 3 times, when searching for smaller cyclic string. Indeed, if the smallest is not there, but it is in the 4*generator, then it must span at least two generators, but then it wouldn't be the smallest.
So now lets assume the bigger string is 3*generator (or 2*generator).
Also it's clear that if the generator has only different digits, then the answer is 2*generator. If not then you just need to find all pairs of identical characters in the bigger string say at position i and j and check whether the string starting a i, which is 2*(j-i) long is cyclic. If you try them in order of increasing j-i, then you can stop after the first success.

Check if given string can be created by a set of characters cut out from magazine article

"Observe that when you cut a character out of a magazine, the character on the reverse side of the page is also removed. Give an algorithm to determine whether you can generate a given string by pasting cutouts from a given magazine. Assume that you are given a function that will identify the character and its position on the reverse side of the page for any given character position."
How can I do it?
I can do some initial pruning so that if a needed character has only one way of getting picked up, its taken initially before turning the sub-problem for dynamic technique, but what after this initial pruning?
What is the time and space complexity?
As #LiKao suggested, this can be solved using max flow. To construct the network we make two "layers" of vertices: one with all the distinct characters in the input string and one with each position on the page. Make an edge with capacity 1 from a character to a position if that position has that character on one side. Make edges of capacity 1 from each position to the sink, and make edges from the source to each character with capacity equal to the multiplicity of that character in the input string.
For example, let's say we're searching for the word "FOO" on a page with four positions:
pos 1 2 3 4
front F C O Z
back O O K Z
We then generate the following network, ignoring position 4 since it does not provide any of the required characters.
Now, we only need to determine if there is a flow from the source to the sink of length("FOO") = 3 or more.
You can use dynamic programming directly.
We are given string s with n letters. We are given a set of pieces P = {p_1, ..., p_k}. Each piece has one letter in the front p_i.f and one in the back p_i.b.
Denote with f(j, p) the function that returns true if it is feasible to create substring s_1...s_j using pieces in p \subseteq P, and false otherwise.
The following recurrence holds:
f(n, P) = f(n-1, P-p_1) | f(n-1, P-p_2) | ... | f(n-1, P-p_k)
In plain English the feasibility of s using all pieces in P, depends on the feasibility of the substring s_1...s_n-1 given one less piece, and we try removing all possible pieces (of course in practice we do not have to remove all pieces one by one; we only need to remove those pieces for which p_i.f == s_n || p_i.b == s_n).
The initial condition is that f(1, P-p_1) = f(1, P-p2) = ... = true, assuming that we have already checked a-priori (in linear time) that there are enough letters in P to cover all the letters in s.
While this problem can be formulated as a Maxflow problem as shown in the accepted answer, it is simpler and more efficient to formulate it as a maximum cardinality matching problem in a bipartite graph. Maxflow algorithms like Dinic's are slower than the special case algorithms like Hopcroft–Karp algorithm.
The bipartite graph is formed by adding two edges from every character of the given string to a cutout, one edge for each side. We then run Hopcroft–Karp. In the end, we simply check whether the cardinality of the matching is equal to the length of the string.
For a working implementation (in Scala) using JGraphT, see my GitHub.
I'd like to come up with a more efficient DP solution, since Skiena's book has this problem in the DP section, but so far haven't found any.

Resources