Utilizing a trie Data Struture - data-structures

So i am implementing a trie used for reading unique words from a file. I was looking online on how to implement it and came across this way of doing it:
//to insert the string in the trie tree
'
void insert(struct node *head, string str)
{
int i, j;
for(i = 0;i < str.size(); ++i){
//if the child node is pointing to NULL
if(head -> next_char[str[i] - 'a'] == NULL){
struct node *n;
//initialise the new node
n = new struct node;
for(j = 0;j < 26; ++j){
n -> next_char[j] = NULL;
}
n -> end_string = 0;
head -> next_char[str[i] - 'a'] = n;
head = n;
}
//if the child node is not pointing to q
else head = head -> next_char[str[i] - 'a'];
}
//to mark the end_string flag for this string
head -> end_string = 1;
}
My confusion arrise from the line:
'head -> next_char[str[i] - 'a'] == NULL
what is the purpose of using the subtraction of 'a' in all the ways this code implements it?

Trie makes sense when your input strings consist of the characters from the some relatively small fixed alphabet.
In this concrete implementation it is assumed that these characters are in the range from a..z, 26 total.
As in many languages Char type is actually Int or Byte, you can perform arithmetic operations with it. When you do that, character's code is used as operand.
Having above in mind it is clear, that the easiest way to map chars from some known non-zero based range to zero-based range is to subtract the start element of the range from code of the particular character.
For 'a'..'z' range:
when you do ('a' - 'a') you get 0
'b' - 'a' = 1
...
'z' - 'a' = 25

I add a small information beside the answer of #Aivean which is perfect.
In this implementation, each node in the Trie contains a static array of size 26 to points to its children.
The goal of this is to find the correct child in constant time, and hence check if it exists or not.
To find the correct child (the position in the array of 26) we use current_Char - 'a' as it is well explained in #Aivean Answer.

Related

Huffman algorithm inverse matching

I was wondering if given a binary sequence we can check if it matches a string using the Huffman algorithm.
for example, if we a string "abdcc" and several binary sequences we can calculate which one is a possible representation of "abdcc" that used Huffman's algorithm
Interesting puzzle. As mentioned by j_random_hacker in a comment, it's possible to do this using a backtracking search. There are a few constraints to valid Huffman encodings of the string that we can use to narrow the search down:
No two Huffman codes of length n and m can be identical in the first n or m bits (whichever is shorter). This is because otherwise a Huffman decoder wouldn't be able to tell if it had encountered the longer or the shorter code when decoding. And obviously two codes of the same length cannot be identical. (1)
If at any time there are less bits remaining in the bitstream than characters remaining in the string we are matching then the string cannot match. (2)
If we reach the end of the string and there are still bits remaining in the bitstream then the string does not match (3)
If we encounter a character in the string for the second time, and we have already assumed a Huffman code for that same character earlier in the string, then an identical code must be present in the bit stream or the string cannot match. (4)
We can define a function matchHuffmanString that matches a string with Huffman encoded bitstream, with a Huffman code table as part of the global state. To begin with the code table is empty and we call matchHuffmanString, passing the start of the string and the start of the bitstream.
When the function is called, it checks if there are enough bits in the stream to match the string and returns if not. (2)
If the string is empty, then if the bitstream is also empty then there is a match and the code table is output. If the stream is empty but the bitstream is not then there is no match so the function returns. (3)
If characters remain in the string, then the first character is read. The function checks if there is already an entry in the code table for that character, and if so then the same code must be present in the bitstream. If not then there is no match so the function returns (4). If there is then the function calls itself, moving on to the next character and past the matching code in the bitstream.
If there is no matching code for the character, then the possibility that it is represented by a code of every possible length n from 1 bit to 32 bits (an arbitrary limit) is considered. n bits are read from the bitstream and checked to see if such a code would conflict with any existing codes according to rule (1). If no conflict exists then the code is added to the code table, then the function recurses, moving onto the next character and past the assumed code of length n bits. After returning then it backtracks by removing the code from the table.
Simple implementation in C:
#include <stdio.h>
// Huffman table:
// a 01
// b 0001
// c 1
// d 0010
char* string = "abdcc";
// 01 0001 0010 1 1
// reverse bit order (MSB first) an add extra 0 for padding to stop getBits reading past the end of the array:
#define MESSAGE_LENGTH (12)
unsigned int message[] = {0b110100100010, 0};
// can handle messages of >32 bits, even though the above message is only 12 bits long
unsigned int getBits(int start, int n)
{
return ((message[start>>5] >> (start&31)) | (message[(start>>5)+1] << (32-(start&31)))) & ((1<<n)-1);
}
unsigned int codes[26];
int code_lengths[26];
int callCount = 0;
void outputCodes()
{
// output the codes:
int i, j;
for(i = 0; i < 26; i++)
{
if(code_lengths[i] != 0)
{
printf("%c ", i + 'a');
for(j = 0; j < code_lengths[i]; j++)
printf("%s", codes[i] & (1 << j) ? "1" : "0");
printf("\n");
}
}
}
void matchHuffmanString(char* s, int len, int startbit)
{
callCount++;
if(len > MESSAGE_LENGTH - startbit)
return; // not enough bits left to encode the rest of the message even at 1 bit per char (2)
if(len == 0) // no more characters to match
{
if(startbit == MESSAGE_LENGTH)
{
// (3) we exactly used up all the bits, this stream matches.
printf("match!\n\n");
outputCodes();
printf("\nCall count: %d\n", callCount);
}
return;
}
// read a character from the string (assume 'a' to 'z'):
int c = s[0] - 'a';
// is there already a code for this character?
if(code_lengths[c] != 0)
{
// check if the code in the bit stream matches:
int length = code_lengths[c];
if(startbit + length > MESSAGE_LENGTH)
return; // ran out of bits in stream, no match
unsigned int bits = getBits(startbit, length);
if(bits != codes[c])
return; // bits don't match (4)
matchHuffmanString(s + 1, len - 1, startbit + length);
}
else
{
// this character doesn't have a code yet, consider every possible length
int i, j;
for(i = 1; i < 32; i++)
{
// are there enough bits left for a code this long?
if(startbit + i > MESSAGE_LENGTH)
continue;
unsigned int bits = getBits(startbit, i);
// does this code conflict with an existing code?
for(j = 0; j < 26; j++)
{
if(code_lengths[j] != 0) // check existing codes only
{
// do the two codes match in the first i or code_lengths[j] bits, whichever is shorter?
int length = code_lengths[j] < i ? code_lengths[j] : i;
if((bits & ((1 << length)-1)) == (codes[j] & ((1 << length)-1)))
break; // there's a conflict (1)
}
}
if(j != 26)
continue; // there was a conflict
// add the new code to the codes array and recurse:
codes[c] = bits; code_lengths[c] = i;
matchHuffmanString(s + 1, len - 1, startbit + i);
code_lengths[c] = 0; // clear the code (backtracking)
}
}
}
int main(void) {
int i;
for(i = 0; i < 26; i++)
code_lengths[i] = 0;
matchHuffmanString(string, 5, 0);
return 0;
}
output:
match!
a 01
b 0001
c 1
d 0010
Call count: 42
Ideone.com Demo
The above code could be improved by iterating over the string as long as it is encountering characters that it already has a code for, and only recursing when it finds one it doesn't. Also it only works for lowercase letters a-z with no spaces and doesn't do any validation. I'd have to test it to be sure, but I think it's a tractable problem even for long strings, because any possible combinatorial explosion only happens when encountering new characters that don't already have codes in the table, and even then it's subject to contraints.

Longest common prefix for n string

Given n string of max length m. How can we find the longest common prefix shared by at least two strings among them?
Example: ['flower', 'flow', 'hello', 'fleet']
Answer: fl
I was thinking of building a Trie for all the string and then checking the deepest node (satisfies longest) that branches out to two/more substrings (satisfies commonality). This takes O(n*m) time and space. Is there a better way to do this
Why to use trie(which takes O(mn) time and O(mn) space, just use the basic brute force way. first loop, find the shortest string as minStr, which takes o(n) time, second loop, compare one by one with this minStr, and keep an variable which indicates the rightmost index of minStr, this loop takes O(mn) where m is the shortest length of all strings. The code is like below,
public String longestCommonPrefix(String[] strs) {
if(strs.length==0) return "";
String minStr=strs[0];
for(int i=1;i<strs.length;i++){
if(strs[i].length()<minStr.length())
minStr=strs[i];
}
int end=minStr.length();
for(int i=0;i<strs.length;i++){
int j;
for( j=0;j<end;j++){
if(minStr.charAt(j)!=strs[i].charAt(j))
break;
}
if(j<end)
end=j;
}
return minStr.substring(0,end);
}
there is an O(|S|*n) solution to this problem, using a trie. [n is the number of strings, S is the longest string]
(1) put all strings in a trie
(2) do a DFS in the trie, until you find the first vertex with more than 1 "edge".
(3) the path from the root to the node you found at (2) is the longest common prefix.
There is no possible faster solution then it [in terms of big O notation], at the worst case, all your strings are identical - and you need to read all of them to know it.
I would sort them, which you can do in n lg n time. Then any strings with common prefixes will be right next to eachother. In fact you should be able to keep a pointer of which index you're currently looking at and work your way down for a pretty speedy computation.
As a completely different answer from my other answer...
You can, with one pass, bucket every string based on its first letter.
With another pass you can sort each bucket based on its second later. (This is known as radix sort, which is O(n*m), and O(n) with each pass.) This gives you a baseline prefix of 2.
You can safely remove from your dataset any elements that do not have a prefix of 2.
You can continue the radix sort, removing elements without a shared prefix of p, as p approaches m.
This will give you the same O(n*m) time that the trie approach does, but will always be faster than the trie since the trie must look at every character in every string (as it enters the structure), while this approach is only guaranteed to look at 2 characters per string, at which point it culls much of the dataset.
The worst case is still that every string is identical, which is why it shares the same big O notation, but will be faster in all cases as is guaranteed to use less comparisons since on any "non-worst-case" there are characters that never need to be visited.
public String longestCommonPrefix(String[] strs) {
if (strs == null || strs.length == 0)
return "";
char[] c_list = strs[0].toCharArray();
int len = c_list.length;
int j = 0;
for (int i = 1; i < strs.length; i++) {
for (j = 0; j < len && j < strs[i].length(); j++)
if (c_list[j] != strs[i].charAt(j))
break;
len = j;
}
return new String(c_list).substring(0, len);
}
It happens that the bucket sort (radix sort) described by corsiKa can be extended such that all strings are eventually placed alone in a bucket, and at that point, the LCP for such a lonely string is known. Further, the shustring of each string is also known; it is one longer than is the LCP. The bucket sort is defacto the construction of a suffix array but, only partially so. Those comparisons that are not performed (as described by corsiKa) indeed represent those portions of the suffix strings that are not added to the suffix array. Finally, this method allows for determination of not just the LCP and shustrings, but also one may easily find those subsequences that are not present within the string.
Since the world is obviously begging for an answer in Swift, here's mine ;)
func longestCommonPrefix(strings:[String]) -> String {
var commonPrefix = ""
var indices = strings.map { $0.startIndex}
outerLoop:
while true {
var toMatch: Character = "_"
for (whichString, f) in strings.enumerate() {
let cursor = indices[whichString]
if cursor == f.endIndex { break outerLoop }
indices[whichString] = cursor.successor()
if whichString == 0 { toMatch = f[cursor] }
if toMatch != f[cursor] { break outerLoop }
}
commonPrefix.append(toMatch)
}
return commonPrefix
}
Swift 3 Update:
func longestCommonPrefix(strings:[String]) -> String {
var commonPrefix = ""
var indices = strings.map { $0.startIndex}
outerLoop:
while true {
var toMatch: Character = "_"
for (whichString, f) in strings.enumerated() {
let cursor = indices[whichString]
if cursor == f.endIndex { break outerLoop }
indices[whichString] = f.characters.index(after: cursor)
if whichString == 0 { toMatch = f[cursor] }
if toMatch != f[cursor] { break outerLoop }
}
commonPrefix.append(toMatch)
}
return commonPrefix
}
What's interesting to note:
this runs in O^2, or O(n x m) where n is the number of strings and m
is the length of the shortest one.
this uses the String.Index data type and thus deals with Grapheme Clusters which the Character type represents.
And given the function I needed to write in the first place:
/// Takes an array of Strings representing file system objects absolute
/// paths and turn it into a new array with the minimum number of common
/// ancestors, possibly pushing the root of the tree as many level downwards
/// as necessary
///
/// In other words, we compute the longest common prefix and remove it
func reify(fullPaths:[String]) -> [String] {
let lcp = longestCommonPrefix(fullPaths)
return fullPaths.map {
return $0[lcp.endIndex ..< $0.endIndex]
}
}
here is a minimal unit test:
func testReifySimple() {
let samplePaths:[String] = [
"/root/some/file"
, "/root/some/other/file"
, "/root/another/file"
, "/root/direct.file"
]
let expectedPaths:[String] = [
"some/file"
, "some/other/file"
, "another/file"
, "direct.file"
]
let reified = PathUtilities().reify(samplePaths)
for (index, expected) in expectedPaths.enumerate(){
XCTAssert(expected == reified[index], "failed match, \(expected) != \(reified[index])")
}
}
Perhaps a more intuitive solution. Channel the already found prefix out of earlier iteration as input string to the remaining or next string input. [[[w1, w2], w3], w4]... so on], where [] is supposedly the LCP of two strings.
public String findPrefixBetweenTwo(String A, String B){
String ans = "";
for (int i = 0, j = 0; i < A.length() && j < B.length(); i++, j++){
if (A.charAt(i) != B.charAt(j)){
return i > 0 ? A.substring(0, i) : "";
}
}
// Either of the string is prefix of another one OR they are same.
return (A.length() > B.length()) ? B.substring(0, B.length()) : A.substring(0, A.length());
}
public String longestCommonPrefix(ArrayList<String> A) {
if (A.size() == 1) return A.get(0);
String prefix = A.get(0);
for (int i = 1; i < A.size(); i++){
prefix = findPrefixBetweenTwo(prefix, A.get(i)); // chain the earlier prefix
}
return prefix;
}

compaction in an array storing 2 linked lists

An array Arr ( size n ) can represent doubly linked list.
[ Say the cells have struct { int val, next, prev; } ]
I have two lists A and B stored in the array.
A has m nodes and B has n - m nodes.
These nodes being scattered, I want to rearrange them such that all nodes of A are from Arr[0] .. Arr[m-1] and rest are filled by nodes of B, in O(m) time.
The solution that occurs to me is to :
Iterate A till a node occurs which is placed beyond Arr[m-1]
then, iterate B till a node occurs which is placed before Arr[m]
swap the two ( including the manipulation of the next prev links of them and their neighbours).
However in this case the total number of iterations is O(n + m). Hence there should be a better answer.
P.S:
This question occurs in Introduction to Algorithms, 2nd edition.
Problem 10.3-5
How about iterating through list A and placing each element in Arr[0] ... Arr[m-1], obviously swapping its position with whatever was there before and updating the prev/next links as well. There will be a lot of swapping but nevertheless it will be O(m) since once you finish iterating through A (m iterations), all of its elements will be located (in order, incidentally) in the first m slots of Arr, and thus B must be located entirely in the rest of Arr.
To add some pseudocode
a := index of head of A
for i in 0 ... m-1
swap Arr[i], Arr[a]
a := index of next element in A
end
i think "jw013" is right but the idea needs some improvements :
by swapping your are changing the address of elements in the Arr array .
so you need to be careful about that !
e.g. lets say we have Arr like :
indices: 0 1 2 3 4
| 2 | empty | 3 | empty | 1 | (assume the link list is like 1 -> 2 -> 3)
so Arr[4].next is 0 and Arr[0].next is 2 .
but when you swap Arr[4] and Arr[0] then Arr[0].next is 0 .
which is not what we want to happen so we should consider adjusting pointers when swapping.
so the code for it is like :
public static void compactify(int List_head , int Free , node [] array){
int List_lenght ;
List_lenght = find_listlenght(List_head , array);
if(List_lenght != 0){ // if the list is not empty
int a = List_head;
for (int i = 0; i < List_lenght ; i++) {
swap( array , a , i );
a = array[i].next;
print_mem(array);
}
}
}
now when calling swap:
private static void swap(node[] array, int a, int i) {
// adjust the next and prev of both array[a] and array[i]
int next_a = array[a].next;
int next_i = array[i].next;
int prev_a = array[a].prev;
int prev_i = array[i].prev;
// if array[a] has a next adjust the array[next_a].prev to i
if(next_a != -1)
array[next_a].prev = i;
// if array[i] has a next adjust the array[next_i].prev to a
if(next_i != -1)
array[next_i].prev = a;
// like before adjust the pointers of array[prev_a] and array[prev_i]
if(prev_a != -1)
array[prev_a].next = i;
if(prev_i != -1)
array[prev_i].next = a;
node temp = array[a];
array[a] = array[i];
array[i] = temp;
}

Selecting an optimum set according to ranked criteria

I am given a string, and a set of rules which select valid substrings by a process which isn't important here. Given an enumeration of all valid substrings, I have to find the optimum set of substrings according to a set of ranked criteria, such as:
Substrings may not overlap
All characters must be part of a substring if possible
Use as few different substrings as possible
etc.
For example, given the string abc and the substrings [a, ab, bc], the optimal set of substrings by the preceding rules is [a, bc].
Currently I'm doing this by a standard naive algorithm of enumerating all possible sets of substrings, then iterating over them to find the best candidate. The problem is that as the length of the string and the number of substrings goes up, the number of possible sets increases exponentially. With 50 substrings (well within possibility for this app), the number of sets to enumerate is 2^50, which is extremely prohibitive.
It seems like there should be a way to avoid generating many of the sets that will obviously be losers, or to algorithmically converge on the optimum set without having to blindly generate every candidate. What options are there?
Note that for this application it may be acceptable to use an algorithm that offers a statistical rather than absolute guarantee, such as an n% chance of hitting a non-optimal candidate, where n is suitably small.
Looks to me like a tree structure is needed.
Basically your initial branching is on all the substrings, then all but the one you used in the first round etc all the way to the bottom. You're right in that this branches to 2^50 but if you use ab-pruning to quickly terminate branches that are obviously inferior and then add some memoization to prune situations you've seen before you could speed up considerably.
You'll probably have to do a fair amount of AI learning to get it all but wikipedia pages on ab-pruning and transposition tables will get you a start.
edit:
Yep you're right, probably not clear enough.
Assuming your example "ABABABAB BABABABA" with substrings {"ABAB","BABA"}.
If you set your evaluation function to simply treat wasted characters as bad the tree will go something like this:
ABAB (eval=0)
ABAB (eval=0)
ABAB (eval=2 because we move past/waste a space char and a B)
[missing expansion]
BABA (eval=1 because we only waste the space)
ABAB (eval=2 now have wasted the space above and a B at this level)
BABA (eval=1 still only wasted the space)*
BABA (eval=1 prune here because we already have a result that is 1)
BABA (eval=1 prune here for same reason)
*best solution
I suspect the simple 'wasted chars' isn't enough in the non trivial example but it does prune half the tree here.
Here's a working solution in Haskell. I have called the unique substrings symbols, and an association of one occurrence of the substrings a placement. I have also interpreted criterion 3 ("Use as few different substrings as possible") as "use as few symbols as possible", as opposed to "use as few placements as possible".
This is a dynamic programming approach; the actual pruning occurs due to the memoization. Theoretically, a smart haskell implementation could do it for you, (but there are other ways where you wrap makeFindBest), I'd suggest using a bitfield to represent the used symbols and just an integer to represent the remaining string. The optimisation is possible from the fact that: given optimal solutions for the strings S1 and S2 that both use the same set of symbols, if S1 and S2 are concatenated then the two solutions can be concatenated in a similar manner and the new solution will be optimal. Hence for each partition of the input string, makeFindBest need only be evaluated once on the postfix for each possible set of symbols used in the prefix.
I've also integrated branch-and-bound pruning as suggested in Daniel's answer; this makes use of an evaluation function which becomes worse the more characters skipped. The cost is monotonic in the number of characters processed, so that if we have found a set of placements that wasted only alpha characters, then we never again try to skip more than alpha characters.
Where n is the string length and m is the number of symbols, the worst case is O(m^n) naively, and m is O(2^n). Note that removing constraint 3 would make things much quicker: the memoization would only need to be parameterized by the remaining string which is an O(n) cache, as opposed to O(n * 2^m)!
Using a string search/matching algorithm such as Aho-Corasick's string matching algorithm, improves the consume/drop 1 pattern I use here from exponential to quadratic. However, this by itself doesn't avoid the factorial growth in the combinations of the matches, which is where the dynamic programming helps.
Also note that your 4th "etc." criteria could possibly change the problem a lot if it constrains the problem in a way that makes it possible to do more aggressive pruning, or requires backtracking!
module Main where
import List
import Maybe
import System.Environment
type Symbol = String
type Placement = String
-- (remaining, placement or Nothing to skip one character)
type Move = (String, Maybe Placement)
-- (score, usedsymbols, placements)
type Solution = (Int, [Symbol], [Placement])
-- invoke like ./a.out STRING SPACE-SEPARATED-SYMBOLS ...
-- e.g. ./a.out "abcdeafghia" "a bc fg"
-- output is a list of placements
main = do
argv <- System.Environment.getArgs
let str = head argv
symbols = concat (map words (tail argv))
(putStr . show) $ findBest str symbols
putStr "\n"
getscore :: Solution -> Int
getscore (sc,_,_) = sc
-- | consume STR SYM consumes SYM from the start of STR. returns (s, SYM)
-- where s is the rest of STR, after the consumed occurrence, or Nothing if
-- SYM isnt a prefix of STR.
consume :: String -> Symbol -> Maybe Move
consume str sym = if sym `isPrefixOf` str
then (Just (drop (length sym) str, (Just sym)))
else Nothing
-- | addToSoln SYMBOLS P SOL incrementally updates SOL with the new SCORE and
-- placement P
addToSoln :: [Symbol] -> Maybe Placement -> Solution -> Solution
addToSoln symbols Nothing (sc, used, ps) = (sc - (length symbols) - 1, used, ps)
addToSoln symbols (Just p) (sc, used, ps) =
if p `elem` symbols
then (sc - 1, used `union` [p], p : ps)
else (sc, used, p : ps)
reduce :: [Symbol] -> Solution -> Solution -> [Move] -> Solution
reduce _ _ cutoff [] = cutoff
reduce symbols parent cutoff ((s,p):moves) =
let sol = makeFindBest symbols (addToSoln symbols p parent) cutoff s
best = if (getscore sol) > (getscore cutoff)
then sol
else cutoff
in reduce symbols parent best moves
-- | makeFindBest SYMBOLS PARENT CUTOFF STR searches for the best placements
-- that can be made on STR from SYMBOLS, that are strictly better than CUTOFF,
-- and prepends those placements to PARENTs third element.
makeFindBest :: [Symbol] -> Solution -> Solution -> String -> Solution
makeFindBest _ cutoff _ "" = cutoff
makeFindBest symbols parent cutoff str =
-- should be memoized by (snd parent) (i.e. the used symbols) and str
let moves = if (getscore parent) > (getscore cutoff)
then (mapMaybe (consume str) symbols) ++ [(drop 1 str, Nothing)]
else (mapMaybe (consume str) symbols)
in reduce symbols parent cutoff moves
-- a solution that makes no placements
worstScore str symbols = -(length str) * (1 + (length symbols))
findBest str symbols =
(\(_,_,ps) -> reverse ps)
(makeFindBest symbols (0, [], []) (worstScore str symbols, [], []) str)
This smells like a dynamic programming problem. You can find a number of good sources on it, but the gist is that you generate a collection of subproblems, and then build up "larger" optimal solutions by combining optimal subsolutions.
This is an answer rewritten to use the Aho-Corasick string-matching algorithm and Dijkstra's algorithm, in C++. This should be a lot closer to your target language of C#.
The Aho-Corasick step constructs an automaton (based on a suffix tree) from the set of patterns, and then uses that automaton to find all matches in the input string. Dijkstra's algorithm then treats those matches as nodes in a DAG, and moves toward the end of the string looking for the lowest cost path.
This approach is a lot easier to analyze, as it's simply combining two well-understood algorithms.
Constructing the Aho-Corasick automaton is linear time in the length of the patterns, and then the search is linear in the input string + the cumulative length of the matches.
Dijkstra's algorithm runs in O(|E| + |V| log |V|) assuming an efficient STL. The graph is a DAG, where vertices correspond to matches or to runs of characters that are skipped. Edge weights are the penalty for using an extra pattern or for skipping characters. An edge exists between two matches if they are adjacent and non-overlapping. An edge exists from a match m to a skip if that is the shortest possible skip between m and another match m2 that overlaps with some match m3 starting at the same place as the skip (phew!). The structure of Dijkstra's algorithm ensures that the optimal answer is the first one to be found by the time we reach the end of the input string (it achieves the pruning Daniel suggested implicitly).
#include <iostream>
#include <queue>
#include <vector>
#include <list>
#include <string>
#include <algorithm>
#include <set>
using namespace std;
static vector<string> patterns;
static string input;
static int skippenalty;
struct acnode {
acnode() : failure(NULL), gotofn(256) {}
struct acnode *failure;
vector<struct acnode *> gotofn;
list<int> outputs; // index into patterns global
};
void
add_string_to_trie(acnode *root, const string &s, int sid)
{
for (string::const_iterator p = s.begin(); p != s.end(); ++p) {
if (!root->gotofn[*p])
root->gotofn[*p] = new acnode;
root = root->gotofn[*p];
}
root->outputs.push_back(sid);
}
void
init_tree(acnode *root)
{
queue<acnode *> q;
unsigned char c = 0;
do {
if (acnode *u = root->gotofn[c]) {
u->failure = root;
q.push(u);
} else
root->gotofn[c] = root;
} while (++c);
while (!q.empty()) {
acnode *r = q.front();
q.pop();
do {
acnode *u, *v;
if (!(u = r->gotofn[c]))
continue;
q.push(u);
v = r->failure;
while (!v->gotofn[c])
v = v->failure;
u->failure = v->gotofn[c];
u->outputs.splice(u->outputs.begin(), v->gotofn[c]->outputs);
} while (++c);
}
}
struct match { int begin, end, sid; };
void
ahocorasick(const acnode *state, list<match> &out, const string &str)
{
int i = 1;
for (string::const_iterator p = str.begin(); p != str.end(); ++p, ++i) {
while (!state->gotofn[*p])
state = state->failure;
state = state->gotofn[*p];
for (list<int>::const_iterator q = state->outputs.begin();
q != state->outputs.end(); ++q) {
struct match m = { i - patterns[*q].size(), i, *q };
out.push_back(m);
}
}
}
////////////////////////////////////////////////////////////////////////
bool operator<(const match& m1, const match& m2)
{
return m1.begin < m2.begin
|| (m1.begin == m2.end && m1.end < m2.end);
}
struct dnode {
int usedchars;
vector<bool> usedpatterns;
int last;
};
bool operator<(const dnode& a, const dnode& b) {
return a.usedchars > b.usedchars
|| (a.usedchars == b.usedchars && a.usedpatterns < b.usedpatterns);
}
bool operator==(const dnode& a, const dnode& b) {
return a.usedchars == b.usedchars
&& a.usedpatterns == b.usedpatterns;
}
typedef priority_queue<pair<int, dnode>,
vector<pair<int, dnode> >,
greater<pair<int, dnode> > > mypq;
void
dijkstra(const vector<match> &matches)
{
typedef vector<match>::const_iterator mIt;
vector<bool> used(patterns.size(), false);
dnode initial = { 0, used, -1 };
mypq q;
set<dnode> last;
dnode d;
q.push(make_pair(0, initial));
while (!q.empty()) {
int cost = q.top().first;
d = q.top().second;
q.pop();
if (last.end() != last.find(d)) // we've been here before
continue;
last.insert(d);
if (d.usedchars >= input.size()) {
break; // found optimum
}
match m = { d.usedchars, 0, 0 };
mIt mp = lower_bound(matches.begin(), matches.end(), m);
if (matches.end() == mp) {
// no more matches, skip the remaining string
dnode nextd = d;
d.usedchars = input.size();
int skip = nextd.usedchars - d.usedchars;
nextd.last = -skip;
q.push(make_pair(cost + skip * skippenalty, nextd));
continue;
}
// keep track of where the shortest match ended; we don't need to
// skip more than this.
int skipmax = (mp->begin == d.usedchars) ? mp->end : mp->begin + 1;
while (mp != matches.end() && mp->begin == d.usedchars) {
dnode nextd = d;
nextd.usedchars = mp->end;
int extra = nextd.usedpatterns[mp->sid] ? 0 : 1; // extra pattern
int nextcost = cost + extra;
nextd.usedpatterns[mp->sid] = true;
nextd.last = mp->sid * 2 + extra; // encode used pattern
q.push(make_pair(nextcost, nextd));
++mp;
}
if (mp == matches.end() || skipmax <= mp->begin)
continue;
// skip
dnode nextd = d;
nextd.usedchars = mp->begin;
int skip = nextd.usedchars - d.usedchars;
nextd.last = -skip;
q.push(make_pair(cost + skip * skippenalty, nextd));
}
// unwind
string answer;
while (d.usedchars > 0) {
if (0 > d.last) {
answer = string(-d.last, '*') + answer;
d.usedchars += d.last;
} else {
answer = "[" + patterns[d.last / 2] + "]" + answer;
d.usedpatterns[d.last / 2] = !(d.last % 2);
d.usedchars -= patterns[d.last / 2].length();
}
set<dnode>::const_iterator lp = last.find(d);
if (last.end() == lp) return; // should not happen
d.last = lp->last;
}
cout << answer;
}
int
main()
{
int n;
cin >> n; // read n patterns
patterns.reserve(n);
acnode root;
for (int i = 0; i < n; ++i) {
string s;
cin >> s;
patterns.push_back(s);
add_string_to_trie(&root, s, i);
}
init_tree(&root);
getline(cin, input); // eat the rest of the first line
getline(cin, input);
cerr << "got input: " << input << endl;
list<match> matches;
ahocorasick(&root, matches, input);
vector<match> vmatches(matches.begin(), matches.end());
sort(vmatches.begin(), vmatches.end());
skippenalty = 1 + patterns.size();
dijkstra(vmatches);
return 0;
}
Here is a test file with 52 single-letter patterns (compile and then run with the test file on stdin):
52 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz

How can I compute the number of characters required to turn a string into a palindrome?

I recently found a contest problem that asks you to compute the minimum number of characters that must be inserted (anywhere) in a string to turn it into a palindrome.
For example, given the string: "abcbd" we can turn it into a palindrome by inserting just two characters: one after "a" and another after "d": "adbcbda".
This seems to be a generalization of a similar problem that asks for the same thing, except characters can only be added at the end - this has a pretty simple solution in O(N) using hash tables.
I have been trying to modify the Levenshtein distance algorithm to solve this problem, but haven't been successful. Any help on how to solve this (it doesn't necessarily have to be efficient, I'm just interested in any DP solution) would be appreciated.
Note: This is just a curiosity. Dav proposed an algorithm which can be modified to DP algorithm to run in O(n^2) time and O(n^2) space easily (and perhaps O(n) with better bookkeeping).
Of course, this 'naive' algorithm might actually come in handy if you decide to change the allowed operations.
Here is a 'naive'ish algorithm, which can probably be made faster with clever bookkeeping.
Given a string, we guess the middle of the resulting palindrome and then try to compute the number of inserts required to make the string a palindrome around that middle.
If the string is of length n, there are 2n+1 possible middles (Each character, between two characters, just before and just after the string).
Suppose we consider a middle which gives us two strings L and R (one to left and one to right).
If we are using inserts, I believe the Longest Common Subsequence algorithm (which is a DP algorithm) can now be used the create a 'super' string which contains both L and reverse of R, see Shortest common supersequence.
Pick the middle which gives you the smallest number inserts.
This is O(n^3) I believe. (Note: I haven't tried proving that it is true).
My C# solution looks for repeated characters in a string and uses them to reduce the number of insertions. In a word like program, I use the 'r' characters as a boundary. Inside of the 'r's, I make that a palindrome (recursively). Outside of the 'r's, I mirror the characters on the left and the right.
Some inputs have more than one shortest output: output can be toutptuot or outuputuo. My solution only selects one of the possibilities.
Some example runs:
radar -> radar, 0 insertions
esystem -> metsystem, 2 insertions
message -> megassagem, 3 insertions
stackexchange -> stegnahckexekchangets, 8 insertions
First I need to check if an input is already a palindrome:
public static bool IsPalindrome(string str)
{
for (int left = 0, right = str.Length - 1; left < right; left++, right--)
{
if (str[left] != str[right])
return false;
}
return true;
}
Then I need to find any repeated characters in the input. There may be more than one. The word message has two most-repeated characters ('e' and 's'):
private static bool TryFindMostRepeatedChar(string str, out List<char> chs)
{
chs = new List<char>();
int maxCount = 1;
var dict = new Dictionary<char, int>();
foreach (var item in str)
{
int temp;
if (dict.TryGetValue(item, out temp))
{
dict[item] = temp + 1;
maxCount = temp + 1;
}
else
dict.Add(item, 1);
}
foreach (var item in dict)
{
if (item.Value == maxCount)
chs.Add(item.Key);
}
return maxCount > 1;
}
My algorithm is here:
public static string MakePalindrome(string str)
{
List<char> repeatedList;
if (string.IsNullOrWhiteSpace(str) || IsPalindrome(str))
{
return str;
}
//If an input has repeated characters,
// use them to reduce the number of insertions
else if (TryFindMostRepeatedChar(str, out repeatedList))
{
string shortestResult = null;
foreach (var ch in repeatedList) //"program" -> { 'r' }
{
//find boundaries
int iLeft = str.IndexOf(ch); // "program" -> 1
int iRight = str.LastIndexOf(ch); // "program" -> 4
//make a palindrome of the inside chars
string inside = str.Substring(iLeft + 1, iRight - iLeft - 1); // "program" -> "og"
string insidePal = MakePalindrome(inside); // "og" -> "ogo"
string right = str.Substring(iRight + 1); // "program" -> "am"
string rightRev = Reverse(right); // "program" -> "ma"
string left = str.Substring(0, iLeft); // "program" -> "p"
string leftRev = Reverse(left); // "p" -> "p"
//Shave off extra chars in rightRev and leftRev
// When input = "message", this loop converts "meegassageem" to "megassagem",
// ("ee" to "e"), as long as the extra 'e' is an inserted char
while (left.Length > 0 && rightRev.Length > 0 &&
left[left.Length - 1] == rightRev[0])
{
rightRev = rightRev.Substring(1);
leftRev = leftRev.Substring(1);
}
//piece together the result
string result = left + rightRev + ch + insidePal + ch + right + leftRev;
//find the shortest result for inputs that have multiple repeated characters
if (shortestResult == null || result.Length < shortestResult.Length)
shortestResult = result;
}
return shortestResult;
}
else
{
//For inputs that have no repeated characters,
// just mirror the characters using the last character as the pivot.
for (int i = str.Length - 2; i >= 0; i--)
{
str += str[i];
}
return str;
}
}
Note that you need a Reverse function:
public static string Reverse(string str)
{
string result = "";
for (int i = str.Length - 1; i >= 0; i--)
{
result += str[i];
}
return result;
}
C# Recursive solution adding to the end of the string:
There are 2 base cases. When length is 1 or 2. Recursive case: If the extremes are equal, then
make palindrome the inner string without the extremes and return that with the extremes.
If the extremes are not equal, then add the first character to the end and make palindrome the
inner string including the previous last character. return that.
public static string ConvertToPalindrome(string str) // By only adding characters at the end
{
if (str.Length == 1) return str; // base case 1
if (str.Length == 2 && str[0] == str[1]) return str; // base case 2
else
{
if (str[0] == str[str.Length - 1]) // keep the extremes and call
return str[0] + ConvertToPalindrome(str.Substring(1, str.Length - 2)) + str[str.Length - 1];
else //Add the first character at the end and call
return str[0] + ConvertToPalindrome(str.Substring(1, str.Length - 1)) + str[0];
}
}

Resources