Trie data structure - data-structures

Given N strings. Each string contains only lowercase letters from a−j (both inclusive). The set of N strings is said to be GOOD SET if no string is prefix of another string else, it is BAD SET.
For example, aab, abcde, aabcd is BAD SET because aab is prefix of aabcd.
Print GOOD SET if it satisfies the problem requirement.
Else, print BAD SET and the first string for which the condition fails.
Input Format:
First line contains N, the number of strings in the set.
Then next N lines follow, where ith line contains ith string.
Constraints:
1 ≤ N ≤ 105
1 ≤ Length of the string ≤60
Output Format:
Output GOOD SET if the set is valid.
Else, output BAD SET followed by the first string for which the condition fails.
Can anyone suggest on this?

Construct a trie, inserting each string to the trie one by one, recording the pointers to nodes representing each string inserted in trie. Once done, scan the node pointers for all strings. If any of those ends at internal node, then it is a BAD SET. otherwise, (all string ends on distinct leaves), it is a GOOD SET. Time and space complexity are both linear to the total length of all strings.

Related

How to Modify a Suffix Array to search multiple strings?

I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize to an entire list of search strings, but I can't see how.
Assume I'm trying to implement a simple substring search over a word list and wish to return a list of words matching a given substring. The naive approach would appear to be to insert the lexicographic end character '$' between words in my list, concatenate them all together, and produce a suffix tree from the result. But this would seem to generate large numbers of irrelevant entries. If I create a source string of 'banana$muffin' then I'll end up generating suffixes for 'ana$muffin' which I'll never use.
I'd appreciate any hints as to how to do this right, or better yet, a pointer to some algorithm texts that handle this case.
In Suffix-Arrays you usually don't use strings, just one string. That will be the concatenated version of several strings with some endtoken (a different one for every string). For the Suffix Arrays, you use pointers (or the array index) to reference the suffix (only the position for the first token/character is needed).
So the space required is the array + for each suffix the pointer. (that is just a pretty simple implementation, you should do more, to get more performance).
In that case you could optimise the sorting algorithm for the suffixes, since you only need to sort those suffixes the pointers are referencing to, till the endtokens. Everything behind the endtoken does not need to be used in the sorting algorithm.
After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.
If you start with a multi-string suffix tree, one of the standard conversion algorithms will still work. However, instead of having getting an array of integers, you end up with an array of lists. Each lists contains one or more pairs of a string identifier and a starting offset in that string.
The resulting structure is still useful, but not as efficient as a normal suffix array.
From Iowa State University, taken from Prefix.pdf:
Suffix trees and suffix arrays can be generalized to multiple strings.
The generalized suffix tree of a set of strings S = {s1, s2, . . . ,
sk}, denoted GST(S) or simply GST, is a compacted trie of all suffixes
of each string in S. We assume that the unique termination character $
is appended to the end of each string. A leaf label now consists of a
pair of integers (i, j), where i denotes the suffix is from string si
and j denotes the starting position of the suffix in si . Similarly,
an edge label in a GST is a substring of one of the strings. An edge
label is represented by a triplet of integers (i, j, l), where i
denotes the string number, and j and l denote the starting and ending
positions of the substring in si . For convenience of understanding,
we will continue to show the actual edge labels. Note that two strings
may have identical suffixes. This is compensated by allowing leaves in
the tree to have multiple labels. If a leaf is multiply labelled, each
suffix should come from a different string. If N is the total number
of characters (including the $ in each string) of all strings in S,
the GST has at most N leaf nodes and takes up O(N) space. The
generalized suffix array of S, denoted GSA(S) or simply GSA, is a
lexicographically sorted array of all suffixes of each string in S.
Each suffix is represented by an integer pair (i, j) denoting suffix
starting from position j in si . If suffixes from different strings
are identical, they occupy consecutive positions in the GSA. For
convenience, we make an exception for the suffix $ by listing it only
once, though it occurs in each string. The GST and GSA of strings
apple and maple are shown in Figure 1.2.
Here you have an article about an algorithm to construct a GSA:
Generalized enhanced suffix array construction in external memory

Minimum number of char substitutions to get a palindrome

I would like to solve this problem from TopCoder, in which a String is given and in each step you have to replace all occurrences of an character (of your choice) with another character (of your choice), so that at the end after all steps you get a palindrome. The problem is to identify the minimum total number of replacements.
Ideas so far:
I can identify that the string after every step is simply a node/vertex in a graph and that the cost of every edge is the number of replacements made in the step, but I don't see how to use greedy for that (it is definitely not the Minimum Spanning Tree problem). I don't think it makes sense to identify all possible nodes & edge costs and to convert the problem in the Shortest Path problem. On the other side, I think in every step it makes sense to replace the character X with the biggest number of conflicts, with the character Y in conflict with X that occurs most in the string.
Anyway, I can't either prove that it works. Also I can't identify any known problems in this. Any ideas?
You need to identify disjunct sets of characters. A disjunct set of characters is a set of characters that will all have to become the same character in order for the string to become a palindrome.
Example:
Let's say we have the string abcdefgfmdebac
It has 3 disjunct sets, abc, de and fgm
Algorithm:
Pick the first character and check all occurences of it picking up other characters in the set.
In the example string we start with a and pick up band c (because they sit on the opposite sides of the two ain our string). We repeat the process for band c, but no new characters are added to the set. So abc is our first disjunct set.
Continue doing this with the remaining characters.
A disjunct set of n characters (counting all characters) needs n-m replacements, where m is the number of occurences of the most frequent character.
So simply sum over the sets.
In our example it takes 4 + 2 + 2 = 8 replacements.

Find strings which are prefixes of other strings

This is an interview question. Given a number of strings find such strings, which are prefixes of others. For example, given strings = {"a", "aa", "ab", abb"} the result is {"a", "ab"}.
The simplest solution is just to sort the strings and check each pair of two subsequent strings if the 1st one is a prefix of the 2nd one. The running time of the algorithm is the running time of the sorting.
I guess there is another solution, which uses a trie, and has complexity O(N), where N is the number of strings. Could you suggest such an algorithm?
I have a following idea regarding Trie, complexity O(N):
You start with empty Trie.
You take words one by one, and add word to Trie.
After you add a word (let's call it word Wi) to Trie, there are two cases to consider:
Wi is prefix of some of the words you added before.
That statement is true if you didn't add any nodes to Trie while adding word Wi.
In that case, Wi is prefix and part of our solution.
Some of the words added before are prefix of Wi.
That statement is true if you passed through node that represents end of some word added before (let's cal that word Wj). In that case, Wj is prefix of Wi and part of our solution.
In more details (pseudocode):
for word in words
add word to trie
if size of trie did not change then // first case
add word to result
if ending nodes found while adding word // second case
add words defined by those nodes to result
return result
Adding new word to Trie:
node = trie.root();
for letter in word
if node.hasChild(letter) == false then // if letter doesnt exist, add it
node.addChild(letter)
if letter is last_letter_of_word then // if last letter of word, store that info
node.setIsLastLetterOf(word)
node = node.getChild(letter) // move
While you are adding new word, you can also check if you passed through any nodes that represent last letters of other words.
Complexity of algorithm that I described is O(N).
Another important thing is that this way you can know how many times word Wi prefixes other words, which may be useful.
Example for {aab, aaba, aa}:
Green nodes are nodes detected as case 1.
Red nodes are nodes detected as case 2.
Each column(trie) is one step. At the beginning trie is empty.
Black arrows show which nodes we visited(added) in that step.
Nodes that represent last letter of some word have that word written in parenthesess.
In step 1 we add word aab.
In step 2 we add word aaba, recognize one case 2 (word aab) and add word aab to result.
In step 3 we add word aa, recognize case 1 and add word aa to result.
At the end we have result = {aab, aa} which is correct.
The original answer is correct for: is a string a a substring of b (misread).
Using a trie, you can simply add all strings to it in a first iteration, and in the 2nd iteration, start reading each word, let it be w. If you find a word that you finished your read, but did not reach the string terminator ($ usually), you reach some node v in the trie.
By doing a DFS from v, you can get all strings which w is prefix of them.
high level pseudo code:
t <- new trie
for each word w:
t.add(w)
for each word w:
node <- t.getLastNode(w)
if node.val != $
collection<- DFS(node) (excluding w itself)
w is a prefix of each word in collection
Note: in order to optimize it, you might need to do some extra work: if a is prefix of b, and b is prefix of c, then a is prefix of c, so - when you do the DFS, if you reach some node that was already searched - just append its strings to the current prefix.
Still, since there could be quadric number of possibilities ("a", "aa", "aaa", .... ), getting all of them requires quadric time.
Original answer: finding if a is a substring of b:
The suggested solution runs in a quadric complexity, you will need to check each two pairs, giving you O(n* (n-1) * |S|).
You can build a suffix tree from the strings in the first iteration, and in the 2nd iteration check if each string is a non trivial entry (not itself) of another string.
This solution is O(n*|S|)

minimal cyclic sub string in a bigger cyclic string

I am trying to find an algorithm that culd return the length of the shortest cyclic sub string in a larger cyclic string.
A cyclic string would be defined as a concatenation of tow or more identicle strings, e.g. "abababab", or "aaaa"...
Now in a given for example a string T = "abbcabbcabbcabbc" there is a cycle of the pattern "abbc" but the shortest cyclic sub string would be "bb".
If you're just looking for a substring that appears more than once:
Build a Suffix tree from the string.
While creating the suffix tree, you can count re-occurrences of every substring and save it on the number of occurrences on the node.
Then just do a BFS search on the tree (which will give you a layered search, from shorter to longer strings) and find the first substring which is longer than 1 that occurred more than once.
Total complexity: O(n) where n is the length of the string
Edit:
The paths from the root to the leaves
have a one-to-one relationship with
the suffixes of S
You can implement the tree that each node contains one letter, that will give you better granularity and allow you to see all the substrings by length.
Here's a suffix tree of banana where every node contains one letter, you can see that you have all the substrings there.
If you'll look at the applications section of the suffix tree, you'll see that it is used for exactly this kind of tasks - finding stuff about substrings.
Look at the image from the root, you can see ALL the substrings start from the root (BFS list):
b
a
n
ba
an
na
ban
ana
nan
bana
anan
nana
banan
anana
banana
Let me call "abbc" the generator in your example - i.e. the string that you repeat in order to get the bigger string.
The very first observation is that the smaller string should be made by repeating some substring twice.
It's clear that the smallest string should be smaller than the generator repeated twice (2*generator), because 2*generator is cyclic.
Now note that you only need to consider the string obtained by taking the generator 3 times, when searching for smaller cyclic string. Indeed, if the smallest is not there, but it is in the 4*generator, then it must span at least two generators, but then it wouldn't be the smallest.
So now lets assume the bigger string is 3*generator (or 2*generator).
Also it's clear that if the generator has only different digits, then the answer is 2*generator. If not then you just need to find all pairs of identical characters in the bigger string say at position i and j and check whether the string starting a i, which is 2*(j-i) long is cyclic. If you try them in order of increasing j-i, then you can stop after the first success.

efficient way to find matches against two strings

I need to find all equal substrings against two strings. I've tried to use suffix tree to lookup substrings and it works fast, but too memory consuming (inappropriate for my task).
Any other ideas?
Aho-corasick is a great implementation for matching any number of strings with minimal performance issues. Did you try that?
You could do sliding window, though that's less memory, but more time consuming.
The smallest substring is one character (actually, the empty word is one, but let's leave that aside).
Take character 1 of string 1 and save the positions of that character in string 2 in some sort of data structure, like a map or an array.
Then you take the next one, (character 2 of string 1) and do the same thing.
Once you've reached the end of string 1, you start over but this time you take every two characters of string 1 and alway advance by one character checking for all positions in string 2.
You do this as long as the substring you're cheking is equal in length to string 1, meaning you compare string 1 and 2 as a whole.
Keep in mind: when string 2 is longer than string 1, you need to advance the whole string 1 once every character on string 2, since string 1 might be a substring of string 2.
If string 1 is larger than string 2, you can stop cheking, once your substring is longer that string 2, all other substrings will have been checked by then. Ideally, you'd end up having a map, (which in its simplest form is a two dimensional array), that holds the positions of each substring of string 1 in string 2.
Why do you say that suffix tree is too memory consuming? If implemented properly, it consumes only O(n) memory.

Resources