Counting distinct common subsequences for a given set of strings - algorithm

I was going through this paper about counting number of distinct common subsequences between two strings which has described a DP approach to do the same. Now, when there are more than two strings whose number of distinct common subsequences must be found, it might take an approach different from this one. What I want is that whether this task is achievable in time complexity less than exponential and how can it be done?

If you have an alphabet of size k, and m strings of size at most n then (assuming that all individual math operations are O(1)) this problem is solvable with dynamic programming in time at most O(k nm+1) and memory O(k nm). Those are not tight bounds, and in practice performance and memory should be significantly better than that. But in practice with long strings you will wind up needing big integer arithmetic, which will make math operations not O(1). Still it is polynomial.
Here is the trick in an unfortunately confusing sentence. We want to build up a series of tables listing, for each possible length of subsequence and each set of ways to pick one copy of a character from each string, the number of distinct subsequences there are whose minimal expression in each string ends at the chosen spot. If we do that, then the sum of all of those values is our final answer.
Here is an outline of how to do it (which you can do without understanding the above description).
For each string, build a transition table mapping (position in string, character) to the position of the next occurrence of that character. The tables should start with position 0 being before the first character. You can use -1 for running off of the end of the string.
Create a data structure that maps a list of integers the same size as the number of strings you have to another integer. This will be the count of subsequences of a fixed length whose shortest representation in each string ends at that set of positions.
Insert as the sole value (0, 0, ..., 0) -> 1 to represent the fact that there is 1 subsequence of length 0 and its shortest representation in each string ends at the start.
Set the total count of common subsequences to 0.
While that map is not empty:
Add the sum of values in that map to the total count of common subsequences.
Create a second map of the same type, with no data.
For each key/value pair in the first map:
For each possible character in your alphabet:
Construct a new vector of integers to be a new key by taking each string, looking at the position, then taking the next position of that character. Of course if you run off of the end of the string, break out of the loop.
If that key is not in your second map, insert it with value 0.
Increase the value for that key in the second map by your current value in the current map. (Basically add the number of subsequences that just had this minimal character transition.)
Copy the second data structure to the first.
The total count of distinct subsequences in common across all of the strings should now be correct.

Related

Count palindromic permutations ("mirrors") of an array

I've been trying to find a solution for this question:
Given an array of integers, count the distinct permutations that are palindromes ("mirrors"); that is, find the number of distinct ways that the array's elements can be rearranged so that they read the same way backward as forward. For example:
If the array is [1,1,2], then there is only one distinct palindromic permutation (namely [1,2,1]), so the desired result is 1.
If the array is [1,1,2,2], then there are two distinct palindromic permutations (namely [1,2,2,1] and [2,1,1,2]), so the desired result is 2.
If the array is [2,2,2,3,3], then there are two distinct palindromic permutations (namely [3,2,2,2,3] and [2,3,2,3,2]), so the desired result is 2.
I've been trying to solve this and been stuck for quite a while, and can't find any solution online. Any help will be appreciated (just starting out on algo & ds stuff)
My idea is to find the index of the median of that array (e.g., in example #1, the median is at index 1) and move all numbers after it to before it (so, [1,2,1]), and check using two pointers (one at end, one at start) if all numbers are equal.
However, this won't work if, let's say, #1 is arr = [1,2,2], as doing the above would be equal to 1,2,2. What I should've done in this case is then to move the 1 in between the 2s (sort of median from the end, if that makes sense). Sort of like the above method but the reverse (?)
Here is the general idea:
Count the frequency of each unique value.
If the array's length is odd, then exactly one frequency should be odd. If not, there are no mirrors. If so, that value will have to be placed in the center. The number of mirrors is then equal to what you would get for an array with one value less -- that value removed.
Now the array length is even. No frequencies should be odd, or else there are no mirrors. Now halve all those frequencies.
Determine how many permutations can be formed with those values and their (halved) frequencies. The formula is:
๐‘›! / (๐‘›1!๐‘›2!๐‘›3!...๐‘›๐‘˜!)
where ๐‘› is the sum of all (halved) frequencies (i.e. half the size of the array), and the ๐‘›๐‘– is the list of (halved) frequencies.

Minimum number of deletions for a given word to become a dictionary word

Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?
For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| ยท 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|ยท2|x|) if you take the better of the two approaches.
You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.

get unique words in text stream

At a given instance can we find out unique words in a text stream.
One naive solution i can think of is using a hashmap to keep words count.
But this would require keeping words which have word count more than 1 in hashmap. In case of long text stream, it will be lot of words to maintain. Is there a way to crunch on space complexity for this.
You cannot get the number of distinct words exactly without paying the space complexity. However, you can get a reasonably good estimate by using the Flajolet-Martin approach, described on slide 20 of this slide deck.
Assuming the data stream consists of a universe of elements chosen from a set of size N, you can do the following steps, copied from the slides linked above.
Pick a hash function h that maps each of the N elements to at least log_2 (N) bits.
For each stream element a, let r(a) be the number of trailing 0's in h(a).
Record R = the maximum r(a) seen.
Estimated number of distinct elements = 2^R.

Determining if a sequence T is a sorting of a sequence S in O(n) time

I know that one can easily determine if a sequence is sorted in O(n) time. However, how can we insure that some sequence T is indeed the sorting of elements from sequence S in O(n) time?
That is, someone might have an algorithm that outputs some sequence T that is indeed in sorted order, but may not contain any elements from sequence S, so how can we check that T is indeed a sorted sequence of S in O(n) time?
Get the length L of S.
Check the length of T as well. If they differ, you are done!
Let Hs be a hash map with something like 2L buckets of all elements in S.
Let Ht be a hash map (again, with 2L buckets) of all elements in T.
For each element in T, check that it exists in Hs.
For each element in S, check that it exists in Ht.
This will work if the elements are unique in each sequence. See wcdolphin's answer for the small changes needed to make it work with non-unique sequences.
I have NOT taken memory consumption into account. Creating two hashmap of double the size of each sequence may be expensive. This is the usual tradeoff between speed and memory.
While Emil's answer is very good, you can do slightly better.
Fundamentally, in order for T to be a reordering of S it must contain all of the same elements. That is to say, for every element in T or S, they must occur the same number of times. Thus, we will:
Create a Hash table of all elements in S, mapping from the 'Element' to the number of occurrences.
Iterate through every element in T, decrementing the number of times the current element occurred.
If the number of occurrences is zero, remove it from the hash.
If the current element is not in the hash, T is not a reordering of S.
Create a hash map of both sequences. Use the character as key, and the count of the character as value. If a character has not been added yet add it with a count of 1. If a character has already been added increase its count by 1.
Verify that for each character in the input sequence that the hash map of the sorted sequence contains the character as key and has the same count as value.
I believe it this is a O(n^2) problem because:
Assuming data structure you use to store elements is a linked list for minimal operations of removing an element
You will be doing a S.contains(element of T) for every element of T, and one to check they are the same size.
You cannot assume that s is ordered and therefore need to do a element by element comparison for every element.
worst case would be if S is reverse of T
This would mean that for element (0+x) of T you would do (n-x) comparisons if you remove each successful element.
this results in (n*(n+1))/2 operations which is O(n^2)
Might be some other cleverer algorithm out there though

string transposition algorithm

Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.
There are several edit distance algorithms, the given Wikipeida link has links to a few.
Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.
Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.

Resources