Algorithm: Is it possible to form a valid string - algorithm

Assume we have n three letter substrings. It is possible to make a string of length n+2 out of these N substrings by concatenating them (Where overlapping letters are written only once) . Whereby this string must have the form a1,a2,a3,a4...
So it is only allowed to link two substrings if they overlap at two adjacent places: 'yxz' + 'xzw' = 'yxzw' , but 'yxz' + 'aby' is for example not allowed.
Example 1: The n = 3 three letter substrings are 'abc','cde','bcd' Output: YES
. Because 'abc' + 'bcd'+ 'cde' = 'abcde' is a valid String with n+2 = 5 letters.
Example 2: The n = 3 three letter substrings are 'abc','bca','bcd' Output: NO. Because its not possible to concatenating them all.
How can i finde an efficient algorithm for this problem? Trying all possible combinations takes far too long with O(n!)

One of the popular approaches to solving this kind of problems is to build the overlap graph of the input sequences, whose vertices are your triplets and where an arc a_i -> a_j between two triplets means that the last two letters of a_i are the first two letters of a_j; and then to find a Hamiltonian path in the resulting graph.
A naïve search would of course not outperform the exhaustive search you mention, but the linked Wikipedia article gives some leads on how to do this more efficiently.

Related

Why is the total number of possible substrings of a string n^2?

I read that the total number of substrings that can be formed from a given string is n^2 but I don't understand how to count this.
By substrings, I mean, given a string CAT, the substrings would be:
C
CA
CAT
A
AT
T
The total number of (nonempty) substrings is n + C(n,2). The leading n counts the number of substrings of length 1 and C(n,2) counts the number of substrings of length > 1 and is equal to the number of ways to choose 2 indices from the set of n. The standard formula for binomial coefficients yields C(n,2) = n*(n-1)/2. Combining these two terms and simplifying gives that the total number is (n^2 + n)/2. #rici in the comments notes that this is the same as C(n+1,2) which makes sense if you e.g. think in terms of Python string slicing where substrings of s can always be written in the form s[i:j] where 0 <= i < j <= n (with j being 1 more than the final index). For n = 3 this works out to (9 + 3)/2 = 6.
In the sense of complexity theory the number of substrings is O(n^2), which might be what you read somewhere.
You have a starting point and and end point - if each could point to anywhere along the word, each would have n possible values, and therefor an overall of n^2, so that's an upper limit.
However, we need a constraint saying that the substring cannot end before it started, so end - start >=0. This cuts the possible count in about half, but on asymptotic terms it's still O(n^2)
Substring calculation is logically
selecting 2 blank spaces atleast one letter apart.
a| b c | d = substring bc
| a b c |d = substring abc.
Now how many ways can you chose these 2 blankspace. For n letter word there are n+1.
Then first select one = n+1 ways
Select another (not the same)= n
So total n(n+1). But you have calculated everything twice. So n*(n+1)/2.
Programmatically, without applying any special algorithms(like Z algo etc) you can use a map to calculate no of distinct substrings.(O(n^3)).
You can use suffix tree to get O(n^2) substring calculaton.
To get a substring of a given string s, you just need to select two different points in the string. Let s contain n characters,
|s[0]|s[1]|...|s[n-1]|
You want to choose two vertical bars to get a substring. How many vertical bars do you have? Exactly n+1. So the number of sustrings is C(n+1,2) = n(n+1)/2, which is to choose 2 items from n+1. Of course, it could be denoted as O(n^2).

Balanced Parentheses

I was wondering if anyone can answer me which is number of results generated in the backtracking solution for the next problem:
Given n pairs of parentheses, write a function to generate all combinations of well-formed parentheses.
For example, given n = 3, a solution set is:
"((()))", "(()())", "(())()", "()(())", "()()()"
There is a related post in stackoverflow: Generate balanced parentheses in java
My doubt if that if there is a formula that can give me the number of valid parentheses I can generate before compute them.
for example:
- f(n):
- f(1) = 1
- f(2) = 2
- f(3) = 5
and so on..
Thank you.
The number expressions containing n pairs of parentheses which are correctly matched can be calculated via Catalan numbers. Quoting the relevant link from Wikipedia:
There are many counting problems in combinatorics whose solution is given by the Catalan numbers … Cn is the number of Dyck words of length 2n. A Dyck word is a string consisting of n X's and n Y's such that no initial segment of the string has more Y's than X's … Re-interpreting the symbol X as an open parenthesis and Y as a close parenthesis, Cn counts the number of expressions containing n pairs of parentheses which are correctly matched.
The nth Catalan number is given directly in terms of binomial coefficients by:

Find most frequent combination of numbers in a set

This question is related to the following questions:
How to find most frequent combinations of numbers in a list
Most frequently occurring combinations
My problem is:
Scenario:
I have a set of numbers, EACH COMBINATION IS UNIQUE in this set and each number in the combination appears only once:
Goal:
Find frequency of appears of combination (size of 2) in this set.
Example:
The frequency threshold is 2.
Set = {1,12,13,134,135,235,2345,12345}
The frequency of degree of 2 combination is(show all combinations that appears more than 2 times):
13 - appear 4 times
14 - appear 3 times
23 - appear 3 times
12 - appear 2 times
...
The time complexity of exhaustive searching for all possible combinations grow exponentially.
Can any one help me to think a algorithm that can solve this problem faster? (hash table, XOR, tree search....)
Thank you
PS.
Don't worry about the space complexity
Solution and conclusion:
templatetypedef's answer is good for substring' length more than 3
If substring's length is 2, btilly's answer is straight forward and easy to implement (also have a good performance on time)
Here is pseudo-code whose running time should be O(n * m * m) where n is the size of the set, and m is the size of the things in that set:
let counts be a hash mapping a pair of characters to a count
foreach number N in list:
foreach pair P of characters in N:
if exists counts[P]:
counts[P] = counts[P] + 1
else:
counts[P] = 1
let final be an array of (pair, count)
foreach P in keys of counts:
if 1 < counts[P]:
add (P, counts[P]) to final
sort final according to the counts
output final
#templatetypedef's answer is going to eventually be more efficient if you're looking for combinations of 3, 4, etc characters. But this should be fine for the stated problem.
You can view this problem as a string problem: given a collection of strings, return all substrings of the collection that appear at least k times. Fortunately, there's a polynomial-time algorithm for this problem That uses generalized suffix trees.
Start by constructing a generalized suffix tree for the string representations of your numbers, which takes time linear in the number of digits across all numbers. Then, do a DFS and annotate each node with the number of leaf nodes in its subtree (equivalently, the number of times the string represented by the node appears in the input set), and in the course of doing so output each string discovered this way to appear at least k times. The runtime for this operation is O(d + z), where d is the number of total digits in the input and z is the total number of digits produced as output.
Hope this helps!

Finding a set of repeated, non-overlapping substrings of two input strings using suffix arrays

Input: two strings A and B.
Output: a set of repeated, non overlapping substrings
I have to find all the repeated strings, each of which has to occur in both(!) strings at least once. So for instance, let
A = "xyabcxeeeyabczeee" and B = "yxabcxabee".
Then a valid output would be {"abcx","ab","ee"} but not "eee", since it occurs only in string A.
I think this problem is very related to the "supermaximal repeat" problem. Here is a definition:
Maximal repeated pair :
A pair of identical substrings alpha and beta in S such that extending alpha and beta
in either direction would destroy the equality of the two strings
It is represented as a triplet (position1,position2, length)
Maximal repeat :
“A substring of S that occurs in a maximal pair in S”.
Example: abc in S = xabcyiiizabcqabcyrxar.
Note: There can be numerous maximal repeated pairs, but there can be only a limited
number of maximal repeats.
Supermaximal repeat
“A maximal repeat that never occurs as a substring of any other maximal repeat”
Example: abcy in S = xabcyiiizabcqabcyrxar.
An algorithm for finding all supermaximal repeats is described in "Algorithms on strings, trees and sequences", but only for suffix trees.
It works by:
1.) finding all left-diverse nodes using DFS
For each position i in S, S(i-1) is called the left character i.
Left character of a leaf in T(S) is the left character of the suffix position
represented by that leaf.
An internal node v in T(S) is called left-diverse if at least two leaves in v’s
subtree have different left characters.
2.) applying theorem 7.12.4 on those nodes:
A left diverse internal node v represents a supermaximal repeat a if and only if
all of v's children are leaves, and each has a distinct left character
Both strings A and B probably have to be concatenated and when we check v's leaves
in step two we also have to impose an additional constraint, that there has to be
at least one distinct left character from strings A and B. This can be done by comparing
their position against the length of A. If position(left character) > length(A), then left character is in A, else in B.
Can you help me solve this problem with suffix + lcp arrays?
It sounds like you are looking for the set intersection of all substrings of your two input strings. In that case, single letter substrings should also be returned. Let s1 and s2 be your strings, s1 the shorter of the two. After doing some thinking for a while about this, I don't think you can get much better than the intuitive O(n^3m) or O(n^3) algorithm, where n is the length of s1 and m is the length of s2. I don't think suffix trees can help you here.
for(int i=0 to n-1){
for(int j=1 to n-i){
if(contains(s2,substring(s1,i,j))) emit the substring
}
}
The runtime comes from the (n^2)/2 loop iterations, each doing a worst-case O(nm) contains operation (possibly O(n) depending on implementation). But its not really quite this bad since there will be a constant much smaller than one out front, since the length of the substring will actually range between 1 and n.
If you don't want single character matches, you could just initialize j as 2 or something higher.
BTW: Don't actually create new strings with substring, find/create a contains function that will take indicies and the original string and just look at the characters between, inclusive of i, exclusive of j.

Optimal solution for non-overlapping maximum scoring sequences

While developing part of a simulator I came across the following problem. Consider a string of length N, and M substrings of this string with a non-negative score assigned to each of them. Of particular interest are the sets of substrings that meet the following requirements:
They do not overlap.
Their total score (by sum, for simplicity) is maximum.
They span the entire string.
I understand the naive brute-force solution is of O(M*N^2) complexity. While the implementation of this algorithm would probably not impose a lot on the performance of the whole project (nowhere near the critical path, can be precomputed, etc.), it really doesn't sit well with me.
I'd like to know if there are any better solutions to this problem and if so, which are they? Pointers to relevant code are always appreciated, but just algorithm description will do too.
This can be thought of as finding the longest path through a DAG. Each position in the string is a node and each substring match is an edge. You can trivially prove through induction that for any node on the optimal path the concatenation of the optimal path from the beginning to that node and from that node to the end is the same as the optimal path. Thanks to that you can just keep track of the optimal paths for each node and make sure you have visited all edges that end in a node before you start to consider paths containing it.
Then you just have the issue to find all edges that start from a node, or all substring that match at a given position. If you already know where the substring matches are, then it's as trivial as building a hash table. If you don't you can still build a hashtable if you use Rabin-Karp.
Note that with this you'll still visit all the edges in the DAG for O(e) complexity. Or in other words, you'll have to consider once each substring match that's possible in a sequence of connected substrings from start to the end. You could get better than this by doing preprocessing the substrings to find ways to rule out some matches. I have my doubts if any general case complexity improvements can come for this and any practical improvements depend heavily on your data distribution.
It is not clear whether M substrings are given as sequences of characters or indeces in the input string, but the problem doesn't change much because of that.
Let us have input string S of length N, and M input strings Tj. Let Lj be the length of Tj, and Pj - score given for string Sj. We say that string
This is called Dynamic Programming, or DP. You keep an array res of ints of length N, where the i-th element represents the score one can get if he has only the substring starting from the i-th element (for example, if input is "abcd", then res[2] will represent the best score you can get of "cd").
Then, you iterate through this array from end to the beginning, and check whether you can start string Sj from the i-th character. If you can, then result of (res[i + Lj] + Pj) is clearly achievable. Iterating over all Sj, res[i] = max(res[i + Lj] + Pj) for all Sj which can be applied to the i-th character.
res[0] will be your final asnwer.
inputs:
N, the number of chars in a string
e[0..N-1]: (b,c) an element of set e[a] means [a,b) is a substring with score c.
(If all substrings are possible, then you could just have c(a,b).)
By e.g. [1,2) we mean the substring covering the 2nd letter of the string (half open interval).
(empty substrings are not allowed; if they were, then you could handle them properly only if you allow them to be "taken" at most k times)
Outputs:
s[i] is the score of the best substring covering of [0,i)
a[i]: [a[i],i) is the last substring used to cover [0,i); else NULL
Algorithm - O(N^2) if the intervals e are not sparse; O(N+E) where e is the total number of allowed intervals. This is effectively finding the best path through an acyclic graph:
for i = 0 to N:
a[i] <- NULL
s[i] <- 0
a[0] <- 0
for i = 0 to N-1
if a[i] != NULL
for (b,c) in e[i]:
sib <- s[i]+c
if sib>s[b]:
a[b] <- i
s[b] <- sib
To yield the best covering triples (a,b,c) where cost of [a,b) is c:
i <- N
if (a[i]==NULL):
error "no covering"
while (a[i]!=0):
from <- a[i]
yield (from,i,s[i]-s[from]
i <- from
Of course, you could store the pair (sib,c) in s[b] and save the subtraction.
O(N+M) solution:
Set f[1..N]=-1
Set f[0]=0
for a = 0 to N-1
if f[a] >= 0
For each substring beginning at a
Let b be the last index of the substring, and c its score
If f[a]+c > f[b+1]
Set f[b+1] = f[a]+c
Set g[b+1] = [substring number]
Now f[N] contains the answer, or -1 if no set of substrings spans the string.
To get the substrings:
b = N
while b > 0
Get substring number from g[N]
Output substring number
b = b - (length of substring)

Resources