Recursive solution to common longest substring between two strings - algorithm

I am trying to return the length of a common substring between two strings. I'm very well aware of the DP solution, however I want to be able to solve this recursively just for practice.
I have the solution to find the longest common subsequence...
def get_substring(str1, str2, i, j):
if i == 0 or j == 0:
return
elif str1[i-1] == str2[j-1]:
return 1 + get_substring(str1, str2, i-1, j-1)
else:
return max(get_substring(str1, str2, i, j-1), get_substring(str1, str2, j-1, i))
However, I need the longest common substring, not the longest common sequence of letters. I tried altering my code in a couple of ways, one being changing the base case to...
if i == 0 or j == 0 or str1[i-1] != str2[j-1]:
return 0
But that did not work, and neither did any of my other attempts.
For example, for the following strings...
X = "AGGTAB"
Y = "BAGGTXAYB"
print(get_substring(X, Y, len(X), len(Y)))
The longest substring is AGGT.
My recursive skills are not the greatest, so if anybody can help me out that would be very helpful.

package algo.dynamic;
public class LongestCommonSubstring {
public static void main(String[] args) {
String a = "AGGTAB";
String b = "BAGGTXAYB";
int maxLcs = lcs(a.toCharArray(), b.toCharArray(), a.length(), b.length(), 0);
System.out.println(maxLcs);
}
private static int lcs(char[] a, char[] b, int i, int j, int count) {
if (i == 0 || j == 0)
return count;
if (a[i - 1] == b[j - 1]) {
count = lcs(a, b, i - 1, j - 1, count + 1);
}
count = Math.max(count, Math.max(lcs(a, b, i, j - 1, 0), lcs(a, b, i - 1, j, 0)));
return count;
}
}

You need to recurse on each separately. Which is easier to do if you have multiple recursive functions.
def longest_common_substr_at_both_start (str1, str2):
if 0 == len(str1) or 0 == len(str2) or str1[0] != str2[0]:
return ''
else:
return str1[0] + longest_common_substr_at_both_start(str1[1:], str2[1:])
def longest_common_substr_at_first_start (str1, str2):
if 0 == len(str2):
return ''
else:
answer1 = longest_common_substr_at_both_start (str1, str2)
answer2 = longest_common_substr_at_first_start (str1, str2[1:])
return answer2 if len(answer1) < len(answer2) else answer1
def longest_common_substr (str1, str2):
if 0 == len(str1):
return ''
else:
answer1 = longest_common_substr_at_first_start (str1, str2)
answer2 = longest_common_substr(str1[1:], str2)
return answer2 if len(answer1) < len(answer2) else answer1
print(longest_common_substr("BAGGTXAYB","AGGTAB") )

I am so sorry. I didn't have time to convert this into a recursive function. This was relatively straight forward to compose. If Python had a fold function a recursive function would be greatly eased. 90% of recursive functions are primitive. That's why fold is so valuable.
I hope the logic in this can help with a recursive version.
(x,y)= "AGGTAB","BAGGTXAYB"
xrng= range(len(x)) # it is used twice
np=[(a+1,a+2) for a in xrng] # make pairs of list index values to use
allx = [ x[i:i+b] for (a,b) in np for i in xrng[:-a]] # make list of len>1 combinations
[ c for i in range(len(y)) for c in allx if c == y[i:i+len(c)]] # run, matching x & y
...producing this list from which to take the longest of the matches
['AG', 'AGG', 'AGGT', 'GG', 'GGT', 'GT']
I didn't realize getting the longest match from the list would be a little involved.
ls= ['AG', 'AGG', 'AGGT', 'GG', 'GGT', 'GT']
ml= max([len(x) for x in ls])
ls[[a for (a,b) in zip(range(len(ls)),[len(x) for x in ls]) if b == ml][0]]
"AGGT"

Related

Calling Functions repeatedly in Mathematica

I have the follwoing python code that I would like to convert to Mathematica. But I totally get stuck on how to deifne the functions I def shouldSwap and def findPermutations in Mathematica and then call on findPermutations for index + 1. How do I define such functions in Mathematica? Thank you very much for your help!
My python code:
def shouldSwap(string, start, curr):
for j in range(strat, curr):
if string[j] == string[curr]:
return 0
return 1
def findPermutations(string, index, n):
if index >= n:
print(''.join(String))
return
for i in range(index, n):
check = shouldSwap(string index, i)
if check:
string[index], string[i] = string[i], string[index]
findPermutations(string, index + 1, n)
string[index], string[i] = string[i], string[index]
if __name__ == "__main__":
string = list("ABCA")
n = len(string)
findPermutations(string, 0, n)
My hopelessly failed attempt to convert it to Mathematica:
string = List[A, B, C, A]
n = Length[string]
index = 0;
If[index>=n, string, Return]
If[index<n,
For[i = index, i<n, i+1,
If[
For[j = index, j<i, j++,
If[string[j] == string[i], Return,
Swap[string[index], string[i]] and
How do I do this??? findPermutations(string, index + 1, n) and
Swap[string[index], string[i]]
]
]
]
]
There are several syntax errors in the Python code. Is there a reason for wanting to transliterate it? The WL way to do it
"ABCA" // Characters // Permutations // Map[StringJoin]
{"ABCA", "ABAC", "ACBA", "ACAB", "AABC", "AACB", "BACA", "BAAC", "BCAA", "CABA", "CAAB", "CBAA"}

Dynamic Programming for shortest subsequence that is not a subsequence of two strings

Problem: Given two sequences s1 and s2 of '0' and '1'return the shortest sequence that is a subsequence of neither of the two sequences.
E.g. s1 = '011' s2 = '1101' Return s_out = '00' as one possible result.
Note that substring and subsequence are different where substring the characters are contiguous but in a subsequence that needs not be the case.
My question: How is dynamic programming applied in the "Solution Provided" below and what is its time complexity?
My attempt involves computing all the subsequences for each string giving sub1 and sub2. Append a '1' or a '0' to each sub1 and determine if that new subsequence is not present in sub2.Find the minimum length one. Here is my code:
My Solution
def get_subsequences(seq, index, subs, result):
if index == len(seq):
if subs:
result.add(''.join(subs))
else:
get_subsequences(seq, index + 1, subs, result)
get_subsequences(seq, index + 1, subs + [seq[index]], result)
def get_bad_subseq(subseq):
min_sub = ''
length = float('inf')
for sub in subseq:
for char in ['0', '1']:
if len(sub) + 1 < length and sub + char not in subseq:
length = len(sub) + 1
min_sub = sub + char
return min_sub
Solution Provided (not mine)
How does it work and its time complexity?
It looks that the below solution looks similar to: http://kyopro.hateblo.jp/entry/2018/12/11/100507
def set_nxt(s, nxt):
n = len(s)
idx_0 = n + 1
idx_1 = n + 1
for i in range(n, 0, -1):
nxt[i][0] = idx_0
nxt[i][1] = idx_1
if s[i-1] == '0':
idx_0 = i
else:
idx_1 = i
nxt[0][0] = idx_0
nxt[0][1] = idx_1
def get_shortest(seq1, seq2):
len_seq1 = len(seq1)
len_seq2 = len(seq2)
nxt_seq1 = [[len_seq1 + 1 for _ in range(2)] for _ in range(len_seq1 + 2)]
nxt_seq2 = [[len_seq2 + 1 for _ in range(2)] for _ in range(len_seq2 + 2)]
set_nxt(seq1, nxt_seq1)
set_nxt(seq2, nxt_seq2)
INF = 2 * max(len_seq1, len_seq2)
dp = [[INF for _ in range(len_seq2 + 2)] for _ in range(len_seq1 + 2)]
dp[len_seq1 + 1][len_seq2 + 1] = 0
for i in range( len_seq1 + 1, -1, -1):
for j in range(len_seq2 + 1, -1, -1):
for k in range(2):
if dp[nxt_seq1[i][k]][nxt_seq2[j][k]] < INF:
dp[i][j] = min(dp[i][j], dp[nxt_seq1[i][k]][nxt_seq2[j][k]] + 1);
res = ""
i = 0
j = 0
while i <= len_seq1 or j <= len_seq2:
for k in range(2):
if (dp[i][j] == dp[nxt_seq1[i][k]][nxt_seq2[j][k]] + 1):
i = nxt_seq1[i][k]
j = nxt_seq2[j][k]
res += str(k)
break;
return res
I am not going to work it through in detail, but the idea of this solution is to create a 2-D array of every combinations of positions in the one array and the other. It then populates this array with information about the shortest sequences that it finds that force you that far.
Just constructing that array takes space (and therefore time) O(len(seq1) * len(seq2)). Filling it in takes a similar time.
This is done with lots of bit twiddling that I don't want to track.
I have another approach that is clearer to me that usually takes less space and less time, but in the worst case could be as bad. But I have not coded it up.
UPDATE:
Here is is all coded up. With poor choices of variable names. Sorry about that.
# A trivial data class to hold a linked list for the candidate subsequences
# along with information about they match in the two sequences.
import collections
SubSeqLinkedList = collections.namedtuple('SubSeqLinkedList', 'value pos1 pos2 tail')
# This finds the position after the first match. No match is treated as off the end of seq.
def find_position_after_first_match (seq, start, value):
while start < len(seq) and seq[start] != value:
start += 1
return start+1
def make_longer_subsequence (subseq, value, seq1, seq2):
pos1 = find_position_after_first_match(seq1, subseq.pos1, value)
pos2 = find_position_after_first_match(seq2, subseq.pos2, value)
gotcha = SubSeqLinkedList(value=value, pos1=pos1, pos2=pos2, tail=subseq)
return gotcha
def minimal_nonsubseq (seq1, seq2):
# We start with one candidate for how to start the subsequence
# Namely an empty subsequence. Length 0, matches before the first character.
candidates = [SubSeqLinkedList(value=None, pos1=0, pos2=0, tail=None)]
# Now we try to replace candidates with longer maximal ones - nothing of
# the same length is better at going farther in both sequences.
# We keep this list ordered by descending how far it goes in sequence1.
while candidates[0].pos1 <= len(seq1) or candidates[0].pos2 <= len(seq2):
new_candidates = []
for candidate in candidates:
candidate1 = make_longer_subsequence(candidate, '0', seq1, seq2)
candidate2 = make_longer_subsequence(candidate, '1', seq1, seq2)
if candidate1.pos1 < candidate2.pos1:
# swap them.
candidate1, candidate2 = candidate2, candidate1
for c in (candidate1, candidate2):
if 0 == len(new_candidates):
new_candidates.append(c)
elif new_candidates[-1].pos1 <= c.pos1 and new_candidates[-1].pos2 <= c.pos2:
# We have found strictly better.
new_candidates[-1] = c
elif new_candidates[-1].pos2 < c.pos2:
# Note, by construction we cannot be shorter in pos1.
new_candidates.append(c)
# And now we throw away the ones we don't want.
# Those that are on their way to a solution will be captured in the linked list.
candidates = new_candidates
answer = candidates[0]
r_seq = [] # This winds up reversed.
while answer.value is not None:
r_seq.append(answer.value)
answer = answer.tail
return ''.join(reversed(r_seq))
print(minimal_nonsubseq('011', '1101'))

Divide two strings to form palindrome

Given two strings, A and B, of equal length, find whether it is possible to cut both strings at a common point such that the first part of A and the second part of B form a palindrome.
I've tried bruteforce, this can be achieved in O(N^2). I'm looking for any kind of optimization. I'm not familiar with back tracking and DP. So, can anyone throw some light....whether i should think in these lines?
Here is a possible solution considering that we cut the 2 strings in a common point. It runs in linear time w.r.t the string length, so in O(n).
// Palindrome function
function is_pal(str) {
str_len = len(str)
result = true
for i from 0 to 1 + str_len / 2 {
if str[i] != str[str_len - i] then {
result = false
break
}
}
return result
}
// first phase: iterate on both strings
function solve_pb(A, B) {
str_len = len(A)
idx = 0
while A[idx] == B[str_len - idx - 1] {
idx += 1
}
if idx >= str_len / 2 {
return str_len / 2
else if is_pal(A[idx + 1 ... str_len - idx - 2]) {
return str_len - idx - 2
else if is_pal(B[idx + 1 ... str_len - idx - 2]) {
return idx
else
return -1 // no solution possible
The principle is the following:
First, we iterate on A, and reverse iterate on B, as long as they are 'symetric'.
A: aaabcaabb ............ // ->
B: ............ bbaacbaaa // <-
If the strings are symetric until their respective middle, then the solution is trivial. Otherwise we check if the 'middle portion' of A or B is itself a palindrome. If it is the case we have a solution, otherwise we do not have a solution.

DNA subsequence dynamic programming question

I'm trying to solve DNA problem which is more of improved(?) version of LCS problem.
In the problem, there is string which is string and semi-substring which allows part of string to have one or no letter skipped. For example, for string "desktop", it has semi-substring {"destop", "dek", "stop", "skop","desk","top"}, all of which has one or no letter skipped.
Now, I am given two DNA strings consisting of {a,t,g,c}. I"m trying to find longest semi-substring, LSS. and if there is more than one LSS, print out the one in the fastest order.
For example, two dnas {attgcgtagcaatg, tctcaggtcgatagtgac} prints out "tctagcaatg"
and aaaattttcccc, cccgggggaatatca prints out "aattc"
I'm trying to use common LCS algorithm but cannot solve it with tables although I did solve the one with no letter skipped. Any advice?
This is a variation on the dynamic programming solution for LCS, written in Python.
First I'm building up a Suffix Tree for all the substrings that can be made from each string with the skip rule. Then I'm intersecting the suffix trees. Then I'm looking for the longest string that can be made from that intersection tree.
Please note that this is technically O(n^2). Its worst case is when both strings are the same character, repeated over and over again. Because you wind up with a lot of what logically is something like, "an 'l' at position 42 in the one string could have matched against position l at position 54 in the other". But in practice it will be O(n).
def find_subtree (text, max_skip=1):
tree = {}
tree_at_position = {}
def subtree_from_position (position):
if position not in tree_at_position:
this_tree = {}
if position < len(text):
char = text[position]
# Make sure that we've populated the further tree.
subtree_from_position(position + 1)
# If this char appeared later, include those possible matches.
if char in tree:
for char2, subtree in tree[char].iteritems():
this_tree[char2] = subtree
# And now update the new choices.
for skip in range(max_skip + 1, 0, -1):
if position + skip < len(text):
this_tree[text[position + skip]] = subtree_from_position(position + skip)
tree[char] = this_tree
tree_at_position[position] = this_tree
return tree_at_position[position]
subtree_from_position(0)
return tree
def find_longest_common_semistring (text1, text2):
tree1 = find_subtree(text1)
tree2 = find_subtree(text2)
answered = {}
def find_intersection (subtree1, subtree2):
unique = (id(subtree1), id(subtree2))
if unique not in answered:
answer = {}
for k, v in subtree1.iteritems():
if k in subtree2:
answer[k] = find_intersection(v, subtree2[k])
answered[unique] = answer
return answered[unique]
found_longest = {}
def find_longest (tree):
if id(tree) not in found_longest:
best_candidate = ''
for char, subtree in tree.iteritems():
candidate = char + find_longest(subtree)
if len(best_candidate) < len(candidate):
best_candidate = candidate
found_longest[id(tree)] = best_candidate
return found_longest[id(tree)]
intersection_tree = find_intersection(tree1, tree2)
return find_longest(intersection_tree)
print(find_longest_common_semistring("attgcgtagcaatg", "tctcaggtcgatagtgac"))
Let g(c, rs, rt) represent the longest common semi-substring of strings, S and T, ending at rs and rt, where rs and rt are the ranked occurences of the character, c, in S and T, respectively, and K is the number of skips allowed. Then we can form a recursion which we would be obliged to perform on all pairs of c in S and T.
JavaScript code:
function f(S, T, K){
// mapS maps a char to indexes of its occurrences in S
// rsS maps the index in S to that char's rank (index) in mapS
const [mapS, rsS] = mapString(S)
const [mapT, rsT] = mapString(T)
// h is used to memoize g
const h = {}
function g(c, rs, rt){
if (rs < 0 || rt < 0)
return 0
if (h.hasOwnProperty([c, rs, rt]))
return h[[c, rs, rt]]
// (We are guaranteed to be on
// a match in this state.)
let best = [1, c]
let idxS = mapS[c][rs]
let idxT = mapT[c][rt]
if (idxS == 0 || idxT == 0)
return best
for (let i=idxS-1; i>=Math.max(0, idxS - 1 - K); i--){
for (let j=idxT-1; j>=Math.max(0, idxT - 1 - K); j--){
if (S[i] == T[j]){
const [len, str] = g(S[i], rsS[i], rsT[j])
if (len + 1 >= best[0])
best = [len + 1, str + c]
}
}
}
return h[[c, rs, rt]] = best
}
let best = [0, '']
for (let c of Object.keys(mapS)){
for (let i=0; i<(mapS[c]||[]).length; i++){
for (let j=0; j<(mapT[c]||[]).length; j++){
let [len, str] = g(c, i, j)
if (len > best[0])
best = [len, str]
}
}
}
return best
}
function mapString(s){
let map = {}
let rs = []
for (let i=0; i<s.length; i++){
if (!map[s[i]]){
map[s[i]] = [i]
rs.push(0)
} else {
map[s[i]].push(i)
rs.push(map[s[i]].length - 1)
}
}
return [map, rs]
}
console.log(f('attgcgtagcaatg', 'tctcaggtcgatagtgac', 1))
console.log(f('aaaattttcccc', 'cccgggggaatatca', 1))
console.log(f('abcade', 'axe', 1))

Remove consecutive duplicates in a string to make the smallest string

Given a string and the constraint of matching on >= 3 characters, how can you ensure that the result string will be as small as possible?
edit with gassa's explicitness:
E.G.
'AAAABBBAC'
If I remove the B's first,
AAAA[BBB]AC -- > AAAAAC, then I can remove all of the A's from the resultant string and be left with:
[AAAAA]C --> C
'C'
If I just remove what is available first (the sequence of A's), I get:
[AAAA]BBBAC -- > [BBB]AC --> AC
'AC'
A tree would definitely get you the shortest string(s).
The tree solution:
Define a State (node) for each current string Input and all its removable sub-strings' int[] Indexes.
Create the tree: For each int index create another State and add it to the parent state State[] Children.
A State with no possible removable sub-strings has no children Children = null.
Get all Descendants State[] of your root State. Order them by their shortest string Input. And that is/are your answer(s).
Test cases:
string result = FindShortest("AAAABBBAC"); // AC
string result2 = FindShortest("AABBAAAC"); // AABBC
string result3 = FindShortest("BAABCCCBBA"); // B
The Code:
Note: Of-course everyone is welcome to enhance the following code in terms of performance and/or fixing any bug.
class Program
{
static void Main(string[] args)
{
string result = FindShortest("AAAABBBAC"); // AC
string result2 = FindShortest("AABBAAAC"); // AABBC
string result3 = FindShortest("BAABCCCBBA"); // B
}
// finds the FIRST shortest string for a given input
private static string FindShortest(string input)
{
// all possible removable strings' indexes
// for this given input
int[] indexes = RemovableIndexes(input);
// each input string and its possible removables are a state
var state = new State { Input = input, Indexes = indexes };
// create the tree
GetChildren(state);
// get the FIRST shortest
// i.e. there would be more than one answer sometimes
// this could be easily changed to get all possible results
var result =
Descendants(state)
.Where(d => d.Children == null || d.Children.Length == 0)
.OrderBy(d => d.Input.Length)
.FirstOrDefault().Input;
return result;
}
// simple get all descendants of a node/state in a tree
private static IEnumerable<State> Descendants(State root)
{
var states = new Stack<State>(new[] { root });
while (states.Any())
{
State node = states.Pop();
yield return node;
if (node.Children != null)
foreach (var n in node.Children) states.Push(n);
}
}
// creates the tree
private static void GetChildren(State state)
{
// for each an index there is a child
state.Children = state.Indexes.Select(
i =>
{
var input = RemoveAllAt(state.Input, i);
return input.Length < state.Input.Length && input.Length > 0
? new State
{
Input = input,
Indexes = RemovableIndexes(input)
}
: null;
}).ToArray();
foreach (var c in state.Children)
GetChildren(c);
}
// find all possible removable strings' indexes
private static int[] RemovableIndexes(string input)
{
var indexes = new List<int>();
char d = input[0];
int count = 1;
for (int i = 1; i < input.Length; i++)
{
if (d == input[i])
count++;
else
{
if (count >= 3)
indexes.Add(i - count);
// reset
d = input[i];
count = 1;
}
}
if (count >= 3)
indexes.Add(input.Length - count);
return indexes.ToArray();
}
// remove all duplicate chars starting from an index
private static string RemoveAllAt(string input, int startIndex)
{
string part1, part2;
int endIndex = startIndex + 1;
int i = endIndex;
for (; i < input.Length; i++)
if (input[i] != input[startIndex])
{
endIndex = i;
break;
}
if (i == input.Length && input[i - 1] == input[startIndex])
endIndex = input.Length;
part1 = startIndex > 0 ? input.Substring(0, startIndex) : string.Empty;
part2 = endIndex <= (input.Length - 1) ? input.Substring(endIndex) : string.Empty;
return part1 + part2;
}
// our node, which is
// an input string &
// all possible removable strings' indexes
// & its children
public class State
{
public string Input;
public int[] Indexes;
public State[] Children;
}
}
I propose O(n^2) solution with dynamic programming.
Let's introduce notation. Prefix and suffix of length l of string A denoted by P[l] and S[l]. And we call our procedure Rcd.
Rcd(A) = Rcd(Rcd(P[n-1])+S[1])
Rcd(A) = Rcd(P[1]+Rcd(S[n-1]))
Note that outer Rcd in the RHS is trivial. So, that's our optimal substructure. Based on this i came up with the following implementation:
#include <iostream>
#include <string>
#include <vector>
#include <cassert>
using namespace std;
string remdupright(string s, bool allowEmpty) {
if (s.size() >= 3) {
auto pos = s.find_last_not_of(s.back());
if (pos == string::npos && allowEmpty) s = "";
else if (pos != string::npos && s.size() - pos > 3) s = s.substr(0, pos + 1);
}
return s;
}
string remdupleft(string s, bool allowEmpty) {
if (s.size() >= 3) {
auto pos = s.find_first_not_of(s.front());
if (pos == string::npos && allowEmpty) s = "";
else if (pos != string::npos && pos >= 3) s = s.substr(pos);
}
return s;
}
string remdup(string s, bool allowEmpty) {
return remdupleft(remdupright(s, allowEmpty), allowEmpty);
}
string run(const string in) {
vector<vector<string>> table(in.size());
for (int i = 0; i < (int)table.size(); ++i) {
table[i].resize(in.size() - i);
}
for (int i = 0; i < (int)table[0].size(); ++i) {
table[0][i] = in.substr(i,1);
}
for (int len = 2; len <= (int)table.size(); ++len) {
for (int pos = 0; pos < (int)in.size() - len + 1; ++pos) {
string base(table[len - 2][pos]);
const char suffix = in[pos + len - 1];
if (base.size() && suffix != base.back()) {
base = remdupright(base, false);
}
const string opt1 = base + suffix;
base = table[len - 2][pos+1];
const char prefix = in[pos];
if (base.size() && prefix != base.front()) {
base = remdupleft(base, false);
}
const string opt2 = prefix + base;
const string nodupopt1 = remdup(opt1, true);
const string nodupopt2 = remdup(opt2, true);
table[len - 1][pos] = nodupopt1.size() > nodupopt2.size() ? opt2 : opt1;
assert(nodupopt1.size() != nodupopt2.size() || nodupopt1 == nodupopt2);
}
}
string& res = table[in.size() - 1][0];
return remdup(res, true);
}
void testRcd(string s, string expected) {
cout << s << " : " << run(s) << ", expected: " << expected << endl;
}
int main()
{
testRcd("BAABCCCBBA", "B");
testRcd("AABBAAAC", "AABBC");
testRcd("AAAA", "");
testRcd("AAAABBBAC", "C");
}
You can check default and run your tests here.
Clearly we are not concerned about any block of repeated characters longer than 2 characters. And there is only one way two blocks of the same character where at least one of the blocks is less than 3 in length can be combined - namely, if the sequence between them can be removed.
So (1) look at pairs of blocks of the same character where at least one is less than 3 in length, and (2) determine if the sequence between them can be removed.
We want to decide which pairs to join so as to minimize the total length of blocks less than 3 characters long. (Note that the number of pairs is bound by the size (and distribution) of the alphabet.)
Let f(b) represent the minimal total length of same-character blocks remaining up to the block b that are less than 3 characters in length. Then:
f(b):
p1 <- previous block of the same character
if b and p1 can combine:
if b.length + p1.length > 2:
f(b) = min(
// don't combine
(0 if b.length > 2 else b.length) +
f(block before b),
// combine
f(block before p1)
)
// b.length + p1.length < 3
else:
p2 <- block previous to p1 of the same character
if p1 and p2 can combine:
f(b) = min(
// don't combine
b.length + f(block before b),
// combine
f(block before p2)
)
else:
f(b) = b.length + f(block before b)
// b and p1 cannot combine
else:
f(b) = b.length + f(block before b)
for all p1 before b
The question is how can we efficiently determine if a block can be combined with the previous block of the same character (aside from the obvious recursion into the sub-block-list between the two blocks).
Python code:
import random
import time
def parse(length):
return length if length < 3 else 0
def f(string):
chars = {}
blocks = [[string[0], 1, 0]]
chars[string[0]] = {'indexes': [0]}
chars[string[0]][0] = {'prev': -1}
p = 0 # pointer to current block
for i in xrange(1, len(string)):
if blocks[len(blocks) - 1][0] == string[i]:
blocks[len(blocks) - 1][1] += 1
else:
p += 1
# [char, length, index, f(i), temp]
blocks.append([string[i], 1, p])
if string[i] in chars:
chars[string[i]][p] = {'prev': chars[string[i]]['indexes'][ len(chars[string[i]]['indexes']) - 1 ]}
chars[string[i]]['indexes'].append(p)
else:
chars[string[i]] = {'indexes': [p]}
chars[string[i]][p] = {'prev': -1}
#print blocks
#print
#print chars
#print
memo = [[None for j in xrange(len(blocks))] for i in xrange(len(blocks))]
def g(l, r, top_level=False):
####
####
#print "(l, r): (%s, %s)" % (l,r)
if l == r:
return parse(blocks[l][1])
if memo[l][r]:
return memo[l][r]
result = [parse(blocks[l][1])] + [None for k in xrange(r - l)]
if l < r:
for i in xrange(l + 1, r + 1):
result[i - l] = parse(blocks[i][1]) + result[i - l - 1]
for i in xrange(l, r + 1):
####
####
#print "\ni: %s" % i
[char, length, index] = blocks[i]
#p1 <- previous block of the same character
p1_idx = chars[char][index]['prev']
####
####
#print "(p1_idx, l, p1_idx >= l): (%s, %s, %s)" % (p1_idx, l, p1_idx >= l)
if p1_idx < l and index > l:
result[index - l] = parse(length) + result[index - l - 1]
while p1_idx >= l:
p1 = blocks[p1_idx]
####
####
#print "(b, p1, p1_idx, l): (%s, %s, %s, %s)\n" % (blocks[i], p1, p1_idx, l)
between = g(p1[2] + 1, index - 1)
####
####
#print "between: %s" % between
#if b and p1 can combine:
if between == 0:
if length + p1[1] > 2:
result[index - l] = min(
result[index - l],
# don't combine
parse(length) + (result[index - l - 1] if index - l > 0 else 0),
# combine: f(block before p1)
result[p1[2] - l - 1] if p1[2] > l else 0
)
# b.length + p1.length < 3
else:
#p2 <- block previous to p1 of the same character
p2_idx = chars[char][p1[2]]['prev']
if p2_idx < l:
p1_idx = chars[char][p1_idx]['prev']
continue
between2 = g(p2_idx + 1, p1[2] - 1)
#if p1 and p2 can combine:
if between2 == 0:
result[index - l] = min(
result[index - l],
# don't combine
parse(length) + (result[index - l - 1] if index - l > 0 else 0),
# combine the block, p1 and p2
result[p2_idx - l - 1] if p2_idx - l > 0 else 0
)
else:
#f(b) = b.length + f(block before b)
result[index - l] = min(
result[index - l],
parse(length) + (result[index - l - 1] if index - l > 0 else 0)
)
# b and p1 cannot combine
else:
#f(b) = b.length + f(block before b)
result[index - l] = min(
result[index - l],
parse(length) + (result[index - l - 1] if index - l > 0 else 0)
)
p1_idx = chars[char][p1_idx]['prev']
#print l,r,result
memo[l][r] = result[r - l]
"""if top_level:
return (result, blocks)
else:"""
return result[r - l]
if len(blocks) == 1:
return ([parse(blocks[0][1])], blocks)
else:
return g(0, len(blocks) - 1, True)
"""s = ""
for i in xrange(300):
s = s + ['A','B','C'][random.randint(0,2)]"""
print f("abcccbcccbacccab") # b
print
print f("AAAABBBAC"); # C
print
print f("CAAAABBBA"); # C
print
print f("AABBAAAC"); # AABBC
print
print f("BAABCCCBBA"); # B
print
print f("aaaa")
print
The string answers for these longer examples were computed using jdehesa's answer:
t0 = time.time()
print f("BCBCCBCCBCABBACCBABAABBBABBBACCBBBAABBACBCCCACABBCAABACBBBBCCCBBAACBAABACCBBCBBAABCCCCCAABBBBACBBAAACACCBCCBBBCCCCCCCACBABACCABBCBBBBBCBABABBACCAACBCBBAACBBBBBCCBABACBBABABAAABCCBBBAACBCACBAABAAAABABB")
# BCBCCBCCBCABBACCBABCCAABBACBACABBCAABACAACBAABACCBBCBBCACCBACBABACCABBCCBABABBACCAACBCBBAABABACBBABABBCCAACBCACBAABBABB
t1 = time.time()
total = t1-t0
print total
t0 = time.time()
print f("CBBACAAAAABBBBCAABBCBAABBBCBCBCACACBAABCBACBBABCABACCCCBACBCBBCBACBBACCCBAAAACACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBCCCABACABBCABBAAAAABBBBAABAABBCACACABBCBCBCACCCBABCAACBCAAAABCBCABACBABCABCBBBBABCBACABABABCCCBBCCBBCCBAAABCABBAAABBCAAABCCBAABAABCAACCCABBCAABCBCBCBBAACCBBBACBBBCABAABCABABABABCA")
# CBBACCAABBCBAACBCBCACACBAABCBACBBABCABABACBCBBCBACBBABCACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBABACABBCBBCACACABBCBCBCABABCAACBCBCBCABACBABCABCABCBACABABACCBBCCBBCACBCCBAABAABCBBCAABCBCBCBBAACCACCABAABCABABABABCA
t1 = time.time()
total = t1-t0
print total
t0 = time.time()
print f("AADBDBEBBBBCABCEBCDBBBBABABDCCBCEBABADDCABEEECCECCCADDACCEEAAACCABBECBAEDCEEBDDDBAAAECCBBCEECBAEBEEEECBEEBDACDDABEEABEEEECBABEDDABCDECDAABDAEADEECECEBCBDDAEEECCEEACCBBEACDDDDBDBCCAAECBEDAAAADBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDAEEEBBBCEDECBCABDEDEBBBABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCECCCA")
# AADBDBECABCEBCDABABDCCBCEBABADDCABCCEADDACCEECCABBECBAEDCEEBBECCBBCEECBAEBCBEEBDACDDABEEABCBABEDDABCDECDAABDAEADEECECEBCBDDACCEEACCBBEACBDBCCAAECBEDDBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDACEDECBCABDEDEABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCEA
t1 = time.time()
total = t1-t0
print total
Another scala answer, using memoization and tailcall optimization (partly) (updated).
import scala.collection.mutable.HashSet
import scala.annotation._
object StringCondense extends App {
#tailrec
def groupConsecutive (s: String, sofar: List[String]): List[String] = s.toList match {
// def groupConsecutive (s: String): List[String] = s.toList match {
case Nil => sofar
// case Nil => Nil
case c :: str => {
val (prefix, rest) = (c :: str).span (_ == c)
// Strings of equal characters, longer than 3, don't make a difference to just 3
groupConsecutive (rest.mkString(""), (prefix.take (3)).mkString ("") :: sofar)
// (prefix.take (3)).mkString ("") :: groupConsecutive (rest.mkString(""))
}
}
// to count the effect of memoization
var count = 0
// recursively try to eliminate every group of 3 or more, brute forcing
// but for "aabbaabbaaabbbaabb", many reductions will lead sooner or
// later to the same result, so we try to detect these and avoid duplicate
// work
def moreThan2consecutive (s: String, seenbefore: HashSet [String]): String = {
if (seenbefore.contains (s)) s
else
{
count += 1
seenbefore += s
val sublists = groupConsecutive (s, Nil)
// val sublists = groupConsecutive (s)
val atLeast3 = sublists.filter (_.size > 2)
atLeast3.length match {
case 0 => s
case 1 => {
val res = sublists.filter (_.size < 3)
moreThan2consecutive (res.mkString (""), seenbefore)
}
case _ => {
val shrinked = (
for {idx <- (0 until sublists.size)
if (sublists (idx).length >= 3)
pre = (sublists.take (idx)).mkString ("")
post= (sublists.drop (idx+1)).mkString ("")
} yield {
moreThan2consecutive (pre + post, seenbefore)
}
)
(shrinked.head /: shrinked.tail) ((a, b) => if (a.length <= b.length) a else b)
}
}
}
}
// don't know what Rcd means, adopted from other solution but modified
// kind of a unit test **update**: forgot to reset count
testRcd (s: String, expected: String) : Boolean = {
count = 0
val seenbefore = HashSet [String] ()
val result = moreThan2consecutive (s, seenbefore)
val hit = result.equals (expected)
println (s"Input: $s\t result: ${result}\t expected ${expected}\t $hit\t count: $count");
hit
}
// some test values from other users with expected result
// **upd:** more testcases
def testgroup () : Unit = {
testRcd ("baabcccbba", "b")
testRcd ("aabbaaac", "aabbc")
testRcd ("aaaa", "")
testRcd ("aaaabbbac", "c")
testRcd ("abcccbcccbacccab", "b")
testRcd ("AAAABBBAC", "C")
testRcd ("CAAAABBBA", "C")
testRcd ("AABBAAAC", "AABBC")
testRcd ("BAABCCCBBA", "B")
testRcd ("AAABBBAAABBBAAABBBC", "C") // 377 subcalls reported by Yola,
testRcd ("AAABBBAAABBBAAABBBAAABBBC", "C") // 4913 when preceeded with AAABBB
}
testgroup
def testBigs () : Unit = {
/*
testRcd ("BCBCCBCCBCABBACCBABAABBBABBBACCBBBAABBACBCCCACABBCAABACBBBBCCCBBAACBAABACCBBCBBAABCCCCCAABBBBACBBAAACACCBCCBBBCCCCCCCACBABACCABBCBBBBBCBABABBACCAACBCBBAACBBBBBCCBABACBBABABAAABCCBBBAACBCACBAABAAAABABB",
"BCBCCBCCBCABBACCBABCCAABBACBACABBCAABACAACBAABACCBBCBBCACCBACBABACCABBCCBABABBACCAACBCBBAABABACBBABABBCCAACBCACBAABBABB")
*/
testRcd ("CBBACAAAAABBBBCAABBCBAABBBCBCBCACACBAABCBACBBABCABACCCCBACBCBBCBACBBACCCBAAAACACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBCCCABACABBCABBAAAAABBBBAABAABBCACACABBCBCBCACCCBABCAACBCAAAABCBCABACBABCABCBBBBABCBACABABABCCCBBCCBBCCBAAABCABBAAABBCAAABCCBAABAABCAACCCABBCAABCBCBCBBAACCBBBACBBBCABAABCABABABABCA",
"CBBACCAABBCBAACBCBCACACBAABCBACBBABCABABACBCBBCBACBBABCACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBABACABBCBBCACACABBCBCBCABABCAACBCBCBCABACBABCABCABCBACABABACCBBCCBBCACBCCBAABAABCBBCAABCBCBCBBAACCACCABAABCABABABABCA")
/*testRcd ("AADBDBEBBBBCABCEBCDBBBBABABDCCBCEBABADDCABEEECCECCCADDACCEEAAACCABBECBAEDCEEBDDDBAAAECCBBCEECBAEBEEEECBEEBDACDDABEEABEEEECBABEDDABCDECDAABDAEADEECECEBCBDDAEEECCEEACCBBEACDDDDBDBCCAAECBEDAAAADBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDAEEEBBBCEDECBCABDEDEBBBABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCECCCA",
"AADBDBECABCEBCDABABDCCBCEBABADDCABCCEADDACCEECCABBECBAEDCEEBBECCBBCEECBAEBCBEEBDACDDABEEABCBABEDDABCDECDAABDAEADEECECEBCBDDACCEEACCBBEACBDBCCAAECBEDDBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDACEDECBCABDEDEABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCEA")
*/
}
// for generated input, but with fixed seed, to compare the count with
// and without memoization
import util.Random
val r = new Random (31415)
// generate Strings but with high chances to produce some triples and
// longer sequences of char clones
def genRandomString () : String = {
(1 to 20).map (_ => r.nextInt (6) match {
case 0 => "t"
case 1 => "r"
case 2 => "-"
case 3 => "tt"
case 4 => "rr"
case 5 => "--"
}).mkString ("")
}
def testRandom () : Unit = {
(1 to 10).map (i=> testRcd (genRandomString, "random mode - false might be true"))
}
testRandom
testgroup
testRandom
// testBigs
}
Comparing the effect of memoization lead to interesting results:
Updated measurements. In the old values, I forgot to reset the counter, which leaded to much higher results. Now the spreading of results
is much more impressive and in total, the values are smaller.
No seenbefore:
Input: baabcccbba result: b expected b true count: 4
Input: aabbaaac result: aabbc expected aabbc true count: 2
Input: aaaa result: expected true count: 2
Input: aaaabbbac result: c expected c true count: 5
Input: abcccbcccbacccab result: b expected b true count: 34
Input: AAAABBBAC result: C expected C true count: 5
Input: CAAAABBBA result: C expected C true count: 5
Input: AABBAAAC result: AABBC expected AABBC true count: 2
Input: BAABCCCBBA result: B expected B true count: 4
Input: AAABBBAAABBBAAABBBC res: C expected C true count: 377
Input: AAABBBAAABBBAAABBBAAABBBC r: C expected C true count: 4913
Input: r--t----ttrrrrrr--tttrtttt--rr----result: rr--rr expected ? unknown ? false count: 1959
Input: ttrtt----tr---rrrtttttttrtr--rr result: r--rr expected ? unknown ? false count: 213
Input: tt----r-----ttrr----ttrr-rr--rr-- result: ttrttrrttrr-rr--rr-- ex ? unknown ? false count: 16
Input: --rr---rrrrrrr-r--rr-r--tt--rrrrr result: rr-r--tt-- expected ? unknown ? false count: 32
Input: tt-rrrrr--r--tt--rrtrrr------- result: ttr--tt--rrt expected ? unknown ? false count: 35
Input: --t-ttt-ttt--rrrrrt-rrtrttrr result: --tt-rrtrttrr expected ? unknown ? false count: 35
Input: rrt--rrrr----trrr-rttttrrtttrr result: rrtt- expected ? unknown ? false count: 1310
Input: ---tttrrrrrttrrttrr---tt-----tt result: rrttrr expected ? unknown ? false count: 1011
Input: -rrtt--rrtt---t-r--r---rttr-- result: -rrtt--rr-r--rrttr-- ex ? unknown ? false count: 9
Input: rtttt--rrrrrrrt-rrttt--tt--t result: r--t-rr--tt--t expectd ? unknown ? false count: 16
real 0m0.607s (without testBigs)
user 0m1.276s
sys 0m0.056s
With seenbefore:
Input: baabcccbba result: b expected b true count: 4
Input: aabbaaac result: aabbc expected aabbc true count: 2
Input: aaaa result: expected true count: 2
Input: aaaabbbac result: c expected c true count: 5
Input: abcccbcccbacccab result: b expected b true count: 11
Input: AAAABBBAC result: C expected C true count: 5
Input: CAAAABBBA result: C expected C true count: 5
Input: AABBAAAC result: AABBC expected AABBC true count: 2
Input: BAABCCCBBA result: B expected B true count: 4
Input: AAABBBAAABBBAAABBBC rest: C expected C true count: 28
Input: AAABBBAAABBBAAABBBAAABBBC C expected C true count: 52
Input: r--t----ttrrrrrr--tttrtttt--rr----result: rr--rr expected ? unknown ? false count: 63
Input: ttrtt----tr---rrrtttttttrtr--rr result: r--rr expected ? unknown ? false count: 48
Input: tt----r-----ttrr----ttrr-rr--rr-- result: ttrttrrttrr-rr--rr-- xpe? unknown ? false count: 8
Input: --rr---rrrrrrr-r--rr-r--tt--rrrrr result: rr-r--tt-- expected ? unknown ? false count: 19
Input: tt-rrrrr--r--tt--rrtrrr------- result: ttr--tt--rrt expected ? unknown ? false count: 12
Input: --t-ttt-ttt--rrrrrt-rrtrttrr result: --tt-rrtrttrr expected ? unknown ? false count: 16
Input: rrt--rrrr----trrr-rttttrrtttrr result: rrtt- expected ? unknown ? false count: 133
Input: ---tttrrrrrttrrttrr---tt-----tt result: rrttrr expected ? unknown ? false count: 89
Input: -rrtt--rrtt---t-r--r---rttr-- result: -rrtt--rr-r--rrttr-- ex ? unknown ? false count: 6
Input: rtttt--rrrrrrrt-rrttt--tt--t result: r--t-rr--tt--t expected ? unknown ? false count: 8
real 0m0.474s (without testBigs)
user 0m0.852s
sys 0m0.060s
With tailcall:
real 0m0.478s (without testBigs)
user 0m0.860s
sys 0m0.060s
For some random strings, the difference is bigger than a 10fold.
For long Strings with many groups one could, as an improvement, eliminate all groups which are the only group of that character, for instance:
aa bbb aa ccc xx ddd aa eee aa fff xx
The groups bbb, ccc, ddd, eee and fff are unique in the string, so they can't fit to something else and could all be eliminated, and the order of removal is will not matter. This would lead to the intermediate result
aaaa xx aaaa xx
and a fast solution. Maybe I try to implement it too. However, I guess, it will be possible to produce random Strings, where this will have a big impact and by a different form of random generated strings, to distributions, where the impact is low.
Here is a Python solution (function reduce_min), not particularly smart but I think fairly easy to understand (excessive amount of comments added for answer clarity):
def reductions(s, min_len):
"""
Yields every possible reduction of s by eliminating contiguous blocks
of l or more repeated characters.
For example, reductions('AAABBCCCCBAAC', 3) yields
'BBCCCCBAAC' and 'AAABBBAAC'.
"""
# Current character
curr = ''
# Length of current block
n = 0
# Start position of current block
idx = 0
# For each character
for i, c in enumerate(s):
if c != curr:
# New block begins
if n >= min_len:
# If previous block was long enough
# yield reduced string without it
yield s[:idx] + s[i:]
# Start new block
curr = c
n = 1
idx = i
else:
# Still in the same block
n += 1
# Yield reduction without last block if it was long enough
if n >= min_len:
yield s[:idx]
def reduce_min(s, min_len):
"""
Finds the smallest possible reduction of s by successive
elimination of contiguous blocks of min_len or more repeated
characters.
"""
# Current set of possible reductions
rs = set([s])
# Current best solution
result = s
# While there are strings to reduce
while rs:
# Get one element
r = rs.pop()
# Find reductions
r_red = list(reductions(r, min_len))
# If no reductions are found it is irreducible
if len(r_red) == 0 and len(r) < len(result):
# Replace if shorter than current best
result = r
else:
# Save reductions for next iterations
rs.update(r_red)
return result
assert reduce_min("BAABCCCBBA", 3) == "B"
assert reduce_min("AABBAAAC", 3) == "AABBC"
assert reduce_min("AAAA", 3) == ""
assert reduce_min("AAAABBBAC", 3) == "C"
EDIT: Since people seem to be posting C++ solutions, here is mine in C++ (again, function reduce_min):
#include <string>
#include <vector>
#include <unordered_set>
#include <iterator>
#include <utility>
#include <cassert>
using namespace std;
void reductions(const string &s, unsigned int min_len, vector<string> &rs)
{
char curr = '\0';
unsigned int n = 0;
unsigned int idx = 0;
for (auto it = s.begin(); it != s.end(); ++it)
{
if (curr != *it)
{
auto i = distance(s.begin(), it);
if (n >= min_len)
{
rs.push_back(s.substr(0, idx) + s.substr(i));
}
curr = *it;
n = 1;
idx = i;
}
else
{
n += 1;
}
}
if (n >= min_len)
{
rs.push_back(s.substr(0, idx));
}
}
string reduce_min(const string &s, unsigned int min_len)
{
unordered_set<string> rs { s };
string result = s;
vector<string> rs_new;
while (!rs.empty())
{
auto it = rs.begin();
auto r = *it;
rs.erase(it);
rs_new.clear();
reductions(r, min_len, rs_new);
if (rs_new.empty() && r.size() < result.size())
{
result = move(r);
}
else
{
rs.insert(rs_new.begin(), rs_new.end());
}
}
return result;
}
int main(int argc, char **argv)
{
assert(reduce_min("BAABCCCBBA", 3) == "B");
assert(reduce_min("AABBAAAC", 3) == "AABBC");
assert(reduce_min("AAAA", 3) == "");
assert(reduce_min("AAAABBBAC", 3) == "C");
return 0;
}
If you can use C++17 you can save memory by using string views.
EDIT 2: About the complexity of the algorithm. It is not straightforward to figure out, and as I said the algorithm is meant to be simple more than anything, but let's see. In the end, it is more or less the same as a breadth-first search. Let's say the string length is n, and, for generality, let's say the minimum block length (value 3 in the question) is m. In the first level, we can generate up to n / m reductions in the worst case. For each of these, we can generate up to (n - m) / m reductions, and so on. So basically, at "level" i (loop iteration i) we create up to (n - i * m) / m reductions per string we had, and each of these will take O(n - i * m) time to process. The maximum number of levels we can have is, again, n / m. So the complexity of the algorithm (if I'm not making mistakes) should have the form:
O( sum {i = 0 .. n / m} ( O(n - i * m) * prod {j = 0 .. i} ((n - i * m) / m) ))
|-Outer iters--| |---Cost---| |-Prev lvl-| |---Branching---|
Whew. So this should be something like:
O( sum {i = 0 .. n / m} (n - i * m) * O(n^i / m^i) )
Which in turn would collapse to:
O((n / m)^(n / m))
So yeah, the algorithm is more or less simple, but it can run into exponential cost cases (the bad cases would be strings made entirely of exactly m-long blocks, like AAABBBCCCAAACCC... for m = 3).

Resources