Selecting an optimum set according to ranked criteria

Selecting an optimum set according to ranked criteria - algorithm

I am given a string, and a set of rules which select valid substrings by a process which isn't important here. Given an enumeration of all valid substrings, I have to find the optimum set of substrings according to a set of ranked criteria, such as:
Substrings may not overlap
All characters must be part of a substring if possible
Use as few different substrings as possible
etc.
For example, given the string abc and the substrings [a, ab, bc], the optimal set of substrings by the preceding rules is [a, bc].
Currently I'm doing this by a standard naive algorithm of enumerating all possible sets of substrings, then iterating over them to find the best candidate. The problem is that as the length of the string and the number of substrings goes up, the number of possible sets increases exponentially. With 50 substrings (well within possibility for this app), the number of sets to enumerate is 2^50, which is extremely prohibitive.
It seems like there should be a way to avoid generating many of the sets that will obviously be losers, or to algorithmically converge on the optimum set without having to blindly generate every candidate. What options are there?
Note that for this application it may be acceptable to use an algorithm that offers a statistical rather than absolute guarantee, such as an n% chance of hitting a non-optimal candidate, where n is suitably small.

Looks to me like a tree structure is needed.
Basically your initial branching is on all the substrings, then all but the one you used in the first round etc all the way to the bottom. You're right in that this branches to 2^50 but if you use ab-pruning to quickly terminate branches that are obviously inferior and then add some memoization to prune situations you've seen before you could speed up considerably.
You'll probably have to do a fair amount of AI learning to get it all but wikipedia pages on ab-pruning and transposition tables will get you a start.
edit:
Yep you're right, probably not clear enough.
Assuming your example "ABABABAB BABABABA" with substrings {"ABAB","BABA"}.
If you set your evaluation function to simply treat wasted characters as bad the tree will go something like this:
ABAB (eval=0)
ABAB (eval=0)
ABAB (eval=2 because we move past/waste a space char and a B)
[missing expansion]
BABA (eval=1 because we only waste the space)
ABAB (eval=2 now have wasted the space above and a B at this level)
BABA (eval=1 still only wasted the space)*
BABA (eval=1 prune here because we already have a result that is 1)
BABA (eval=1 prune here for same reason)
*best solution
I suspect the simple 'wasted chars' isn't enough in the non trivial example but it does prune half the tree here.

Here's a working solution in Haskell. I have called the unique substrings symbols, and an association of one occurrence of the substrings a placement. I have also interpreted criterion 3 ("Use as few different substrings as possible") as "use as few symbols as possible", as opposed to "use as few placements as possible".
This is a dynamic programming approach; the actual pruning occurs due to the memoization. Theoretically, a smart haskell implementation could do it for you, (but there are other ways where you wrap makeFindBest), I'd suggest using a bitfield to represent the used symbols and just an integer to represent the remaining string. The optimisation is possible from the fact that: given optimal solutions for the strings S1 and S2 that both use the same set of symbols, if S1 and S2 are concatenated then the two solutions can be concatenated in a similar manner and the new solution will be optimal. Hence for each partition of the input string, makeFindBest need only be evaluated once on the postfix for each possible set of symbols used in the prefix.
I've also integrated branch-and-bound pruning as suggested in Daniel's answer; this makes use of an evaluation function which becomes worse the more characters skipped. The cost is monotonic in the number of characters processed, so that if we have found a set of placements that wasted only alpha characters, then we never again try to skip more than alpha characters.
Where n is the string length and m is the number of symbols, the worst case is O(m^n) naively, and m is O(2^n). Note that removing constraint 3 would make things much quicker: the memoization would only need to be parameterized by the remaining string which is an O(n) cache, as opposed to O(n * 2^m)!
Using a string search/matching algorithm such as Aho-Corasick's string matching algorithm, improves the consume/drop 1 pattern I use here from exponential to quadratic. However, this by itself doesn't avoid the factorial growth in the combinations of the matches, which is where the dynamic programming helps.
Also note that your 4th "etc." criteria could possibly change the problem a lot if it constrains the problem in a way that makes it possible to do more aggressive pruning, or requires backtracking!
module Main where
import List
import Maybe
import System.Environment
type Symbol = String
type Placement = String
-- (remaining, placement or Nothing to skip one character)
type Move = (String, Maybe Placement)
-- (score, usedsymbols, placements)
type Solution = (Int, [Symbol], [Placement])
-- invoke like ./a.out STRING SPACE-SEPARATED-SYMBOLS ...
-- e.g. ./a.out "abcdeafghia" "a bc fg"
-- output is a list of placements
main = do
argv <- System.Environment.getArgs
let str = head argv
symbols = concat (map words (tail argv))
(putStr . show) $ findBest str symbols
putStr "\n"
getscore :: Solution -> Int
getscore (sc,_,_) = sc
-- | consume STR SYM consumes SYM from the start of STR. returns (s, SYM)
-- where s is the rest of STR, after the consumed occurrence, or Nothing if
-- SYM isnt a prefix of STR.
consume :: String -> Symbol -> Maybe Move
consume str sym = if sym `isPrefixOf` str
then (Just (drop (length sym) str, (Just sym)))
else Nothing
-- | addToSoln SYMBOLS P SOL incrementally updates SOL with the new SCORE and
-- placement P
addToSoln :: [Symbol] -> Maybe Placement -> Solution -> Solution
addToSoln symbols Nothing (sc, used, ps) = (sc - (length symbols) - 1, used, ps)
addToSoln symbols (Just p) (sc, used, ps) =
if p `elem` symbols
then (sc - 1, used `union` [p], p : ps)
else (sc, used, p : ps)
reduce :: [Symbol] -> Solution -> Solution -> [Move] -> Solution
reduce _ _ cutoff [] = cutoff
reduce symbols parent cutoff ((s,p):moves) =
let sol = makeFindBest symbols (addToSoln symbols p parent) cutoff s
best = if (getscore sol) > (getscore cutoff)
then sol
else cutoff
in reduce symbols parent best moves
-- | makeFindBest SYMBOLS PARENT CUTOFF STR searches for the best placements
-- that can be made on STR from SYMBOLS, that are strictly better than CUTOFF,
-- and prepends those placements to PARENTs third element.
makeFindBest :: [Symbol] -> Solution -> Solution -> String -> Solution
makeFindBest _ cutoff _ "" = cutoff
makeFindBest symbols parent cutoff str =
-- should be memoized by (snd parent) (i.e. the used symbols) and str
let moves = if (getscore parent) > (getscore cutoff)
then (mapMaybe (consume str) symbols) ++ [(drop 1 str, Nothing)]
else (mapMaybe (consume str) symbols)
in reduce symbols parent cutoff moves
-- a solution that makes no placements
worstScore str symbols = -(length str) * (1 + (length symbols))
findBest str symbols =
(\(_,_,ps) -> reverse ps)
(makeFindBest symbols (0, [], []) (worstScore str symbols, [], []) str)

This smells like a dynamic programming problem. You can find a number of good sources on it, but the gist is that you generate a collection of subproblems, and then build up "larger" optimal solutions by combining optimal subsolutions.

This is an answer rewritten to use the Aho-Corasick string-matching algorithm and Dijkstra's algorithm, in C++. This should be a lot closer to your target language of C#.
The Aho-Corasick step constructs an automaton (based on a suffix tree) from the set of patterns, and then uses that automaton to find all matches in the input string. Dijkstra's algorithm then treats those matches as nodes in a DAG, and moves toward the end of the string looking for the lowest cost path.
This approach is a lot easier to analyze, as it's simply combining two well-understood algorithms.
Constructing the Aho-Corasick automaton is linear time in the length of the patterns, and then the search is linear in the input string + the cumulative length of the matches.
Dijkstra's algorithm runs in O(|E| + |V| log |V|) assuming an efficient STL. The graph is a DAG, where vertices correspond to matches or to runs of characters that are skipped. Edge weights are the penalty for using an extra pattern or for skipping characters. An edge exists between two matches if they are adjacent and non-overlapping. An edge exists from a match m to a skip if that is the shortest possible skip between m and another match m2 that overlaps with some match m3 starting at the same place as the skip (phew!). The structure of Dijkstra's algorithm ensures that the optimal answer is the first one to be found by the time we reach the end of the input string (it achieves the pruning Daniel suggested implicitly).
#include <iostream>
#include <queue>
#include <vector>
#include <list>
#include <string>
#include <algorithm>
#include <set>
using namespace std;
static vector<string> patterns;
static string input;
static int skippenalty;
struct acnode {
acnode() : failure(NULL), gotofn(256) {}
struct acnode *failure;
vector<struct acnode *> gotofn;
list<int> outputs; // index into patterns global
};
void
add_string_to_trie(acnode *root, const string &s, int sid)
{
for (string::const_iterator p = s.begin(); p != s.end(); ++p) {
if (!root->gotofn[*p])
root->gotofn[*p] = new acnode;
root = root->gotofn[*p];
}
root->outputs.push_back(sid);
}
void
init_tree(acnode *root)
{
queue<acnode *> q;
unsigned char c = 0;
do {
if (acnode *u = root->gotofn[c]) {
u->failure = root;
q.push(u);
} else
root->gotofn[c] = root;
} while (++c);
while (!q.empty()) {
acnode *r = q.front();
q.pop();
do {
acnode *u, *v;
if (!(u = r->gotofn[c]))
continue;
q.push(u);
v = r->failure;
while (!v->gotofn[c])
v = v->failure;
u->failure = v->gotofn[c];
u->outputs.splice(u->outputs.begin(), v->gotofn[c]->outputs);
} while (++c);
}
}
struct match { int begin, end, sid; };
void
ahocorasick(const acnode *state, list<match> &out, const string &str)
{
int i = 1;
for (string::const_iterator p = str.begin(); p != str.end(); ++p, ++i) {
while (!state->gotofn[*p])
state = state->failure;
state = state->gotofn[*p];
for (list<int>::const_iterator q = state->outputs.begin();
q != state->outputs.end(); ++q) {
struct match m = { i - patterns[*q].size(), i, *q };
out.push_back(m);
}
}
}
////////////////////////////////////////////////////////////////////////
bool operator<(const match& m1, const match& m2)
{
return m1.begin < m2.begin
|| (m1.begin == m2.end && m1.end < m2.end);
}
struct dnode {
int usedchars;
vector<bool> usedpatterns;
int last;
};
bool operator<(const dnode& a, const dnode& b) {
return a.usedchars > b.usedchars
|| (a.usedchars == b.usedchars && a.usedpatterns < b.usedpatterns);
}
bool operator==(const dnode& a, const dnode& b) {
return a.usedchars == b.usedchars
&& a.usedpatterns == b.usedpatterns;
}
typedef priority_queue<pair<int, dnode>,
vector<pair<int, dnode> >,
greater<pair<int, dnode> > > mypq;
void
dijkstra(const vector<match> &matches)
{
typedef vector<match>::const_iterator mIt;
vector<bool> used(patterns.size(), false);
dnode initial = { 0, used, -1 };
mypq q;
set<dnode> last;
dnode d;
q.push(make_pair(0, initial));
while (!q.empty()) {
int cost = q.top().first;
d = q.top().second;
q.pop();
if (last.end() != last.find(d)) // we've been here before
continue;
last.insert(d);
if (d.usedchars >= input.size()) {
break; // found optimum
}
match m = { d.usedchars, 0, 0 };
mIt mp = lower_bound(matches.begin(), matches.end(), m);
if (matches.end() == mp) {
// no more matches, skip the remaining string
dnode nextd = d;
d.usedchars = input.size();
int skip = nextd.usedchars - d.usedchars;
nextd.last = -skip;
q.push(make_pair(cost + skip * skippenalty, nextd));
continue;
}
// keep track of where the shortest match ended; we don't need to
// skip more than this.
int skipmax = (mp->begin == d.usedchars) ? mp->end : mp->begin + 1;
while (mp != matches.end() && mp->begin == d.usedchars) {
dnode nextd = d;
nextd.usedchars = mp->end;
int extra = nextd.usedpatterns[mp->sid] ? 0 : 1; // extra pattern
int nextcost = cost + extra;
nextd.usedpatterns[mp->sid] = true;
nextd.last = mp->sid * 2 + extra; // encode used pattern
q.push(make_pair(nextcost, nextd));
++mp;
}
if (mp == matches.end() || skipmax <= mp->begin)
continue;
// skip
dnode nextd = d;
nextd.usedchars = mp->begin;
int skip = nextd.usedchars - d.usedchars;
nextd.last = -skip;
q.push(make_pair(cost + skip * skippenalty, nextd));
}
// unwind
string answer;
while (d.usedchars > 0) {
if (0 > d.last) {
answer = string(-d.last, '*') + answer;
d.usedchars += d.last;
} else {
answer = "[" + patterns[d.last / 2] + "]" + answer;
d.usedpatterns[d.last / 2] = !(d.last % 2);
d.usedchars -= patterns[d.last / 2].length();
}
set<dnode>::const_iterator lp = last.find(d);
if (last.end() == lp) return; // should not happen
d.last = lp->last;
}
cout << answer;
}
int
main()
{
int n;
cin >> n; // read n patterns
patterns.reserve(n);
acnode root;
for (int i = 0; i < n; ++i) {
string s;
cin >> s;
patterns.push_back(s);
add_string_to_trie(&root, s, i);
}
init_tree(&root);
getline(cin, input); // eat the rest of the first line
getline(cin, input);
cerr << "got input: " << input << endl;
list<match> matches;
ahocorasick(&root, matches, input);
vector<match> vmatches(matches.begin(), matches.end());
sort(vmatches.begin(), vmatches.end());
skippenalty = 1 + patterns.size();
dijkstra(vmatches);
return 0;
}
Here is a test file with 52 single-letter patterns (compile and then run with the test file on stdin):
52 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz

Related

String permutation with duplicate characters

I have string "0011" and want all of the combinations without duplicate.
that's means I want a string with a combination of two '0' and two '1';
for example : [0011,0101,0110,1001,1010,1100]
I tried with this and the result is exactly what i need.
private void permutation(String result, String str, HashSet hashset) {
if (str.length()==0 && !hashSet.contains(result)){
System.out.println(result);
hashSet.add(result);
return;
}
IntStream.range(0,str.length()).forEach(pos->permutation(result+ str.charAt(pos), str.substring(0, pos) + str.substring(pos+1),hashset));
}
if i remove HashSet, this code will produce 24 results instead of 6 results.
but the time complexity of this code is O(n!).
how to avoid it to create a duplicate string and reduce the time complexity?

Probably something like this can be faster than n! even on small n
The idea is to count how many bits we need should be in resulting item and
iterate through all posible values and filter only those than have same number of bits. It will work similar amount of time with only one 1 and for 50%/50% of 0 and 1
function bitCount(n) {
n = n - ((n >> 1) & 0x55555555)
n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
return ((n + (n >> 4) & 0xF0F0F0F) * 0x1010101) >> 24
}
function perm(inp) {
const bitString = 2;
const len = inp.length;
const target = bitCount(parseInt(inp, bitString));
const min = (Math.pow(target, bitString) - 1);
const max = min << (len - target);
const result = [];
for (let i = min; i < max + 1; i++) {
if (bitCount(i) === target) {
result.push(i.toString(bitString).padStart(len, '0'));
}
}
return result;
}
const inp = '0011';
const res = perm(inp);
console.log('result',res);
P.s. My first idea was probably faster than upper code. But upper is easier to implement
first idea was to convert string to int
and use bitwise left shift but only for one digit every time. it still depends on n. and can be larger or smaller than upper solution. but bitwise shift is faster itself.
example
const input = '0011'
const len = input.length;
step1: calc number of bits = 2;
then generate first element = 3(Dec) is = '0011' in bin
step2 move last from the right bit one position left with << operator: '0101'
step3 move again: '1001'
step4: we are reached `len` so use next bit:100'1' : '1010'
step5: repeat:'1100'
step6: move initial 3 << 1: '0110'
repeat above steps: '1010'
step8: '1100'
it will generate duplicates so probably can be improved
Hope it helps

The worst case time complexity cannot be improved because there can be no duplicates in a string. However, in case of a multi-set, we could prune a lot of sub-trees to prevent duplicates.
The key idea is to permute the string using traditional backtracking algorithm but prevent swapping if the character has been previously swapped to prevent duplicates.
Here is a C++ code snippet that prevents duplicates and doesn't use any memory for lookup.
bool shouldSwap(const string& str, size_t start, size_t index) {
for (auto i = start; i < index; ++i) {
if (str[i] == str[index])
return false;
}
return true;
}
void permute(string& str, size_t index)
{
if (index >= str.size()) {
cout << str << endl;;
return;
}
for (size_t i = index; i < str.size(); ++i) {
if(shouldSwap(str, index, i)) {
swap(str[index], str[i]);
permute(str, index + 1);
swap(str[index], str[i]);
}
}
}
Running demo. Also refer to SO answer here and Distinct permutations for more references.
Also, note that the time complexity of this solution is O(n2 n!)
O(n) for printing a string
O(n) for iterating over the string to generate swaps and recurrence.
O(n!) possible states for the number of permutations.

Find all anagrams in a string O(n) solution

Here is the problem:
Given a string s and a non-empty string p, find all the start indices of p's anagrams in s.
Input: s: "cbaebabacd" p: "abc"
Output: [0, 6]
Input: s: "abab" p: "ab"
Output: [0, 1, 2]
Here is my solution
vector<int> findAnagrams(string s, string p) {
vector<int> res, s_map(26,0), p_map(26,0);
int s_len = s.size();
int p_len = p.size();
if (s_len < p_len) return res;
for (int i = 0; i < p_len; i++) {
++s_map[s[i] - 'a'];
++p_map[p[i] - 'a'];
}
if (s_map == p_map)
res.push_back(0);
for (int i = p_len; i < s_len; i++) {
++s_map[s[i] - 'a'];
--s_map[s[i - p_len] - 'a'];
if (s_map == p_map)
res.push_back(i - p_len + 1);
}
return res;
}
However, I think it is O(n^2) solution because I have to compare vectors s_map and p_map.
Does a O(n) solution exist for this problem?

lets say p has size n.
lets say you have an array A of size 26 that is filled with the number of a,b,c,... which p contains.
then you create a new array B of size 26 filled with 0.
lets call the given (big) string s.
first of all you initialize B with the number of a,b,c,... in the first n chars of s.
then you iterate through each word of size n in s always updating B to fit this n-sized word.
always B matches A you will have an index where we have an anagram.
to change B from one n-sized word to another, notice you just have to remove in B the first char of the previous word and add the new char of the next word.
Look at the example:
Input
s: "cbaebabacd"
p: "abc" n = 3 (size of p)
A = {1, 1, 1, 0, 0, 0, ... } // p contains just 1a, 1b and 1c.
B = {1, 1, 1, 0, 0, 0, ... } // initially, the first n-sized word contains this.
compare(A,B)
for i = n; i < size of s; i++ {
B[ s[i-n] ]--;
B[ s[ i ] ]++;
compare(A,B)
}
and suppose that compare(A,B) prints the index always A matches B.
the total complexity will be:
first fill of A = O(size of p)
first fill of B = O(size of s)
first comparison = O(26)
for-loop = |s| * (2 + O(26)) = |s| * O(28) = O(28|s|) = O(size of s)
____________________________________________________________________
2 * O(size of s) + O(size of p) + O(26)
which is linear in size of s.

Your solution is the O(n) solution. The size of the s_map and p_map vectors is a constant (26) that doesn't depend on n. So the comparison between s_map and p_map takes a constant amount of time regardless of how big n is.
Your solution takes about 26 * n integer comparisons to complete, which is O(n).

// In papers on string searching algorithms, the alphabet is often
// called Sigma, and it is often not considered a constant. Your
// algorthm works in (Sigma * n) time, where n is the length of the
// longer string. Below is an algorithm that works in O(n) time even
// when Sigma is too large to make an array of size Sigma, as long as
// values from Sigma are a constant number of "machine words".
// This solution works in O(n) time "with high probability", meaning
// that for all c > 2 the probability that the algorithm takes more
// than c*n time is 1-o(n^-c). This is a looser bound than O(n)
// worst-cast because it uses hash tables, which depend on randomness.
#include <functional>
#include <iostream>
#include <type_traits>
#include <vector>
#include <unordered_map>
#include <vector>
using namespace std;
// Finding a needle in a haystack. This works for any iterable type
// whose members can be stored as keys of an unordered_map.
template <typename T>
vector<size_t> AnagramLocations(const T& needle, const T& haystack) {
// Think of a contiguous region of an ordered container as
// representing a function f with the domain being the type of item
// stored in the container and the codomain being the natural
// numbers. We say that f(x) = n when there are n x's in the
// contiguous region.
//
// Then two contiguous regions are anagrams when they have the same
// function. We can track how close they are to being anagrams by
// subtracting one function from the other, pointwise. When that
// difference is uniformly 0, then the regions are anagrams.
unordered_map<remove_const_t<remove_reference_t<decltype(*needle.begin())>>,
intmax_t> difference;
// As we iterate through the haystack, we track the lead (part
// closest to the end) and lag (part closest to the beginning) of a
// contiguous region in the haystack. When we move the region
// forward by one, one part of the function f is increased by +1 and
// one part is decreased by -1, so the same is true of difference.
auto lag = haystack.begin(), lead = haystack.begin();
// To compare difference to the uniformly-zero function in O(1)
// time, we make sure it does not contain any points that map to
// 0. The the property of being uniformly zero is the same as the
// property of having an empty difference.
const auto find = [&](const auto& x) {
difference[x]++;
if (0 == difference[x]) difference.erase(x);
};
const auto lose = [&](const auto& x) {
difference[x]--;
if (0 == difference[x]) difference.erase(x);
};
vector<size_t> result;
// First we initialize the difference with the first needle.size()
// items from both needle and haystack.
for (const auto& x : needle) {
lose(x);
find(*lead);
++lead;
if (lead == haystack.end()) return result;
}
size_t i = 0;
if (difference.empty()) result.push_back(i++);
// Now we iterate through the haystack with lead, lag, and i (the
// position of lag) updating difference in O(1) time at each spot.
for (; lead != haystack.end(); ++lead, ++lag, ++i) {
find(*lead);
lose(*lag);
if (difference.empty()) result.push_back(i);
}
return result;
}
int main() {
string needle, haystack;
cin >> needle >> haystack;
const auto result = AnagramLocations(needle, haystack);
for (auto x : result) cout << x << ' ';
}

import java.util.*;
public class FindAllAnagramsInAString_438{
public static void main(String[] args){
String s="abab";
String p="ab";
// String s="cbaebabacd";
// String p="abc";
System.out.println(findAnagrams(s,p));
}
public static List<Integer> findAnagrams(String s, String p) {
int i=0;
int j=p.length();
List<Integer> list=new ArrayList<>();
while(j<=s.length()){
//System.out.println("Substring >>"+s.substring(i,j));
if(isAnamgram(s.substring(i,j),p)){
list.add(i);
}
i++;
j++;
}
return list;
}
public static boolean isAnamgram(String s,String p){
HashMap<Character,Integer> map=new HashMap<>();
if(s.length()!=p.length()) return false;
for(int i=0;i<s.length();i++){
char chs=s.charAt(i);
char chp=p.charAt(i);
map.put(chs,map.getOrDefault(chs,0)+1);
map.put(chp,map.getOrDefault(chp,0)-1);
}
for(int val:map.values()){
if(val!=0) return false;
}
return true;
}
}

Longest Common Substring

We have two strings a and b respectively. The length of a is greater than or equal to b. We have to find out the longest common substring. If there are multiple answers then we have to output the substring which comes earlier in b (earlier as in whose starting index comes first).
Note: The length of a and b can be up to 106.
I tried to find the longest common substring using suffix array (sorting the suffixes using quicksort). For the case when there is more than one answer, I tried pushing all the common substrings in a stack which are equal to the length of the longest common substring.
I wanted to know is there any faster way to do so?

Build a suffix tree of a string a$b, that is, a concatenated with some character like $ not occurring in both strings, then concatenated with b. A (compressed) suffix tree can be built in O(|a|+|b|) time and memory, and have O(|a|+|b|) nodes.
Now, for each node, we know its depth (the length of the string obtained by starting from the root and traversing the tree down to that node). We also can keep track of two boolean quantities: whether this node was visited during the build phase corresponding to a, and whether it was visited during the build phase corresponding to b (for example, we might as well build the two trees separately and then merge them using pre-order traversal). Now, the task boils down to finding the deepest vertex which was visited during both phases, which can be done by a single pre-order traversal. The case of multiple answers should be easy to handle.
This Wikipedia page contains another (brief) overview of the technique.

This is longest substring,what you are looking for is it with repetition or without .
please go through this it might be helpful.
http://www.programcreek.com/2013/02/leetcode-longest-substring-without-repeating-characters-java/

import java.util.Scanner;
public class JavaApplication8 {
public static int find(String s1,String s2){
int n = s1.length();
int m = s2.length();
int ans = 0;
int[] a = new int[m];
int b[] = new int[m];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
ans = Math.max(ans, a[j]);
}
}
int[] c = a;
a = b;
b = c;
}
return ans;
}
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String s1 = sc.next();
String s2 = sc.next();
System.out.println(find(s1,s2));
}
}
Time Complexity O(N)
Space Complexity O(N)

package main
import (
"fmt"
"strings"
)
func main(){
fmt.Println(lcs("CLCL","LCLC"))
}
func lcs(s1,s2 string)(max int,str string){
str1 := strings.Split(s1,"")
str2 := strings.Split(s2,"")
fmt.Println(str1,str2)
str = ""
mnMatrix := [4][4]int{}
for i:=0;i<len(str1);i++{
for j:=0;j<len(str2);j++{
if str1[i]==str2[j]{
if i==0 || j==0 {
mnMatrix[i][j] = 1
max = 1
//str = str1[i]
}else{
mnMatrix[i][j] = mnMatrix[i-1][j-1]+1
max = mnMatrix[i][j]
str = ""
for k:=max;k>=1;k--{
str = str + str2[k]
//fmt.Println(str)
}
}
}else{
mnMatrix[i][j] = 0
}
}
}
fmt.Println(mnMatrix)
return max, str
}
enter code here

Longest common prefix for n string

Given n string of max length m. How can we find the longest common prefix shared by at least two strings among them?
Example: ['flower', 'flow', 'hello', 'fleet']
Answer: fl
I was thinking of building a Trie for all the string and then checking the deepest node (satisfies longest) that branches out to two/more substrings (satisfies commonality). This takes O(n*m) time and space. Is there a better way to do this

Why to use trie(which takes O(mn) time and O(mn) space, just use the basic brute force way. first loop, find the shortest string as minStr, which takes o(n) time, second loop, compare one by one with this minStr, and keep an variable which indicates the rightmost index of minStr, this loop takes O(mn) where m is the shortest length of all strings. The code is like below,
public String longestCommonPrefix(String[] strs) {
if(strs.length==0) return "";
String minStr=strs[0];
for(int i=1;i<strs.length;i++){
if(strs[i].length()<minStr.length())
minStr=strs[i];
}
int end=minStr.length();
for(int i=0;i<strs.length;i++){
int j;
for( j=0;j<end;j++){
if(minStr.charAt(j)!=strs[i].charAt(j))
break;
}
if(j<end)
end=j;
}
return minStr.substring(0,end);
}

there is an O(|S|*n) solution to this problem, using a trie. [n is the number of strings, S is the longest string]
(1) put all strings in a trie
(2) do a DFS in the trie, until you find the first vertex with more than 1 "edge".
(3) the path from the root to the node you found at (2) is the longest common prefix.
There is no possible faster solution then it [in terms of big O notation], at the worst case, all your strings are identical - and you need to read all of them to know it.

I would sort them, which you can do in n lg n time. Then any strings with common prefixes will be right next to eachother. In fact you should be able to keep a pointer of which index you're currently looking at and work your way down for a pretty speedy computation.

As a completely different answer from my other answer...
You can, with one pass, bucket every string based on its first letter.
With another pass you can sort each bucket based on its second later. (This is known as radix sort, which is O(n*m), and O(n) with each pass.) This gives you a baseline prefix of 2.
You can safely remove from your dataset any elements that do not have a prefix of 2.
You can continue the radix sort, removing elements without a shared prefix of p, as p approaches m.
This will give you the same O(n*m) time that the trie approach does, but will always be faster than the trie since the trie must look at every character in every string (as it enters the structure), while this approach is only guaranteed to look at 2 characters per string, at which point it culls much of the dataset.
The worst case is still that every string is identical, which is why it shares the same big O notation, but will be faster in all cases as is guaranteed to use less comparisons since on any "non-worst-case" there are characters that never need to be visited.

public String longestCommonPrefix(String[] strs) {
if (strs == null || strs.length == 0)
return "";
char[] c_list = strs[0].toCharArray();
int len = c_list.length;
int j = 0;
for (int i = 1; i < strs.length; i++) {
for (j = 0; j < len && j < strs[i].length(); j++)
if (c_list[j] != strs[i].charAt(j))
break;
len = j;
}
return new String(c_list).substring(0, len);
}

It happens that the bucket sort (radix sort) described by corsiKa can be extended such that all strings are eventually placed alone in a bucket, and at that point, the LCP for such a lonely string is known. Further, the shustring of each string is also known; it is one longer than is the LCP. The bucket sort is defacto the construction of a suffix array but, only partially so. Those comparisons that are not performed (as described by corsiKa) indeed represent those portions of the suffix strings that are not added to the suffix array. Finally, this method allows for determination of not just the LCP and shustrings, but also one may easily find those subsequences that are not present within the string.

Since the world is obviously begging for an answer in Swift, here's mine ;)
func longestCommonPrefix(strings:[String]) -> String {
var commonPrefix = ""
var indices = strings.map { $0.startIndex}
outerLoop:
while true {
var toMatch: Character = "_"
for (whichString, f) in strings.enumerate() {
let cursor = indices[whichString]
if cursor == f.endIndex { break outerLoop }
indices[whichString] = cursor.successor()
if whichString == 0 { toMatch = f[cursor] }
if toMatch != f[cursor] { break outerLoop }
}
commonPrefix.append(toMatch)
}
return commonPrefix
}
Swift 3 Update:
func longestCommonPrefix(strings:[String]) -> String {
var commonPrefix = ""
var indices = strings.map { $0.startIndex}
outerLoop:
while true {
var toMatch: Character = "_"
for (whichString, f) in strings.enumerated() {
let cursor = indices[whichString]
if cursor == f.endIndex { break outerLoop }
indices[whichString] = f.characters.index(after: cursor)
if whichString == 0 { toMatch = f[cursor] }
if toMatch != f[cursor] { break outerLoop }
}
commonPrefix.append(toMatch)
}
return commonPrefix
}
What's interesting to note:
this runs in O^2, or O(n x m) where n is the number of strings and m
is the length of the shortest one.
this uses the String.Index data type and thus deals with Grapheme Clusters which the Character type represents.
And given the function I needed to write in the first place:
/// Takes an array of Strings representing file system objects absolute
/// paths and turn it into a new array with the minimum number of common
/// ancestors, possibly pushing the root of the tree as many level downwards
/// as necessary
///
/// In other words, we compute the longest common prefix and remove it
func reify(fullPaths:[String]) -> [String] {
let lcp = longestCommonPrefix(fullPaths)
return fullPaths.map {
return $0[lcp.endIndex ..< $0.endIndex]
}
}
here is a minimal unit test:
func testReifySimple() {
let samplePaths:[String] = [
"/root/some/file"
, "/root/some/other/file"
, "/root/another/file"
, "/root/direct.file"
]
let expectedPaths:[String] = [
"some/file"
, "some/other/file"
, "another/file"
, "direct.file"
]
let reified = PathUtilities().reify(samplePaths)
for (index, expected) in expectedPaths.enumerate(){
XCTAssert(expected == reified[index], "failed match, \(expected) != \(reified[index])")
}
}

Perhaps a more intuitive solution. Channel the already found prefix out of earlier iteration as input string to the remaining or next string input. [[[w1, w2], w3], w4]... so on], where [] is supposedly the LCP of two strings.
public String findPrefixBetweenTwo(String A, String B){
String ans = "";
for (int i = 0, j = 0; i < A.length() && j < B.length(); i++, j++){
if (A.charAt(i) != B.charAt(j)){
return i > 0 ? A.substring(0, i) : "";
}
}
// Either of the string is prefix of another one OR they are same.
return (A.length() > B.length()) ? B.substring(0, B.length()) : A.substring(0, A.length());
}
public String longestCommonPrefix(ArrayList<String> A) {
if (A.size() == 1) return A.get(0);
String prefix = A.get(0);
for (int i = 1; i < A.size(); i++){
prefix = findPrefixBetweenTwo(prefix, A.get(i)); // chain the earlier prefix
}
return prefix;
}

How can I compute the number of characters required to turn a string into a palindrome?

I recently found a contest problem that asks you to compute the minimum number of characters that must be inserted (anywhere) in a string to turn it into a palindrome.
For example, given the string: "abcbd" we can turn it into a palindrome by inserting just two characters: one after "a" and another after "d": "adbcbda".
This seems to be a generalization of a similar problem that asks for the same thing, except characters can only be added at the end - this has a pretty simple solution in O(N) using hash tables.
I have been trying to modify the Levenshtein distance algorithm to solve this problem, but haven't been successful. Any help on how to solve this (it doesn't necessarily have to be efficient, I'm just interested in any DP solution) would be appreciated.

Note: This is just a curiosity. Dav proposed an algorithm which can be modified to DP algorithm to run in O(n^2) time and O(n^2) space easily (and perhaps O(n) with better bookkeeping).
Of course, this 'naive' algorithm might actually come in handy if you decide to change the allowed operations.
Here is a 'naive'ish algorithm, which can probably be made faster with clever bookkeeping.
Given a string, we guess the middle of the resulting palindrome and then try to compute the number of inserts required to make the string a palindrome around that middle.
If the string is of length n, there are 2n+1 possible middles (Each character, between two characters, just before and just after the string).
Suppose we consider a middle which gives us two strings L and R (one to left and one to right).
If we are using inserts, I believe the Longest Common Subsequence algorithm (which is a DP algorithm) can now be used the create a 'super' string which contains both L and reverse of R, see Shortest common supersequence.
Pick the middle which gives you the smallest number inserts.
This is O(n^3) I believe. (Note: I haven't tried proving that it is true).

My C# solution looks for repeated characters in a string and uses them to reduce the number of insertions. In a word like program, I use the 'r' characters as a boundary. Inside of the 'r's, I make that a palindrome (recursively). Outside of the 'r's, I mirror the characters on the left and the right.
Some inputs have more than one shortest output: output can be toutptuot or outuputuo. My solution only selects one of the possibilities.
Some example runs:
radar -> radar, 0 insertions
esystem -> metsystem, 2 insertions
message -> megassagem, 3 insertions
stackexchange -> stegnahckexekchangets, 8 insertions
First I need to check if an input is already a palindrome:
public static bool IsPalindrome(string str)
{
for (int left = 0, right = str.Length - 1; left < right; left++, right--)
{
if (str[left] != str[right])
return false;
}
return true;
}
Then I need to find any repeated characters in the input. There may be more than one. The word message has two most-repeated characters ('e' and 's'):
private static bool TryFindMostRepeatedChar(string str, out List<char> chs)
{
chs = new List<char>();
int maxCount = 1;
var dict = new Dictionary<char, int>();
foreach (var item in str)
{
int temp;
if (dict.TryGetValue(item, out temp))
{
dict[item] = temp + 1;
maxCount = temp + 1;
}
else
dict.Add(item, 1);
}
foreach (var item in dict)
{
if (item.Value == maxCount)
chs.Add(item.Key);
}
return maxCount > 1;
}
My algorithm is here:
public static string MakePalindrome(string str)
{
List<char> repeatedList;
if (string.IsNullOrWhiteSpace(str) || IsPalindrome(str))
{
return str;
}
//If an input has repeated characters,
// use them to reduce the number of insertions
else if (TryFindMostRepeatedChar(str, out repeatedList))
{
string shortestResult = null;
foreach (var ch in repeatedList) //"program" -> { 'r' }
{
//find boundaries
int iLeft = str.IndexOf(ch); // "program" -> 1
int iRight = str.LastIndexOf(ch); // "program" -> 4
//make a palindrome of the inside chars
string inside = str.Substring(iLeft + 1, iRight - iLeft - 1); // "program" -> "og"
string insidePal = MakePalindrome(inside); // "og" -> "ogo"
string right = str.Substring(iRight + 1); // "program" -> "am"
string rightRev = Reverse(right); // "program" -> "ma"
string left = str.Substring(0, iLeft); // "program" -> "p"
string leftRev = Reverse(left); // "p" -> "p"
//Shave off extra chars in rightRev and leftRev
// When input = "message", this loop converts "meegassageem" to "megassagem",
// ("ee" to "e"), as long as the extra 'e' is an inserted char
while (left.Length > 0 && rightRev.Length > 0 &&
left[left.Length - 1] == rightRev[0])
{
rightRev = rightRev.Substring(1);
leftRev = leftRev.Substring(1);
}
//piece together the result
string result = left + rightRev + ch + insidePal + ch + right + leftRev;
//find the shortest result for inputs that have multiple repeated characters
if (shortestResult == null || result.Length < shortestResult.Length)
shortestResult = result;
}
return shortestResult;
}
else
{
//For inputs that have no repeated characters,
// just mirror the characters using the last character as the pivot.
for (int i = str.Length - 2; i >= 0; i--)
{
str += str[i];
}
return str;
}
}
Note that you need a Reverse function:
public static string Reverse(string str)
{
string result = "";
for (int i = str.Length - 1; i >= 0; i--)
{
result += str[i];
}
return result;
}

C# Recursive solution adding to the end of the string:
There are 2 base cases. When length is 1 or 2. Recursive case: If the extremes are equal, then
make palindrome the inner string without the extremes and return that with the extremes.
If the extremes are not equal, then add the first character to the end and make palindrome the
inner string including the previous last character. return that.
public static string ConvertToPalindrome(string str) // By only adding characters at the end
{
if (str.Length == 1) return str; // base case 1
if (str.Length == 2 && str[0] == str[1]) return str; // base case 2
else
{
if (str[0] == str[str.Length - 1]) // keep the extremes and call
return str[0] + ConvertToPalindrome(str.Substring(1, str.Length - 2)) + str[str.Length - 1];
else //Add the first character at the end and call
return str[0] + ConvertToPalindrome(str.Substring(1, str.Length - 1)) + str[0];
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Selecting an optimum set according to ranked criteria - algorithm

This smells like a dynamic programming problem. You can find a number of good sources on it, but the gist is that you generate a collection of subproblems, and then build up "larger" optimal solutions by combining optimal subsolutions.

Related

String permutation with duplicate characters

Find all anagrams in a string O(n) solution

Longest Common Substring

Longest common prefix for n string

How can I compute the number of characters required to turn a string into a palindrome?

Categories

Resources