Find if a string can be obtained from a matrix of characters - algorithm

Given a matrix of characters and a string, find whether the string can be obtained from the matrix. From each character in the matrix, we can move up/down/right/left. For example, if the matrix[3][4] is:
o f a s
l l q w
z o w k
and the string is follow, then the function should return true.
The only approach I can think of is a backtracking algorithm that searches whether the word is possible or not. Is there any other faster algorithm to approach this problem?
And suppose I have a lot of queries (on finding whether a word exists or not). Then can there be some preprocessing done to answer the queries faster?

You can solve this using DFS. Let's define a graph for the problem. The vertices of the graph will comprise of the cell of a combination of cell of the matrix and a length of prefix of the string we are searching for. When we are at a given vertex this will mean that all the characters of the specified prefix were matched so far and that we currently are at the given cell.
We define edges as connecting cells adjacent by a side and doing a "valid" transaction. That is the cell we are going to should be the next in the string we are searching for.
To solve the problem we do a DFS from all cells that contain the first letter of the string and prefix length 1(meaning we've matched this first letter). From there on we continue the search and on each step we compute which are the edges going out of the current position(cell/string prefix length combination). We terminate the first time we reach a prefix of length L - the length of the string.
Note that DFS may be considered backtracking but what is more important is to keep track of the nodes in the graph we've already visited. Thus the overall complexity is bound by N * M * L where N and M are the dimensions of the matrix and L - the length of the string.

You could of course find all possible strings (start with a charater and go as far as you can). This can be done with a recursive function.
grid:
abc
def
ghi
strings:
abcfedghi
abcfehgd
abcfehi
abedghif
abefc
abefighd
abehgd
abehifc
ad...
...
Then sort these strings and when looking for a word use a binary search on the list. (When looking for an n letter word you would of course only consider the first n letters of the strings in the list.) A lot of preparation and much memory needed, but searching will be fast. So if you use the same grid again and again, the preparation may finally pay :-)

Below is the pseudo code for finding if the given string is present in a given matrix. Here visited keeps track of the location of the string in the matrix and it uses backtracking for keeping track of that. I hope this is helpful.
bool isSafe(matrix[n][m], int visited[n][m], int i, int j, int n, int m){
if(i<m && j<n && i>=0 && j>=0 && visited[i][j] == 0)
return true;
return false;
}
bool dfs(char matrix[n][m], int i, int j, int visited[n][m], char str[], int index){
if(index == strlen(str))
return true;
// row moves
int x[] = {-1, 0, 1, -1};
// col moves
int y[] = {0, -1, 1, 0};
if(str[index] == matrix[i][j]){
visited[i][j] = 1;
// for all the neighbours
for(int k = 0; k<4; k++){
// mark given position visited
next_x = i + x[k];
next_y = j + y[k];
if(isSafe(matrix, visited, next_x, next_y, n, m)){
if(dfs(matrix, next_x, next_y, visited, str, index+1) == true)
return true;
}
}
// backtrack
visited[i][j] = 0;
}
return false;
}
bool isPresent(char matrix[n][m], char str[]){
// visited initialized to 0
int visited[n][m] = {0};
for(int i=0;i<n;i++)
for(int j=0;j<n;j++){
if(dfs(matrix, i, j, n, m ,visited, str, 0) == true)
return true;
}
return false;
}

Related

Majority element of JPEG images using Divide and conquer [duplicate]

An array is said to have a majority element if more than half of its elements are the same. Is there a divide-and-conquer algorithm for determining if an array has a majority element?
I normally do the following, but it is not using divide-and-conquer. I do not want to use the Boyer-Moore algorithm.
int find(int[] arr, int size) {
int count = 0, i, mElement;
for (i = 0; i < size; i++) {
if (count == 0) mElement = arr[i];
if (arr[i] == mElement) count++;
else count--;
}
count = 0;
for (i = 0; i < size; i++) {
if (arr[i] == mElement) count++;
}
if (count > size / 2) return mElement;
return -1;
}
I can see at least one divide and conquer method.
Start by finding the median, such as with Hoare's Select algorithm. If one value forms a majority of the elements, the median must have that value, so we've just found the value we're looking for.
From there, find (for example) the 25th and 75th percentile items. Again, if there's a majority element, at least one of those would need to have the same value as the median.
Assuming you haven't ruled out there being a majority element yet, you can continue the search. For example, let's assume the 75th percentile was equal to the median, but the 25th percentile wasn't.
When then continue searching for the item halfway between the 25th percentile and the median, as well as the one halfway between the 75th percentile and the end.
Continue finding the median of each partition that must contain the end of the elements with the same value as the median until you've either confirmed or denied the existence of a majority element.
As an aside: I don't quite see how Boyer-Moore would be used for this task. Boyer-Moore is a way of finding a substring in a string.
There is, and it does not require the elements to have an order.
To be formal, we're dealing with multisets (also called bags.) In the following, for a multiset S, let:
v(e,S) be the multiplicity of an element e in S, i.e. the number of times it occurs (the multiplicity is zero if e is not a member of S at all.)
#S be the cardinality of S, i.e. the number of elements in S counting multiplicity.
⊕ be the multiset sum: if S = L ⊕ R then S contains all the elements of L and R counting multiplicity, i.e. v(e;S) = v(e;L) + v(e;R) for any element e. (This also shows that the multiplicity can be calculated by 'divide-and-conquer'.)
[x] be the largest integer less than or equal to x.
The majority element m of S, if it exists, is that element such that 2 v(m;S) > #S.
Let's call L and R a splitting of S if L ⊕ R = S and an even splitting if |#L - #R| ≤ 1. That is, if n=#S is even, L and R have exactly half the elements of S, and if n is odd, than one has cardinality [n/2] and the other has cardinality [n/2]+1.
For an arbitrary split of S into L and R, two observations:
If neither L nor R has a majority element, then S cannot: for any element e, 2 v(e;S) = 2 v(e;L) + 2 v(e;R) ≤ #L + #R = #S.
If one of L and R has a majority element m with multiplicity k, then it is the majority element of S only if it has multiplicity r in the other half, with 2(k+r) > #S.
The algorithm majority(S) below returns either a pair (m,k), indicating that m is the majority element with k occurrences, or none:
If S is empty, return none; if S has just one element m, then return (m,1). Otherwise:
Make an even split of S into two halves L and R.
Let (m,k) = majority(L), if not none:
a. Let k' = k + v(m;R).
b. Return (m,k') if 2 k' > n.
Otherwise let (m,k) = majority(R), if not none:
a. Let k' = k + v(m;L).
b. Return (m,k') if 2 k' > n.
Otherwise return none.
Note that the algorithm is still correct even if the split is not an even one. Splitting evenly though is likely to perform better in practice.
Addendum
Made the terminal case explicit in the algorithm description above. Some sample C++ code:
struct majority_t {
int m; // majority element
size_t k; // multiplicity of m; zero => no majority element
constexpr majority_t(): m(0), k(0) {}
constexpr majority_t(int m_,size_t k_): m(m_), k(k_) {}
explicit operator bool() const { return k>0; }
};
static constexpr majority_t no_majority;
size_t multiplicity(int x,const int *arr,size_t n) {
if (n==0) return 0;
else if (n==1) return arr[0]==x?1:0;
size_t r=n/2;
return multiplicity(x,arr,r)+multiplicity(x,arr+r,n-r);
}
majority_t majority(const int *arr,size_t n) {
if (n==0) return no_majority;
else if (n==1) return majority_t(arr[0],1);
size_t r=n/2;
majority_t left=majority(arr,r);
if (left) {
left.k+=multiplicity(left.m,arr+r,n-r);
if (left.k>r) return left;
}
majority_t right=majority(arr+r,n-r);
if (right) {
right.k+=multiplicity(right.m,arr,r);
if (right.k>r) return right;
}
return no_majority;
}
A simpler divide and conquer algorithm works for the case that there exists more than 1/2 elements which are the same and there are n = 2^k elements for some integer k.
FindMost(A, startIndex, endIndex)
{ // input array A
if (startIndex == endIndex) // base case
return A[startIndex];
x = FindMost(A, startIndex, (startIndex + endIndex - 1)/2);
y = FindMost(A, (startIndex + endIndex - 1)/2 + 1, endIndex);
if (x == null && y == null)
return null;
else if (x == null && y != null)
return y;
else if (x != null && y == null)
return x;
else if (x != y)
return null;
else return x
}
This algorithm could be modified so that it works for n which is not exponent of 2, but boundary cases must be handled carefully.
Lets say the array is 1, 2, 1, 1, 3, 1, 4, 1, 6, 1.
If an array contains more than half of elements same then there should be a position where the two consecutive elements are same.
In the above example observe 1 is repeated more than half times. And the indexes(index start from 0) index 2 and index 3 have same element.

dynamic programming reduction of brute force

A emoticon consists of an arbitrary positive number of underscores between two semicolons. Hence, the shortest possible emoticon is ;_;. The strings ;__; and ;_____________; are also valid emoticons.
given a String containing only(;,_).The problem is to divide string into one or more emoticons and count how many division are possible. Each emoticon must be a subsequence of the message, and each character of the message must belong to exactly one emoticon. Note that the subsequences are not required to be contiguous. subsequence definition.
The approach I thought of is to write a recursive method as follows:
countDivision(string s){
//base cases
if(s.empty()) return 1;
if(s.length()<=3){
if(s.length()!=3) return 0;
return s[0]==';' && s[1]=='_' && s[2]==';';
}
result=0;
//subproblems
genrate all valid emocticon and remove it from s let it be w
result+=countDivision(w);
return result;
}
The solution above will easily timeout when n is large such as 100. What kind of approach should I use to convert this brute force solution to a dynamic programming solution?
Few examples
1. ";_;;_____;" ans is 2
2. ";;;___;;;" ans is 36
Example 1.
";_;;_____;" Returns: 2
There are two ways to divide this string into two emoticons.
One looks as follows: ;_;|;_____; and the other looks like
this(rembember we can pick subsequence it need not be contigous): ;_ ;|; _____;
I'll describe an O(n^4)-time and -space dynamic programming solution (that can easily be improved to use just O(n^3) space) that should work for up to n=100 or so.
Call a subsequence "fresh" if consists of a single ;.
Call a subsequence "finished" if it corresponds to an emoticon.
Call a subsequence "partial" if it has nonzero length and is a proper prefix of an emoticon. (So for example, ;, ;_, and ;___ are all partial subsequences, while the empty string, _, ;; and ;___;; are not.)
Finally, call a subsequence "admissible" if it is fresh, finished or partial.
Let f(i, j, k, m) be the number of ways of partitioning the first i characters of the string into exactly j+k+m admissible subsequences, of which exactly j are fresh, k are partial and m are finished. Notice that any prefix of a valid partition into emoticons determines i, j, k and m uniquely -- this means that no prefix of a valid partition will be counted by more than one tuple (i, j, k, m), so if we can guarantee that, for each tuple (i, j, k, m), the partition prefixes within that tuple are all counted once and only once, then we can add together the counts for tuples to get a valid total. Specifically, the answer to the question will then be the sum over all 1 <= j <= n of f(n, 0, j, 0).
If s[i] = "_":
f(i, j, k, m) =
(j+1) * f(i-1, j+1, k, m-1) // Convert any of the j+1 fresh subsequences to partial
+ m * f(i-1, j, k, m) // Add _ to any of the m partial subsequences
Else if s[i] = ";":
f(i, j, k, m) =
f(i-1, j-1, k, m) // Start a fresh subsequence
+ (m+1) * f(i-1, j, k-1, m+1) // Finish any of the m+1 partial subsequences
We also need the base cases
f(0, 0, 0, 0) = 1
f(0, _, _, _) = 0
f(i, j, k, m) = 0 if any of i, j, k or m are negative
My own C++ implementation gives the correct answer of 36 for ;;;___;;; in a few milliseconds, and e.g. for ;;;___;;;_;_; it gives an answer of 540 (also in a few milliseconds). For a string consisting of 66 ;s followed by 66 _s followed by 66 ;s, it takes just under 2s and reports an answer of 0 (probably due to overflow of the long long).
Here's a fairly straightforward memoized recursion that returns an answer immediately for a string of 66 ;s followed by 66 _s followed by 66 ;s. The function has three parameters: i = index in the string, j = number of accumulating emoticons with only a left semi-colon, and k = number of accumulating emoticons with a left semi-colon and one or more underscores.
An array is also constructed for how many underscores and semi-colons are available to the right of each index, to help decide on the next possibilities.
Complexity is O(n^3) and the problem constrains the search space, where j is at most n/2 and k at most n/4.
Commented JavaScript code:
var s = ';_;;__;_;;';
// record the number of semi-colons and
// underscores to the right of each index
var cs = new Array(s.length);
cs.push(0);
var us = new Array(s.length);
us.push(0);
for (var i=s.length-1; i>=0; i--){
if (s[i] == ';'){
cs[i] = cs[i+1] + 1;
us[i] = us[i+1];
} else {
us[i] = us[i+1] + 1;
cs[i] = cs[i+1];
}
}
// memoize
var h = {};
function f(i,j,k){
// memoization
var key = [i,j,k].join(',');
if (h[key] !== undefined){
return h[key];
}
// base case
if (i == s.length){
return 1;
}
var a = 0,
b = 0;
if (s[i] == ';'){
// if there are still enough colons to start an emoticon
if (cs[i] > j + k){
// start a new emoticon
a = f(i+1,j+1,k);
}
// close any of k partial emoticons
if (k > 0){
b = k * f(i+1,j,k-1);
}
}
if (s[i] == '_'){
// if there are still extra underscores
if (j < us[i] && k > 0){
// apply them to partial emoticons
a = k * f(i+1,j,k);
}
// convert started emoticons to partial
if (j > 0){
b = j * f(i+1,j-1,k+1);
}
}
return h[key] = a + b;
}
console.log(f(0,0,0)); // 52

Path of Length N in graph with constraints

I want to find number of path of length N in a graph where the vertex can be any natural number. However two vertex are connected only if the product of the two vertices is less than some natural number P. If the product of two vertexes are greater than P than those are not connected and can't be reached from one other.
I can obviously run two nested loops (<= P) and create an adjacency matrix, but P can be extremely large and this approach would be extremely slow. Can anyone think of some optimal approach to solve the problem? Can we solve it using Dynamic Programming?
I agree with Ante's recurrence, although I used a slightly simplified version. Note that I'm using the letter P to name the maximum product, as it is used in the original problem statement:
f(1,x) = 1
f(i,x) = sum(f(i-1, y) for y in {1, ..., floor(P/x)})
f(i,x) is the number of sequences of length i that end with x. The answer to the question is then f(n+1, 1).
Of course since P can be up to 10^9 in this task, a straightforward implementation with a DP table is out of the question. However, there are only up to m < 70000 possible different values of floor(P/i). So let's find the maximal segments aj ... bj, where floor(P/aj) = floor(P/bj). We can find those segments in O(number of segments * log P) using binary search.
Imagine the full DP table for f. Since there are only m different values for floor(P/x), every row of f consists of m contiguous ranges that have the same value.
So let's compute the compressed DP table, where we represent the rows as list of (length, value) pairs. We start with f(1) = [(P, 1)] and we can compute f(i+1) from f(i) by processing the segments in increasing order and computing prefix sums of the lengths stored in f(i).
The total runtime of my implementation of this approach is O(m (log P + n)). This is the code I used:
using ll=long long;
const int mod = 1000000007;
void add(int& x, ll y) { x = (x+y)%mod; }
int main() {
int n, P;
cin >> n >> P;
int x = 1;
vector<pair<int,int>> segments;
while(x <= P) {
int y = x+1, hi = P+1;
while(y<hi) {
int mid = (y+hi)/2;
if (P/mid < P/x) hi=mid;
else y=mid+1;
}
segments.push_back(make_pair(P/x, y-x));
x = y;
}
reverse(begin(segments), end(segments));
vector<pair<int,int>> dp;
dp.push_back(make_pair(P,1));
for (int i = 1; i <= n; ++i) {
int j = 0;
int sum_smaller = 0, cnt_smaller = 0;
vector<pair<int,int>> dp2;
for (auto it : segments) {
int value = it.first, cnt = it.second;
while (cnt_smaller + dp[j].first <= value) {
cnt_smaller += dp[j].first;
add(sum_smaller,(ll)dp[j].first*dp[j].second);
j++;
}
int pref_sum = sum_smaller;
if (value > cnt_smaller)
add(pref_sum, (ll)(value - cnt_smaller)*dp[j].second);
dp2.push_back(make_pair(cnt, pref_sum));
}
dp = dp2;
reverse(begin(dp),end(dp));
}
cout << dp[0].second << endl;
}
I needed to do some micro-optimizations with the handling of the arrays to get AC, but those aren't really relevant, so I left them away.
If number of vertices is small than adjacency matrix (A) can help. Since sum of elements in A^N is number of distinct paths, if paths are oriented. If not than number of paths i sum of elements / 2. That is due an element (i,j) represents number of paths from vertex i to vertex j.
In this case, same approach can be done by DP, using reasoning that number of paths of length n from vertex v is sum of numbers of paths of length n-1 of all it's neighbours. Neigbours of vertex i are vertices from 1 to floor(Q/i). With that we can construct function N(vertex, length) which represent number of paths from given vertex with given length:
N(i, 1) = floor(Q/i),
N(i, n) = sum( N(j, n-1) for j in {1, ..., floor(Q/i)}.
Number of all oriented paths of length is sum( N(i,N) ).

Minimal cyclic shift algorithm explanation

I have recently came up against this code lacking any comment. It finds minimal cyclic shift of word (this code specifically returns its index in string) and its called Duval algorithm. Only info I found describes algorithm in few words and has cleaner code. I would appreciate any help in understanding this algorithm. I have always found text algorithms pretty tricky and rather hard to understand.
int minLexCyc(const char *x) {
int i = 0, j = 1, k = 1, p = 1, a, b, l = strlen(x);
while(j+k <= (l<<1)) {
if ((a=x[(i+k-1)%l])>(b=x[(j+k-1)%l])) {
i=j++;
k=p=1;
} else if (a<b) {
j+=k;
k=1;
p=j-i;
} else if (a==b && k!=p) {
k++;
} else {
j+=p;
k=1;
}
}
return i;
}
First, I believe that your code has a bug in it. The last line should be
return p;. I beleve that i holds the index of the lexicographically smallest cyclic shift, and p holds the smallest shift that matches. I also think that your stopping condition is too weak, i.e. you are doing too much checking after you have found a match, but I am not sure exactly what it should be.
Note that i and j only advance and that i is always less than j. We are looking for a string that matches the string starting at i, and we are trying to match it with a string that starts at j. We do this by comparing the k'th character of each string while increasing k (as long as they match). Note that we only change i if we determine that the string starting at j is lexicographically less than the string starting at j, and then we set i to j and reset k and p to their initial values.
I do not have time for a detailed analysis, but it looks like
i = the start of the lexicographic smallest cyclic shift
j = the start of the cyclic shift we are matching against the shift starting at i
k = the character in strings i and j currently under consideration (the strings match in positions 1 to k-1
p = the cyclic shift under consideration (i believe p stands for prefix)
Edit Going further
this section of code
if ((a=x[(i+k-1)%l])>(b=x[(j+k-1)%l])) {
i=j++;
k=p=1;
Moves the start of the comparison to a lexicographically earlier string when we find one and reinitializes everything else.
this section
} else if (a<b) {
j+=k;
k=1;
p=j-i;
is the tricky part. We have found a mismatch that is lexicographically later than our reference string, so we skip to the end of the text matched so far, and start matching from there. We also increase p (our stride). Why can we skip over all the starting points between j and j + k? This is because the string starting with i is the lexicographically smallest seen, and if the tail of the current j string is greater then the string at i then any suffix of the string at j will be greater than the string at i.
Finally
} else if (a==b && k!=p) {
k++;
} else {
j+=p;
k=1;
this just checks that the string of length p starting at i repeats.
**further edit*
We do this by incrementing k until k == p, checking that the k'th character of the string starting at i equals the k'th character of the string starting at j. Once k reaches p we start scanning again at the next supposed occurrence of the string.
Even further edit to attempt to answer jethro's questions.
First: the k != p in else if (a==b && k!=p) Here we have a match in that the k'th and all previous characters in the strings starting at i and j are equal. The variable p represents the length that we think that the repeating string is. When k != p, actually k < p, so we are ensuring that the p characters at the string beginning at i are the same as the p characters of the string beginning at j. When k == p (the final else) we should be at a point where the string starting at j + k looks the same as the string starting at j, so we increase j by p and set k back to 1 and go back to comparing the two strings.
Second: Yes, I believe you are correct, it should return i. I was misunderstanding the meaning of "Minimum Cyclic Shift"
It may be the same as this algorithm, whose explanation can be found here:
int ComputeMaxSufPos(string w)
{
int i = 0, n = w.Length;
for (int j = 1; j < n; ++j)
{
int c, k = 0;
while ((c = w[(i + k) % n].CompareTo(w[(j + k) % n])) == 0 && k != n)
{ k++; }
j += c > 0 ? k / (j - i) * (j - i) : k;
i = c > 0 ? j : i;
}
return i;
}

What is the complexity of creating a lexicographic tree

What is the complexity of creating a lexicographic tree?
If you create a prefix tree out of your input, you can perform this query in constant time.
Edit
The query is linear in the length of the search string. I meant that it was constant with regard to the size of the word list.
The appropriate data structure for this is probably a sorted list. In that case this becomes a bisection search problem, so O(log n).
As Gabe mentioned above Trie is good solution but it's little bit hard to implement for dictionaries with large number of words. If O(n log n) algorithm is OK for you, you can solve this problem with binary search. Here is code written in C:
char dict[n][m]; // where n is number of words in dictionary and
// m is maximum possible length of word
char word[m]; // it's your word
int l = -1, r = n;
while(l+1 < r) {
int k = (l+r)/2;
if(strcmp(dict[k], word) < 0) l = k;
else r = k;
}
int len = strlen(word);
l++; // first word's index with greater or equal prefix then word is l+1
bool matches = (strlen(word[l]) >= len);
for(int i = 0; i < len && matches; i++) {
if(word[i] != dict[l][i]) {
matches = 0;
}
}
if(matches) {
printf("given word is prefix of %dth word.", l);
} else {
printf("given word isn't in dictinary.");
}
just run with a simple loop and check whether each word start with whatever.
in almost every language there is a build in function for check whether one string start with another.
the complexity is O(log n), while n being the number of the words in the dictionary.

Resources