challenging string algorithm on pattern matching from bioinformatics [closed] - algorithm

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I was told by a friend the following challenge problem.
Given {A, T, G, C} as our alphabet, we want to know the number of valid phrases with a specified length n with the following recursive pattern definition:
pat=pat1pat2, i.e. concatenate two patterns together to form a new pattern pat.
pat=(pat1|pat2), i.e. choosing either one of the patterns pat1 or pat2 to form a new pattern pat.
pat=(pat1*), i.e. repeating pattern pat1 any number of times (can be 0) to form a new pattern pat.
A phrase formed from the alphabet set {A, T, G, C} is said to satisfy a pattern if it can be formed by above pattern definition; its length is the number of alphabets.
A few examples:
Given a pattern ((A|T|G)*) and n=2, the number of valid phrases
is 9, since there are AA, AT, AG, TA, TT, TG, GA, GT,
GG.
Given a pattern (((A|T)*)|((G|C)*)) and n=2, the number of valid phrases
is 8, since there are AA, AT, TA, TT, GG, GC, CG, CC.
Given a pattern ((A*)C(G*)) and n=3, the number of valid phrases
is 3, since there are AAC, ACG, CGG.
Please point to me the source of this problem if you have ever seen it and your ideas to tackle it.

The choice of letters A,C,G, and T makes me think of DNA base pair sequences. But as thiton wrote, clearly this problem was lifted from the study of regular languages. Google "regular language enumeration" and you should find plenty of research papers and code to get you started. I'd be surprised if computing the number of matching strings for these patterns were not a #P-complete problem, so expect run-times exponential in n.

Related

Native String Search Algorithm's best time complexity [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
When I'm studing this algorithm, I found that the best time complexity is O(n), as the book says. But why is not O(m)? I think the best condition is: pattern string successfully matches at the main string's first position, so only m comparisons are needed.
ps. n is the main string's length and m is the length of pattern string
When discussing string search algorithms, it is most often understood as having to find all occurrences. For example, Wikipedia has in its String-searching algorithm article:
The goal is to find one or more occurrences of the needle within the haystack.
This is confirmed in Wikipedia's description of the Boyer-Moore string search algorithm, where it states:
The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted forward (to the right) according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found.
And again, for the Knuth–Morris–Pratt algorithm we find the same:
the Knuth–Morris–Pratt string-searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S [...]
input:
an array of characters, S (the text to be searched)
an array of characters, W (the word sought)
output:
an array of integers, P (positions in S at which W is found)
an integer, nP (number of positions)
So even in your best case scenario the algorithm must continue the search after the initial match.
yes when you use Bit based (approximate) you can have complexity O(n) but how can you want to find in O(m). Think your first string is a string with length 10^10 and all the characters are 'A', let pattern string "B" so how can you want find "B" in this string with O(m) that m = 1

What is the right way to calculate sub-sequence, subset and sub-strings? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
Often, I come across the following terminology in coding interviews.
Given an array or string, find the
sub-array
sub-sequence
sub-string
What difference they have?
For example, I see an integer array can be split into
n*(n+1)/2
sub arrays. Do they become subsets as well? Should sub-arrays are contiguous?
For calculating the sub-sequences of a string, why to use
2^str_length - 1
After searching online, I ended up with this link
https://www.geeksforgeeks.org/subarraysubstring-vs-subsequence-and-programs-to-generate-them/
But I still feel ambiguous as what is the universal term for calling a part of an array/string? and how to compute them?
In general, arrays and strings are both sub-sequences. The "sequence" part indicates that the order of elements is important somehow. "substring" is usually contiguous; "sub-array" and "sub-sequence" are unclear. If you're in a job interview and not certain of the interpretation, your first job is to ask. Sometimes, part of the job interview is making sure you can spot and resolve ambiguities.
UPDATE after question update
I find the referenced page quite clear.
First, note that string and array are both specific types of a sequence.
subsequence is the generic term: elements of the original sequence
appearing in the same order as in the original, but not necessarily contiguous. For instance, given the sequence "abcdefg", we have sub-sequences "a", "ag", bce", etc.
Elements repeated or otherwise not in the original ordering would include "ga", "bb", bcfe", etc. None of these is a sub-sequence.
"Subset" is a separate type. In a set, repeated elements do not exist, and ordering does not matter.
Does that clear up your problems?

Figure out the order of a list of chars [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
English has 26 chars (a,b,c,d,...,z) and they have the order b behind a, c behind b, etc.
Suppose we have another language. In the language, we also have a number of chars. All chars have an order, just like chars in English.
However, we don't know the total order of all chars yet.
We are given a list of words, in each word, the chars are sorted already.
Please use data structure and algorithm to induct the total order of all chars.
for example,
we have chars #, £, $, %. We don't know the order of these in a language.
We are given a list of words
£ %
# %
$ #
£ $
Then we can get the total order £ $ # %.
Construct a directed graph of containing all characters as vertices.
Create an edge from each character to each character directly following that character in any word. For example, if you have a word # % ^, you'd have edges # -> % and % -> ^.
Run a topological sort on the graph to get the correct order.

Finding number of anagrams [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Following was a question that was asked to me in one of the interviews.
We know anagram of eat is: tea and ate
The question is:
We have a program. We feed a list of 10 thousand alphabets to this program.
We run the program.
Now at run-time, we provide a word to this program eg. "eat"
Now the program should return the number of anagrams that exist in the list of 10 thousand alphabets. Hence for an input of "eat", it should return 2.
What will be the strategy to store those 10 thousand alphabets so that finding the number of anagrams becomes easy.
Order the letters of each word as to minimize it's ordering, i.e. tea becomes aet.
Then simply put these in a (hash) map of words to counts (both tea and ate maps to aet, so we'll have (aet, 2) in the map)
Then, when you get a word, reorder the letters as above and do a lookup for the count.
Running time:
Assuming n words in the list, with an average word length of m...
Expected O(nm log m) preprocessing, expected O(m log m) per query.
It's m log m on the assumption we just do a simple sort of the letters of a word.
The time taken per query is expected to be unaffected by the numbers of words in the list (i.e. hash maps give expected O(1) lookup time).

two whole texts similarity using levenshtein distance [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have two text files which I'd like to compare. What I did is:
I've split both of them into sentences.
I've measured levenshtein distance between each of the sentences from one file with each of the sentences from second file.
I'd like to calculate average similarity between those two text files, however I have trouble to deliver any meaningful value - obviously arithmetic mean (sum of all the distances [normalized] divided by number of comparisions) is a bad idea.
How to interpret such results?
edit:
Distance values are normalized.
The levenshtein distances has a maximum value, i.e. the max. length of both input strings. It cannot get worse than that. So a normalized similarity index (0=bad, 1=match) for two strings a and b can be calculated as 1- distance(a,b)/max(a.length, b.length).
Take one sentence from File A. You said you'd compare this to each sentence of File B. I guess you are looking for a sentence out of B which has the smallest distance (i.e. the highest similarity index).
Simply calculate the average of all those 'minimum similarity indexes'. This should give you a rough estimation of the similarity of two texts.
But what makes you think that two texts which are similar might have their sentences shuffled? My personal opinion is that you should also introduce stop word lists, synonyms and all that.
Nevertheless: Please also check trigram matching which might be another good approach to what you are looking for.

Resources