pattern matching algorithm for large database of dna [closed] - algorithm

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
is there any preexisting efficient algorithms in pattern matching (large data of DNA)?
there are so many algorithms like Knuth-Morris-Pratt algorithm, Boyer-Moore algorithm, Index based forward backward multiple pattern algorithms are there but they are efficient and performs very poor when the is large. so please help me to know about efficient algorithms in pattern matching...

Take a look at the BLAST algorithm.

I'm sure this must have been discussed elsewhere.
I guess you'll already store the strings using two bits only (rather than using eight bits per character). Not only does this reduce the storage size by a factor four (I guess your strings can be as long as hundreds of millions of characters) but also reduces the time to transfer this data (e.g. from disk to memory or from memory to the CPU cache).
The following assumes that there are a lot of queries while the strings to be searched remain the same, so calculating and storing some additional quantities based on the strings is justified.
I suggest you have a look at suffix trees which can be constructed in linear time using Ukkonen's algorithm.
If that is not feasible, maybe you should consider a hybrid approach like building a fixed set dictionary of all possible words up to a fixed length L and divide your string to be searched into regions.
for each word in the dictionary store the region indices in which they appear (include L-1 characters of the next region when building this list)
when searching for a word, split it into strings of length L and check in which regions these appear. Assuming that the maximum search string length does not exceed the length of a region, your search string can only appear in regions where all parts (of length L) of your search string appear (or are in the preceding/following region)
with a standard string searching algorithm, search only the resulting regions for your search string
(this is probably similar to what a Bloom filter does)
For the second approach, you'll need to tune the parameters (the length L of the words in the dictionary and the size/number of regions).

Related

Native String Search Algorithm's best time complexity [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
When I'm studing this algorithm, I found that the best time complexity is O(n), as the book says. But why is not O(m)? I think the best condition is: pattern string successfully matches at the main string's first position, so only m comparisons are needed.
ps. n is the main string's length and m is the length of pattern string
When discussing string search algorithms, it is most often understood as having to find all occurrences. For example, Wikipedia has in its String-searching algorithm article:
The goal is to find one or more occurrences of the needle within the haystack.
This is confirmed in Wikipedia's description of the Boyer-Moore string search algorithm, where it states:
The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted forward (to the right) according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found.
And again, for the Knuth–Morris–Pratt algorithm we find the same:
the Knuth–Morris–Pratt string-searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S [...]
input:
an array of characters, S (the text to be searched)
an array of characters, W (the word sought)
output:
an array of integers, P (positions in S at which W is found)
an integer, nP (number of positions)
So even in your best case scenario the algorithm must continue the search after the initial match.
yes when you use Bit based (approximate) you can have complexity O(n) but how can you want to find in O(m). Think your first string is a string with length 10^10 and all the characters are 'A', let pattern string "B" so how can you want find "B" in this string with O(m) that m = 1

Difference between Jaro-Winkler and Levenshtein distance? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.
I was not able to understand what the difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler provides a normalized score between 0.0 to 1.0.
My questions:
What are the fundamental differences between the two algorithms?
What is the performance difference between the two algorithms?
Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula
1 - (edit distance / length of the larger of the two strings)
The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.
Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.
There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:
A Comparison of Personal Name
Matching: Techniques and Practical
Issues
According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:
Jaro
Jaro-Winkler
Levenshtein
Damerau-Levenshtein
with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.

Uniform-cost search algorithm [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I was just reading about it on a book and wikipedia but still dont understand it 100%.
I would really appreciate it if someone could explain it with an example or two.
Thanks
Say I'm looking at a map, searching for a pizza place near my block in the city. A few different strategies I could use:
Breadth first search (BFS): Look at concentric circles of blocks around my block, farther and farther outward until I find a pizza place. This will give me one of the pizza places which is closest to my block as the crow flies.
Depth first search (DFS): Follow a road until I hit a dead end, then backtrack. Eventually all possible branches will be searched, so if there's a pizza place out there somewhere then I'll find it, but it probably won't be very close to my block.
Uniform cost search (UCS): Say traffic is bad on some streets, and I'm really familiar with the city. For any given location I can say how long it will take me to get there from my block. So, looking at the map, first I search all blocks that will take me 1 minute or less to get to. If I still haven't found a pizza place, I search all blocks that will take me between 1 and 2 minutes to get to. I repeat this process until I've found a pizza place. This will give me one of the pizza places which is the closest drive from my block. Just as BFS looks like concentric circles, UFS will looks like a contour map.
Typically you will implement UCS using a priority queue to search nodes in order of least cost.
I assume you were looking at this Wikipedia page. What it means is that the time required for a given operation (adding two numbers, comparing two numbers, retrieving data from memory, etc.) is independent of the size of the variables involved. In other words, an 8-bit comparison takes the same amount of time as a 32-bit comparison. Making this assumption allows you to simplify an efficiency analysis and compare algorithms without getting bogged down in implementation details.

How to fill the chessboard with domino? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
How we can to fill the chessboard with domino and we have a some blocks. and chessboard is n x m. and the places filled with ordered numbers.
Test :
Answer like this :
input give n , m and k. k is number of blocks.
and next k lines give blocks such as 6 7 or 4 9.
sorry for my English.
Here's an idea. In your example board, it is immediately obvious that squares 7 9 and 14 must contain domino 'ends', that is to say it must be the case that there are dominos covering 2-7, 8-9, and 14-15.
(In case it's not 'immediately obvious', the rule I used is that a square with 'walls' on three sides dictates the orientation of the domino covering that square)
If we place those three dominos, it may then be the case that there are more squares to which the same rule now applies (eg 20).
By iterating this process, we can certainly make progress towards our goal, or alternatively get to a place where we know it can't be achieved.
See how far that gets you.
edit also, note that in your example, the lower-left corner 2x2 (squares 11 12 16 17) is not uniquely determined - a 90 degree rotation of the depicted arrangement would also work - so you will have to consider such situations. If you are looking for any solution, you must come up with a way of arbitrarily picking one of many possibilities; if you are trying to enumerate all possibilities, you will have to come up with a way of finding them all!

Algorithm optimization [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Say you wanted to find which input causes function x to output value y, and you know the (finite) range of possible inputs.
The input and output are both numbers, and positively correlated.
What would be the best way to optimize that?
I'm currently just looping through all of the possible inputs.
Thanks.
One solution would be a binary search over the possible inputs.
Flow:
find the median input x
get the output from function(x)
if the output is less than the desired y
start over using the smaller half of the possible inputs
else
start over using the larger half of the possible inputs
A binary search algorithm, perhaps?
http://en.wikipedia.org/wiki/Binary_search_algorithm
If the range is finite and small, a precomputed lookup table might be the fastest way
if you have some sets of know "x" data that yield "y" you can divied between training and test sets and use neural networks.

Resources