In BLAST, how to get the HSP corresponding to each word? - bioinformatics

In BLAST, I can only get a lot of sequences using its local and online services, but I cannot get the HSP corresponding to every word (seed) I want. According to the principle of BLAST, we know that the sequence will be divided into multiple words at the beginning, and then locate in the database according to these words, find multiple hits, and finally expand to the left and right sides. My question is how to get the HSP corresponding to a single word, instead of processing the completed result for me like online BLAST.Hope to get your suggestions, thank you very much.Attach the BLAST algorithm flowenter image description here

Related

How do I modify the prefix trie data structure to handle words that are in the middle?

I want to implement simple autocomplete functionality for a website. I first wanted to use the prefix trie data structure for this, and that's how autocomplete usually works, you enter a prefix and you can search the trie for the possible suffixes, however the product owner wants to handle the words that are in the middle as well.
Let me explain what I mean. Imagine I have these product names:
tile for bathroom
tile for living room
kitchen tile
kitchen tile, black
some other tile, green
The user searches for "tile" and they will only see the first 2 results if I use the prefix trie, but I want all those results to pop up, however I don't know about any efficient data structure to handle this. Can you please suggest something? Can a prefix trie be modified to handle this?
I have thought about some modifications, such as inserting all suffixes, etc, but they will give wrong results, for example, I have inserted suffixes for
kitchen tile, black
some other tile, green
and kept the prefixes in the first node for each suffix (kind of like cartesian product), that way I can get the result of "some other tile, black" which doesn't exist. So this solution is bad. Also this solution will use a lot of memory...
The trie data structure indeed works for prefix match operations, not for in the middle text search
The usual data structure to support in the middle text search is the suffix tree: https://en.wikipedia.org/wiki/Suffix_tree
It requires enough space to store about 20 times your list of words in memory, so yes it costs more memory
Suffix array is a space efficient alternative: https://en.wikipedia.org/wiki/Suffix_array
Don't over-think this. Computers are fast. If you're talking on the order of thousands of products in memory, then a sequential search doing a contains check is going to be plenty fast enough: just a few milliseconds, if that.
If you're talking a high-traffic site with thousands of requests per second, or a system with hundreds of thousands of different products, you'll need a better approach. But for a low-traffic site and a few thousand products, do the simple thing first. It's easy to implement and easy to prove correct. Then, if it's not fast enough, you can worry about optimizing it.
I have an approach that will work using simple tries.
Assumption:- User will see Sentence once the whole word is complete
Let's take above example to understand this approach.
1. Take each sentence, say tile for bathroom.
2. Split the sentences into words as - tile, for, bathroom.
3. Create a tuple of [String, String], so for above example we will get three tuples.
(i) [tile, tile for bathroom]
(ii) [for, tile for bathroom]
(iii)[bathroom, tile for bathroom]
4. Now insert the first String of the above tuple into your trie and store the
other tuple (which is the whole sentence) as a String object to the
last character node for the word. i.e when inserting tile, the node at
character e will store the sentence string value.
5. One case to handle here is, like the tile word appears in two strings, so in that case
the last character e will store a List of string having values - tile for bathroom and tile for living room.
Once you have the trie ready based on the above approach, you would be able to search the sentence based on any word being used in that sentence. In short, we are creating each word of the sentence as tag.
Let me know, if you need more clarity on the above approach.
Hope this helps!

Search a string as you type the character

I have contacts stored in my mobile. Lets say my contacts are
Ram
Hello
Hi
Feat
Eat
At
When I type letter 'A' I should get all the matching contacts say "Ram, Feat, Eat, At".
Now I type one more letter T. Now my total string is "AT" now my program should reuse the results of previous search for "A". Now it should return me "Feat, Eat, At"
Design and develop a program for this.
This is interview question at Samsung mobile development
I tried solving with Trie data structures. Could not get good solution for reusing already searched string results. I also tried solution with dictionary data structure, solution has same disadvantage as Trie.
question is how do I search the contacts for each letter typed reusing the search results of earlier searched string? What data structure and algorithm should be used for efficiently solving the problem.
I am not asking for program. So programming language is immaterial for me.
State machine appears to be good solution. Does anyone have suggestion?
Solution should be fast enough for million contacts.
It kind of depends on how many items you're searching. If it's a relatively small list, you can do a string.contains check on everything. So when the user types "A", you search the entire list:
for each contact in contacts
if contact.Name.Contains("A")
Add contact to results
Then the user types "T", and you sequentially search the previous returned results:
for each contact in results
if contact.Name.Contains("AT")
Add contact to new search results
Things get more interesting if the list of contacts is huge, but for the number of contacts that you'd normally have in a phone (a thousand would be a lot!), this is going to work very well.
If the interviewer said, "use the results from the previous search for the new search," then I suspect that this is the answer he was looking for. It would take longer to create a new suffix tree than to just sequentially search the previous result set.
You could optimize this a little bit by storing the position of the substring along with the contact so that all you have to do the next time around is check to see if the next character is as expected, but doing so complicates the algorithm a bit (you have to treat the first search as a special case, and you have to explicitly check string lengths, etc.), and is unlikely to provide much benefit after the first few characters because the size of the list to be searched would be pretty small. The pure sequential search with contains check is going to be plenty fast. Users wouldn't notice the few microseconds you'd save with that optimization.
Update after edit to question
If you want to do this with a million contacts, sequential search might not be the best way to go at the start. Although I'd still give it a try. "Fast enough for a million contacts" raises the question of what exactly "fast enough" means. How long does it take to search one million contacts for the existence of a single letter? How long is the user willing to wait? Remember also that you only have to show one page of contacts before the user takes another action. And you can almost certainly to that before the user presses the second key. Especially if you have a background thread doing the search while the foreground thread handles input and writing the first page of matched strings to the display.
Anyway, you could speed up the initial search by creating a bigram index. That is, for each bigram (sequence of two characters), build a list of names that contain that bigram. You'll also want to create a list of strings for each single character. So, given your list of names, you'd have:
r - ram
a - ram, feat, eat, a
m - ram
h - hello, hi
...
ra - ram
am - ram
...
at - feat, eat, at
...
etc.
I think you get the idea.
That bigram index gets stored in a dictionary or hash map. There are only 325 possible bigrams in the English language, and of course the 26 letters, so at most your dictionary is going to have 351 entries.
So you have almost instant lookup of 1- and 2-character names. How does this help you?
An analysis of Project Gutenberg text shows that the most common bigram in the English language occurs only 3.8% of the time. I realize that names won't share exactly that distribution, but that's a pretty good rough number. So after the first two characters are typed, you'll probably be working with less than 5% of the total names in your list. Five percent of a million is 50,000. With just 50,000 names, you can start using the sequential search algorithm that I described originally.
The cost of this new structure isn't too bad, although it's expensive enough that I'd certainly try the simple sequential search first, anyway. This is going to cost you an extra 2 million references to the names, in the worst case. You could reduce that to a million extra references if you build a 2-level trie rather than a dictionary. That would take slightly longer to lookup and display the one-character search results, but not enough to be noticeable by the user.
This structure is also very easy to update. To add a name, just go through the string and make entries for the appropriate characters and bigrams. To remove a name, go through the name extracting bigrams, and remove the name from the appropriate lists in the bigram index.
Look up "generalized suffix tree", e.g. https://en.wikipedia.org/wiki/Generalized_suffix_tree . For a fixed alphabet size this data structure gives asymptotically optimal solution to find all z matches of a substring of length m in a set of strings in O(z + m) time. Thus you get the same sort of benefit as if you restricted your search to the matches for the previous prefix. Also the structure has optimal O(n) space and build time where n is the total length of all your contacts. I believe you can modify the structure so that you just find the k strings that contain the substring in O(k + m) time, but in general you probably shouldn't have too many matches per contact that have a match, so this may not even be necessary.
What I'm thinking to do is, keeping track of the so far matched string. Suppose in the first step, we identify the strings those have "A" in them and we keep trace of the positions of 'A". Then in the next step we only iterate over these strings and instead of searching them full we only check for occurrence of "T" as the next character to "A" we kept trace in the previous step and so on.

prefix similarity search

I am trying to find a way to build a fuzzy search where both the text database and the queries may have spelling variants. In particular, the text database is material collected from the web and likely would not benefit from full text engine's prep phase (word stemming)
I could imagine using pg_trgm as a starting point and then validate hits by Levenshtein.
However, people tend to do prefix queries E.g, in the realm of music, I would expect "beetho symphony" to be a reasonable search term. So, is someone were typing "betho symphony", is there a reasonable way (using postgresql with perhaps tcl or perl scripting) to discover that the "betho" part should be compared with "beetho" (returning an edit distance of 1)
What I ended up is a simple modification of the common algorithm: normally I would just pick up the last value from the matrix or vector pair. Referring to the "iterative" algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance I put the strings to be probed as first argument and the query string as second one. Now, when the algorithm finishes, the minimum value in the result column gives the proper result
Sample results:
query "fantas", words in database "fantasy", "fantastic" => 0
query "fantas", wor in database "fan" => 3
The input to edit distance are words selected from a "most words" list based on trigram similarity
You can modify edit distance algorithm to give a lower weight to the latter part of the string.
Eg: Match(i,j) = 1/max(i,j)^2 instead of Match(i,j)=1 for every i&j. (i and j are the location of the symbols you are comparing).
What this does is: dist('ABCD', 'ABCE') < dist('ABCD', 'EBCD').

Find Successive words in Article B available from A

There are two articles, A and B, which are very large. Get three or more successive words in A and check if they appear in B, and count how many times they appear. For example, if 'book' 'his' and 'her' appear in A, how many times do they appear in B?
I thought about splitting the entire content of B and then checking all 3 words in A with StringToken, but I am not sure about the algorithm efficiency.
Look into what a Hashtable is, scan your file B for words one by one (you can split if you don't care about memory usage on large files) each word you find into the hashtable (when not found) or increment the number to get of times a word is seen.
Then you just scan. over A, looking for each set of 3 words, with a rolling sliding window. this way you can increase the length of the window later without rewriting anything.
for reference you should really tag homework questions as such.
It is obvious that you need to scan / parse entire content of B once to reach the results. You just cannot avoid doing that. Read it line by line. For every line, search for the given query terms and their counts in the line. Keep adding the counts generated on per line basis to get the final result.
If you want to do such computation many times on content of B for same/different terms, creating an Inverted_index for B would be best way.

Algorithm to find all possible results using text search

I am currently making a web crawler to crawl all the possible characters on a video game site (Final Fantasy XIV Lodestone).
My interface for doing this is using the site's search. http://lodestone.finalfantasyxiv.com/rc/search/characterForm
If the search finds more than 1000 characters it only returns the first 1000. The text search does not seem to understand either *, ? or _.
If a search for the letter a, I'm getting all the characters that have a in their names rather than all characters that start with a.
I'm guessing I could do searches for all character combination aa, ab, ba, etc. But that doesn't guarantee me:
I will never get more than 1000 result
It doesn't seem very efficient has many characters would appear multiple times and would need to be filtered out.
I'm looking for an algorithm on how to construct my search text.
Considered as a practical problem, have you asked Square Enix for some kind of API access or database dump? They might prefer this to having you scrape their search results.
Considered purely in the abstract, it's not clear that any search strategy will succeed in finding all the results. For suppose there were a character called "Ar", how would you find it? If you search for "ar", the results only go as far as at Ak—. If you search for "a" or "r", the situation is even worse. Any other search fails to find this character. (In practice you might be able to find "Ar" by guessing its world and/or main skill, but in theory there might be so many characters with that skill on that world that this remains ineffective.)
Main question here is what are you planning to do with all those characters. What is the purpose of your program? Putting that aside, you can search for single letter, and filter by both main skill and world (using double loop). It is highly unlikely that you will ever have more that 1000 hits that way for any consonant. If you want to search for name starting with vowel then use bigraph vowel-other_letter in a loop that iterates other_letter from A to Z.
Additional optimization is to try to guess at what page the list with needed letter will start. If you have total number of pages (TNOP) then your list will start somewhere near page TNOP * LETTER / 27, where LETTER is the order of the letter in the alphabet.

Resources