bioruby how to find regex match coordinates - ruby

I am using BioRuby working with a gene sequence (retrieved by API EMBL) and trying to find the coordinates of subsequence found there by Regex/Match. I need the start and end positions of the query subseq in the EMBL Seq. Any suggestion would be very welcome.
Thanks
I found the sub sequence by Regex/Match. And I am wondering if Bio::Fasta::Report::Hit::Query could work just from the documentation but I do not see examples of use.

Related

How to traverse an unknown-length path in SurrealDB?

I want to recursively follow related records in SurrealDB, and I can't find the syntax to express it.
The simplest explanation of my goal is Neo4j/Cypher's variable length pattern matches. More generally, I want to start at a record and follow particular relations until I stop (either by number of steps or some other condition), where I don't know how many relation steps are needed between start and end.
The closest I can find is discussed here, in the section on 'No JOINs'. This doesn't fill my need, because the query specifies the number of steps between start and end. I'm imagining something like SELECT {->parent->person REPEATED 1..5} FROM person:tobie, which would find all of tobie's ancestors for 5 generations (person:tobie->parent->person, person:tobie->parent->person->parent->person, etc).
If this isn't part of SurrealQL's features, can you give me hints on other ways to get the same result? I've considered using the scripting functions, which seems powerful but off the beaten path.

In BLAST, how to get the HSP corresponding to each word?

In BLAST, I can only get a lot of sequences using its local and online services, but I cannot get the HSP corresponding to every word (seed) I want. According to the principle of BLAST, we know that the sequence will be divided into multiple words at the beginning, and then locate in the database according to these words, find multiple hits, and finally expand to the left and right sides. My question is how to get the HSP corresponding to a single word, instead of processing the completed result for me like online BLAST.Hope to get your suggestions, thank you very much.Attach the BLAST algorithm flowenter image description here

What is the best data structure for text auto completion?

I have a long list of words and I want to show words starting with the text entered by the user. As user enters a character, application should update the list showed to user. It should be like AutoCompleteTextView on Android. I am just curious about the best data structure to store the words so that the search is very fast.
A trie could be used . http://en.wikipedia.org/wiki/Trie https://stackoverflow.com/search?q=trie
A nice article - http://www.sarathlakshman.com/2011/03/03/implementing-autocomplete-with-trie-data-structure/
PS : If you have some sub-sequences that "don't branch" then you may save space by using a radix trie, which is a trie implementation that puts several characters in node when possible - http://en.wikipedia.org/wiki/Radix_tree
You may find this thread interesting:
String similarity algorithms?
It's not exactly what you want, instead it's a slightly extended version of your problem.
For implementation of autocomplete feature, ternary search trees(TST) are also used:
http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/
However, if you want to find any random substring within a string, try a Generalised suffix tree.
http://en.wikipedia.org/wiki/Generalised_suffix_tree
Tries (and their various varieties) are useful here. A more detailed treatment on this topic is in this paper. Maybe you can implement a completion trie for Android?

identifier splitting to approximately match documentation

Different software projects have different coding convention; even in the same project there may be different languages used and will have different convention. What is good for searching documentation (which appear outside the source files), with identifier tokens from the source code?
For example if the source has self._def_passwd, or this.defPasswrd, a query on the documentation tree should strive to match default password.
So far I've been trying to sort by Levenshtein distance, which works nicely for small edit distances, but there are too many false positives when I increase the threshold, which is problematic with white spaces in documentation.
8 0.666667 announcement getContent AnnouncementBean.java(Token.Name.Function )
8 0.666667 announcement getPercent DataObservation.java (Token.Name.Function)
8 0.666667 announcement GroupBean GroupBean.java (Token.Name.Class)
where the first value is the Levenshtein distance, second one the distance divided by the length of the word matched.
I'm thinking to
look into Jaccard, Tanimoto algorithms
intellisence/suggest kinda code
Somewhere in SO there were posts on some algorithms that bio guys use for matching sequences
Come up with regular expressions chain rules based on http://en.wikipedia.org/wiki/Naming_convention_%28programming%29
the last one being literally the last option. Which other algorithms do you think would could give better results for this kinda stuff?
Try using weighted edit distance, here you can encode knowledge of usual abbreviation, probable character mistakes by distance in keyboard. For example you can zero weight to vowels like [ao] and password will be equal to pswrd. Other option is to build word level edit distance and use synonyms here. I also have builded EditDistance which works simultaneousnesly with words and characters.

Algorithm to find all possible results using text search

I am currently making a web crawler to crawl all the possible characters on a video game site (Final Fantasy XIV Lodestone).
My interface for doing this is using the site's search. http://lodestone.finalfantasyxiv.com/rc/search/characterForm
If the search finds more than 1000 characters it only returns the first 1000. The text search does not seem to understand either *, ? or _.
If a search for the letter a, I'm getting all the characters that have a in their names rather than all characters that start with a.
I'm guessing I could do searches for all character combination aa, ab, ba, etc. But that doesn't guarantee me:
I will never get more than 1000 result
It doesn't seem very efficient has many characters would appear multiple times and would need to be filtered out.
I'm looking for an algorithm on how to construct my search text.
Considered as a practical problem, have you asked Square Enix for some kind of API access or database dump? They might prefer this to having you scrape their search results.
Considered purely in the abstract, it's not clear that any search strategy will succeed in finding all the results. For suppose there were a character called "Ar", how would you find it? If you search for "ar", the results only go as far as at Ak—. If you search for "a" or "r", the situation is even worse. Any other search fails to find this character. (In practice you might be able to find "Ar" by guessing its world and/or main skill, but in theory there might be so many characters with that skill on that world that this remains ineffective.)
Main question here is what are you planning to do with all those characters. What is the purpose of your program? Putting that aside, you can search for single letter, and filter by both main skill and world (using double loop). It is highly unlikely that you will ever have more that 1000 hits that way for any consonant. If you want to search for name starting with vowel then use bigraph vowel-other_letter in a loop that iterates other_letter from A to Z.
Additional optimization is to try to guess at what page the list with needed letter will start. If you have total number of pages (TNOP) then your list will start somewhere near page TNOP * LETTER / 27, where LETTER is the order of the letter in the alphabet.

Resources