I have 1000's of protein sequences in FASTA and their accession numbers. I want to go back into the whole genome shotgun database and retrieve all DNA sequences that encode for a protein identical to one in my list of initial sequences.
I've tried running a tBlastn with <10 results for each sequence, 1 per query and e-value below 1e-100 or with an e-value of zero and I'm not getting any results. I would like to automate this entire process.
Is this something that can be done by running blast from the command line and a batch script?
You should get at least one result: the one that encodes for the original protein. The others, if any, would be pseudogenes, if I follow you.
Anyway, a bit of programming may help help, check out Biopython. Bioperl or Bioruby should have similar features.
In particular you can BLAST using Biopython
You might find this link useful:
https://www.biostars.org/p/5403/
A similar question has been asked there, and some reasonable solutions have been posted.
Related
Is there a function that works similar to a hashcode where a string or set of bits is passed in and converted to a number. However this algorithm works such that strings that are more similar to one another would result in numbers closer to one another.
i.e.
f("abcdefg") - f("abcdef") < f("lorem ipsum dolor") - f("abcde")
The algorithm doesn't have to be perfect, I'm just trying to turn some descriptions into a numerical representation as one more input for an ML experiment. I know this string data has value to the algorithm I'm just trying to come up with some simple ways to turn it into a number.
What i understand from your post is very similar to a tpic of my interest. There is an great tool or process for doing the task you are asking for.
The tool i am referring to is known as word2vec. It gives a vectorization of each word in the string. It was found by Google. In this model each word is given a vectorizatipon based on the words in the vocabulary and its nearby words (next word and prev word). Go through this word2vec topic from google or youtube and you will have a clear idea of it.
The power of this tool is so much that u can do unexpected things. An example would be
King - Man + Woman = Queen
This tool is mainly being used for semantics analysis.
How can I predict a word that's missing from a sentence?
I've seen many papers on predicting the next word in a sentence using an n-grams language model with frequency distributions from a set of training data. But instead I want to predict a missing word that's not necessarily at the end of the sentence. For example:
I took my ___ for a walk.
I can't seem to find any algorithms that take advantage of the words after the blank; I guess I could ignore them, but they must add some value. And of course, a bi/trigram model doesn't work for predicting the first two words.
What algorithm/pattern should I use? Or is there no advantage to using the words after the blank?
Tensorflow has a tutorial to do this: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html
Incidentally it does a bit more and generates word embeddings, but to get there they train a model to predict the (next/missing) word. They also show using only the previous words, but you can apply the same ideas and add the words that follow.
They also have a bunch of suggestions on how to improve the precision (skip ngrams).
Somewhere at the bottom of the tutorial you have links to working source-code.
The only thing to be worried about is to have sufficient training data.
So, when I've worked with bigrams/trigrams, an example query generally looked something like "Predict the missing word in 'Would you ____'". I'd then go through my training data and gather all the sets of three words matching that pattern, and count the things in the blanks. So, if my training data looked like:
would you not do that
would you kindly pull that lever
would you kindly push that button
could you kindly pull that lever
I would get two counts for "kindly" and one for "not", and I'd predict "kindly". All you have to do for your problem is consider the blank in a different place: "____ you kindly" would get two counts for "would" and one for "could", so you'd predict "would". As far as the computer is concerned, there's nothing special about the word order - you can describe whatever pattern you want, from your training data. Does that make sense?
I am reading Programming Pearls by Jon Bentley (reference).
Here author is mentioning about various sorting algorithms like merge sort, multipass sort.
Questions:
How does algorithm for merge sort work by reading input file once and using work files and writing output file only once?
How does the author denote that the 40 pass i.e. multipass sort algorithm works by writing only once to output file and with no work files?
Can someone explain the above with a simple example, like having memory to store 3 digits and having 10 digits to store, e.g. 9,0,8,6,5,4,1,2,3,7
This is from Chapter 1 of Jon Bentley's
Programming Pearls, 2nd Edn (1999), which is an excellent book. The equivalent example from the first edition is slightly different; the multipass algorithm only made 27 passes over the data (and there was less memory available).
The sort described by Jon Bentley has special setup constraints.
File contains at most 10 million records.
Each record is a 7 digit number.
There is no other data associated with the records.
There is only 1 MiB of memory available when the sort must be done.
Question 1
The single read of the input file slurps as many lines from the input as will fit in memory, sorts that data, and writes it out to a work file. Rinse and repeat until there is no more data in the input file.
Then, complete the process by reading the work files and merging the sorted contents into a single output file. In extreme cases, it might be necessary to create new, bigger work files because the program can't read all the work files at once. If that happens, you arrange for the final pass to have the maximum number of inputs that can be handled, and have the intermediate passes merge appropriate numbers of files.
This is a general purpose algorithm.
Question 2
This is where the peculiar properties of the data are exploited. Since the numbers are unique and limited in range, the algorithm can read the file the first time, extracting numbers from the first fortieth of the range, sorting and writing those; then it extracts the second fortieth of the range, then the third, ..., then the last fortieth.
This is a special-purpose algorithm, exploiting the nature of the numbers.
I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.
I got a wordlist which is 56GB and I would like to remove doubles.
I've tried to approach this in java but I run out of space on my laptop after 2.5M words.
So I'm looking for an (online) program or algorithm which would allow me to remove all duplicates.
Thanks in advance,
Sir Troll
edit:
What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated
I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing 'a' into a.txt, first char equals 'b' into b.txt. ...
a.txt
b.txt
c.txt
-
afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.
if the files remain to big you can also split using more than 1 char
e.g:
aa.txt
ab.txt
ac.txt
...
Frameworks like Mapreduce or Hadoop are perfect for such tasks. You'll need to write your own map and reduce functions. Although i'm sure this must've been done before. A quick search on stackoverflow gave this
I suggest you use a Bloom Filter for this.
For each word, check if it's already present in the filter, otherwise insert it (or, rather some good hash value of it).
It should be fairly efficient and you shouldn't need to provide it with more than a gigabyte or two for it to have practically no false negatives. I leave it to you to work out the math.
I do like the divide-and-conquer comments here, but I have to admit: If you're running into trouble with 2.5mio words, something's going wrong with your original approach. Even if we assume each word is unique within those 2.5mio (which basically rules out that what we're talking about is a text in a natural language) and assuming each word is on average 100 unicode characters long we're at 500MB for storing the unique strings plus some overhead for storing the set structure. Meaning: You should be doing really fine since those numbers are totally overestimated already. Maybe before installing Hadoop, you could try increasing your heap size?