Convert DNA Sequences into numerical vectors for R / Weka - dna-sequence

I would like to use machine learning techniques such as Naive Bayes and SVM in Weka to identify species using DNA Sequence data.
The Issue is that I have to convert the DNA sequences into numerical vectors.
MY sequences are like this:
------------------------------------------------G
------------------------------------------GGAGATG
------------------------------------------GGAGATG
------------------------------------------GGAGATG
TTATTAATTCGAGCAGAATTAGGAAATCCTGGATCTTTAATTGGTGATG
----------------------------------------------ATG
CTATTAATTCGAGCTGAGCTAAGCCAGCCCGGGGCTCTGCTCGGAGATG
-----------------------TCAACCTGGGGCCCTACTCGGAGACG
----TAATCCGAGCAGAATTAAGCCAACCTGGCGCCCTACTAGGGGATG
CTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATG
TTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATA
How can I do this? Any suggestion of other programs for doing ML with DNA sequences besides Weka?

This answer makes use of R.
You can use R's Biostrings package for this.
Install package first:
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biostrings"))
Convert character string to DNAstring:
dna1 <- DNAString("------------------------------------------------G------------------------------------------GGAGATG")
Alternatively,
dna2 <- DNAStringSet(c("ACGT", "GTCA", "GCTA"))
alphabetFrequency(dna1)
letterFrequency(dna1, "GC")
....
Then (if you must) you can call Weka functions from R, e.g. Naive Bayes with NB <- make_Weka_classifier("weka/classifiers/bayes/NaiveBayes")
; NB(colx ~ . , data=mydata), or convert your data as you wish and/or export to other types of files that Weka understands. The foreign::write.arff() function comes to mind. But I wouldn't use Weka for this.
Needless to say, you can simply also enter these sequences into a website performing a BLAST search and get likely species candidates.
For CTATTAATTCGAGCTGAGCTAAGCCAGCCCGGGGCTCTGCTCGGAGATG I get mitochondrial DNA from "banded rock lizard" (Petrosaurus mearnsi) with 91% probability.

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

how to embidingvector word embiding?

i read a paper "Integrating and Evaluating Neural Word Embeddings
in Information Retrieval".
I tried to understand a source code and when opening the file named:
vectors_ap8889_skipgram_s200_w20_neg20_hs0_sam1e-4_iter5
i found a word vector representation like this:
downtown "-0.465147 -0.049099 -0.023432 0.058986 -0.085395 -0.027324 -0.050315 ................................................"
Please; i need you to explain what do thses values mean and wich term refer in the corpus and how can i obtain it.
The numbers don't have an intrinsic meaning. It's just a n-dimensional embedding of the given words.
If the embedding is done correctly you should see similar words having embedding close together. For example "good" should be closer to "awesome" than to "island".
The common way to use it is to convert the words into their embedding space and use that as an input to some machine learning problem. The advantage is that the embedding is trained on a lot more data than you have for your problem so the embedding offer a shortcut to training your model.

what is the relationship between LCS and string similarity?

I wanted to know how similar where two strings and I found a tool in the following page:
https://www.tools4noobs.com/online_tools/string_similarity/
and it says that this tool is based on the article:
"An O(ND) Difference Algorithm and its Variations"
available on:
http://www.xmailserver.org/diff2.pdf
I have read the article, but I have some doubts about how they programmed that tool, for example the authors said that it is based on the C library GNU diff and analyze.c; maybe it refers to this:
https://www.gnu.org/software/diffutils/
and this:
https://github.com/masukomi/dwdiff-annotated/blob/master/src/diff/analyze.c
The problem that I have is how to understand the relation with the article, for what I read the article shows an algorithm for finding the LCS (longest common subsequence) between a pair of strings, so they use a modification of the dynamic programming algorithm used for solving this problem. The modification is the use of the shortest path algorithm to find the LCS that has the minimum number of modifications.
At this point I am lost, because I do not know how the authors of the tool I first mentioned used the LCS for finding how similar are two sequences. Also the have put a limit value of 0.4, what does that mean? can anybody help me with this? or have I misunderstood that article?
Thanks
I think the description on the string similarity tool is not being entirely honest, because I'm pretty sure it has been implemented using the Perl module String::Similarity. The similarity score is normalised to a value between 0 and 1, and as the module page describes, the limit value can be used to abort the comparison early if the similarity falls below it.
If you download the Perl module and expand it, you can read the C source of the algorithm, in the file called fstrcmp.c, which says that it is "Derived from GNU diff 2.7, analyze.c et al.".
The connection between the LCS and string similarity is simply that those characters that are not in the LCS are precisely the characters you would need to add, delete or substitute in order to convert the first string to the second, and the number of these differing characters is usually used as the difference score, as in the Levenshtein Distance.

How does language detection work?

I have been wondering for some time how does Google translate(or maybe a hypothetical translator) detect language from the string entered in the "from" field. I have been thinking about this and only thing I can think of is looking for words that are unique to a language in the input string. The other way could be to check sentence formation or other semantics in addition to keywords. But this seems to be a very difficult task considering different languages and their semantics. I did some research to find that there are ways that use n-gram sequences and use some statistical models to detect language. Would appreciate a high level answer too.
Take the Wikipedia in English. Check what is the probability that after the letter 'a' comes a 'b' (for example) and do that for all the combination of letters, you will end up with a matrix of probabilities.
If you do the same for the Wikipedia in different languages you will get different matrices for each language.
To detect the language just use all those matrices and use the probabilities as a score, let say that in English you'd get this probabilities:
t->h = 0.3 h->e = .2
and in the Spanish matrix you'd get that
t->h = 0.01 h->e = .3
The word 'the', using the English matrix, would give you a score of 0.3+0.2 = 0.5
and using the Spanish one: 0.01+0.3 = 0.31
The English matrix wins so that has to be English.
If you want to implement a lightweight language guesser in the programming language of your choice you can use the method of 'Cavnar and Trenkle '94: N-Gram-Based Text Categorization'. You can find the Paper on Google Scholar and it is pretty straight forward.
Their method builds a N-Gram statistic for every language it should be able to guess afterwards from some text in that language. Then such statistic is build for the unknown text aswell and compared to the previously trained statistics by a simple out-of-place measure.
If you use Unigrams+Bigrams (possibly +Trigrams) and compare the 100-200 most frequent N-Grams your hit rate should be over 95% if the text to guess is not too short.
There was a demo available here but it doesn't seem to work at the moment.
There are other ways of Language Guessing including computing the probability of N-Grams and more advanced classifiers, but in the most cases the approach of Cavnar and Trenkle should perform sufficiently.
You don't have to do deep analysis of text to have an idea of what language it's in. Statistics tells us that every language has specific character patterns and frequencies. That's a pretty good first-order approximation. It gets worse when the text is in multiple languages, but still it's not something extremely complex.
Of course, if the text is too short (e.g. a single word, worse, a single short word), statistics doesn't work, you need a dictionary.
An implementation example.
Mathematica is a good fit for implementing this. It recognizes (ie has several dictionaries) words in the following languages:
dicts = DictionaryLookup[All]
{"Arabic", "BrazilianPortuguese", "Breton", "BritishEnglish", \
"Catalan", "Croatian", "Danish", "Dutch", "English", "Esperanto", \
"Faroese", "Finnish", "French", "Galician", "German", "Hebrew", \
"Hindi", "Hungarian", "IrishGaelic", "Italian", "Latin", "Polish", \
"Portuguese", "Russian", "ScottishGaelic", "Spanish", "Swedish"}
I built a little and naive function to calculate the probability of a sentence in each of those languages:
f[text_] :=
SortBy[{#[[1]], #[[2]] / Length#k} & /# (Tally#(First /#
Flatten[DictionaryLookup[{All, #}] & /# (k =
StringSplit[text]), 1])), -#[[2]] &]
So that, just looking for words in dictionaries, you may get a good approximation, also for short sentences:
f["we the people"]
{{BritishEnglish,1},{English,1},{Polish,2/3},{Dutch,1/3},{Latin,1/3}}
f["sino yo triste y cuitado que vivo en esta prisión"]
{{Spanish,1},{Portuguese,7/10},{Galician,3/5},... }
f["wszyscy ludzie rodzą się wolni"]
{{"Polish", 3/5}}
f["deutsch lernen mit jetzt"]
{{"German", 1}, {"Croatian", 1/4}, {"Danish", 1/4}, ...}
You might be interested in The WiLI benchmark dataset for written language identification. The high level-answer you can also find in the paper is the following:
Clean the text: Remove things you don't want / need; make unicode un-ambiguious by applying a normal form.
Feature Extraction: Count n-grams, create tf-idf features. Something like that
Train a classifier on the features: Neural networks, SVMs, Naive Bayes, ... whatever you think could work.

A function where small changes in input always result in large changes in output

I would like an algorithm for a function that takes n integers and returns one integer. For small changes in the input, the resulting integer should vary greatly. Even though I've taken a number of courses in math, I have not used that knowledge very much and now I need some help...
An important property of this function should be that if it is used with coordinate pairs as input and the result is plotted (as a grayscale value for example) on an image, any repeating patterns should only be visible if the image is very big.
I have experimented with various algorithms for pseudo-random numbers with little success and finally it struck me that md5 almost meets my criteria, except that it is not for numbers (at least not from what I know). That resulted in something like this Python prototype (for n = 2, it could easily be changed to take a list of integers of course):
import hashlib
def uniqnum(x, y):
return int(hashlib.md5(str(x) + ',' + str(y)).hexdigest()[-6:], 16)
But obviously it feels wrong to go over strings when both input and output are integers. What would be a good replacement for this implementation (in pseudo-code, python, or whatever language)?
A "hash" is the solution created to solve exactly the problem you are describing. See wikipedia's article
Any hash function you use will be nice; hash functions tend to be judged based on these criteria:
The degree to which they prevent collisions (two separate inputs producing the same output) -- a by-product of this is the degree to which the function minimizes outputs that may never be reached from any input.
The uniformity the distribution of its outputs given a uniformly distributed set of inputs
The degree to which small changes in the input create large changes in the output.
(see perfect hash function)
Given how hard it is to create a hash function that maximizes all of these criteria, why not just use one of the most commonly used and relied-on existing hash functions there already are?
From what it seems, turning integers into strings almost seems like another layer of encryption! (which is good for your purposes, I'd assume)
However, your question asks for hash functions that deal specifically with numbers, so here we go.
Hash functions that work over the integers
If you want to borrow already-existing algorithms, you may want to dabble in pseudo-random number generators
One simple one is the middle square method:
Take a digit number
Square it
Chop off the digits and leave the middle digits with the same length as your original.
ie,
1111 => 01234321 => 2342
so, 1111 would be "hashed" to 2342, in the middle square method.
This way isn't that effective, but for a few number of hashes, this has very low collision rates, a uniform distribution, and great chaos-potential (small changes => big changes). But if you have many values, time to look for something else...
The grand-daddy of all feasibly efficient and simple random number generators is the (Mersenne Twister)[http://en.wikipedia.org/wiki/Mersenne_twister]. In fact, an implementation is probably out there for every programming language imaginable. Your hash "input" is something that will be called a "seed" in their terminology.
In conclusion
Nothing wrong with string-based hash functions
If you want to stick with the integers and be fancy, try using your number as a seed for a pseudo-random number generator.
Hashing fits your requirements perfectly. If you really don't want to use strings, find a Hash library that will take numbers or binary data. But using strings here looks OK to me.
Bob Jenkins' mix function is a classic choice, at when n=3.
As others point out, hash functions do exactly what you want. Hashes take bytes - not character strings - and return bytes, and converting between integers and bytes is, of course, simple. Here's an example python function that works on 32 bit integers, and outputs a 32 bit integer:
import hashlib
import struct
def intsha1(ints):
input = struct.pack('>%di' % len(ints), *ints)
output = hashlib.sha1(input).digest()
return struct.unpack('>i', output[:4])
It can, of course, be easily adapted to work with different length inputs and outputs.
Have a look at this, may be you can be inspired
Chaotic system
In chaotic dynamics, small changes vary results greatly.
A x-bit block cipher will take an number and convert it effectively to another number. You could combine (sum/mult?) your input numbers and cipher them, or iteratively encipher each number - similar to a CBC or chained mode. Google 'format preserving encyption'. It is possible to create a 32-bit block cipher (not widely 'available') and use this to create a 'hashed' output. Main difference between hash and encryption, is that hash is irreversible.

Resources