Encode digital information to DNA - bioinformatics

DNA Is a Structure That Encodes Biological Information.
Recently DNA is used to encode digital information. (i.e. translate digital Information like photo, text, etc to DNA sequence.
What algorithm is used exactly for translating binary to DNA sequence?
As Wikipedia claims:
5.5 petabits can be stored in each cubic millimeter of DNA
http://en.wikipedia.org/wiki/DNA_digital_data_storage
so it is efficient way for storing huge amount of information in DNA.
Is there any good Reference or tutorial book that trains how to encode information to DNA efficiently and decode them again to original information?
Thanks

I'd suggest to study molecular biology. You'd need to understand biochemistry behind DNA replication and transcription, what are nucleotides, how cell works, how nucleus works, and many, many other areas; it's really not a matter of some tutorial, or a book (I'm trained in bioinformatics, biophysics, and molecular biology). To the rest of your question - since DNA uses four different nucleotides, you can encode two bits into one nucleotide, e.g. 00b = A, 01b = T, 10b = C, 11b = G.

The proof of concept came from here:
Is it possible? Yes. Is it practical: Not yet. It will substitute the common hard-drives/flash disks? No, with the current technology.
If want to learn more about DNA technology, from the biochemistry perspective I advise you to give a look into DNA replication and study a bit of DNA chemistry.
If you a interested in the current DNA technology sequencing.
And will find better a probably answer/discussion at biostar!

Related

What are the Machine Learning Algorithms to use when identifying given words

I'm new to the machine learning and I want to identify the given words using an algorithm.
As an example,
construct the triangle ABC such that AB=7cm, BAC=60 and AC=5.5cm
construct the square that is 7cm long each side.
in this example I need to identify the words triangle and square.
So it seems like you want to be the algorithm to be intelligence rather than just identifying couple of words. So in order to that you should go for Natural language processing. There you can identifying nouns of different geometrical objects and in order to gather those information, I mean if you want to list that AB=7, BAC=60 and AC=505 then learn recurrent neural networks (RNN). RNN models can remember what you have said at the beginning of the sentence and identify what details are belongs to that.
ex - Anna lives in Paris and she's fluent in French.
RNN can identify the word French.
So just by using a machine learning algorithm it is not possible to identify those words and gather details when provide such a sentence.
You can read this article for further understanding.
Understanding LSTM Networks

Given a list of words, how to develop an algorithmic way to semantically group them?

I am working with the Google Places API, and they contain a list of 97 different locations. I want to reduce the list of locations into a lesser number
of them, as many of them are groupable. For example, atm and bank into financial; temple, church, mosque, synagogue into worship; school, university into education; subway_station, train_station, transit_station, gas_station into transportation.
But also, it should not overgeneralize; for example, pet_store, city_hall, courthouse, restaurant into something like buildings.
I tried quite a few methods to do this. First I downloaded synonyms of each of the 97 words in the list from multiple dictionaries. Then, I found out the similarity between 2 words based on what fraction of unique synonyms they share in common (Jaccard similarity):
But after that, how do I group words into clusters? Using traditional clustering methods (k-means, k-medoid, hierarchical clustering, and FCM), I am not getting any good clustering (I identified several misclassifications by scanning the results manually):
I even tried the word2vec model trained on Google news data (where each word is expressed as a vector of 300 features), and I do not get good clusters based on that as well:
You are probably looking for something related to vector space dimensionality reduction. In these techniques, you'll need a corpus of text that uses the locations as words in the text. Dimensionality reduction will then group the terms together. You can do some reading on Latent Dirichlet Allocation and Latent semantic indexing. A good reference is "Introduction to Information Retrieval" by Manning et al., chapter 18. Note that this book is from 2009, so a lot of advances are not captured. As you noted, there has been a lot of work such as word2vec. Another good reference is "Speech and Language Processing" by Jurafsky and Martin, chapter 16.
You need much more data.
No algorithm ever, without additional data, will relate ATM and bank to financial. Because that requires knowledge of these terms.
Jaccard similarity doesn't have access to such knowledge, it can only work on the words. And then "river bank" and "bank branch" are very similar.
So don't expect magic to happen by the algorithm. You need the magic to be in the data...

Efficiently finding all points dominated by a point in 2D

In the OCW Advanced Data Structures course, Prof. E. Demaine mentions a Data Structure that is able to find all the points dominated by a query point (b2, b3) using O(n) space and O(k) time, provided that a search for point b3 has already been completed, where k is the size of the output.
The solution works by transforming the above problem into a ray stabbing problem, and using a technique similar to fractional cascading, as shown in the following image from the lecture notes:
While the concept itself is intuitive, implementing the actual data structure is not straightforward at all.
Chazelle describes this in a paper as Filtering Search (pp712).
I would like to find additional literature or answers that describe and explain of this data structure and algorithm (perhaps with pseudo code and more images, with focus on implementation).
Additionally, I would also like to know more about whether this structure can be implemented in a way that is not "static". That is, I would like to be able to insert and delete points from the structure as efficiently as possible.
The book "Computational Geometry: Algorithms and Applications" covers data structures for questions like these. Each chapter has a nice section describing where to learn more, including more complex structures for answering the same problems that are not covered in the book. There are enough diagrams, but not much pseudocode.
Many structures like this can be dynamized using techniques discussed in the book "The design of dynamic data structures". Jeff Erickson has some nice notes on the topic. Using fractional cascading with it is discussed is Cache-Oblivious Streaming B-trees" - see the section about "cache-oblivious lookahead arrays.

Clustering algorithms for strings

I have to implement a module in which i need to group sentences(strings) having similar meaning into different clusters. I read about k-means , EM clustering etc. But the problem which i am facing is that these algorithms are explained with vector points on a graph. I am not getting how these algorithms can be implemented for a sentence(String) having similar meaning. Please suggest some appropriate ways.
For example ,
Lets consider a classroom scenario..
1) Teacher has ample knowledge.
2) Students understand what teacher teaches.
3) Teacher is sometimes punctual in class.
4) Teacher is audible in class.
Lets say we have these 4 sentences. Then looking at them we can say that sentence 1 and 2 are of similar meaning. But sentence 3 and 4 are neither related to each other nor to the first two. In this way i need to classify the sentences. So how can it be done?
First of all you should make yourself familiar with the bag of words concept.
The basic idea ist to map each word in a sentence on the number of occurrences, e.g., for the sentences hello world, hello tanay would get mapped onto
Hello World Tanay
1 1 0
1 0 1
This allows you to use one of the standard approaches.
Also worthwhile would be having a look at TF/DF it is made to reweigh words in a bag of words representation, with their importance to distinguish the documents (or sentences in your case)
Secondly, you should look at LDA which was made specifically for clustering words to concepts. Nevertheless, it is made of a view concepts.
Most promising to me sounds like a combination of these approaches. Generate, bags of words, reweigh the bag of words using TF/DF, run LDA and augment the reweighed bag of words with the LDA concepts and then use a standard clustering algorithm.
Clustering cannot do this.
Because it looks for stucture in data, but you want to cluster by the abstract human concept of meaning that is hard to capture using statistics...
So you first need to solve the really hard task of making the computer understand language reliably. And not on a "best match" basis, but good enough to quantify similarities.
There are actualy some attempts in this direction, usually involving massive data and deep learning. They can do this on some toy examples such as "Paris - France + USA = ?" - sometimes. Google for IBM Watson and Google word2vec.
Good luck. You will need high-performance GPUs and Exabytes of training data.

Comparing two English strings for similarities

So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic
decomposition of buried dead organisms. The age of the organisms and
their resulting fossil fuels is typically millions of years, and
sometimes exceeds 650 million years. The fossil fuels, which contain
high percentages of carbon, include coal, petroleum, and natural gas.
Fossil fuels range from volatile materials with low carbon:hydrogen
ratios like methane, to liquid petroleum to nonvolatile materials
composed of almost pure carbon, like anthracite coal. Methane can be
found in hydrocarbon fields, alone, associated with oil, or in the
form of methane clathrates. It is generally accepted that they formed
from the fossilized remains of dead plants by exposure to heat and
pressure in the Earth's crust over millions of years. This biogenic
theory was first introduced by Georg Agricola in 1556 and later by
Mikhail Lomonosov in the 18th century.
Second:
Fossil fuel reforming is a method of producing hydrogen or other
useful products from fossil fuels such as natural gas. This is
achieved in a processing device called a reformer which reacts steam
at high temperature with the fossil fuel. The steam methane reformer
is widely used in industry to make hydrogen. There is also interest in
the development of much smaller units based on similar technology to
produce hydrogen as a feedstock for fuel cells. Small-scale steam
reforming units to supply fuel cells are currently the subject of
research and development, typically involving the reforming of
methanol or natural gas but other fuels are also being considered such
as propane, gasoline, autogas, diesel fuel, and ethanol.
That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.
In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!
You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).

Resources