Data structures for bioinformatics [closed] - data-structures

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
What are some data structures that should be known by somebody involved in bioinformatics? I guess that anyone is supposed to know about lists, hashes, balanced trees, etc., but I expect that there are domain specific data structures. Is there any book devoted to this subject?

The most fundamental data structure used in bioinformatics is string. There are also a whole range of different data structures representing strings. And algorithms like string matching are based on the efficient representation/data structures.
A comprehensive work on this is Dan Gusfield's Algorithms on Strings, Trees and Sequences

A lot of introductory books on bioinformatics will cover some of the basic structures you'd use. I'm not sure what the standard textbook is, but I'm sure you can find that. It might be useful to look at some of the language-specific books:
Bioinformatics Programming With Python
Beginning Perl for Bioinformatics
I chose those two as examples because they're published by O'Reilly, which, in my experience, publishes good quality books.
I just so happen to have the Python book on my hard drive, and a great deal of it talks about processing strings for bioinformatics using Python. It doesn't seem like bioinformatics uses any fancy special data structures, just existing ones.

Spatial hashing datastructures (kd-tree) for example are used often for nearest neighbor queries of arbitrary feature vectors as well as 3d protein structure analysis.
Best book for your $$ is Understanding Bioinformatics by Zvelebil because it covers everything from sequence analysis to structure comparison.

In addition to basic familiarity with the structures you mentioned, suffix trees (and suffix arrays), de Bruijn graphs, and interval graphs are used extensively. The Handbook of Computational Molecular Biology is very well written. I've never read the whole thing, but I've used it as a reference.

I also highly recommend this book, http://www.comp.nus.edu.sg/~ksung/algo_in_bioinfo/
And more recently, python is much more frequently used in bioinformatics than perl. So I really suggest you start with python, it is widely used in my projects.

Many projects in bioinformatics involve combining information from different, semi-structured sources. RDF and ontologies are essential for much of this. See, for example, the bio2RDF project. http://bio2rdf.org/. A good understanding of identifiers is valuable.
Much bioinformatics is exploratory and rapid lightweight tools are often used. See workflow tools such as Taverna where the primary resource is often a set of web services - so HTTP/REST are common.

Whatever your mathematical or computational expertise is, you are likely to find an application in computational biology. If not, make this another question of stackoverflow and you'll be helped :o)
As mentioned in the other answers, somewhat timeless are string comparisons and pattern discovery in 1-dimensional data since sequences are so easy to get. With a renewed interest in medical informatics though you also have two/three-dimensional image analysis that you run e.g. against genomic data. With molecular biochemistry you also have pattern searches on 3D surfaces and molecular simulations. To study drug effects you will work with gene networks and compare those across tissues. Typical challenges for big data and information integration apply. And then, you need statistical descriptions of the likelihood of a pattern or the clinical association of any features identified to be found by chance.

Related

Application of dynamic programming in real world programming [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have found that dynamic programming is a bit skillful and demanding. But since I expect myself to become an adequate software engineer, I am wondering in which development scenario will DP massively be used or in other words are there any practical usage of it in development based on modern computers?
If you think about design patterns like Proxy pattern and dynamic proxy, which is broadly used in spring framework, DP seems like it is only useful in tech interview.
Also, application of parallelized computing and distributed system seems not easy to empower DP in modern computer context.
Are there any not rare scenarios where DP is widely used in very practical ways?
Please forgive my ignorance, since I haven't meet DP in real production level development, which makes me doubt the meaning of digging into DP.
I agree with # Matt Timmermans.
You don't learn about DP in case you have to use DP someday. By practicing DP, you learn ways of thinking about problems that will make you a better developer. In 10 years, nobody will care about the spring framework, but the techniques you learned from DP will still serve you well.
Now, the answer to your questions, part by parts:
1) Why DP if we have modern computers?
I think you got confused by the analogy of modern computers and the need for DP. Although modern computers are powerful in processing, you may think why I need DP if I have modern fast processors to run my application on.
Not every task can be executed on these modern computers as they come up with storage, network, and compute costs. In fact, as an engineer, we should be thinking of optimizing the usage of such resources, that is, making your code efficient to make it capable of running on minimum system configurations.
In today's world, we have a shared service architecture. It means that different independent services share resources. But the fact is they are interdependent indirectly. Imagine what will happen if a non-optimized code is consuming a lot of memory and compute time. These processors will face difficulty in allocating resources for other services or applications.
The thing is, "Why should I buy an apartment if a multi-bedroom flat can meet my needs and also creates an opportunity for others to buy a flat in the same apartment?"
2) DP in tech interviews
The fact that makes DP the most challenging topic to ace is the number of variations in DP.
It checks your ability to break down a difficult task into small ones to avoid reputations and thus save time, efforts, and thus overall resources.
That is one of the most prime reasons why DP is part of tech interviews.
Not only DP teaches you to optimize and learn useful things, but it also highlights bad practices of writing codes.
3) Usage of DP in real life
In Google Maps to find the shortest path between source and the series of destinations (one by one) out of the various available paths.
In networking to transfer data from a sender to various receivers in a sequential manner.
Document Distance Algorithms- to identify the extent of similarity between two text documents used by Search engines like Google, Wikipedia, Quora, and other websites
Edit distance algorithm used in spell checkers.
Databases caching common queries in memory: through dedicated cache tiers storing data to avoid DB access, web servers store common data like configuration that can be used across requests. Then multiple levels of caching in code abstractions within every single request that prevents fetching the same data multiple times and save CPU cycles by avoiding recomputation. Finally, caches within your browser or mobile phones that keep the data that doesn't need to be fetched from the server every time.
Git merge. Document diffing is one of the most prominent uses of LCS.
Dynamic programming is used in TeX's system of calculating the right amounts of hyphenations and justifications.
Genetic algorithms.
Also, I found a great answer on Quora which lists the areas in which DP can be used:
Operations research,
Decision making,
Query optimization,
Water resource engineering,
Economics,
Reservoir Operations problems,
Connected speech recognition,
Slope stability analysis,
Using Matlab,
Using Excel,
Unit commitment,
Image processing,
Optimal Inventory control,
Reservoir operational Problems,
Sap Abap,
Sequence Alignment,
Simulation for sewer management,
Finance,
Production Optimization,
Genetic Algorithms for permutation problem,
Haskell,
HTML,
Healthcare,
Hydropower scheduling,
LISP,
Linear space,
XML indexing and querying,
Business,
Bioinformatics

Tutorials For Natural Language Processing [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I recently attended a class on coursera about "Natural Language Processing" and I learnt a lot about parsing, IR and other interesting aspects like Q&A etc. though I grasped the concepts well but I did not actually get any practical knowledge of it. Can anyone suggest me good online tutorials or books for Natural Language Processing?
Thanks
You could read Jurafsky and Martin's Speech and Language Processing (2008 edition), which is the standard textbook in the field. It's long, and has a variety of topics, so I'd suggest reading just the chapters that really apply to your interests.
Further, the best way to learn is almost certainly to actually implement NLP algorithms from scratch. You could pick some standard tasks (language modeling, text classification, POS-tagging, NER, parsing) and implement various algorithms from the ground up (ngram models, HMMs, Naive Bayes, MaxEnt, CKY) to really understand what makes them work. It also shouldn't be too hard to find some free dataset to test your implementations on.
Finally, there are lots of tutorials out there for specific NLP algorithms that are excellent. For example, if you want to build an HMM, I suggest Jason Eisner's tutorial which also covers smoothing and unsupervised training with EM. If you want to implement Gibbs sampling for unsupervised Naive Bayes training, I suggest Philip Resnik's tutorial.
Aside from Jurafsky and Martin's book, Christopher D. Manning and Hinrich Schütze's Foundations of Statistical Natural Language Processing is also widely used. For IR, Manning et al. also wrote Introduction to Information Retrieval which can be read or downloaded online at their site.
If you want practical knowledge on how can you work on Natural language you should start implementing it.
I suggest to use NLTK(Natural Language Proecessing Toolkit) with Python. Its easy to implement NLP in python.
You can refer to this link
http://nltk.org/
Or you can try it online on
http://cst.dk/online/pos_tagger/uk/
Instead of reading a specific book, diving into the sea of papers might be an as good idea. http://www.aclweb.org, for example, contains many topics on NLP. Through those papers, you get references to more papers, some of which are the foundations of a certain branch of NLP. And because they were written by different authors, you are unlikely to be influenced too much by one point of view.
If you are a Java developer there is an extensive list of tutorials for how to build components of NLP systems using LingPipe at http://alias-i.com/lingpipe/demos/tutorial/read-me.html. Full disclosure I wrote some of those tutorials and one of the books below.
There are a few books that are more industrially oriented:
1) Natural Language Processing with Java by Richard M Reese
This covers how to do some common tasks with a range of open source toolkits (including LingPipe).
2) Natural Language Processing with Java and LingPipe Cookbook Paperback
by Breck Baldwin, Krishna Dayanidhi
This book is task driven at the level of "get the component built" and covers the major technologies driving most NLP systems that are text driven. It does not cover translation. It goes into more detail than the first book and has broader coverage than the LingPipe tutorials but is sometimes less detailed than the tutorials.
Breck
There is a hub for teaching and learning materials called TeLeMaCo. You can find resources for many aspects of NLP, and you can easily add more materials that you have found on the web.

MapReduce alternatives [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 months ago.
Improve this question
Are there any alternative paradigms to MapReduce (Google, Hadoop)? Is there any other reasonable way how to split & merge big problems?
Definitively. Check out, for example, Bulk Synchronous Parallel. Map/Reduce is in fact a very restricted way of reducing problems, however that restriction makes it manageable in a framework like Hadoop. The question is if it is less trouble to press your problem into a Map/Reduce setting, or if its easier to create a domain-specific parallelization scheme and having to take care of all the implementation details yourself. Pig, in fact, is only an abstraction layer on top of Hadoop which automates many standard problem transformations from not-Map-Reduce-y to Map-Reduce-compatible.
Edit 26.1.13: Found a nice up-to-date overview here
Phil Colella identified seven numerical methods for scientific computation based on the patterns of scattering and gathering of data between processing nodes, and called them 'dwarfs'. These have been added to by others, a list is available at the Dwarf Mine:
Dense Linear Algebra
Sparse Linear Algebra
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines
Update (August 2014): Stratosphere is now called Apache Flink (incubating).
Have a look at Stratosphere. It is another Big Data runtime that offers more operators (map, reduce, join, union, cross, iterate, ...). It also allows to define advanced data flow graphs (with Hadoop MR, you would have to chain jobs).
Stratosphere also supports BSP with its graph processing abstraction (called Spargel).
If you like to read scientific papers, have a look at Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing, it explains the theoretical backgrounds of the system.
Another system in the field is Spark which has its own model (RDDs). Since BSP has been mentioned here, also have a look at GraphLab, the offer an alternative to BSP.
Microsoft's Dryad is claimed to be more general than MapReduce.
Best alternate for MapReduce is Spark, because its 10 to 100 times faster than the MapReduce.
And also very easy to maintain, less coding high performance.

Priority of learning programming craft and other suggestions [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
As I am in my starting career year in software development (C++ & C#) I now see my flaws and what I miss in this sphere. Because of that I came into some conclusions and made myself a plan to fill those gaps and increase my knowledge in software development. But the question I stumbled upon after making a tasks which I need to do has not quite obvious answer to me. What is the priority of those tasks? Here are these tasks and my priority by numbering:
Learning:
Functional programming (Scala)
Data structures & Algorithms (Cormen book to the rescue + TopCoder/ProjectEuler/etc)
Design patterns (GOF or Head First)
Do you agree with this tasks and priorities? Or do I miss something here? Any suggestions are welcome!
I think you have it backwards. Start with design patterns, which will help you reduce the amount messy code you produce, and understand better code made by other people (particularly libraries written with design patterns in mind).
In addition to the book of four, there are many other design pattern books -- Patterns of Enterprise Application Architecture, for example. It might be worth looking at them after you get a good grounding. But I also highly recommend Domain Driven Design, which I think gives you a way of thinking about how to structure your program, instead of just identifying pieces here and there.
Next you can go with algorithms. I prefer Skiena's The Algorithm Design Manual, whose emphasis is more on getting people to know how to select and use algorithms, as well as building them from well known "parts" than on getting people to know to make proofs about algorithms. It is also available for Kindle, which was useful to me.
Also, get a good data structures book -- people often neglect that. I like the Handbook of Data Structures and Applications, though I'm also looking into Advanced Data Structures.
However, I cannot recommend either TopCoder or Euler for this task. TopCoder is, imho, mostly about writing code fast. Nothing bad about it, but it's hardly likely to make a difference on day-to-day stuff. If you like it, by all means do it. Also, it's excellent preparation for job interviews with the more technically minded companies.
Project Euler, on the other hand, is much more targeted at scientific computing, computer science and functional programming. It will be an excellent training ground when learning functional programming.
There's something that has a bit of design patterns, algorithms and functional programming, which is Elements of Programming. It uses C++ for its examples, which is a plus for you.
As for functional programming, I think it is less urgent than the other two. However, I indicate either Clojure or Haskell instead of Scala.
Learning functional programming in Scala is like learning Spanish in a latino neighborhood, while learning functional programming in Clojure is like learning Spanish in Madrid, and learning functional programming in Haskell is like learning Spanish in an isolated monastery in Spain. :-)
Mind you, I prefer Scala as a programming language, but I already knew FP when I came to it.
When you do get to functional programming, get Chris Okasaki's Purely Functional Data Structures, for a good grounding on algorithms and data structures for functional programming.
Beyond that, try to learn a new language every year. Even if not for the language itself, you are more likely to keep up to date with what people are doing nowadays.
Data structures and algorithms will help you no matter what language you use. I'd work on it first. Then design patterns (any OOP language will benefit from them). Functional programming is nice, but not necessarily a top priority.
Depends entirely on what you're doing.
I'd tailor which one you learn first to what would help you the most with your current job.
Write lots of code. Try to do it better every time. Occasionally work with more senior people, who can provide guidance praise and gentle correction.
I think that in general the topics that you have picked are very important, and my give you the chance to do something more than the usual boring stuff. However, I believe that the order should be something like this:
Data structures & Algorithms
Functional programming
Software Design
Specific technologies you need
My opinion is that Algorithms and data structures should be first. It is very hard to study algorithms if you have a lot of other things in you head (good coding practices, lots of programming paradigms, etc.). Also with time, people tend to become more lazy, and lose the patience to get into the ideas of this complex matter. On the other hand, missing some fundamental understanding about how things can be represented or operate, may lead to serious flaws in understanding anything more sophisticated. So, assuming that you have some ideas about imperative programming (the usual stuff tаught in the introductory courses) you should enhance your knowledge with algorithms and data structures.
It is important to have at least basic understanding of other paradigms. Functional programming is a good example. You may also consider getting familiar with logic programming. Having basic understanding of Algorithms and Data Structures will help you a lot in understanding how such languages work. I don't know whether Scala is the best language for that purpose, but will probably do. Alternatively, you can pick something more classic like Lisp, or Scheme. Haskell is also an interesting language.
About the Design Patterns... knowing design patterns will help you in doing object oriented design, but you should be aware, that design patterns are just a set of solutions to popular problems. Knowing Design Patterns is by no means that same as knowing how to design software. In order to improve you software design skills you should study other materials too. A good example from where you can get understanding about these concepts is the book Code Complete, or the MIT course 6.170 (its materials are publicly available).
At some point you will need to get into the details of a specific framework (or frameworks) that you will need for what you do. Keep in mind, that such frameworks change, and you should be able to adapt, and learn new technologies. For instance, knowing ASP.NET MVC now, may be worthless 5 years from now (or may not be, who knows?).
Finally, keep in mind, that no matter what you read, you need to practice a lot, which means solving problems, writing code, designing software, etc. Most of these concepts can not be easily explained, or even expressed with words, so you will need to reach most of them by yourself, (that is, you will need to reinvent the wheel many times).
Good luck with your career!
If would think Functional Programming would be low in priority since the languages you use are OO in nature, I would think spending some time in Design Patterns and on the specifics of the language itself would be more useful.
I read both GOF and HeadFirst, HeadFirst is probably the easier and more fun of the 2 but much thicker. You should probably look at Enterprise Design Patterns, like Martin Fowler's page http://martinfowler.com/eaaCatalog/
What field do you think you will work in? Games ? Web? That will probably decide how important the Algo part would be for.
I would say that you first need to understand (even if not remember) the base algorithms and data structures. (use Knuth and Cormen), then get to learn architecture (design patterns are here.)..
Functional programming is just one type of programming and is mandatory. There are many great programmers that are not using functional programming, but I assume that for all kinds you must first know the basics- algorithms and data structures.
I'd say #2 goes first, especially if you are planning to use C++/C# at work, having a good command of data structures and algorithms will give you some edge. I see #1 and #3 as somewhat parallel paths, but I do have a couple of suggestions: start with the Head First book for patterns, the GOF is more like a reference book and also the notation and language may get quite abstruse. As for functional programming, may I suggest Clojure instead of Scala? I'm convinced that a "functional-first" language (like F# or Clojure) will force you to think functional (a good thing) instead of just patching your O-O/imperative skills.

Simple algorithm tutorials? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I want to learn algorithms using some very basic simple tutorials. Are there any out there? I have heard of recursion and stuff and I would like to get good at it. Any help would be appreciated.
I would start out by taking a look at EternallyConfuzzled which contains great tutorials for basic Data Structures an Algorithms including linked lists and binary search trees, sorting and searching algorithms. If you want to learn more after this I would recommend the following books in order of increasing complexity, completeness, and required math knowledge:
Algorithms in C (also available in C++ and Java)
Introduction to Algorithms
The Art of Computer Programming
If you want to learn algorithms this book is the best choice.
(source: mcgraw-hill.com)
MIT's OCW has video lectures of their Algorithm course. The professor is one of the authors of the book Introduction to Algorithms, which another poster suggested.
It assumes a basic knowledge of Discrete Maths.
TopCoder has some good algorithm tutorials.
If you're interested in a tutorial, avoid the CLRS book recommend above. It takes a rigorous theoretical approach to the study of Algorithms, which is very different from a tutorial approach.
You learn Algorithms by doing them. So find a resource that provides Algorithms problems and guidance in solving them. If you want a textbook, check out the Algorithm Design Manual, which also has an online Algorithm Repository.
If you prefer an online course, Udacity offers a python-based Algorithms course, while Coursera offers general and Java-based ones.
Since the important part is practicing Algorithms, you can skip the video courses and just solve challenges. Other answers suggested sites with challenges you can practice once you're good at Algorithms. In the beginning you'll want more guidance, so find a resource that provides Algorithms challenges and help with solving them. I created Learneroo for this purpose. You can start by learning the fundamentals of Recursion with the Recursion Tutorial.
Recursion really isn't an algorithm. Since you don't have anything specific you're interested in I'd suggest you read wikipedia's List of alorithms or as others have suggested grab a book.
I would start at the Stony Brook Algorithm Repository. The site has some really good explanations of different types of algorithms, and it references what books and other resources it uses so you can get a taste of what's available.
I suggest that you start from sorting algorithms. Read the related wikipedia page, skip the O(n log n) stuff, and focus on the implementations of, say, insertion sort, merge sort, and quick sort. Familiarize with binary searching. Also, learn about some basic data structures, such as vectors, linked lists, stacks, their implementation, and what they are useful for. (More often than not, an algorithm to solve a problem goes together with a suitable data structure.) Once you are confident with different algorithms and data structures, you can dive in a more complete treatise such as the book by Cormen et al.
As for recursion, it is not an algorithm in itself. It is instead a technique that some algorithms employ to solve a problem, when the latter can be naturally split into subproblems. The technique of splitting a problem, solving the subproblems separately and then merging their solutions to obtain a solution for the original problem, is called "divide et impera", or "divide and conquer". (Recursion is also the related feature of most programming languages, where it basically means "functions that call themselves".)
The most cited, the most trivial, and the most useless example of a "recursive algorithm", is the one to compute factorials. Don't mind it. Instead, read about the Tower of Hanoi problem, which admits a simple and elegant recursive solution, and again, study some sorting algorithms, for many of them are indeed recursive.
To the various people who have commented that book xyz is not simple, I'd point out that algorithmics is not a simple topic. You need at least university entry level mathematics to understand the concepts plus the ability to reason about computation at a suitably abstract level. If you ever find an "Algorithmics for Dummies" book, don't waste your money!
my choice http://aduni.org/courses/algorithms/
Going through solutions in topcoder problems is a very good way to pick up algorithms. Reading theory alone won't help
Khan academy started an excellent interactive self paced course on algorithms - https://www.khanacademy.org/computing/computer-science/algorithms.
Recursion is a language feature, and less an "algorithm" per se. All recursion can be replaced with proper data structures (like a stack).
I'd recommend grabbing a book. The problem with algorithms is that it's a relatively progressive topic. You first need to learn simple searches before you can learn sorting, and you need sorting before you can do minimum spanning trees etc. A book will properly order these, and if the text doesn't give you enough information the internet is a great next step. Try Amazon and look at the comments for someone who is new.
Make sure you learn an implementation language before you try to go at this though, until you understand how the language works it's going to be very hard to pick out bugs in your logic vs a misunderstanding of what's happening for a given sequence of commands.
USA Computing Olympiad has a nice algorithms training site that so far anyone can sign up for and it's almost in a class like format. read a little, do an exercise, read more, do an exercise etc.
One of my favorite list of algorithm problems is Project Euler, they are pretty diverse and you can solve the same problem many times for optimizations, and you will find lots of communities (C++, C#, Python, ... etc) posting their benchmarks for every problem
It is so much fun, geek fun
Solve questions on various sites as SPOJ etc . and read books on Introduction to Algorithms, there are some online courses as well on coursera .

Resources