How to handle multiple optimal edit paths implementing Needleman-Wunsche algorithm? - bioinformatics

Trying to implement Needleman-Wunsche algorithm for biological sequences comparison. In some circumstances there exist multiple optimal edit paths.
What is the common practice in bio-seq-compare tools handling this? Any priority/preferences among substitute/insert/deletion?
If I want to keep multiple edit paths in memory, any data structure is recommended? Or generally, how to store paths with branches and merges?
Any comments appreciated.

If two paths are have identical scores, that means that the likelihood of them is the same no matter which kinds of operations they used. Priority for substitutions vs. insertions or deletions has already been handled in getting that score. So if two scores are the same, common practice is to break the tie arbitrarily.
You should be able to handle this by recording all potential cells that you could have arrived at the current one from in your traceback matrix. Then, during traceback, start a separate branch whenever you come to a branching point. In order to allow for merges too, store some additional data about each cell (how will depend on what language you're using) indicating how many different paths left from it. Then, during traceback, wait at a given cell until that number of paths have arrived back at it, and then merge them into one. You can either be following the different branches with true parallel processing, or by just alternating which one you are advancing.

Unless you have an a reason to prefer one input sequence over the other in advance it should not matter.
Otherwise you might consider seq_a as the vertical axis and seq_b as the horizontal axis then always choose to step in your preferred direction if there is a tie to break ... but I'm not convincing myself there is any difference to the to alignment assuming one favors one of the starting sequences over the other

As a lot of similar algorithms, Needleman-Wunsche one is just a task of finding the shortest way into a graph (square grid in this case). So I would use A* for defining a sequence & store the possible paths as a dictionary with nodes passes.

Related

Understanding thread-safety

Let use as example standard Binary Search Tree, its nodes having an encapsulated value and pointers to the left and right children and its parent. Its primary methods are addNode(), removeNode() and searchNode().
I want to understand how to make my data structure safe. I however do not know much about thread-safety and want to ask if what I am thinking is correct or not.
My first problem is one of decision: what I want to happen when I have concurrent methods? Let us say I am using my BST to manage a map in a game: the position of the character of a player is represented by an array of doubles, I store the player's positions in my BST in lexicographic order. One player destroyed a target (and his thread want to remove it from my map), but another player is initiating an attack and needs a list of all possible targets inside a certain distance from him, including the target just removed (and his thread want to execute a search that will include the target). How I manage the situation? What about other situations?
My second problem, (I think) my biggest, is one of actuation: how I make happen what I want to happen. I read about things like atomic-operations, linearity, obstruction-free, lock-free, wait-free. I never found out in a single place a simple definition for each, the advantages of that strategy, the disadvantages. What are those things? What am I doing if I am locking the nodes, or if i am locking the pointers to other nodes, or the values encapsulates in the node?
Am I understanding correctly the problematics of thread-safety? Do you have suggestions on problems I should think about, or on written papers and pdfs I should read in order to better understand the problems and their many solutions?

In Reinforcement learning using feature approximation, does one have a single set of weights or a set of weights for each action?

This question is an attempt to reframe this question to make it clearer.
This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions.
These discussions (The Basic Update Rule and Linear Value Function Approximation) show a set of weights for each action.
The reason they are different is that the first slide assumes you can anticipate the result of performing an action and then find features for the resulting states. (Note that the feature functions are functions of both the current state and the anticipated action.) In that case, the same set of weights can be applied to all the resulting features.
But in some cases, one can't anticipate the effect of an action. Then what does one do? Even if one has perfect weights, one can't apply them to the results of applying the actions if one can't anticipate those results.
My guess is that the second pair of slides deals with that problem. Instead of performing an action and then applying weights to the features of the resulting states, compute features of the current state and apply possibly different weights for each action.
Those are two very different ways of doing feature-based approximation. Are they both valid? The first one makes sense in situations, e.g., like Taxi, in which one can effectively simulate what the environment will do at each action. But in some cases, e.g., cart-pole, that's not possible/feasible. Then it would seem you need a separate set of weights for each action.
Is this the right way to think about it, or am I missing something?
Thanks.

Algorithms for Minimum resource requirements

I have a question for which I have made some solutions, but I am not happy with the scalability. I'm looking for input of some different approaches / algorithms to solving it.
Problem:
Software can run on electronic controllers (ECUs) and requires
different resources to run a given feature. It may require a given
amount of storage or RAM or a digital or Analog Input or Output for
instance. If we have multiple features and multiple controller options
we want to find the combination that minimizes the hardware
requirements (cost). I'll simplify the resources to letters to
simplify the understanding.
Example 1:
Feature1(A)
ECU1(A,B,C)
First a trivial example. Lets assume that a feature requires 1 unit of resource A, and ECU has 1 unit of resources A, B and C available, it is obvious that the feature will fit in the ECU with resources B & C left over.
Example 2:
Feature2(A,B)
ECU2(A|B,B,C)
In this example, Feature 2 requires resources A and B, and the ECU has 3 resources, the first of which can be A or B. In this case, you can again see that the feature will fit in the ECU, but only if check in a certain order. If you assign F(A) to E(A|B), then F(B) to E(B) it works, but if you assign F(B) to E(A|B) then there is no resource left on the ECU for F(A) so it doesn't appear to fit. This would lead one to the observation that we should prefer non-OR'd resources first to avoid such a conflict.
An example of the above could be a an analog input could also be used as a digital input for instance.
Example 3
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C)
Now things are a little bit more complicated, but it is still quite obvious to a person that the feature will fit into the ECU.
My problems are simply more scaled up versions of these examples (i.e. multiple features per ECU with more ECUs to choose from.
Algorithms
GA
My first approach to this was to use a genetic algorithm. For a given set of features i.e. F(A,B,C,D), and a list of currently available ECUs find which single or combination of ECUs fit the requirements.
ECUs would initially be randomly selected and features checked they fitted and added to them. If a feature didn't fit another ECU was added to the architecture. A population of these architectures was created and ranked based on lowest cost of housing all the features. Architectures could then be mated in successive generations with mutations and such to improve fitness.
This approached worked quite well, but tended to get stuck in local minima (not the cheapest option) based on a golden example I had worked by hand.
Combinatorial / Permutations
My next approach was to work out all of the possible permutations (the ORs from above) for an ECU to see if the features fit.
If we go back to example 2 and expand the ORs we get 2 permutations;
Feature2(A,B)
ECU2(A|B,B,C) = (A,B,C), (B,B,C)
From here it is trivial to check that the feature fits in the first permutation, but not the second.
...and for example 3 there are 12 permutations
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C) = (A,B,A), (B,B,A), (C,B,A), (A,C,A), (B,C,A), (C,C,A), (A,B,C), (B,B,C), (C,B,C), (A,C,C), (B,C,C), (C,C,C)
Again it is trivial to check that feature 3 fits in at least one of the permutations (3rd, 5th & 7th).
Based on this approach I was also able to get a solution also, but I have ECUs with so many OR'd inputs that I have millions of ECU permutations which drastically increased the run time (minutes). I can live with this, but first wanted to see if there was a better way to skin the cat, apart from Parallelizing this approach.
So that is the problem...
I have more ideas on how to approach it, but assume that there is a fancy name for such a problem or the name of the algorithm that has been around for 20+ years that I'm not familiar with and I was hoping someone could point me in that direction to either some papers or the names of relevant algorithms.
The obvious remark of simply summing the feature resource requirements and creating a new monolithic ECU is not an option. Lastly, no, this is not in any way associated with any assignment or problem given by a school or university.
Sorry for the long question, but hopefully I've sufficiently described what I am trying to do and this peaks the interest of someone out there.
Sincerely, Paul.
Looks like individual feature plug can be solved as bipartite matching.
You make bipartite graph:
left side corresponds to feature requirements
right side corresponds to ECU subnodes
edges connect each left and right side vertixes with common letters
Let me explain by example 2:
Feature2(A,B)
ECU2(A|B,B,C)
How graph looks:
2 left vertexes: L1 (A), L2 (B)
3 right vertexes: R1 (A|B), R2 (B), R3 (C)
3 edges: L1-R1 (A-A|B), L2-R1 (B-A|B), L2-R2 (B-B)
Then you find maximal matching for unordered bipartite graph. There are few well-known algorithms for it:
https://en.wikipedia.org/wiki/Matching_(graph_theory)
If maximal matching covers every feature vertex, we can use it to plug feature.
If maximal matching does not cover every feature vertex, we are short of resources.
Unfortunately, this approach works like greedy algorithms. It does not know of upcoming features and does not tweak solution to fit more features later. Partially optimization for simple cases can work like you described in question, but in general it's dead end - only algorithm that accounts for every feature in whole feature set can make overall effective solution.
You can try to add several features to one ECU simultaneously. If you want to add new feature to given ECU, you can try all already assigned features plus candidate feature. In this case local optimum solution will be found for given feature set (if it's possible to plug them all to one ECU).
I've not enough reputation to comment, so here's what i wanted to propose for your problem:
Like GA there are some other Random Based approaches too e.g. Bayesian Apporaoch , Decision Tree etc.
In my opinion Decision Tree will suit your problem as it, against some input dataset/attributes, shows a path to each class(in your case ECUs) that helps to select right class/ECU. Train your system with some sample data sets so that it can decide right ECU for your actual data set/Features.
Check Decision Trees - Machine Learning for more information. Hope it helps!

Intelligent purely functional sets

Set computations composed of unions, intersections and differences can often be expressed in many different ways. Are there any theories or concrete implementations that try to minimize the amount of computation required to reach a given answer?
For example, I first came across a practical application of this when trying to decompose atoms in a simulation of an amorphous material into neighbor shells where the first shell are the immediate neighbors of some given origin atom and the second shell are those atoms that are neighbors of the first shell not in either the first shell or the one before it:
nth 0 = singleton i
nth 1 = neighbors i
nth n = reduce union (map neighbors (nth(n-1))) - nth(n-1) - nth(n-2)
There are many different ways to solve this. You can incrementally test of membership in each set whilst composing the result or you can compute the union of three neighbor shells and use intersection to remove the previous two shells leaving the outermost one. In practice, solutions that require the construction of large intermediate sets are slower.
Presumably an intelligent set implementation could compose the expression that was to be evaluated and then optimize it (e.g. to reduce the size of intermediate sets) before evaluating it in order to improve performance. Do such set implementations exist?
Your question immediately reminded me of Haskell's stream fusion, described in this paper. The general principle can be summarized quite easily: Instead of storing a list, you store a way to build a list. Then the list transformation functions operate directly on the list generator, meaning that all the operations fuse into a single generation of the data without any intermediate structures. Then when you are done composing operations you run the generator and produce the data.
So I think the answer to your question is that if you wanted some similarly intelligent mechanism that fused computations and eliminated intermediate data structures, you'd need to find a way to transform a set into a "co-structure" (that's what the paper calls it) that generates a set and operate directly on that, then actually generate the set when you are done.
I think there's a very deep theory behind this concept that the paper hints at but never spells out, and if somebody else here knows what it is, please let me know, because this is very relevant to something else I am doing, too!

How to subdivide a 2d game world for better collision detection

I'm developing a game which features a sizeable square 2d playing area. The gaming area is tileless with bounded sides (no wrapping around). I am trying to figure out how I can best divide up this world to increase the performance of collision detection. Rather than checking each entity for collision with all other entities I want to only check nearby entities for collision and obstacle avoidance.
I have a few special concerns for this game world...
I want to be able to be able to use a large number of entities in the game world at once. However, a % of entities won't collide with entities of the same type. For example projectiles won't collide with other projectiles.
I want to be able to use a large range of entity sizes. I want there to be a very large size difference between the smallest entities and the largest.
There are very few static or non-moving entities in the game world.
I'm interested in using something similar to what's described in the answer here: Quadtree vs Red-Black tree for a game in C++?
My concern is how well will a tree subdivision of the world be able to handle large size differences in entities? To divide the world up enough for the smaller entities the larger ones will need to occupy a large number of regions and I'm concerned about how that will affect the performance of the system.
My other major concern is how to properly keep the list of occupied areas up to date. Since there's a lot of moving entities, and some very large ones, it seems like dividing the world up will create a significant amount of overhead for keeping track of which entities occupy which regions.
I'm mostly looking for any good algorithms or ideas that will help reduce the number collision detection and obstacle avoidance calculations.
If I were you I'd start off by implementing a simple BSP (binary space partition) tree. Since you are working in 2D, bound box checks are really fast. You basically need three classes: CBspTree, CBspNode and CBspCut (not really needed)
CBspTree has one root node instance of class CBspNode
CBspNode has an instance of CBspCut
CBspCut symbolize how you cut a set in two disjoint sets. This can neatly be solved by introducing polymorphism (e.g. CBspCutX or CBspCutY or some other cutting line). CBspCut also has two CBspNode
The interface towards the divided world will be through the tree class and it can be a really good idea to create one more layer on top of that, in case you would like to replace the BSP solution with e.g. a quad tree. Once you're getting the hang of it. But in my experience, a BSP will do just fine.
There are different strategies of how to store your items in the tree. What I mean by that is that you can choose to have e.g. some kind of container in each node that contains references to the objects occuping that area. This means though (as you are asking yourself) that large items will occupy many leaves, i.e. there will be many references to large objects and very small items will show up at single leaves.
In my experience this doesn't have that large impact. Of course it matters, but you'd have to do some testing to check if it's really an issue or not. You would be able to get around this by simply leaving those items at branched nodes in the tree, i.e. you will not store them on "leaf level". This means you will find those objects quick while traversing down the tree.
When it comes to your first question. If you only are going to use this subdivision for collision testing and nothing else, I suggest that things that can never collide never are inserted into the tree. A missile for example as you say, can't collide with another missile. Which would mean that you dont even have to store the missile in the tree.
However, you might want to use the bsp for other things as well, you didn't specify that but keep that in mind (for picking objects with e.g. the mouse). Otherwise I propose that you store everything in the bsp, and resolve the collision later on. Just ask the bsp of a list of objects in a certain area to get a limited set of possible collision candidates and perform the check after that (assuming objects know what they can collide with, or some other external mechanism).
If you want to speed up things, you also need to take care of merge and split, i.e. when things are removed from the tree, a lot of nodes will become empty or the number of items below some node level will decrease below some merge threshold. Then you want to merge two subtrees into one node containing all items. Splitting happens when you insert items into the world. So when the number of items exceed some splitting threshold you introduce a new cut, which splits the world in two. These merge and split thresholds should be two constants that you can use to tune the efficiency of the tree.
Merge and split are mainly used to keep the tree balanced and to make sure that it works as efficient as it can according to its specifications. This is really what you need to worry about. Moving things from one location and thus updating the tree is imo fast. But when it comes to merging and splitting it might become expensive if you do it too often.
This can be avoided by introducing some kind of lazy merge and split system, i.e. you have some kind of dirty flagging or modify count. Batch up all operations that can be batched, i.e. moving 10 objects and inserting 5 might be one batch. Once that batch of operations is finished, you check if the tree is dirty and then you do the needed merge and/or split operations.
Post some comments if you want me to explain further.
Cheers !
Edit
There are many things that can be optimized in the tree. But as you know, premature optimization is the root to all evil. So start off simple. For example, you might create some generic callback system that you can use while traversing the tree. This way you dont have to query the tree to get a list of objects that matched the bound box "question", instead you can just traverse down the tree and execute that call back each time you hit something. "If this bound box I'm providing intersects you, then execute this callback with these parameters"
You most definitely want to check this list of collision detection resources from gamedev.net out. It's full of resources with game development conventions.
For other than collision detection only, check their entire list of articles and resources.
My concern is how well will a tree
subdivision of the world be able to
handle large size differences in
entities? To divide the world up
enough for the smaller entities the
larger ones will need to occupy a
large number of regions and I'm
concerned about how that will affect
the performance of the system.
Use a quad tree. For objects that exist in multiple areas you have a few options:
Store the object in both branches, all the way down. Everything ends up in leaf nodes but you may end up with a significant number of extra pointers. May be appropriate for static things.
Split the object on the zone border and insert each part in their respective locations. Creates a lot of pain and isn't well defined for a lot of objects.
Store the object at the lowest point in the tree you can. Sets of objects now exist in leaf and non-leaf nodes, but each object has one pointer to it in the tree. Probably best for objects that are going to move.
By the way, the reason you're using a quad tree is because it's really really easy to work with. You don't have any heuristic based creation like you might with some BSP implementations. It's simple and it gets the job done.
My other major concern is how to
properly keep the list of occupied
areas up to date. Since there's a lot
of moving entities, and some very
large ones, it seems like dividing the
world up will create a significant
amount of overhead for keeping track
of which entities occupy which
regions.
There will be overhead to keeping your entities in the correct spots in the tree every time they move, yes, and it can be significant. But the whole point is that you're doing much much less work in your collision code. Even though you're adding some overhead with the tree traversal and update it should be much smaller than the overhead you just removed by using the tree at all.
Obviously depending on the number of objects, size of game world, etc etc the trade off might not be worth it. Usually it turns out to be a win, but it's hard to know without doing it.
There are lots of approaches. I'd recommend settings some specific goals (e.g., x collision tests per second with a ratio of y between smallest to largest entities), and do some prototyping to find the simplest approach that achieves those goals. You might be surprised how little work you have to do to get what you need. (Or it might be a ton of work, depending on your particulars.)
Many acceleration structures (e.g., a good BSP) can take a while to set up and thus are generally inappropriate for rapid animation.
There's a lot of literature out there on this topic, so spend some time searching and researching to come up with a list candidate approaches. Mock them up and profile.
I'd be tempted just to overlay a coarse grid over the play area to form a 2D hash. If the grid is at least the size of the largest entity then you only ever have 9 grid squares to check for collisions and it's a lot simpler than managing quad-trees or arbitrary BSP trees. The overhead of determining which coarse grid square you're in is typically just 2 arithmetic operations and when a change is detected the grid just has to remove one reference/ID/pointer from one square's list and add the same to another square.
Further gains can be had from keeping the projectiles out of the grid/tree/etc lookup system - since you can quickly determine where the projectile would be in the grid, you know which grid squares to query for potential collidees. If you check collisions against the environment for each projectile in turn, there's no need for the other entities to then check for collisions against the projectiles in reverse.

Resources