How to represent molecules and compare equality - algorithm

I've seen this question about the representation of molecules in memory, and it makes sense to me (tl;dr represent it as a graph with atoms as nodes and bonds as edges). But now my question is this: how do we check and see if two molecules are equal? This could be generalized as how can we check equality of (acyclic) graphs? For now we'll ignore stereoisomers and cyclical structures, such as the carbon ring in the example given in the first link.
Here's a more detailed description of my problem: For my Molecule class (as of now), I intend to have an array of Atoms and an array of Bonds. Each Bond will point to the two Atoms at either end, and will have a weight (i.e., the number of chemical bonds in that edge). In other words, this will most closely resemble an edge list graph. My first guess is to iterate over the Atoms in one molecule and try to find corresponding Atoms in the other molecule based on the Bonds that contain that Atom, but this is a rather naive approach, and the complexity seems pretty large (best guess is close to O(n!). Yikes.).
Regardless of complexity, this approach seems like it would work in most cases, however it seems to break down for some molecules. Take these for example (notice the different location of the OH group):
H H H OH H
| | | | |
H - C - C - C - C - C - H (2-Pentanol)
| | | | |
H H H H H
H H OH H H
| | | | |
H - C - C - C - C - C - H (3-Pentanol)
| | | | |
H H H H H
If we examine these molecules, for each atom in one molecule there is a unique same-element atom in the other molecule that has the same number and types of bonds, but these two molecules are clearly not the same, nor are they stereoisomers (which I'm not considering now). Instead they are structural isomers. Is there a way that we can check this relative structure as well? Would this be easier with an adjacency list instead of an edge list? Are there any graph equality algorithms out there that I should look into (ideally in Java)? I've looked a bit into graph canonization, but this seems like it could be NP-hard.
Edit: Looking at the Graph Isomorphism Problem Wikipedia Article, it seems as if graphs with bounded degree have polynomial time solutions to this problem. Furthermore, planar graphs also have polynomial solutions (i.e., the edges only intersect at their endpoints). It seems to me that molecules satisfy both of these conditions, so what is this polynomial-time solution to this problem, or where can I find it? My Google searches are letting me down this time.

If the graphs are acyclic, then it is a tree isomorphism problem, which has a pretty straightforward solution.
For now let's assume all internal nodes are carbon and all edges are the same (later on how to relax this restriction.)
Represent leaf nodes as numbers - say their atomic number. Represent trees of height 1 as sorted lists of their leaf nodes, so:
H Cl
| |
H - C - H and Cl-C-Cl
| |
H H
are [1,1,1,1] and [1,17,17,17] respectively. Obviously two molecules are isomorphic iff the sorted lists are the same.
This generalizes to trees of larger heights - represent a tree of height n as a list of representations of its subtrees, sorted lexigoraphically, so
Cl H H H
| | | |
H - C -C-Cl and Cl- C - C - Cl
| | | |
Cl H H Cl
are both [[1,1,17],[1,17,17]]. Two trees are isomorphic iff their representations are.
Note: usually the tree isomorphism algorithms work on rooted trees. Here we just go recursively from leaves towards the center of the graph which sometimes leaves us with two "roots".
H H Cl
| | |
H - C - C - C - H
| | |
H H H
Here, the left C is [1,1,1], the right C is [1,1,17]. The middle C (which is the root here) has these two lists plus two leaves. Sorted lexicographically it's [1,1,[1,1,1],[1,1,17]].
Now for representing internal nodes that aren't C - you can just simulate them by attaching a fake leaf with a special number, so
H
|
H - C - O - H
|
H
Can be encoded as
H
|
H - C - C - H
| |
H Fake
Where the "Fake" can be, say, 511 so that we know it doesn't clash with any existing atom. The whole molecule will thus be [[1,1,1],[1,511]].
So the algorithm is:
Convert both molecules to the recursively lexicographically sorted list form.
Check if the representations are equal.

#Rafal discussed the case of trees. But what if you do not have trees? here is my two cents:
Mathematica approach
Mathematica has a built-in predicate to check whether two graphs are isomorphic. You can try it for 30 days if you do not have it.
Check nauty
nauty is a solver where you can download it and test isomorphic.
Detect true negatives in advance
You can detect true negatives in advance by simply computing and comparing some numbers/sequences. This includes computing the degree sequence the vertex and edge set degrees. A pair of graphs passing this does not necessarily mean they are isomorphic but will reduce your space (maybe drastically !).
Most importantly, there is a recent advancement of the problem stating that isomorphic tests are polynomial for graphs of bounded treewidth. Even if your graphs seems general, they may exhibit this property (or you can simply assume it in general).

Related

Dynamic Programming / Subproblems + Transition

I am kind of stuck, I decided to try this problem https://icpcarchive.ecs.baylor.edu/external/71/7113.pdf
to prevent it 404'ing here's the basic assignment
a hopper only visits arrays with integer entries,
• a hopper always explores a sequence of array elements using the following rules:
– a hopper cannot jump too far, that is, the next element is always at most D indices away
(how far a hopper can jump depends on the length of its legs),
– a hopper doesn't like big changes in values — the next element differs from the current
element by at most M, more precisely the absolute value of the difference is at most M (how
big a change in values a hopper can handle depends on the length of its arms), and
– a hopper never visits the same element twice.
• a hopper will explore the array with the longest exploration sequence.
n is the length of the array (as described above, D is the maximum length of a jump the
hopper can make, and M is the maximum difference in values a hopper can handle). The next line
contains n integers — the entries of the array. We have 1 ≤ D ≤ 7, 1 ≤ M ≤ 10, 000, 1 ≤ n ≤ 10, 000
and the integers in the array are between -1,000,000 and 1,000,000.
EDIT: I am doing this out of pure curiosity this is not a assignment I need to do for any particular reason other than challenging myself
basically its building a sparse graph out of an array,
the graph is undirected and due to the symmetry of the -d ... d jumps, its also either a complete graph (all edges are included) or mutually disjoint graph components
As first step I tried to simply exhaustive DFS search the graph, which works but has the infamous O(n!) runtime, the first iteration of this was written in F# which was horrible slow the second in C which still plateaus pretty fast too
so I know the longest path problem is NP hard but I thought I would give it a try with dynamic programming
The next approach was to simply use the common DP solution (bitmasked path) to DFS on the graph but at this at this point I already traversed the array and built the entire graph which may contain up to 1000 nodes so its not feasible
My next approach was to build a DFS Tree (tree of all the paths) which is a bit faster but then needs to store all entire path in memory for each iteration already which isn't what I really want, I am thinking I can reduce it to substates while already traversing the array
next I tried to memoize all paths I've already walked by simply using a bitmask and a simple memoization functions as seen here:
let xf = memoizedEdges (fun r i' p mask ->
let mask' = (addBit i' mask)
let nbs = [-d .. -1] # [ 1 .. d]
|> Seq.map (fun f -> match f with
| x when (i' + x) < 0 -> None
| x when (i' + x) >= a.Length -> None
| x when (diff a.[i'+x] a.[i']) > m -> None
| x when (i' + x) = i -> None
| x when (isSet (i'+x) mask') -> None
| x -> Some (i' + x )
)
let ec = nbs
|> Seq.choose id
|> Seq.toList
|> List.map (fun f ->
r f i' mask'
)
max (bitcount mask) (ec |> mxOrZero)
)
So memoized edges works by 3 int parameters the current index (i'), the previous (p) and the path as bitmask, the momizedEdges function itself will check on each recursive call it if has seen i' and p and the mask ... or p and i' and the mask with the i' and p bits flipped to mask the path in the other way (basically if we have seen this path coming from the other side already)
this works as I would expect, but the assignment states its up to 1000 indices which would cause the int32 mask to be too short
so I've been thinking for days now and there must be a way to encode each of the -d ... d steps into a start and end vertice and calculate the path for each step in that window based on the previous steps
I've come up with basically this
0.) Create a container to hold starting and endvertex as key with the current pathlength as value
1.) Check neighbors of i
2.) Have I seen either this combination either as (from -> to) or (to -> from) then I do not add or increase
3.) Check whatever any other predecessors to this node exist and increase the path of those by 1
but this would lead to having all paths stored and I would basically result in tuples and then I am back at my graph with DFS in another form
I am very thankful for any pointers (I just need some new ideas I am really stuck rn) how I could encode each subproblem from -d..d that I can use just intermediate results for calculating the next step (if this is even possible)
Partial answer
This is a difficult problem. Indeed, on competitive programming problem compendium Kattis it is (at the time of writing) in the top 5 of most difficult problems.
Only you know if this sort of problem is possible for you to solve, but there is a fair chance no one on this site can help you completely, hence this partial answer.
Longest path
What we're asked to do here is solve the longest path problem for a particular graph. This problem is known to be NP-complete in general, even for undirected unweighted graphs as ours is. Because the graph can have 1000 vertices, a (sub-)exponential algorithm in N will not work, and we're likely not asked to prove that P=NP, so the only option we have left is to somehow exploit the structure of the graph.
The most promising avenue is through D. D is at most 7, because of which the maximum degree of the graph is at most 14, and all edges are—in a sense—local.
Now, according to Wikipedia, the longest path problem can be solved polynomially on various classes of graphs, such as noncyclic ones. Our graph is of course not noncyclic, but unfortunately this is largely where my knowledge ends. I am not sufficiently familiar with graph theory to see whether the implied graph of the problem is in any of the classes Wikipedia mentions.
Of particular note is that the longest path problem can be solved in polynomial time given bounded-by-a-constant clique-width (or tree-width, which implies the former). I am unable to confirm or prove that our graph has bounded clique-width because of the bound on D, but perhaps you yourself know more about this, or you could try asking on the math or CS stackexchange, as at this point we're pretty far from any actual programming.
Regardless, if you're able to confirm that the graph is clique-width-bounded, this paper may help you further.
I hope this answer is of some use despite not being entirely fulfilling, and good luck!
Citation for the paper in case of link decay
Fomin, F. V., Golovach, P. A., Lokshtanov, D., & Saurabh, S. (2009, January). Clique-width: on the price of generality. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms (pp. 825-834). Society for Industrial and Applied Mathematics.

Algorithm to find difference between two uniquely labeled general trees

I'm looking for an algorithm that finds the minimal changes (insert, delete, move) to get from tree1 to tree2.
The tree is a general and uniquely labeled tree.
This means that each node's value is some kind of unique identifier, like a UUID.
Each node can have children and the order of those child nodes is important.
Example tree 1
A
- B
- C
- D
- E
- F
Example tree 2
A
- B
- F
- C
- D
- G
expected changes from tree 1 to tree 2
move(atIndex: 1, inParent: A, toIndex: 0, inParent: F)
delete(atIndex: 1, inParent: C)
insert(atIndex: 2, inParent: A)
I need this for a macOS App to animated changes in an NSOutlineView.
I found some algorithms to find changes between general trees, but only for ordered and labeled trees and none for uniquely labeled trees.
The algorithms for non-uniquely labeled trees semes to be really complex and I thought that if the labels are unique there must be a simpler and more efficient algorithm.
BTW: NSOultineView does not have a root node but an array of nodes at the beginning. I don't know if that's important.
You're right -- label uniqueness does simplify the problem.
For each label present in both trees, compute a sequence of operations to edit its children. This involves computing a longest common subsequence and issuing appropriate moves/inserts/deletes for all nodes not in this subsequence. (In case a label is present in both trees but with different parents, we should suppress the delete for the old parent.)
This set of operations constitutes a lower bound, but if we just do them, we may hit a situation where we try to move a node under one of its descendants. Perhaps it's enough to detect when this is about to happen and split it into two moves. If you really need the optimal sequence of moves, then I think there's a feedback arc/node set problem to be solved, though it might not be NP-hard here due to the tree structure.
Given that labels are unique, I'm not convinced that "minimal" changes are any different than outlining the diff for each node's full path.
If we look at the path from right to left, we can see that the first parent and index of D below didn't change.
It seems like the minimal changes provided in the question description make an assumption that the move of F from index 2 to 1 happens magically when C at 1 moves. But rather the two actions of "move C" and "move F to C's former place," which we would get from just examining the paths, comprise the actual behaviour that would need to occur due to the assumption in your current "move C."
Example tree 1
A ⊥
- B A 0
- C A 1
- D A C 0
- E A C 1
- F A 2
Example tree 2
A ⊥
- B A 0
- F *A 1
- C *A 1 F 0
- D *A 1 F 0 C 0
- G A 2

Can an m-way B-Tree have m odd?

I read on book CLRS that we have m-way B-tree where m is even. But is there is B-Tree where m is odd, if there is then how can we make changes in the code given in this book.
By an m-way B-tree I assume you mean a B-tree where each internal node is allowed to have at most m children. According to CLRS's definition of a B-tree:
Nodes have lower and upper bounds on the number of keys they can contain. We express these bounds in terms of a fixed integer t 􏰄≥ 2 called the minimum degree of the B-tree: ... an internal node may have at most 2t children.
So the maximum number of children will always be even – by this definition it can not be odd.
However, this is not the only definition of B-tree. There are many definitions with slight differences that ultimately, make little difference to the overall performance. This can cause confusion. There are some B-tree definitions that allow for odd upper bounds and those which don't. CLRS's definition does not odd upper bounds for the children count of internal nodes.
However, another formal definition of a B-tree is by Knuth [1998] (The Art of Computer Programming, Volume 3 (Second ed.), Addison-Wesley, ISBN 0-201-89685-0). Knuth's definition does allow odd upper bounds. While CLRS enumerates all min-max tree bounds of the form (t, 2t) for t ≥ 2, Knuth enumerates all tree bounds of the form (ceil(k/2), k) for k ≥ 2.
Knuth Order, k | (min,max) | CLRS Degree, t
---------------|-------------|---------------
0 | - | –
1 | – | –
2 | – | –
3 | (2,3) | –
4 | (2,4) | t = 2
5 | (3,5) | –
6 | (3,6) | t = 3
7 | (4,7) | –
8 | (4,8) | t = 4
9 | (5,9) | –
10 | (5,10) | t = 5
So for example, a 2-3 tree, (2,3), is a B-tree with Knuth order 3. But it is not a valid CLRS tree because it has an odd upper bound.
Changing code will not be easy as B-trees have a lot of code depending on variable t. One of the biggest changes would be inside: B-TREE-SPLIT-CHILD(x,i), you'd need find a way to split a child with an odd number of children (an even number of keys) into nodes y and z. One of these two resulting nodes will have one more key than the other. If you're looking for code, I'd recommend looking on the Internet for an implementation of a B-tree that uses a definition similar to Knuth's (e.g. search for "Knuth Order B-tree").

What algorithm can I use to verify that a list of nodes can be connected, given some constraints?

I'm building a game where the player is given a random set of nodes and attempts to build the longest list they can by placing the nodes in a certain order. Each node has zero or more connections on the sides that have to match with at least one connection on the side of the next node in the list. For example, a node might look like this:
+--+
left connections A B right connections
B C
+--+
The above node (example node) could be connected with any of these nodes:
+--+
C | This node can connect to the right side of the example node (matches C)
D |
+--+
+--+
B K This node can connect to the left side of the example node (matches A)
L A This node can connect to the right side of the example node (matches B)
+--+
So, given those three nodes, the player could match them up in a list like so:
+--+ +--+ +--+
B K A B C |
L A -A- B C -C- D |
+--+ +--+ +--+
I need to validate that the player's choices. The player doesn't have to select the nodes in the correct order at first, but their resulting final selections must be able to connect into a contiguous, linear list.
So, given an array of unordered nodes (the players selection), I need to form the nodes into a valid list like above, or show an error to the player.
I can brute force the validation, but I was hoping to find a more elegant solution.
After hash up and some precalculations the problem may look like this:
Given a graph determine whether it has a path traversing all nodes
which is exactly the Hamiltonian problem. You may read researches on this topic or analyze certain structure of your graph (for some special graphs it has simple solutions), but in general case the best solution I know is exponential.
However the straightforward brute force solution is to go through all permutations (n!) and check whether it forms proper path (*n). This approach results in O(n*n!) asymptotic. Effectively it means that n should be about 12 at maximum for sub-second check and for n=15 it will take several hours to check in the worst case.
Slight optimisation - forming the path gradually and checking on every new vertex - results in O(n!) time in the worst case, so it will be possible to check for n=13 in a couple of seconds in the worst case and even faster in average as a lot of false paths will be cut at early stages.
You may go further and take advantage of dynamic programming. Let us define isProperPath[S][i] (where S is a bitmask of subset of nodes, i is some node of this subset) as value corresponding to existence of path constructed of nodes of subset corresponding to S with the last node i. Then it is easy to compute isProperPath[S][i] based on all values for subsets with less elements then in S:
isProperPath[S][i] = false;
for (j in S) {
if (isProperPath[S\i][j] && hasConnection(j, i))
isProperPath[S][i] = true;
}
By traversing all pairs of S and i in order of ascending size of S we will compute all values. And the answer is true if and only if isProperPath[A][i] = true where A is the whole given set and i - any of nodes.
Time complexity is O(2^n*n^2), space complexity is O(2^n*n) as there are 2^n*n values and it takes O(n) to compute value based on previous values. This algorithm makes it possible to check sets of size up to 24 in a sub-second time utilizing about 400M bits or 50Mb.

Algorithm to detect and remove least number of inconsistent facts (probably in PROLOG)?

How do you program the following algorithm?
Imagine a list of "facts" like this, where the letters represent variables bound to numeric values:
x = 1
y = 2
z = 3
a = 1
b = 2
c = 3
a = x
b = z
c = z
These "facts" clearly must not all be "true": it's not possible for b=z while b=2 and z=3. If either b=z or b=2 were removed, all facts would be consistent. If z=3 and either b=z or c=z were removed, then all facts would be consistent but there would be one fewer fact than above. This set contains many such consistent subsets. For instance, a=1, b=2, c=3 is a consistent subset, and so are many others.
Two consistent subsets are larger than any other consistent subset in this example:
x = 1
y = 2
z = 3
a = 1
b = 2
c = 3
a = x
c = z
and
x = 1
y = 2
z = 3
a = 1
c = 3
a = x
b = z
c = z
Using an appropriate programming language (I'm thinking PROLOG, but maybe I'm wrong) how would you process a large set containing consistent and inconsistent facts, and then output the largest possible sub-set of consistent facts (or multiple subsets as in the example above)?
This is very closely related to the NP-hard multiway cut problem. In the (unweighted) multiway cut problem, we have an undirected graph and a set of terminal vertices. The goal is to remove as few edges as possible so that each terminal vertex lies in its own connected component.
For this problem, we can interpret each variable and each constant as a vertex, and each equality as an edge from its left-hand side to its right-hand side. The terminal vertices are those associated with constants.
For only two terminals, the multiway cut problem is the polynomial-time solvable s-t minimum cut problem. We can use minimum cuts to get a polynomial-time 2-approximation to the multiway cut problem, by finding the cheapest cut separating two terminals, deleting the edges involved, and then recursing on the remaining connected components. Several approximation algorithms with better ratios have been proposed in the theoretical literature on multiway cut.
Practically speaking, multiway cut arises in applications to computer vision, so I would expect that there has been some work on obtaining exact solutions. I don't know what's out there, though.
Prolog could serve as a convenient implementation language, but some thought about algorithms suggests that a specialized approach may be advantageous.
Among statements of these kinds (equalities between two variables or between one variable and one constant) the only inconsistency that can arise is a path connecting two distinct constants.
Thus, if we find all "inconsistency" paths that connect pairs of distinct constants, it is necessary and sufficient to find a set of edges (original equalities) that disconnect all of those paths.
It is tempting to think a greedy algorithm is optimal here: always select an edge to remove which is common to the largest number of remaining "inconsistency" paths. Thus I propose:
1) Find all simple paths P connecting two different constants (without passing through any third constant), and construct a linkage structure between these paths and their edges.
2) Count the frequencies of edges E appearing along those "inconsistency" paths P.
3) Find a sufficient number of edges to remove by pursuing a greedy strategy, removing the next edge that appears most frequently and updating the counts of edges in remaining paths accordingly.
4) Given that upper bound on edges necessary to remove (to leave a consistent subset of statements), apply a backtracking strategy to determine whether any smaller number of edges would suffice.
As applied to the example in the Question, there prove to be exactly two "inconsistency" paths:
2 -- b -- z -- 3
2 -- b -- z -- c -- 3
Removing either of the two edges, 2 -- b or b -- z, common to both of these paths suffices to disconnect both "inconsistency" paths (removing all inconsistencies among remaining statements).
Moreover it is evident that no other single edge's removal would suffice to accomplish that.

Resources