Counting isomeric n-carbon aliphatic alkanes - algorithm

An n-carbon aliphatic alkane is an unrooted tree consisting of n nodes where the degree of each node is atmost 4. As an example, see this for a list of the enumeration of some low values of n.
I am looking for an algorithm to compute the number of such n-carbon aliphatic alkanes, given an n.
I have seen this in chemistry stackexchange already. I have also thought of dynamic programming, i.e, building larger graphs from smaller components, but I cannot deal with overcounting the same isomers.
Clarification: The Carbons are just a metaphor. I do not wish to take into account the instability of C16 and C17, nor do I care about stereoisomers

So the standard approach is to use the Redfield–Pólya Theorem also known as the Pólya enumeration theorem. However it is not very 'algorithmic' - you have code like this (the Mathematica, Haskell, or one of the Python versions).
The rosettacode page also describes a more direct approach using canonical checking to avoid duplicates. The algorithm is a specialised form of orderly generation (I think) that only works for trees without vertex of edge colors and a maximum valence of 4.

Related

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

Exploration Algorithm

Massively edited this question to make it easier to understand.
Given an environment with arbitrary dimensions and arbitrary positioning of an arbitrary number of obstacles, I have an agent exploring the environment with a limited range of sight (obstacles don't block sight). It can move in the four cardinal directions of NSEW, one cell at a time, and the graph is unweighted (each step has a cost of 1). Linked below is a map representing the agent's (yellow guy) current belief of the environment at the instant of planning. Time does not pass in the simulation while the agent is planning.
http://imagizer.imageshack.us/a/img913/9274/qRsazT.jpg
What exploration algorithm can I use to maximise the cost-efficiency of utility, given that revisiting cells are allowed? Each cell holds a utility value. Ideally, I would seek to maximise the sum of utility of all cells SEEN (not visited) divided by the path length, although if that is too complex for any suitable algorithm then the number of cells seen will suffice. There is a maximum path length but it is generally in the hundreds or higher. (The actual test environments used on my agent are at least 4x bigger, although theoretically there is no upper bound on the dimensions that can be set, and the maximum path length would thus increase accordingly)
I consider BFS and DFS to be intractable, A* to be non-optimal given a lack of suitable heuristics, and Dijkstra's inappropriate in generating a single unbroken path. Is there any algorithm you can think of? Also, I need help with loop detection, as I've never done that before since allowing revisitations is my first time.
One approach I have considered is to reduce the map into a spanning tree, except that instead of defining it as a tree that connects all cells, it is defined as a tree that can see all cells. My approach would result in the following:
http://imagizer.imageshack.us/a/img910/3050/HGu40d.jpg
In the resultant tree, the agent can go from a node to any adjacent nodes that are 0-1 turn away at intersections. This is as far as my thinking has gotten right now. A solution generated using this tree may not be optimal, but it should at least be near-optimal with much fewer cells being processed by the algorithm, so if that would make the algorithm more likely to be tractable, then I guess that is an acceptable trade-off. I'm still stuck with thinking how exactly to generate a path for this however.
Your problem is very similar to a canonical Reinforcement Learning (RL) problem, the Grid World. I would formalize it as a standard Markov Decision Process (MDP) and use any RL algorithm to solve it.
The formalization would be:
States s: your NxM discrete grid.
Actions a: UP, DOWN, LEFT, RIGHT.
Reward r: the value of the cells that the agent can see from the destination cell s', i.e. r(s,a,s') = sum(value(seen(s')).
Transition function: P(s' | s, a) = 1 if s' is not out of the boundaries or a black cell, 0 otherwise.
Since you are interested in the average reward, the discount factor is 1 and you have to normalize the cumulative reward by the number of steps. You also said that each step has cost one, so you could subtract 1 to the immediate reward rat each time step, but this would not add anything since you will already average by the number of steps.
Since the problem is discrete the policy could be a simple softmax (or Gibbs) distribution.
As solving algorithm you can use Q-learning, which guarantees the optimality of the solution provided a sufficient number of samples. However, if your grid is too big (and you said that there is no limit) I would suggest policy search algorithms, like policy gradient or relative entropy (although they guarantee convergence only to local optima). You can find something about Q-learning basically everywhere on the Internet. For a recent survey on policy search I suggest this.
The cool thing about these approaches is that they encode the exploration in the policy (e.g., the temperature in a softmax policy, the variance in a Gaussian distribution) and will try to maximize the cumulative long term reward as described by your MDP. So usually you initialize your policy with a high exploration (e.g., a complete random policy) and by trial and error the algorithm will make it deterministic and converge to the optimal one (however, sometimes also a stochastic policy is optimal).
The main difference between all the RL algorithms is how they perform the update of the policy at each iteration and manage the tradeoff exploration-exploitation (how much should I explore VS how much should I exploit the information I already have).
As suggested by Demplo, you could also use Genetic Algorithms (GA), but they are usually slower and require more tuning (elitism, crossover, mutation...).
I have also tried some policy search algorithms on your problem and they seems to work well, although I initialized the grid randomly and do not know the exact optimal solution. If you provide some additional details (a test grid, the max number of steps and if the initial position is fixed or random) I can test them more precisely.

What invariant do RRB-trees maintain?

Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?
They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.
#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.

How can you compute a shortest addition chain for an arbitrary n <= 600 within one second?

How can you compute a shortest addition chain (sac) for an arbitrary n <= 600 within one second?
Notes
This is the programming competition on codility for this month.
Addition chains are numerically very important, since they are the most economical way to compute x^n (by consecutive multiplications).
Knuth's Art of Computer Programming, Volume 2, Seminumerical Algorithms has a nice introduction to addition chains and some interesting properties, but I didn't find anything that enabled me to fulfill the strict performance requirements.
What I've tried (spoiler alert)
Firstly, I constructed a (highly branching) tree (with the start 1-> 2 -> ( 3 -> ..., 4 -> ...)) such that for each node n, the path from the root to n is a sac for n. But for values >400, the runtime is about the same as for making a coffee.
Then I used that program to find some useful properties for reducing the search space. With that, I'm able to build all solutions up to 600 while making a coffee. But for n, I need to compute all solutions up to n. Unfortunately, codility measures the class initialization's runtime, too...
Since the problem is probably NP-hard, I ended up hard-coding a lookup table. But since codility asked to construct the sac, I don't know if they had a lookup table in mind, so I feel dirty and like a cheater. Hence this question.
Update
If you think a hard-coded, full lookup table is the way to go, can you give an argument why you think a full computation/partly computed solutions/heuristics won't work?
I have just got my Golden Certificate for this problem. I will not provide a full solution because the problem is still available on the site.I will instead give you some hints:
You might consider doing a deep-first search.
There exists a minimal star-chain for each n < 12509
You need to know how prune your search space.
You need a good lower bound for the length of the chain you are looking for.
Remember that you need just one solution, not all.
Good luck.
Addition chains are numerically very important, since they are the
most economical way to compute x^n (by consecutive multiplications).
This is not true. They are not always the most economical way to compute x^n. Graham et. all proved that:
If each step in addition chain is assigned a cost equal to the product
of the numbers at that step, "binary" addition chains are shown to
minimize the cost.
Situation changes dramatically when we compute x^n (mod m), which is a common case, for example in cryptography.
Now, to answer your question. Apart from hard-coding a table with answers, you could try a Brauer chain.
A Brauer chain (aka star-chain) is an addition chain where each new element is formed as the sum of the previous element and some element (possibly the same). Brauer chain is a sac for n < 12509. Quoting Daniel. J. Bernstein:
Brauer's algorithm is often called "the left-to-right 2^k-ary method",
or simply "2^k-ary method". It is extremely popular. It is easy to
implement; constructing the chain for n is a simple matter of
inspecting the bits of n. It does not require much storage.
BTW. Does anybody know a decent C/C++ implementation of Brauer's chain computation? I'm working partially on a comparison of exponentiation times using binary and Brauer's chains for both cases: x^n and x^n (mod m).

backtracking algorithm for set cover

Can someone provide me with a backtracking algorithm to solve the "set cover" problem to find the minimum number of sets that cover all the elements in the universe?
The greedy approach almost always selects more sets than the optimal number of sets.
This paper uses Linear Programming Relaxation to solve covering problems.
Basically, the LP relaxation yields good bounds, and can be used to identify solutions that are optimum in many cases. Incidentally, when I last looked at open source LP solvers (~2003) I wasn't impressed (some gave incorrect results), but there seem to be some decent open source LP solvers now.
Your problem needs a little more clarification - it seems that you are given a family of subsets $$S_1,\ldots,S_n$$ of a set A, such that the union of the subsets equals A, and you want a minimum number of subsets whose union is still A.
The basic approach is branch and bound with some heuristics. E.g., if a particular element of A is in only one subset $$S_i$$, then you must select $$S_i$$. Similarly, if $$S_k$$ is a subset of $$S_j$$, then there's no reason to consider $$S_k$$; if element $$a_i$$ is in every subset that $$a_j$$ is in, then you can not bother considering $$a_i$$.
For branch and bound you need good bounding heuristics. Lower bounds can come from independent sets (if there are k elements $$i_1,\ldots,i_L$$ in A such that each if $$i_p$$ is contained in $$A_p$$ and $$i_q$$ is contained in $$A_q$$ then $$A_p$$ and $$A_q$$ are disjoint). Better lower bounds come from the LP relaxation described above.
The Espresso logic minimization system from Berkeley has a very high quality set covering engine.

Resources