Heuristics for estimating the efficiency of Reduced Ordered Binary Decision Diagrams? - data-structures

Reduced Ordered Binary Decision Diagrams (ROBDD) are an efficient data structure for boolean functions of multiple variables f(x1,x2,...,xn). I would like to get an intuition for how efficient they are.
For instance, for data compression, we know that data with low entropy (some symbols appearing more often than other, many repetitions) can be compressed very well while completely random data cannot be compressed.
Is there an analogous intuition for estimating how efficiently ROBDDs can represent a given boolean formula? Any literature on this subject (preferably online)?

There is paper in the Wikipedia article Symbolic Boolean Manipulation with Ordered Binary Decision Diagrams which gives lower and upper bounds for certain function classes (symmetric, representing binary arithmetic). I think that in the average case 2n*log n >= 2^k holds, where n is the number of nodes in the diagram and k is the number of variables of the function. The upper bound is n <= 2^(k+1) - 1 achieved with the full binary tree.

Related

Is there an algorithm better than O(N²) to determine if matrix is symmetric?

Algorithm requirements
Input is an arbitrary square matrix M of size N×N, which just fits in memory.
The algorithm's output must be true if M[i,j] = M[j,i] for all j≠i, false otherwise.
Obvious solutions
Check if the transpose equals the matrix itself (MT=M). Easiest to program in many environments, but (usually) consumes twice the memory and requires N² comparisons worst case. Therefore, this is O(N²) and has high peak memory.
Check if the lower triangular part equals the upper triangular part. Of course, the algorithm returns on the first inequality found. This would make the worst case (worst case being, the matrix is indeed symmetric) require N²/2 - N comparisons, since the diagonal does not need to be checked. So although it is better than option 1, this is still O(N²).
Question
Although it's hard to see how it would be possible (the N² elements will all have to be compared somehow), is there an algorithm doing this check that is better than O(N²)?
Or, provided there is a proof of non-existence of such an algorithm: how to implement this most efficiently for a multi-core CPU (Intel or AMD) taking into account things like cache-friendliness, optimal branch prediction, other compiler-specific specializations, etc.?
This question stems mostly from academic interest, although I imagine a practical use could be to determine what solver to use if the matrix describes a linear system AX=b...
Since you will have to examine all the elements except the diagonal, the complexity IMO can't be better than O (n^2).
For a dense matrix, the answer is a definite "no", because any uninspected (non-diagonal) elements could be different from their transposed counterparts.
For standard representations of a sparse matrix, the same reasoning indicates that you can't generally do better than the input size.
However, the same reasoning doesn't apply to arbitrary matrix representations. For example, you could store sparse representations of the symmetric and antisymmetric components of your matrix, which can easily be checked for symmetry in O(1) time by checking if antisymmetric element has any components at all...
I think you can take a probabilistic approach here.
I think it's not a chance/coincidence that x randomly picked lower coordinate elements will match to their upper triangular counter part. The chance is very high that the matrix is indeed symmetric.
So instead of going through all the ½n² - n elements you can check p random coordinates and tell if the matrix is symmetric with confidence:
p / (½n² - n)
you can then decide a threshold above which you believe that the matrix must be a symmetric matrix.

Efficient data structures for disjoint integer intervals

I have a set of disjoint integer intervals and want to check whether a given integer lies in one of these intervals. Of course, this can be achieved by means of a binary search in logarithmic time. However, the vast majority of the queries return false, i.e., only very few integers lie in any interval. To speedup the application, I'm looking for a probabilistic, constant-time algorithm (some sort of hash function) that tells me whether a given integer is definitely not or maybe in an interval. Here is a sketch of the intended algorithm, where magic_data_structure is initialized with the intervals stored in tree:
x = some_integer;
if(!magic_data_structure.find(x))
return false; // definitely not in any interval
return tree.find(x); // binary search on tree
Any ideas or hints for literature? Thank you very much in advance for your help!
P.S.: Does anybody know improvements of interval trees for non-overlapping intervals which (unlike the ones described above) may include other intervals?
This is a naive solution, but constant.
If you are not dealing with extremely large quantities of numbers, you could just use a hash table where the keys are the numbers and the values are a pointer to the set they're in. But of course if there is a lot of data it might take too long (and too much memory) to index it this way.
Looks like there are various disjoint-set data structures and algorithms to store/search them, but I doubt if any of them have constant times.

Algorithms with superexponential runtime?

I was talking with a student the other day about the common complexity classes of algorithms, like O(n), O(nk), O(n lg n), O(2n), O(n!), etc. I was trying to come up with an example of a problem for which solutions whose best known runtime is super-exponential, such as O(22n), but still decidable (e.g. not the halting problem!) The only example I know of is satisfiability of Presburger arithmetic, which I don't think any intro CS students would really understand or be able to relate to.
My question is whether there is a well-known problem whose best known solution has runtime that is superexponential; at least ω(n!) or ω(nn). I would really hope that there is some "reasonable" problem meeting this description, but I'm not aware of any.
Maximum Parsimony is the problem of finding an evolutionary tree connecting n DNA sequences (representing species) that requires the fewest single-nucleotide mutations. The n given sequences are constrained to appear at the leaves; the tree topology and the sequences at internal nodes are what we get to choose.
In more CS terms: We are given a bunch of length-k strings that must appear at the leaves of some tree, and we have to choose a tree, plus a length-k string for each internal node in the tree, so as to minimise the sum of Hamming distances across all edges.
When a fixed tree is also given, the optimal assignment of sequences to internal nodes can be determined very efficiently using the Fitch algorithm. But in the usual case, a tree is not given (i.e. we are asked to find the optimal tree), and this makes the problem NP-hard, meaning that every tree must in principle be tried. Even though an evolutionary tree has a root (representing the hypothetical ancestor), we only need to consider distinct unrooted trees, since the minimum number of mutations required is not affected by the position of the root. For n species there are 3 * 5 * 7 * ... * (2n-5) leaf-labelled unrooted binary trees. (There is just one such tree with 3 species, which has a single internal vertex and 3 edges; the 4th species can be inserted at any of the 3 edges to produce a distinct 5-edge tree; the 5th species can be inserted at any of these 5 edges, and so on -- this process generates all trees exactly once.) This is sometimes written (2n-5)!!, with !! meaning "double factorial".
In practice, branch and bound is used, and on most real datasets this manages to avoid evaluating most trees. But highly "non-treelike" random data requires all, or almost all (2n-5)!! trees to be examined -- since in this case many trees have nearly equal minimum mutation counts.
Showing all permutation of string of length n is n!, finding Hamiltonian cycle is n!, minimum graph coloring, ....
Edit: even faster Ackerman functions. In fact they seems without bound function.
A(x,y) = y+1 (if x = 0)
A(x,y) = A(x-1,1) (if y=0)
A(x,y) = A(x-1, A(x,y-1)) otherwise.
from wiki:
A(4,3) = 2^2^65536,...
Do algorithms to compute real numbers to a certain precision count? The formula for the area of the Mandelbrot set converges extremely slowly; 10118 terms for two digits, 101181 terms for three.
This is not a practical everyday problem, but it's a way to construct relatively straightforward problems of increasing complexity.
The Kolmogorov complexity K(x) is the size of the smallest program that outputs the string $x$ on a pre-determined universal computer U. It's easy to show that most strings cannot be compressed at all (since there are more strings of length n than programs of length n).
If we give U a maximum running time (say some polynomial function P), we get a time-bounded Kolmogorov complexity. The same counting argument holds: there are some strings that are incompressible under this time bounded Kolmogorov complexity. Let's call the first such string (of some length n) xP
Since the time-bounded Kolmogorov complexity is computable, we can test all strings, and find xP
Finding xP can't be done in polynomial time, or we could use this algorithm to compress it, so finding it must be a super-polynomial problem. We do know we can find it in exp(P) time, though. (Jumping over some technical details here)
So now we have a time-bound E = exp(P). We can repeat the procedure to find xE, and so on.
This approach gives us a decidable super-F problem for every time-constructible function F: find the first string of length n (some large constant) that is incompressible under time-bound F.

Why are Fibonacci numbers significant in computer science?

Fibonacci numbers have become a popular introduction to recursion for Computer Science students and there's a strong argument that they persist within nature. For these reasons, many of us are familiar with them.
They also exist within Computer Science elsewhere too; in surprisingly efficient data structures and algorithms based upon the sequence.
There are two main examples that come to mind:
Fibonacci heaps which have better
amortized running time than binomial
heaps.
Fibonacci search which shares
O(log N) running time with binary
search on an ordered array.
Is there some special property of these numbers that gives them an advantage over other numerical sequences? Is it a spatial quality? What other possible applications could they have?
It seems strange to me as there are many natural number sequences that occur in other recursive problems, but I've never seen a Catalan heap.
The Fibonacci numbers have all sorts of really nice mathematical properties that make them excellent in computer science. Here's a few:
They grow exponentially fast. One interesting data structure in which the Fibonacci series comes up is the AVL tree, a form of self-balancing binary tree. The intuition behind this tree is that each node maintains a balance factor so that the heights of the left and right subtree differ by at most one. Because of this, you can think of the minimum number of nodes necessary to get an AVL tree of height h is defined by a recurrence that looks like N(h + 2) ~= N(h) + N(h + 1), which looks a lot like the Fibonacci series. If you work out the math, you can show that the number of nodes necessary to get an AVL tree of height h is F(h + 2) - 1. Because the Fibonacci series grows exponentially fast, this means that the height of an AVL tree is at most logarithmic in the number of nodes, giving you the O(lg n) lookup time we know and love about balanced binary trees. In fact, if you can bound the size of some structure with a Fibonacci number, you're likely to get an O(lg n) runtime on some operation. This is the real reason that Fibonacci heaps are called Fibonacci heaps - the proof that the number of heaps after a dequeue min involves bounding the number of nodes you can have in a certain depth with a Fibonacci number.
Any number can be written as the sum of unique Fibonacci numbers. This property of the Fibonacci numbers is critical to getting Fibonacci search working at all; if you couldn't add together unique Fibonacci numbers into any possible number, this search wouldn't work. Contrast this with a lot of other series, like 3n or the Catalan numbers. This is also partially why a lot of algorithms like powers of two, I think.
The Fibonacci numbers are efficiently computable. The fact that the series can be generated extremely efficiently (you can get the first n terms in O(n) or any arbitrary term in O(lg n)), then a lot of the algorithms that use them wouldn't be practical. Generating Catalan numbers is pretty computationally tricky, IIRC. On top of this, the Fibonacci numbers have a nice property where, given any two consecutive Fibonacci numbers, let's say F(k) and F(k + 1), we can easily compute the next or previous Fibonacci number by adding the two values (F(k) + F(k + 1) = F(k + 2)) or subtracting them (F(k + 1) - F(k) = F(k - 1)). This property is exploited in several algorithms, in conjunction with property (2), to break apart numbers into the sum of Fibonacci numbers. For example, Fibonacci search uses this to locate values in memory, while a similar algorithm can be used to quickly and efficiently compute logarithms.
They're pedagogically useful. Teaching recursion is tricky, and the Fibonacci series is a great way to introduce it. You can talk about straight recursion, about memoization, or about dynamic programming when introducing the series. Additionally, the amazing closed-form for the Fibonacci numbers is often taught as an exercise in induction or in the analysis of infinite series, and the related matrix equation for Fibonacci numbers is commonly introduced in linear algebra as a motivation behind eigenvectors and eigenvalues. I think that this is one of the reasons that they're so high-profile in introductory classes.
I'm sure there are more reasons than just this, but I'm sure that some of these reasons are the main factors. Hope this helps!
Greatest Common Divisor is another magic; see this for too many magics. But Fibonacci numbers are easy to calculate; also it has a specific name. For example, natural numbers 1,2,3,4,5 have too many logic; all primes are within them; sum of 1..n is computable, each one can produce with other ones, ... but no one take care about them :)
One important thing I forgot about it is Golden Ratio, which has very important impact in real life (for example you like wide monitors :)
If you have an algorithm that can be successfully explained in a simple and concise mannor with understandable examples in CS and nature, what better teaching tool could someone come up with?
Fibonacci sequences are indeed found everywhere in nature/life. They're useful at modeling growth of animal populations, plant cell growth, snowflake shape, plant shape, cryptography, and of course computer science. I've heard it being referred to as the DNA pattern of nature.
Fibonacci heap's have already been mentioned; the number of children of each node in the heap is at most log(n). Also the subtree starting a node with m children is at least (m+2)th fibonacci number.
Torrent like protocols which use a system of nodes and supernodes use a fibonacci to decide when a new super node is needed and how many subnodes it will manage. They do node management based on the fibonacci spiral (golden ratio). See the photo below how nodes are split/merged (partitioned from one large square into smaller ones and vice versa). See photo: http://smartpei.typepad.com/.a/6a00d83451db7969e20115704556bd970b-pi
Some occurences in nature
http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/sneezewort.GIF
http://img.blogster.com/view/anacoana/post-uploads/finger.gif
http://jwilson.coe.uga.edu/EMAT6680/Simmons/6690Pictures/pinecone3yellow.gif
http://2.bp.blogspot.com/-X5II-IhjXuU/TVbHrpmRnLI/AAAAAAAAABU/nv73Y9Ylkkw/s320/amazing_fun_featured_2561778790105101600S600x600Q85_200907231856306879.jpg
I don't think there's a definitive answer but one possibility is that the operation of dividing a set S into two partitions S1 and S2 one of which is then divided into to sub-partitions S11 and S12, one of which has the same size as S2 - is a likely approach to many algorithms and that can be sometimes numerically described as a Fibonacci sequence.
Let me add another data structure to yours: Fibonacci trees. They are interesting because the calculation of the next position in the tree can be done by mere addition of the previous nodes:
http://xw2k.nist.gov/dads/html/fibonacciTree.html
It ties well in with the discussion by templatetypedef on AVL-trees (an AVL tree can at worst have fibonacci structure). I've also seen buffers extended in fibonacci-steps rather than powers of two in some cases.
Just to add a trivia about this, Fibonacci numbers describe the breading of rabbits. You start with (1, 1), two rabbits, and then their population grows exponentially .
Their computation as a power of [[0,1],[1,1]] matrix can be considered as the most primitive problem of Operational Research (sort of like Prisoner's Dilemma is the most primitive problem of Game Theory).
Symbols with frequencies that are successive fibonacci numbers create maximum depth huffman trees, which trees correspond to source symbols being encoded with maximum length binary codes. Non-fibonacci source symbol frequencies create more balanced trees, with shorter codes. The code length has direct implications in the description complexity of the finite state machine that is responsible for decoding a given huffman code.
Conjecture: The 1st(fib) image will be compressed to 38bits, while the 2nd(uniform) with 50bits. It seems that the closer your source symbol frequencies are to fibonacci numbers the shorter the final binary sequence, the better the compression, maybe optimal in the huffman model.
Further Reading:
Buro, M. (1993). On the maximum length of Huffman codes. Information
Processing Letters, 45(5), 219-223. doi:10.1016/0020-0190(93)90207-p
For me This is about order and space coordinates.
The Fibonacci sequence can be used as a clock.
The Fibonacci sequence allows to calculate the golden number decimal by decimal.
The golden number multiplied by itself gives almost the golden number +1.
So we can certainly cut an integer into a series of integers, of units by using for example the indexes.
I made a first naive version in python.(poc) code to be updated.
https://gitlab.com/numbers/Numbers/-/blob/main/ranging.py
So we can frame, count and coordinate the calculation steps and the memory spaces to this perfectly periodic reference frame (in time) and thus make it a kind of universal multiplication table equivalent. For me it is explicitly a mapping.
The idea is to eventually propose a ternary code with explicit management of the memory spaces according to the Fibonacci calculation step, and then to find all our numbers there.
Once done, to use this mapping, this universal table, this filter : to check the concordance, the consistency, the periodicity of complex computable operations, such as the wheeler experiment, sinus, gravity etc...
It sounds pretentious when you say it like that. It is not. Nobody create the golden number or Fibonacci. They are here, they are given like fruits on a tree.

Algorithm for generating a size k error-correcting code on n bits

I want to generate a code on n bits for k different inputs that I want to classify. The main requirement of this code is the error-correcting criteria: that the minimum pairwise distance between any two encodings of different inputs is maximized. I don't need it to be exact - approximate will do, and ease of use and speed of computational implementation is a priority too.
In general, n will be in the hundreds, k in the dozens.
Also, is there a reasonably tight bound on the minimum hamming distance between k different n-bit binary encodings?
The problem of finding the exact best error-correcting code for given parameters is very hard, even approximately best codes are hard. On top of that, some codes don't have any decent decoding algorithms, while for others the decoding problem is quite tricky.
However, you're asking about a particular range of parameters where n ≫ k, where if I understand correctly you want a k-dimensional code of length n. (So that k bits are encoded in n bits.) In this range, first, a random code is likely to have very good minimum distance. The only problem is that decoding is anywhere from impractical to intractible, and actually calculating the minimum distance is not that easy either.
Second, if you want an explicit code for the case n ≫ k, then you can do reasonably well with a BCH code with q=2. As the Wikipedia page explains, there is a good decoding algorithm for BCH codes.
Concerning upper bounds for the minimum Hamming distance, in the range n ≫ k you should start with the Hamming bound, also known as the volume bound or the sphere packing bound. The idea of the bound is simple and beautiful: If the minimum distance is t, then the code can correct errors up to distance floor((t-1)/2). If you can correct errors out to some radius, it means that the Hamming balls of that radius don't overlap. On the other hand, the total number of possible words is 2n, so if you divide that by the number of points in one Hamming ball (which in the binary case is a sum of binomial coefficients), you get an upper bound on the number of error-free code words. It is possible to beat this bound, but for large minimum distance it's not easy. In this regime it's a very good bound.

Resources