Hashing of a Bitstring to Sort by Similarities - algorithm

Problem description:
We have a lot of bitstrings of the same size. The number and the size of bitstrings is huge. E.g.: 10100101 and 00001111.
Now there is a distance function that just counts the number of same bit positions. In this example: distance is 2 - because the third last and last bits are set by both bitstrings.
-> Now we can make a tour through the bitstrings with the maximal distance, because every bitstring can be converted to a vertex which is connected to all other bitstrings (by the distance function).
Goal
However, this has the complexity of O(N²). My idea is to use a hashing function, that preserves the similarities and than do a simple sort on the hash values. This should result in a near maximum tour. Of course it is not the best result, but it should be a somewhat good result.
Current Problem
My own hashing function rates the left bits higher than the left bits. So that they have a more significant effect to the sort.
Actual Question
Does such an algorithm exists?
Is it possible to use Locality-Sensitive Hashing (LSH) for that aim and if so, can you formulate the respective algorithm. (I didn't understood the algorithm right now)
Thank you, guys!

Related

Fewest subsets with sum less than N

I have a specific sub-problem for which I am having trouble coming up with an optimal solution. This problem is similar to the subset sum group of problems as well as space filling problems, but I have not seen this specific problem posed anywhere. I don't necessarily need the optimal solution (as I am relatively certain it is NP-hard), but an effective and fast approximation would certainly suffice.
Problem: Given a list of positive valued integers find the fewest number of disjoint subsets containing the entire list of integers where each subset sums to less than N. Obviously no integer in the original list can be greater than N.
In my application I have many lists and I can concatenate them into columns of a matrix as long as they fit in the matrix together. For downstream purposes I would like to have as little "wasted" space in the resulting ragged matrix, hence the space filling similarity.
Thus far I am employing a greedy-like approach, processing from the largest integers down and finding the largest integer that fits into the current subset under the limit N. Once the smallest integer no longer fits into the current subset I proceed to the next subset similarly until all numbers are exhausted. This almost certainly does not find the optimal solution, but was the best I could come up with quickly.
BONUS: My application actually requires batches, where there is a limit on the number of subsets in each batch (M). Thus the larger problem is to find the fewest batches where each batch contains M subsets and each subset sums to less than N.
Straight from Wikipedia (with some bold amendments):
In the bin packing problem, objects [Integers] of different volumes [values] must be
packed into a finite number of bins [sets] or containers each of volume V [summation of the subset < V] in
a way that minimizes the number of bins [sets] used. In computational
complexity theory, it is a combinatorial NP-hard problem.
https://en.wikipedia.org/wiki/Bin_packing_problem
As far as I can tell, this is exactly what you are looking for.

What is complexity measured against? (bits, number of elements, ...)

I've read that the naive approach to testing primality has exponential complexity because you judge the algorithm by the size of its input. Mysteriously, people insist that when discussing primality of an integer, the appropriate measure of the size of the input is the number of bits (not n, the integer itself).
However, when discussing an algorithm like Floyd's, the complexity is often stated in terms of the number of nodes without regard to the number of bits required to store those nodes.
I'm not trying to make an argument here. I honestly don't understand the reasoning. Please explain. Thanks.
Traditionally speaking, the complexity is measured against the size of input.
In case of numbers, the size of input is log of this number (because it is a binary representation of it), in case of graphs, all edges and vertices must be represented somehow in the input, so the size of the input is linear in |V| and |E|.
For example, naive primality test that runs in linear time of the number itself, is called pseudo-polynomial. It is polynomial in the number, but it is NOT polynomial in the size of the input, which is log(n), and it is in fact exponential in the size of the input.
As a side note, it does not matter if you use the size of the input in bits, bytes, or any other CONSTANT factor for this matter, because it will be discarded anyway later on when computing the asymptotical notation as constants.
The main difference is that when discussing algorithms we keep in the back of our mind a hardware that is able to perform operations on the data used in O(1) time. When being strict or when considering data which is not able to fit into the processors register then taking the number of bits in account becomes important.
Although the size of input is measured in the number of bits, in many cases we can use a shortcut that lets us divide out a constant number of bits. This constant factor is embedded in the representation that we choose for our data structure.
When discussing graph algorithms, we assume that each vertex and each edge has a fixed cost of representation in terms of the number of bits, which does not depend of the number of vertices and edges. This assumption requires that weights associated with vertices and edges have fixed size in terms of the number of bits (i.e. all integers, all floats, etc.)
With this assumption in place, adjacency list representation has fixed size per edge or vertex, because we need one pointer per edge and one pointer per vertex, in addition to the weights, which we presume to be of constant size as well.
Same goes for adjacency matrix representation, because we need W(E2 + V) bits for the matrix, where W is the number of bits required to store the weight.
In rare situations when weights themselves are dependent on the number of vertices or edges the assumption of fixed weight no longer holds, so we must go back to counting the number of bits.

Maximum two-dimensional subset-sum

I'm given a task to write an algorithm to compute the maximum two dimensional subset, of a matrix of integers. - However I'm not interested in help for such an algorithm, I'm more interested in knowing the complexity for the best worse-case that can possibly solve this.
Our current algorithm is like O(n^3).
I've been considering, something alike divide and conquer, by splitting the matrix into a number of sub-matrices, simply by adding up the elements within the matrices; and thereby limiting the number of matrices one have to consider in order to find an approximate solution.
Worst case (exhaustive search) is definitely no worse than O(n^3). There are several descriptions of this on the web.
Best case can be far better: O(1). If all of the elements are non-negative, then the answer is the matrix itself. If the elements are non-positive, the answer is the element that has its value closest to zero.
Likewise if there are entire rows/columns on the edges of your matrix that are nothing but non-positive integers, you can chop these off in your search.
I've figured that there isn't a better way to do it. - At least not known to man yet.
And I'm going to stick with the solution I got, mainly because its simple.

Why are Fibonacci numbers significant in computer science?

Fibonacci numbers have become a popular introduction to recursion for Computer Science students and there's a strong argument that they persist within nature. For these reasons, many of us are familiar with them.
They also exist within Computer Science elsewhere too; in surprisingly efficient data structures and algorithms based upon the sequence.
There are two main examples that come to mind:
Fibonacci heaps which have better
amortized running time than binomial
heaps.
Fibonacci search which shares
O(log N) running time with binary
search on an ordered array.
Is there some special property of these numbers that gives them an advantage over other numerical sequences? Is it a spatial quality? What other possible applications could they have?
It seems strange to me as there are many natural number sequences that occur in other recursive problems, but I've never seen a Catalan heap.
The Fibonacci numbers have all sorts of really nice mathematical properties that make them excellent in computer science. Here's a few:
They grow exponentially fast. One interesting data structure in which the Fibonacci series comes up is the AVL tree, a form of self-balancing binary tree. The intuition behind this tree is that each node maintains a balance factor so that the heights of the left and right subtree differ by at most one. Because of this, you can think of the minimum number of nodes necessary to get an AVL tree of height h is defined by a recurrence that looks like N(h + 2) ~= N(h) + N(h + 1), which looks a lot like the Fibonacci series. If you work out the math, you can show that the number of nodes necessary to get an AVL tree of height h is F(h + 2) - 1. Because the Fibonacci series grows exponentially fast, this means that the height of an AVL tree is at most logarithmic in the number of nodes, giving you the O(lg n) lookup time we know and love about balanced binary trees. In fact, if you can bound the size of some structure with a Fibonacci number, you're likely to get an O(lg n) runtime on some operation. This is the real reason that Fibonacci heaps are called Fibonacci heaps - the proof that the number of heaps after a dequeue min involves bounding the number of nodes you can have in a certain depth with a Fibonacci number.
Any number can be written as the sum of unique Fibonacci numbers. This property of the Fibonacci numbers is critical to getting Fibonacci search working at all; if you couldn't add together unique Fibonacci numbers into any possible number, this search wouldn't work. Contrast this with a lot of other series, like 3n or the Catalan numbers. This is also partially why a lot of algorithms like powers of two, I think.
The Fibonacci numbers are efficiently computable. The fact that the series can be generated extremely efficiently (you can get the first n terms in O(n) or any arbitrary term in O(lg n)), then a lot of the algorithms that use them wouldn't be practical. Generating Catalan numbers is pretty computationally tricky, IIRC. On top of this, the Fibonacci numbers have a nice property where, given any two consecutive Fibonacci numbers, let's say F(k) and F(k + 1), we can easily compute the next or previous Fibonacci number by adding the two values (F(k) + F(k + 1) = F(k + 2)) or subtracting them (F(k + 1) - F(k) = F(k - 1)). This property is exploited in several algorithms, in conjunction with property (2), to break apart numbers into the sum of Fibonacci numbers. For example, Fibonacci search uses this to locate values in memory, while a similar algorithm can be used to quickly and efficiently compute logarithms.
They're pedagogically useful. Teaching recursion is tricky, and the Fibonacci series is a great way to introduce it. You can talk about straight recursion, about memoization, or about dynamic programming when introducing the series. Additionally, the amazing closed-form for the Fibonacci numbers is often taught as an exercise in induction or in the analysis of infinite series, and the related matrix equation for Fibonacci numbers is commonly introduced in linear algebra as a motivation behind eigenvectors and eigenvalues. I think that this is one of the reasons that they're so high-profile in introductory classes.
I'm sure there are more reasons than just this, but I'm sure that some of these reasons are the main factors. Hope this helps!
Greatest Common Divisor is another magic; see this for too many magics. But Fibonacci numbers are easy to calculate; also it has a specific name. For example, natural numbers 1,2,3,4,5 have too many logic; all primes are within them; sum of 1..n is computable, each one can produce with other ones, ... but no one take care about them :)
One important thing I forgot about it is Golden Ratio, which has very important impact in real life (for example you like wide monitors :)
If you have an algorithm that can be successfully explained in a simple and concise mannor with understandable examples in CS and nature, what better teaching tool could someone come up with?
Fibonacci sequences are indeed found everywhere in nature/life. They're useful at modeling growth of animal populations, plant cell growth, snowflake shape, plant shape, cryptography, and of course computer science. I've heard it being referred to as the DNA pattern of nature.
Fibonacci heap's have already been mentioned; the number of children of each node in the heap is at most log(n). Also the subtree starting a node with m children is at least (m+2)th fibonacci number.
Torrent like protocols which use a system of nodes and supernodes use a fibonacci to decide when a new super node is needed and how many subnodes it will manage. They do node management based on the fibonacci spiral (golden ratio). See the photo below how nodes are split/merged (partitioned from one large square into smaller ones and vice versa). See photo: http://smartpei.typepad.com/.a/6a00d83451db7969e20115704556bd970b-pi
Some occurences in nature
http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/sneezewort.GIF
http://img.blogster.com/view/anacoana/post-uploads/finger.gif
http://jwilson.coe.uga.edu/EMAT6680/Simmons/6690Pictures/pinecone3yellow.gif
http://2.bp.blogspot.com/-X5II-IhjXuU/TVbHrpmRnLI/AAAAAAAAABU/nv73Y9Ylkkw/s320/amazing_fun_featured_2561778790105101600S600x600Q85_200907231856306879.jpg
I don't think there's a definitive answer but one possibility is that the operation of dividing a set S into two partitions S1 and S2 one of which is then divided into to sub-partitions S11 and S12, one of which has the same size as S2 - is a likely approach to many algorithms and that can be sometimes numerically described as a Fibonacci sequence.
Let me add another data structure to yours: Fibonacci trees. They are interesting because the calculation of the next position in the tree can be done by mere addition of the previous nodes:
http://xw2k.nist.gov/dads/html/fibonacciTree.html
It ties well in with the discussion by templatetypedef on AVL-trees (an AVL tree can at worst have fibonacci structure). I've also seen buffers extended in fibonacci-steps rather than powers of two in some cases.
Just to add a trivia about this, Fibonacci numbers describe the breading of rabbits. You start with (1, 1), two rabbits, and then their population grows exponentially .
Their computation as a power of [[0,1],[1,1]] matrix can be considered as the most primitive problem of Operational Research (sort of like Prisoner's Dilemma is the most primitive problem of Game Theory).
Symbols with frequencies that are successive fibonacci numbers create maximum depth huffman trees, which trees correspond to source symbols being encoded with maximum length binary codes. Non-fibonacci source symbol frequencies create more balanced trees, with shorter codes. The code length has direct implications in the description complexity of the finite state machine that is responsible for decoding a given huffman code.
Conjecture: The 1st(fib) image will be compressed to 38bits, while the 2nd(uniform) with 50bits. It seems that the closer your source symbol frequencies are to fibonacci numbers the shorter the final binary sequence, the better the compression, maybe optimal in the huffman model.
Further Reading:
Buro, M. (1993). On the maximum length of Huffman codes. Information
Processing Letters, 45(5), 219-223. doi:10.1016/0020-0190(93)90207-p
For me This is about order and space coordinates.
The Fibonacci sequence can be used as a clock.
The Fibonacci sequence allows to calculate the golden number decimal by decimal.
The golden number multiplied by itself gives almost the golden number +1.
So we can certainly cut an integer into a series of integers, of units by using for example the indexes.
I made a first naive version in python.(poc) code to be updated.
https://gitlab.com/numbers/Numbers/-/blob/main/ranging.py
So we can frame, count and coordinate the calculation steps and the memory spaces to this perfectly periodic reference frame (in time) and thus make it a kind of universal multiplication table equivalent. For me it is explicitly a mapping.
The idea is to eventually propose a ternary code with explicit management of the memory spaces according to the Fibonacci calculation step, and then to find all our numbers there.
Once done, to use this mapping, this universal table, this filter : to check the concordance, the consistency, the periodicity of complex computable operations, such as the wheeler experiment, sinus, gravity etc...
It sounds pretentious when you say it like that. It is not. Nobody create the golden number or Fibonacci. They are here, they are given like fruits on a tree.

Algorithm for generating a size k error-correcting code on n bits

I want to generate a code on n bits for k different inputs that I want to classify. The main requirement of this code is the error-correcting criteria: that the minimum pairwise distance between any two encodings of different inputs is maximized. I don't need it to be exact - approximate will do, and ease of use and speed of computational implementation is a priority too.
In general, n will be in the hundreds, k in the dozens.
Also, is there a reasonably tight bound on the minimum hamming distance between k different n-bit binary encodings?
The problem of finding the exact best error-correcting code for given parameters is very hard, even approximately best codes are hard. On top of that, some codes don't have any decent decoding algorithms, while for others the decoding problem is quite tricky.
However, you're asking about a particular range of parameters where n ≫ k, where if I understand correctly you want a k-dimensional code of length n. (So that k bits are encoded in n bits.) In this range, first, a random code is likely to have very good minimum distance. The only problem is that decoding is anywhere from impractical to intractible, and actually calculating the minimum distance is not that easy either.
Second, if you want an explicit code for the case n ≫ k, then you can do reasonably well with a BCH code with q=2. As the Wikipedia page explains, there is a good decoding algorithm for BCH codes.
Concerning upper bounds for the minimum Hamming distance, in the range n ≫ k you should start with the Hamming bound, also known as the volume bound or the sphere packing bound. The idea of the bound is simple and beautiful: If the minimum distance is t, then the code can correct errors up to distance floor((t-1)/2). If you can correct errors out to some radius, it means that the Hamming balls of that radius don't overlap. On the other hand, the total number of possible words is 2n, so if you divide that by the number of points in one Hamming ball (which in the binary case is a sum of binomial coefficients), you get an upper bound on the number of error-free code words. It is possible to beat this bound, but for large minimum distance it's not easy. In this regime it's a very good bound.

Resources