Data structure for iterating over subsets of a collection - algorithm

This is a general question about algorithims/data structures.
There is no specific programming language.
I'm working with arrays of boolean values.
The size of the arrays is always 50.
I want to have a collections of these arrays.
I will need to iterate over my collection multiple times.
In order to increase performance, I would like to limit each iteration to a subset of the collection. Rather than the whole collection.
For instance: to iterate only over the arrays that have FALSE in the 4th and 13th position
I will NOT need to search for TRUE values. Only for FALSE values in certain positions of the array.
Note that the possible subsets can share elements without one being included in the other.
Is there any kind of data structure that could help me?

There is a short piece on this in Knuth Volume III "Sorting and Searching" section 6.5. There are a few simple schemes laid out there with somewhat limited effectiveness, which suggests to me that there are no magic answers to this hard problem of searching multi-dimensional spaces.
One approach described here (with parameters adjusted for your problem) is to divide your 50 bits into 5 10-bit chunks, and create 5 hash tables, each mapping a value for one of those 10-bit chunks to a list of all the records which have that value in the given position. Given values for particular bit positions, choose whichever one of the 10-bit chunks contains the most known bits, and use this to only check 512 or 256 or 128 of the 1024 lists in the table for that chunk, depending on whether you know 1, 2, or 3 of the bit positions in that chunk.

Related

algorithm to accomplish comparing two arrays with user define criteria

I want to compare tow float arrays' value. But it may be different from other criteria. Here is how I define which array is the best.
Say we have two array named a,b.First, we compare the max value of these two array, and the array with smaller max value wins. If they have same value, then we can divide each array into two parts. The first part is a[1:max_loc(a)-1] and a[max_loc(a)+1,len(a)], and b is similar. Then we use the same criteria on a[1:max_loc(a)-1] and b[1:max_loc(b)-1] to see which array has the smaller max value. If they have the same max value on these intervals, then divide them to smaller arrays and do the same comparison. We also do the same thing for the a[max_loc(a)+1,len(a)] and b[max_loc(b)+1,len(b)]. Until we find smaller max value on the same intervals, the program end and print out the best array.
What's the algorithm to fulfill this comparison?
P.S. these two arrays may have different length.
Most of the time, what you search is somewhere already on the Internet :
https://www.ics.uci.edu/~eppstein/161/960118.html
Here you got 2 examples with full explanations which follows the divide and conquer idea (MergeSort and QuickSort)

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.
If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.
counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)
I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?
What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?
Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html
If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.
Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))
Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Most efficient way to count occurrences?

I'm looking to calculate entropy and mutual information a huge number of times in performance-critical code. As an intermediate step, I need to count the number of occurrences of each value. For example:
uint[] myArray = [1,1,2,1,4,5,2];
uint[] occurrences = countOccurrences(myArray);
// Occurrences == [3, 2, 1, 1] or some permutation of that.
// 3 occurrences of 1, 2 occurrences of 2, one each of 4 and 5.
Of course the obvious ways to do this are either using an associative array or by sorting the input array using a "standard" sorting algorithm like quick sort. For small integers, like bytes, the code is currently specialized to use a plain old array.
Is there any clever algorithm to do this more efficiently than a hash table or a "standard" sorting algorithm will offer, such as an associative array implementation that heavily favors updates over insertions or a sorting algorithm that shines when your data has a lot of ties?
Note: Non-sparse integers are just one example of a possible data type. I'm looking to implement a reasonably generic solution here, though since integers and structs containing only integers are common cases, I'd be interested in solutions specific to these if they are extremely efficient.
Hashing is generally more scalable, as another answer indicates. However, for many possible distributions (and many real-life cases, where subarrays just happen to be often sorted, depending on how the overall array was put together), timsort is often "preternaturally good" (closer to O(N) than to O(N log N)) -- I hear it's probably going to become the standard/default sorting algorithm in Java at some reasonably close future data (it's been the standard sorting algorithm in Python for years now).
There's no really good way to address such problems except to benchmark on a selection of cases that are representative of the real-life workload you expect to be experiencing (with the obvious risk that you may choose a sample that actually happened to be biased/non-representative -- that's not a small risk if you're trying to build a library that will be used by many external users outside of your control).
Please tell more about your data.
How many items are there?
What is the expected ratio of unique items to total items?
What is the distribution of actual values of your integers? Are they usually small enough to use a simple counting array? Or are they clustered into reasonably narrow groups? Etc.
In any case, I suggest the following idea: a mergesort modified to count duplicates.
That is, you work in terms of not numbers but pairs (number, frequency) (you might use some clever memory-efficient representation for that, for example two arrays instead of an array of pairs etc.).
You start with [(x1,1), (x2,1), ...] and do a mergesort as usual, but when you merge two lists that start with the same value, you put the value into the output list with their sum of occurences. On your example:
[1:1,1:1,2:1,1:1,4:1,5:1,2:1]
Split into [1:1, 1:1, 2:1] and [1:1, 4:1, 5:1, 2:1]
Recursively process them; you get [1:2, 2:1] and [1:1, 2:1, 4:1, 5:1]
Merge them: (first / second / output)
[1:2, 2:1] / [1:1, 2:1, 4:1, 5:1] / [] - we add up 1:2 and 1:1 and get 1:3
[2:1] / [2:1, 4:1, 5:1] / [1:3] - we add up 2:1 and 2:1 and get 2:2
[] / [4:1, 5:1] / [1:3, 2:2]
[1:3, 2:2, 4:1, 5:1]
This might be improved greatly by using some clever tricks to do an initial reduction of the array (obtain an array of value:occurence pairs that is much smaller than the original, but the sum of 'occurence' for each 'value' is equal to the number of occurences of 'value' in the original array). For example, split the array into continuous blocks where values differ by no more than 256 or 65536 and use a small array to count occurences inside each block. Actually this trick can be applied at later merging phases, too.
With an array of integers like in the example, the most effient way would be to have an array of ints and index it based using your values (as you appear to be doing already).
If you can't do that, I can't think of a better alternative than a hashmap. You just need to have a fast hashing algorithm. You can't get better than O(n) performance if you want to use all your data. Is it an option to use only a portion of the data you have?
(Note that sorting and counting is asymptotically slower (O(n*log(n))) than using a hashmap based solution (O(n)).)

Generate sequence of integers in random order without constructing the whole list upfront [duplicate]

This question already has answers here:
Closed 14 years ago.
How can I generate the list of integers from 1 to N but in a random order, without ever constructing the whole list in memory?
(To be clear: Each number in the generated list must only appear once, so it must be the equivalent to creating the whole list in memory first, then shuffling.)
This has been determined to be a duplicate of this question.
very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.
Not the whole list technically, but you could use a bit mask to decide if a number has already been selected. This has a lot less storage than the number list itself.
Set all N bits to 0, then for each desired number:
use one of the normal linear congruent methods to select a number from 1 to N.
if that number has already been used, find the next highest unused (0 bit), with wrap.
set that numbers bit to 1 and return it.
That way you're guaranteed only one use per number and relatively random results.
It might help to specify a language you are searching a solution for.
You could use a dynamic list where you store your generated numbers, since you will need a reference which numbers you already created. Every time you create a new number you could check if the number is contained in the list and throw it away if it is contained and try again.
The only possible way without such a list would be to use a number size where it is unlikely to generate a duplicate like a UUID if the algorithm is working correctly - but this doesn't guarantee that no duplicate is generated - it is just highly unlikely.
You will need at least half of the total list's memory, just to remember what you did already.
If you are in tough memory conditions, you may try so:
Keep the results generated so far in a tree, randomize the data, and insert it into the tree. If you cannot insert then generate another number and try again, etc, until the tree fills halfway.
When the tree fills halfway, you inverse it: you construct a tree holding numbers that you haven't used already, then pick them in random order.
It has some overhead for keeping the tree structure, but it may help when your pointers are considerably smaller in size than your data is.

Resources