Eigen3 - accessing a (non contiguous) subset of vector elements - eigen

Suppose I have a VectorXf exampleVector with arbitrary float values and I want to select out some elements according to their values.
I can efficiently get a logical vector of true/false values according to my criterion
eg boolArray=exampleVector<1;
But now I want to make a new vector (of a smaller dimension) that contains only those elements that meet my criterion.
How can I do this efficiently in eigen3?
In R I could use reducedVector=exampleVector[boolArray]
Thanks in advance

Since the VectorXf stores its values in a continous memory range, you will have to copy out the values that you want. I am sure R does it the same way, so you won't loose efficiency. There is however no way that I know of to do it as conveniently as in R. So you will have to loop through and copy out the relevant values.

Related

Algorithm for selection the most frequent object during factorization

I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.

algorithm to accomplish comparing two arrays with user define criteria

I want to compare tow float arrays' value. But it may be different from other criteria. Here is how I define which array is the best.
Say we have two array named a,b.First, we compare the max value of these two array, and the array with smaller max value wins. If they have same value, then we can divide each array into two parts. The first part is a[1:max_loc(a)-1] and a[max_loc(a)+1,len(a)], and b is similar. Then we use the same criteria on a[1:max_loc(a)-1] and b[1:max_loc(b)-1] to see which array has the smaller max value. If they have the same max value on these intervals, then divide them to smaller arrays and do the same comparison. We also do the same thing for the a[max_loc(a)+1,len(a)] and b[max_loc(b)+1,len(b)]. Until we find smaller max value on the same intervals, the program end and print out the best array.
What's the algorithm to fulfill this comparison?
P.S. these two arrays may have different length.
Most of the time, what you search is somewhere already on the Internet :
https://www.ics.uci.edu/~eppstein/161/960118.html
Here you got 2 examples with full explanations which follows the divide and conquer idea (MergeSort and QuickSort)

Is there any intersection find algorithm similar to union find when sets are not disjoint?

I want to find intersection of sets containing integer values?
What is the most efficient way to do it if say you have 4-5 lists with 2k-4k integers?
In many languages like for example c++ sets are implemented as balanced binary trees so you can directly evaluate set intersection in O(NlogM) use n as smaller set size by just looking up into the other set in O(logM).
Optimization :-
As you want it for multiple sets you can do the optimization that is used in huffman coding :-
Use a priority queue of sets which selects smallest set first
select two smallest sets first evaluate intersection and add it to queue.
Do this till you get empty intersection set or one set(intersection set) remaining.
Note: Use std::set if using c++
If you have memory to spare:
Create a set that will hold the number of occurences of each value.
For each integer I in each of your set, increment the number of occurences of I
Extract integers with a number of occurences equal to the number of sets
This is theoretically in O(sum of all sets cardinalities + retrieval)
where retrieveal can be either the range of your integers (if you're using a raw array) or the cardinality of the union of your sets (if you're using a hash table to enumerate the values for which an occurence is defined).
If the bounds of your set are known and small, you can implement it with a simple array of integers big enough to hold the max number of sets (typically a 8 bits char for 256 sets).
Otherwise you'll need some kind of hash table, which should still theoretically be in o(n).

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?
What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?
Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html
If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.
Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))
Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Data structure to represent piecewise continuous range?

Say that I have an integer-indexed array of length 400, and I want to drop out a few elements from the beginning, lots from the end, and something from the middle too, but without actually altering the original array. That is, instead of looping through the array using indices {0...399}, I want to use a piecewise continuous range such as
{3...15} ∪ {18...243} ∪ {250...301} ∪ {305...310}
What is a good data structure to describe this kind of index ranges? An obvious solution is to make another "index mediator" array, containing mappings from continuos zero-based indexing to the new coordinates above, but it feels quite wasteful, since almost all elements in it would be simply sequential numbers, with just a few occasional "jumps". Besides, what if I find that, oh, I want to modify the range a bit? The whole index array would have to be rebuilt. Not nice.
A few points to note:
The ranges never overlap. If a new range is added to the data structure, and it overlaps with existing ranges, the whole thing should get merged. That is, if I add to the above example the range {300... 308}, it should instead replace the last two ranges with {250...310}.
It should be quite cheap to simply loop through the whole range.
It should also be relatively cheap to query a value directly: "Give me the original index corresponding to the 42nd index in the mapped coordinates".
It should be possible (though maybe not quite cheap) to work other way round: "Give me the mapped coordinate corresponding to 42 in the original coordinates, or tell if it's mapped at all."
Before rolling my own solution, I'd like to know if there exists a well-known data structure that solves this class of problems elegantly.
Thanks!
Seems like an array or list of integer pairs would be the best data structure. Your choice as to whether the second integer of the pair is a end point or a count from the first integer.
Edit: On further reflection, this problem is exactly what a database index has to do. If the integer pairs don't have to be in numeric order, you can handle splits easier. If the number sequence has to remain in order, you need a data structure that allows you to add integer pairs to the middle of the array or list.
A split would be having to change the (6, 12) integer pair to (6, 9) (11, 12), when 10 is removed, as an example.
Besides, what if I find that, oh, I want to modify the range a bit? The whole index array would have to be rebuilt. Not nice.
True. Perhaps one integer pair needs to change. Worst case, you'd have to rebuild the entire array or list.

Resources