Imagine an "item" structure (represented as a JSON hash)
{
id: 1,
value: 5
}
Now imagine I have a set of 100,000 items, and I need to perform calculations on the value associated with each. At the end of the calculation, I update each item with the new value.
To do this quickly, I have been using GSL vector libraries, loading each value as an element of the vector.
For example, the items:
{ id: 1, value: 5 }
{ id: 2, value: 6 }
{ id: 3, value: 7 }
Becomes:
GSL::Vector[5, 6, 7]
Element 1 corresponds to item id 1, element 2 corresponds to item id 2, etc. I then proceed to perform element-wise calculations on each element in the vector, multiplying, dividing etc.
While this works, it bothers me that I have to depend on the list of items being sorted by ID.
Is there another structure that acts like a hash (allowing me to say with certainty a particular result value corresponds to a particular item), but allows me to do fast, memory efficient element-wise operations like a vector?
I'm using Ruby and the GSL bindings, but willing to re-write this in another language if necessary.
Related
I have p items (let's assume p=5, items={0,1,2,3,4}). I need to be able to iterate over them in a random order, but without repeating them (unless all were visited) while maintaining only as small seed-like metadata as possible between the iterations. The generator is otherwise stateless. It would be used like this:
Initialization (metadata is long in this example, but it could be anything "small"):
long metadata = randomLong()
Usage:
(metadata, result) = generator.generate(metadata)
return(result)
If it works properly, it should continuously return something like 3, 1, 0, 4, 2, 3, 1, 0, 4, 2, 3...
Is that possible?
I know I could easily pre-generate the sequence, then metadata would contain whole this sequence and an index, but that's not viable for me, as the sequence will have thousands of items and the metadata must be slim.
I also found this, which resembles what I am trying to achieve, but it's either too brief or too math-y for me.
Added: I am aware of the fact, that for p=1000, there are 1000! ways of ordering the sequence, which would definitely not fit into a long, but both "having metadata something bigger than long" and "generator may be unable to generate some sequences" is OK for me.
I would, as a base, use Fisher-Yates algorithm.
It is able to construct a random permutation of a given ordered list of elements in O(n).
Then the trick could be to construct an iterator that shuffles an internal list of elements and iterate through it, and when this internal iteration ends, shuffles again and iterate on the result...
Something like:
function next() -> element {
internal data:
i an integer;
d an array of elements;
code:
if i equals to d.length { shuffle(d); i <-- 0; }
return d[i++];
}
I have an array and a map. The array contains a list of numbers the map contains a key (integer) value (boolean) pair telling us which items have been removed from the list.
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
And a map telling us which items have been removed:
{ 3: true, 4: true, 5: true, 6: true, 7: true, 8: true, 9: true }
The items remain in the array, but are not counted toward the index when looking up an item in the array by index.
For example, given the above, index 2 would return 10:
[1, 2, x, x, x, x, x, x, x, 10]
-0--1------------------------2-
We can loop through each item in the list to see if it is in the map, but worst case the complexity would be O(n) and if there was one billion items in the list this would be a problem. Is there a better way to determine the correct index? I have thought of using a batch type binary tree - each node would hold a sequential range (if a sequential range exists), or a single number, but even then - if every other item was removed - worst case would be O(n) since there would be no sequential ranges.
You can use a binary tree with an extra property that each node will hold the number of removed items in its subtree. update that value when you insert or remove an item.
To find the index of an item, find it on the tree and for every right turn in the search add the number of removed items of the left subtree.
Okay so I have a huge array of unsorted elements of an unknown data type (all elements are of the same type, obviously, I just can't make assumptions as they could be numbers, strings, or any type of object that overloads the < and > operators. The only assumption I can make about those objects is that no two of them are the same, and comparing them (A < B) should give me which one should show up first if it was sorted. The "smallest" should be first.
I receive this unsorted array (type std::vector, but honestly it's more of an algorithm question so no language in particular is expected), a number of objects per "group" (groupSize), and the group number that the sender wants (groupNumber).
I'm supposed to return an array containing groupSize elements, or less if the group requested is the last one. (Examples: 17 results with groupSize of 5 would only return two of them if you ask for the fourth group. Also, the fourth group is group number 3 because it's a zero-indexed array)
Example:
Received Array: {1, 5, 8, 2, 19, -1, 6, 6.5, -14, 20}
Received pageSize: 3
Received pageNumber: 2
If the array was sorted, it would be: {-14, -1, 1, 2, 5, 6, 6.5, 8, 19, 20}
If it was split in groups of size 3: {{-14, -1, 1}, {2, 5, 6}, {6.5, 8, 19}, {20}}
I have to return the third group (pageNumber 2 in a 0-indexed array): {6.5, 8, 19}
The biggest problem is the fact that it needs to be lightning fast. I can't sort the array because it has to be faster than O(n log n).
I've tried several methods, but can never get under O(n log n).
I'm aware that I should be looking for a solution that doesn't fill up all the other groups, and skips a pretty big part of the steps shown in the example above, to create only the requested group before returning it, but I can't figure out a way to do that.
You can find the value of the smallest element s in the group in linear time using the standard C++ std::nth_element function (because you know it's index in the sorted array). You can find the largest element S in the group in the same way. After that, you need a linear pass to find all elements x such that s <= x <= S and return them. The total time complexity is O(n).
Note: this answer is not C++ specific. You just need an implementation of the k-th order statistics in linear time.
Assume we have an array of objects of length N (all objects have the same set of fields).
And we have an array of length N of the same type values, which represent certain object's field (e.g. array of numbers representing IDs).
Now we want to sort the array of objects by the field which is represented in the 2nd array and in the same order as in the 2nd array.
For example, here are 2 arrays (as in description) and expected result:
A = [ {id: 1, color: "red"}, {id: 2, color: "green"}, {id: 3, color: "blue"} ]
B = [ "green", "blue", "red"]
sortByColorByExample(A, B) ==
[ {id: 2, color: "green"}, {id: 3, color: "blue"}, {id: 1, color: "red"} ]
How to effectively implement 'sort-by-example' function? I can't come up with anything better then O(N^2).
This is assuming you have a bijection from elements in B to elements in A
Build a map (say M) from B's elements to their position (O(N))
For each element of A (O(N)), access the map to find where to put it in the sorted array (O(log(N)) with a efficient implementation of the map)
Total complexity: O(NlogN) time and O(N) space
Suppose we are sorting on an item's colour. Then create a dictionary d that maps each colour to a list of the items in A that have that colour. Then iterate across the colours in the list B, and for each colour c output (and remove) a value from the list d[c]. This runs in O(n) time with O(n) extra space for the dictionary.
Note that you have to decide what to do if A cannot be sorted according to the examples in B: do you raise an error? Choose the order that maximizes the number of matches? Or what?
Anyway, here's a quick implementation in Python:
from collections import defaultdict
def sorted_by_example(A, B, key):
"""Return a list consisting of the elements from the sequence A in the
order given by the sequence B. The function key takes an element
of A and returns the value that is used to match elements from B.
If A cannot be sorted by example, raise IndexError.
"""
d = defaultdict(list)
for a in A:
d[key(a)].append(a)
return [d[b].pop() for b in B]
>>> A = [{'id': 1, 'color': 'red'}, {'id': 2, 'color': 'green'}, {'id': 3, 'color': 'blue'}]
>>> B = ['green', 'blue', 'red']
>>> from operator import itemgetter
>>> sorted_by_example(A, B, itemgetter('color'))
[{'color': 'green', 'id': 2}, {'color': 'blue', 'id': 3}, {'color': 'red', 'id': 1}]
Note that this approach handles the case where there are multiple identical values in the sequence B, for example:
>>> A = 'proper copper coffee pot'.split()
>>> B = 'ccpp'
>>> ' '.join(sorted_by_example(A, B, itemgetter(0)))
'coffee copper pot proper'
Here when there are multiple identical values in B, we get the corresponding elements in A in reverse order, but this is just an artefact of the implementation: by using a collections.deque instead of a list (and popleft instead of pop), we could arrange to get the corresponding elements of A in the original order, if that were preferred.
Make an array of arrays, call it C of size B.length.
Loop through A. If it has color 'green' put it in C[0]. If it has a color of 'blue' put it in C[1], if it has a color of red put it in C[2].
When you're done go through C, and flatten it out to your original structure.
Wouldn't something along the lines of a merge sort be better? Create B.length arrays, one for each element inside B, and go through A, and place them in the appropriate smaller array then when it's all done merge the arrays together. It should be around O(2n)
Iterate through the first array and Make a HashMap of such fields versus the List of Objects. O(n) [assuming there are duplicate values of those key fields]
For eg. key = green will contain all objects with field value Green
Now iterate through the second array, get the list of objects from HashMap and store it in another array. O(k) .. (where k - distinct values of field)
The total running time is O(n) but it requires some additional memory in terms of a map and an auxiliary array
In the end you will get the array sorted as per your requirements.
Given n integer id's, I wish to link all possible sets of up to k id's to a constant value. What I'm looking for is a way to translate sets (e.g. {1, 5}, {1, 3, 5} and {1, 2, 3, 4, 5, 6, 7}) to unique values.
Guarantees:
n < 100 and k < 10 (again: set sizes will range in [1, k]).
The order of id's doesn't matter: {1, 5} == {5, 1}.
All combinations are possible, but some may be excluded.
All sets and values are constant and made only once. No deletes or inserts, no value updates.
Once generated, the only operations taking place will be look-ups.
Look-ups will be frequent and one-directional (given set, look up value).
There is no need to sort (or otherwise organize) the values.
Additionally, it would be nice (but not obligatory) if "neighboring" sets (drop one id, add one id, swap one id, etc) are easy to reach, as well as "all sets that include at least this set".
Any ideas?
Enumerate using the product of primes.
a -> 2
b -> 3
c -> 5
d -> 7
et cetera
Now hash(ab) := 6, and hash (abc) := 30
And a nice side effect is that, if "ab" is a subset of "abc", then:
hash(abc) % hash(ab) == 0
and
hash(abc) / hash(ab) == hash(c)
The bad news: You might run into overflow, the 100th prime will probably be around 1000, and 64 bits cannot accomodate 1000**10. This will not affect the functioning as a hash function; only the subset thingy will fail to work. the same method applied to anagrams
The other option is Zobrist-hashing. It is equivalent to the the primes method, but instead of primes you use a fixed set of (random) numbers, and instead of multiplying you use XOR.
For a fixed small (it needs << ~70 bits) set like yours, it might be possible to tune the zobrist tables to totally avoid collisions (yielding a perfect hash).
And the final (and simplest) way is to use a (100bit) bitmap, and treat that as a hashvalue (maybe after modulo table size)
And a totally unrelated method is to just build a decision tree on the bits of the bitmap. (the tree would have a maximal depth of k) a related kD tree on bit values
May be not the best solution, but you can do the following:
Sort the set from Lowest to highest with a simple IntegerComparator
Add each item of the set to a String
so if you have {2,5,9,4} first Step->{2,4,5,9}; second->"2459"
This way you will get a unique String from a unique set. If you really need to map them to an integer value, you can hash the string after that.
A second way I can think of is to store them in a java Set and simply map it against a HashMap with set as keys
Calculate a 'diff' from each set {1, 6, 87, 89} = {1,5,81,2,0,0,...}
{1,2,3,4} = { 1,1,1,1,0,0,0,0... };
Then binary encode each number with a variable length encoding and concatenate the bits.
It's hard to compare the sets (except for the first few equal bits), but because there can't be many large intervals in a set, all possible values just might fit into 64 bits. (slack of 16 bits at least...)