A set with compound value and fast search for value elements - data-structures

I'm looking for a data structure that is similar to a set which stores compound values consisting of 4 integers: i1, i2, i3, i4. This data structure should have fast look up time, but it should also allow for fast deleting of members with a particular i3 and i4. So delete_a(x) should delete all the members with i3 = x and delete_b(x) should delete all the members with i4 = x.The most critical is member look up operation, so I'd like it to be O(1) if possible. The values of i1, i2, i3, and i4 are rather large, so I cannot use a 4 dimensional array because it will take too much memory. I thought that maybe some combination of a hash table and auxiliary lists can solve this problem.

If you want absolute speed, while maintaining some memory efficiency I would do this:
HashMap_a:
Key: i3
Value: List of Hash(i1,i2,i3,i4)
HashMap_b:
Key: i4
Value: List of Hash(i1,i2,i3,i4)
HashMap:
Key: Hash(i1,i2,i3,i4)
Value: (i1,i2,i3,i4)
delete_a is simply a matter of getting the list from HashMap_a and removing from HashMap the elements whose keys are in the list. Likewise for delete_b
Look-ups are O(1) from HashMap

Related

Which data structure supports given operations efficiently

I need to think of a data structure, which supports the following operations efficiently:
1) Add an integer x
2) Delete an integer with maximum frequency (if there are more than one element with the same maximum frequency delete all of them).
I am thinking of implementing a segment tree where each node stores the index of its child having largest frequency.
Any ideas or suggestions on how to approach this problem or how should it be implemented would be kindly appreciated.
We can use a combination of data structures. A hash_map to maintain the frequency mappings, where the key is the integer, and value a pointer to a "frequency" node representing the frequency value and the set of integers having the same frequency. The frequency nodes will be maintained in a list ordered by the values of the frequencies.
The Frequency node can be defined as
class Freq {
int frequency;
Set<Integer> values_with_frequency;
Freq prev;
Freq next;
}
The elements HashMap would then contain entries of the form
Entry<Integer, Freq>
So, for a snapshot of the dataset such as
a,b,c,b,d,d,a,e,a,f,b where the letters denote integers, the following would be how the data structure would look like.
c -----> (1, [c, e, f])
|
|
e --
|
|
f --
a -----> (3, [a, b])
|
|
b --
d --> (2, [d])
The Freq nodes would be maintained in a linked list, say freq_nodes, sorted by the frequency value. Note that, as explained below, there wouldn't be any log(n) operation needed for keeping the list sorted on the add/delete operations.
The way the add(x), and delete_max_freq() operations could be implemented is as follows
add(x) :
If x is not found in the elements map, check if the first element of the freq_nodes contains the Freq object with frequency 1. If so, add x to the values_with_frequency set of the Freq object. Otherwise, create a new Freq object with 1 as the frequency value and x added to the (now only single element) wrapped set values_with_frequency
Otherwise, (i.e. if x is already there in the elements map), follow the pointer in the value of the entry corresponding to x in elements to the Freq object in the freq_nodes, remove x from the values_with_frequency field of the Freq object, noting the current value of x’s frequency which is the value of elements.get(x).frequency(Hold this value in say F). If the set values_with_frequency is rendered empty due to this removal, delete the corresponding node from the freq_nodes linked list. Finally if the next Freq node in the freq_nodes linked list has the frequency F+1, just add x to the values_with_frequency field of the next node. Otherwise just create a Freq node as was done in the case of non-existence of Freq node with frequency 1 above.
Finally, add the entry (x, Freq) to the elements map.
Note that this whole add(x) operation is going to be O(1) in time.
Here's an example of a sequence of add() operations with the subsequent state of the data structure.
add(a)
a -> N1 : freq_nodes : |N1 (1, {a}) | ( N1 is actually a Freq object)
add(b)
a -> N1 : freq_nodes : |N1 (1, {a, b}) |
b -> N1
add(a)
At this point ‘a’ points to N1, however, its current frequency is 2, so we need to insert a node N2 next to N1 in the DLL, after removing it from N1’s values_with_frequency set {a,b}
a -> N2 : freq_nodes : |N1 (1, {b}) | --> |N2 (2, {a}) |
b -> N1
The interesting thing to note here is that any time we increase the frequency of an existing element from F to say F+1, we need to do the following
if (next node has a higher frequency than F+1 or we have reached the end of the list):
create a new Freq node with frequency equal to F+1 (as is done above)
and insert it next to the current node
else :
add ‘a’ (the input to the add() operation) to the ```values_with_frequency``` set of the next node
The delete_max_freq() operation would just involve removing the last entry of the linked list freq_nodes, and iterating over the keys in the wrapped set values_with_frequency to remove the corresponding keys from the elements map. This operation would take O(k) time where k is the number of elements with maximum frequency.
Assuming "efficient" refers to the way the complexity of those operations scale, big-O style, I'd consider something consisting of:
a hashmap with the integers as keys and their frequencies as values
a tree structure (possibly a binary search tree, e.g.) where its nodes have a number representing a frequency and a hashset of numbers which have that frequency.
When a number is inserted:
1. Look up the number in the hashmap to find its frequency. (O(1))
2. Look up the frequency in the tree (O(log N)). Remove the number from its collection (O(1)). If the collection is empty, remove the frequency from the tree (O(log N)).
3. Increment the number's frequency. Set that value in the hashmap (O(1)).
4. Look up its new frequency in the tree (O(log N)). If it's there, add the number to the collection there (O(1)). If not, add a new node with the number in its collection (O(log N)).
When deleting items with the maximum frequency:
1. Remove the highest-valued node from the tree (O(log N)).
2. For each number in that node's collection, remove that number's entry from the hashmap (O(1) for each number removed).
If you have N numbers to add and remove, your worst-case scenario should be O(N log N) regardless of the actual distribution of frequencies or the order in which numbers are added and removed.
If you know of any assumptions you can make about the numbers being added, it's possible you could make further enhancements like using an indexed array rather than an ordered tree. But if your inputs are fairly unbounded, this seems like a pretty good structure to handle all the operations you want without getting into O(n²) territory.
My thoughts:
You will need 2 maps.
Map 1: Integer as key with frequency as value.
Map 2: Have a map of frequencies as keys and list of integers as values.
Add Integer: Add the integer to map 1. Get the frequency. Add it to the list of frequency key in map 2.
Delete Integer : We can obviously maintain maximum frequency in a variable across these operations. Now, remove the key from map2 which has this max frequency and decrement max frequency.
So, adding and deleting performance should be O(1) on average.
In the above scenario, we will still have integers in map 1 which exist and have the frequency which is unrealistic after the delete from map 2. In this case, when same integer gets added, we do an on demand update in map 1, meaning, if current frequency in map 1 doesn't exist in map 2 for this integer, it means it was deleted and we can reset that to 1 again.
Implementation:
import java.util.*;
class Foo{
Map<Integer,Integer> map1;
Map<Integer,Set<Integer>> map2;
int max_freq;
Foo(){
map1 = new HashMap<>();
map2 = new HashMap<>();
map2.put(0,new HashSet<>());
max_freq = 0;
}
public void add(int x){
map1.putIfAbsent(x,0);
int curr_f = map1.get(x);
if(!map2.containsKey(curr_f)){
map1.put(x,1);
}else{
map1.merge(x,1,Integer::sum);
}
map2.putIfAbsent(map1.get(x),new HashSet<>());
map2.get(map1.get(x)-1).remove(x); // remove from previous frequency list
map2.get(map1.get(x)).add(x);// add to current frequency list
max_freq = Math.max(max_freq,map1.get(x));
printState();
}
public List<Integer> delete(){
List<Integer> ls = new ArrayList<>(map2.get(max_freq));
map2.remove(max_freq);
max_freq--;
while(max_freq > 0 && map2.get(max_freq).size() == 0) max_freq--;
printState();
return ls;
}
public void printState(){
System.out.println(map1.toString());
System.out.println("Maximum frequency: " + max_freq);
for(Map.Entry<Integer,Set<Integer>> m : map2.entrySet()){
System.out.println(m.getKey() + " " + m.getValue().toString());
}
System.out.println("----------------------------------------------------");
}
}
Demo: https://ideone.com/tETHKV
Note: The call to delete() is amortized.

Data Structure with fast access to nth of elements satisfying condition

I'm filling a stack/vector (a dynamically sized container with fast random access by index with insertion only at the end) with composite data (a struct, class, tuple…). For a specific attribute with a small set of possible values, I will want to access the nth of all elements in the stack where this attribute satisfies a condition. To achieve this, additional information can be stored along each composite or in a separate data structure.
Note that the vector is large and that the compared attribute has a small value range but is compared to a set of allowed values. Also the attributes aren't distributed evenly throughout composites in the vector.
Pseudocode of a O(n) naïve approach. How can I improve this:
enum Fruit { apple, orange, banana, potato };
struct c {
Fruit a;
Data d;
}
// Let's assume v has a length of many thousand and that the distribution of fruits is *not* completely random e.g. that maybe potato only rarely occurs or that bananas tend to come in packs
c getFruit(vector<c> v, set<Fruit> s, int n) {
int counter=0;
// iterate over all of v's indices
for(int i=0 ; i<v.length; i+=1) {
if(v[i].a in s) {
if(n==counter) {
return v[i];
}
counter+=1;
}
}
}
// note: The attribute is compared to a set (arbitrary combination of fruits)!
getFruit(largeVector, set{apple, orange, potato}, 15234)
Another approach would be to create a vector for each possible set of fruits which would be super fast O(1) but not so memory efficient.
(Although I do have to implement this now, I'm really just asking out of curiousity because my data is small enough to just go with the naïve approach.)
Any argument why there doesn't seem to a more efficient way is very much approved as well.
Edit: It should be noted that new elements may be appended between queries for indices using the algorithm in question so any caches have to grow with the vector and both growing the vector and this filtered access should be fast.
For each index of the vector, store the preceding number of each fruit.
Then you can do a binary search to find the first index where the sum of the desired fruit counts is sufficient.
If you don't want to use that much memory, then store the counts in a separate arrays, and only store them for every 16th index or so in the main array. Your binary search will then get you an index within 16 positions of the desired answer, and you can do a linear scan from there.

How to augment a skip list such that we can extract max value of a specific segment of the skiplist efficiently? [Skiplist not sorted by value]

i have a problem im struggling with.
I have a skiplist with elements:
element = (date,value)
The dates are the key's of the skiplist,and hence,the skiplist is sorted by date.
How can i augment the skiplist such that the function
Max(d1,d2) -> returns largest value between dates d1 and d2
is most efficient.
The values are integers.
The most efficient way is to iterate over each item from d1 to d2 and select the maximum item. Because the skip list is ordered by date, you cannot assume anything about the order of values: they might as well be randomly ordered. So you'll have to look at each one.
So it's O(log n) (on average: this is a skip list, after all) to find d1, and then it's O(range) to find the maximum element, where range is the number of items between d1 and d2, inclusive.
How you'd implement this is to add a function to the skip list that will allow you to iterate the list starting at an arbitrary element. You almost certainly already have a function that will iterate over the entire list in order, so all you have to do is create a function that will iterate over a range of keys (i.e. from a start key to an end key).

Check if a value belongs to a hash

I'm not sure if this is actually possible thus I ask here. Does anyone knows of an algorithm that would allow something like this?
const values = ['a', 'b', 'c', 'd'];
const hash = createHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('b'); // => true
hash.includes('v'); // => false
What this snippet does, is it first creates some sort of hash from a list of values, then checks if the certain value belongs to that hash.
Hash functions in general
The primary idea of hash functions is to reduce the space, that is the functions are not injective as they map from a bigger domain to a smaller.
So they produce collisions. That is, there are different elements x and y that get mapped to the same hash value:
h(x) = h(y)
So basically you loose information of the given argument x.
However, in order to answer the question whether all values are contained you would need to keep all information (or at least all non-duplicates). This is obviously not possible for nearly all practical hash-functions.
Possible hash-functions would be identity function:
h(x) = x for all x
but this doesn't reduce the space, not practical.
A natural idea would be to compute hash values of the individual elements and then concatenate them, like
h(a, b, c) = (h(a), h(b), h(c))
But this again doesn't reduce the space, hash values are as long as the message, not practical.
Another possibility is to drop all duplicates, so given values [a, b, c, a, b] we only keep [a, b, c]. But this, in most examples, only reduces the space marginally, again not practical.
But no matter what you do, you can not reduce more than the amount of non-duplicates. Else you wouldn't be able to answer the question for some values. For example if we use [a, b, c, a] but only keep [a, b], we are unable to answer "was c contained" correctly.
Perfect hash functions
However, there is the field of perfect hash functions (Wikipedia). Those are hash-functions that are injective, they don't produce collisions.
In some areas they are of interest.
For those you may be able to answer that question, for example if computing the inverse is easy.
Cryptographic hash functions
If you talk about cryptographic hash functions, the answer is no.
Those need to have three properties (Wikipedia):
Pre-image resistance - Given h it should be difficult to find m : hash(m) = h
Second pre-image resistance - Given m it should be difficult to find m' : hash(m) = hash(m')
Collision resistance - It should be difficult to find (m, m') : hash(m) = hash(m')
Informally you have especially:
A small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value.
If you now would have such a hash value you would be able to easily reconstruct it by asking whether some values are contained. Using that you can easily construct collisions on purpose and stuff like that.
Details would however depend on the specific hash algorithm.
For a toy-example let's use the previous algorithm that simply removes all duplicates:
[a, b, c, a, a] -> [a, b, c]
In that case we find messages like
[a, b, c]
[a, b, c, a]
[a, a, b, b, c]
...
that all map to the same hash value.
If the hash function produces collisions (as almost all hash function do) this cannot be possible.
Think about it this way if for example h('abc') = x and h('abd') = x, how can you decide based on x if the original string contains 'd'?
You could arguably decide to use identity as a has function, which would do the job.
Trivial solution will be a simple hash concatenation.
func createHash(values) {
var hash;
foreach (v in values)
hash += MD5(v);
return hash;
}
Can it be done with fixed length hash and variable input? I'd bet it's impossible.
In case of string hash (such as used in HashMaps), because it is additive, I think we can match partially (prefix match but not suffix).
const values = ['a', 'b', 'c', 'd'];
const hash = createStringHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('a'); // => true
hash.includes('a', 'b'); // => true
hash.includes('a', 'b', 'v'); // => false
Bit arrays
If you don't care what the resulting hash looks like, I'd recommend just using a bit array.
Take the range of all possible values
Map this to the range of integers starting from 0
Let each bit in our hash indicate whether or not this value appears in the input
This will require 1 bit for every possible value (which could be a lot of bits for large ranges).
Note: this representation is optimal in terms of the number of bits used, assuming there's no limit on the number of elements you can have (beyond 1 of each possible value) - if it were possible to use any fewer bits, you'd have an algorithm that's capable of providing guaranteed compression of any data, which is impossible by the pigeonhole principle.
For example:
If your range is a-z, you can map this to 0-25, then [a,d,g,h] would map to:
10010011000000000000000000 = 38535168 = 0x24c0000
(abcdefghijklmnopqrstuvwxyz)
More random-looking hashes
If you care what the hash looks like, you could take the output from the above and perform a perfect hash on it to map it either to the same length hash or a longer hash.
One trivial example of such a map would be to increment the resulting hash by a randomly chosen but deterministic value (i.e. it's the same for every hash we convert) - you can also do this for each byte (with wrap-around) if you want (e.g. byte0 = (byte0+5)%255, byte1 = (byte1+18)%255).
To determine whether an element appears, the simplest approach would be to reverse the above operation (subtract instead of add) and then just check if the corresponding bit is set. Depending on what you did, it might also be possible to only convert a single byte.
Bloom filters
If you don't mind false positives, I might recommend just using a bloom filter instead.
In short, this sets multiple bits for each value, and then checks each of those bits to check whether a value is in our collection. But the bits that are set for one value can overlap with the bits for other values, which allows us to significantly reduce the number of bits required at the cost of a few false positives (assuming the total number of elements isn't too large).

Find number of common elements in given vectors from an array of vectors

There is a given array of vectors of different sizes and the total number of elements in all vectors won't exceed 104. Each vector contains at least 1 and at most 104 unique integers, each integer being in the range 1 to 104.
There will be 105 queries where each query asks to find the number of common integers in some given vectors (at most 4).
For example:
4 vectors:
1 2 5
3 5 6
1 3 6
6 7
1 Query:
2 3 (vectors indexed 2 and 3)
Ans:
2 (2 common integers {3,6})
I am unable to come up with an efficient solution for this problem. What algorithm / data structure will be most suitable for this problem? Any references would be very helpful.
EDIT: No integer will occur in more than 4 vectors
If your vectors are sorted you can do that. You start from the largest of all first vector's element (as there cannot be a common element before) and you try to find the smallest largest common element. If there is one you start over from the remaining parts of the vectors. Otherwise you just look at the next plausible candidate.
Let v1, .., v4 denote the four choosen vector.
Let i1=i2=i3=i4=0
While (i1 < v1.length, i2 <v2.length, i3 < v3.length, i4 < v4.length)
Let X = max(v1[i1],v2[i2],v3[i3],v4[i4])
Increase i1, i2, i3, i4 such that v1[i1]>=X, v2[i2]>=X, v3[i3]>=X, v4[i4]>=X
If v1[i1]=v2[i2]=v3[i3]=v4[i4]
count++
i1++
Store the vectors in a set data structure that allows you to do a lookup in O(1) and iterate over all the elements in O(N).
For each query just iterate over the elements in one of the vectors and for each of those elements, check if it exists in the other vector.

Resources