I've got a stream with > 20 millions of values which come with their corresponding key (> 10 millions). The keys are linked to one or more values (max 50000), example:
... (key1, val1), (key2,val2), (key1, val3), (key2, val4), (key1, val6), (key3,val5)...
I store this stream as follows:
key1 : val1, val3, val6
key2 : val2, val4
key3 : val5
Each time I receive a new value in the stream, I first check if this value appears in the list of its corresponding key:
If it's not, i add the value at the end of the list.
If the value is already in the list at the last place, then I do
nothing.
Finally, if the value is already in the list, but not at the last
place, i launch a flag.
My question is: what's the more efficient data structure or tools to perform this process (I want to launch the flag the faster possible). I thought of a hash table associated with linked list (as I give in the example), but checking all the linked list each time I add a value does not sound right. Recall that I do need this notion of LAST value.
Thank you
Checking if the new value is in the list is not optimal - it takes O(n) time to check.
You can use a hashtable instead. You can store the last value separately and update it on insert.
So you have a hashtable, where the values are pairs. Each pair consists of a hashtable (used as a set) and an element (the last element in the set).
Your example looks like this:
(key1 -> (val6, (val1->1, val3->1, val6->1))
(key2 -> (val4, (val2->1, val4->1)
(key3 -> (val5, (val5->1))
You can optimize the cases when the set only contains one element, by not storing the last value explicitly.
Related
We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java
I have some hashtable. For instance I have two entities like
john = { 1stname: jonh, 2ndname: johnson },
eric = { 1stname: eric, 2ndname: ericson }
Then I put them in hashtable:
ht["john"] = john;
ht["eric"] = eric;
Let's imagine there is a collision and hashtable use chaining to fix it. As a result there should be a linked list with these two entities like this
How does hashtable understand what entity should be returned for key? Hash values are the same and it knows nothing about entities structure. For instance if I write thisvar val = ht["john"]; how does hashtable (having only key value and its hash) find out that value should be john record and not eric.
I think what you are confused about is what is stored at each location in the hashtable's adjacent list. It seems like you assume that only the value is being stored. In fact, the data in each list node is a tuple (key, value).
Once you ask for ht['john'], the hashtable find the list associated with hash('john') and if the list is not empty it searches for the key 'john' in the list. If the key is found as the first element of the tuple then the value (second element of the tuple) is returned. If the key is not found, then it means that the element is not in the hashtable.
To summarize, the key hash is used to quickly identify the cell in which the element should be stored if present. Actual key equality is tested for to decide whether the key exists or not.
Is this what you are asking for? I have already put this in comments but seems to me you did not follow link
Collision Resolution in the Hashtable Class
Recall that when inserting an item into or retrieving an item from a hash table, a collision can occur. When inserting an item, an open slot must be found. When retrieving an item, the actual item must be found if it is not in the expected location. Earlier we briefly examined two collusion resolution strategies:
Linear probing
Quardratic probing
The Hashtable class uses a different technique referred to as rehasing. (Some sources refer to rehashing as double hashing.)
Rehasing works as follows: there is a set of hash different functions, H1 ... Hn, and when inserting or retrieving an item from the hash table, initially the H1 hash function is used. If this leads to a collision, H2 is tried instead, and onwards up to Hn if needed. The previous section showed only one hash function, which is the initial hash function (H1). The other hash functions are very similar to this function, only differentiating by a multiplicative factor. In general, the hash function Hk is defined as:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize
Mathematical Note With rehasing it is important that each slot in the hash table is visited exactly once when hashsize number of probes are made. That is, for a given key you don't want Hi and Hj to hash to the same slot in the hash table. With the rehashing formula used by the Hashtable class, this property is maintained if the result of (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))and hashsize are relatively prime. (Two numbers are relatively prime if they share no common factors.) These two numbers are guaranteed to be relatively prime if hashsize is a prime number.
Rehasing provides better collision avoidance than either linear or quadratic probing.
sources here
I have a dictionary inside a dictionary and I wish to print the whole dictionary but sorted around a value in the sub dictionary
Lesson = {Name:{'Rating':Rating, 'Desc':Desc, 'TimeLeftTask':Timeleft}}
or
Lesson = {'Math':{'Rating':11, 'Desc':'Exercises 14 and 19 page 157', 'TimeLeftTask':7}, 'English':{'Rating':23, 'Desc':'Exercise 5 page 204', 'TimeLeftTask':2}}
I want to print this dict for example but sorted by 'Rating' (high numbers at the top)
I have read this post but i don't fully understand it.
If you could keep it simple it would be great.
And yes i'm making a program to sort and deal with my homework
Thanks in advance
def sort_by_subdict(dictionary, subdict_key):
return sorted(dictionary.items(), key=lambda k_v: k_v[1][subdict_key])
Lesson = {'Math':{'Rating':11, 'Desc':'Exercises 14 and 19 page 157', 'TimeLeftTask':7}, 'English':{'Rating':23, 'Desc':'Exercise 5 page 204', 'TimeLeftTask':2}}
print(sort_by_subdict(Lesson, 'Rating'))
As there is no notion of order in dictionary, we need to represent the dictionary as a list of key, value pair tuples to preserve the sorted order.
The so question you mention sorts the dictionary using the sorted function such that it returns a list of (k, v) tuples (here k means key & v means value) of top level dictionary, sorting by the desired value of sub dictionary v.
I'm a beginner in writing map-reduces and I'm not sure about some reduce function properties.
So, reduce gets (key, list of values) as an input parameter...
is it guaranteed that the list of input values always contains at least 2 members? So, an unique key emitted by the mapper would never be passed to the reducer?
or, if there is just one item in the input list, is it guaranteed that the key is unique?
can reduce emit more values then the input values list size?
I have a large list of strings. I need to find all of them which are not unique. Can I make it with just one map/reduce? The only way I see is to count all the unique strings by one map/reduce and then select those which are not unique by the another map/reduce
Thanks
The list of input values to the reduce() method may have one or more, but not zero members.
All of the values mapped from/to a unique key value are passed as a list to the reduce along with the key value. If that list contains one member then you can assume that that key value was mapped to only one value (or once, if you're counting)
Your reducer can write any number, including zero, of key value pairs for a given input key and list of values. The types of the input key/values may be different from the types of the output key/value pairs.
You can solve your problem with one map/reduce step
So, the problem with the strings, pseudocode:
map(string s) {
emit(s, 0);
}
reduce(string key, list values) {
if (valies.size() > 1) { emit(key, 1); return; }
if (valuse.contains(1)) { emit(key, 1); return; }
}
right?
Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?
More specifically, here's what I exactly need to do:
I am having two sets of data:
point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info
Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info
As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number
In order to clarify, here's an example:
For point pairs:
(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)
for line pairs:
(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)
what I want is as following:
for tile 0:
(tile0, point0:line0)
(tile0, point0:line1)
(tile0, point1:line0)
(tile0, point1:line1)
for tile 1:
(tile1, point1:line2)
(tile1, point1:line3)
(tile1, point2:line2)
(tile1, point2:line3)
Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).
So the map output will be something like:
tile0, _point0
tile1, _point0
tile2, _point1
...
tileX, *lineL
tileY, *lineK
...
Then, at the reducer, your input will have this structure:
tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]
and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:
tileX (lineK, pointP)
tileX (lineK, pointR)
...
If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)
Regarding the cross-product which you have to do in the reducer:
You first iterate through the entire values List, separate them into 2 list:
List<String> points;
List<String> lines;
Then do the cross-product using 2 nested for loops.
Then iterate through the resulting list and for each element output:
tile(current key), element_of_the_resulting_cross_product_list
So basically you have two options here.Reduce side join or Map Side Join .
Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.
If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.