I created an ordered_map and inserted the following elements
typedef std::unordered_multimap<std::string,std::string> stringMap;
stringMap mymap;
mymap.insert( {
{"house","maison"},
{"apple","pomme"},
{"tree","arbre"},
{"book","livre"},
{"door","porte"},
{"grapefruit","pamplemousse"}
} );
When I checked the buckets, I found that there were 7 buckets.
This is what I read:
The elements of an un_ordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket
But when I printed the hash code of the keys, I found that there are elements present in a bucket with different hash codes.
#include <iostream>
#include <string>
#include <unordered_map>
int main ()
{
typedef std::unordered_multimap<std::string,std::string> stringMap;
stringMap mymap;
mymap.insert( {
{"house","maison"},
{"apple","pomme"},
{"tree","arbre"},
{"book","livre"},
{"door","porte"},
{"grapefruit","pamplemousse"},
} );
unsigned n = mymap.bucket_count();
unsigned s = mymap.size();
std::cout << "mymap has " << n << " buckets.\n";
std::cout << "mymap size " << s << " keys.\n";
stringMap::hasher fn = mymap.hash_function();
for (unsigned i=0; i<n; ++i)
{
std::cout << "bucket #" << i << " contains: " << std::endl;;
for (auto it = mymap.begin(i); it!=mymap.end(i); ++it)
{
std::cout << "[" << it->first << ":" << it->second << "] ";
std::cout << "KEY HASH VALUE: " << fn (it->first) << std::endl;
}
std::cout << "\n";
}
return 0;
}
Could anyone please tell, if I'm missing anything and why Elements with different hash codes and present in same bucket.
Results:
mymap has 7 buckets.
mymap size 6 keys.
bucket #0 contains:
[book:livre] KEY HASH VALUE: 4190142187
[house:maison] KEY HASH VALUE: 4227651036
bucket #1 contains:
bucket #2 contains:
bucket #3 contains:
[grapefruit:pamplemousse] KEY HASH VALUE: 3375607049
[tree:arbre] KEY HASH VALUE: 335777326
bucket #4 contains:
bucket #5 contains:
[apple:pomme] KEY HASH VALUE: 2758877147
bucket #6 contains:
[door:porte] KEY HASH VALUE: 3658195372
Thanks
That's normal. If you have a 32 bit hash code, you don't want 2^32 buckets. Instead, the hash code is mapped to the index of a bucket. For example if you have 7 buckets, an item might use bucket #(hash % 7). So the items with hash codes 0, 7, 14, 21, ... and so on all appear in the same bucket.
Related
I have a class named Foo that is privately nothing more than 4-byte int. If I return its value as an 8-byte size_t, am I going to be screwing up unordered_map<> or anything else? I could fill all bits with something like return foo + foo << 32;. Would that be better, or would it be worse as all hashes are now multiples of 0x100000001? Or how about return ~foo + foo << 32; which would use all 64 bits and also not have a common factor?
namespace std {
template<> struct hash<MyNamespace::Foo> {
typedef size_t result_type;
typedef MyNamespace::Foo argument_tupe;
size_t operator() (const MyNamespace::Foo& f ) const { return (size_t) f.u32InternalValue; }
};
}
An incremental uint32_t key converted to uint64_t works well
unordered_map will reserve space for the hash-table incrementally.
The less significant bits of the key is used to determine the bucket position, in an example for 4 entries/buckets, the less significant 2 bits are used.
Elements with a key giving the same bucket (multiple of the number of buckets) are chained in a linked list. This carry the concept of load-factor.
// 4 Buckets example
******** ******** ******** ******** ******** ******** ******** ******XX
bucket 00 would contains keys like {0, 256, 200000 ...}
bucket 01 would contains keys like {1, 513, 4008001 ...}
bucket 10 would contains keys like {2, 130, 10002 ...}
bucket 11 would contains keys like {3, 259, 1027, 20003, ...}
If you try to save an additional values in a bucket, and it load factor goes over the limit, the table is resized (e.g. you try to save a 5th element in a 4-bucket table with load_factor=1.0).
Consequently:
Having a uint32_t or a uint64_t key will have little impact until you reach 2^32-elements hash-table.
Would that be better, or would it be worse as all hashes are now multiples of 0x100000001?
This will have no impact until you reach 32-bits overflow (2^32) hash-table.
Good key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32);
Bad key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32)<<32;
The best is to keep the keys as even as possible, avoiding hashes with the same factor again and again. E.g. in the code below, keys with all factor 7 would have collision until resized to 16 buckets.
https://onlinegdb.com/r1N7TNySv
#include <iostream>
#include <unordered_map>
using namespace std;
// Print to std output the internal structure of an unordered_map.
template <typename K, typename T>
void printMapStruct(unordered_map<K, T>& map)
{
cout << "The map has " << map.bucket_count()<<
" buckets and max load factor: " << map.max_load_factor() << endl;
for (size_t i=0; i< map.bucket_count(); ++i)
{
cout << " Bucket " << i << ": ";
for (auto it=map.begin(i); it!=map.end(i); ++it)
{
cout << it->first << " ";
}
cout << endl;
}
cout << endl;
}
// Print the list of bucket sizes by this implementation
void printMapResizes()
{
cout << "Map bucket counts:"<< endl;
unordered_map<size_t, size_t> map;
size_t lastBucketSize=0;
for (size_t i=0; i<1024*1024; ++i)
{
if (lastBucketSize!=map.bucket_count())
{
cout << map.bucket_count() << " ";
lastBucketSize = map.bucket_count();
}
map.emplace(i,i);
}
cout << endl;
}
int main()
{
unordered_map<size_t,size_t> map;
printMapStruct(map);
map.emplace(0,0);
map.emplace(1,1);
printMapStruct(map);
map.emplace(72,72);
map.emplace(17,17);
printMapStruct(map);
map.emplace(7,7);
map.emplace(14,14);
printMapStruct(map);
printMapResizes();
return 0;
}
Note over the bucket count:
In the above example, the bucket count is as follow:
1 3 7 17 37 79 167 337 709 1493 3209 6427 12983 26267 53201 107897 218971 444487 902483 1832561
This seems to purposely follow a series of prime numbers (minimizing collisions). I am not aware of the function behind.
std::unordered_map<> bucket_count() after default rehash
I already went through this post Deleting elements from STL set while iterating
Still, I want to understand why the code below produces the wrong result.
int main() {
unordered_set<int> adjacency;
adjacency.insert(1);
adjacency.insert(2);
for (const auto& n : adjacency) {
adjacency.erase(n);
}
cout <<"After removing all elements: " << endl;
for (const auto& n : adjacency) {
cout << n << " ";
}
cout << endl;
return 0;
}
The adjacency contains 1 and 2. After erasing all elements through for-loop, it still contains element 1. Why?
I am using version (2) erase function below, so the rule "Versions (1) and (3) return an iterator pointing to the position immediately following the last of the elements erased." does not apply?
UPDATE: the reason of not using clear() is that it requires removing the element one by one to do some other processing.
by position (1)
iterator erase ( const_iterator position );
by key (2)
size_type erase ( const key_type& k );
range (3)
iterator erase ( const_iterator first, const_iterator last );
Version (2) returns the number of elements erased, which in unordered_set containers (that have unique values), this is 1 if an element with a value of k existed (and thus was subsequently erased), and zero otherwise.
Versions (1) and (3) return an iterator pointing to the position immediately following the last of the elements erased.
Thanks!
Range-based for-loops use iterators under the hood,
so what you wrote leads to undefined behaviour.
If you need to process all elements, and then remove some
of them based on some criteria, there is a way to do that
that works on all containers:
for(auto it = adjacency.begin(); it != adjacency.end();)
{
Process(*it);
if (Condition(*it))
it = adjacency.erase(it);
else
++it;
}
If you need to process all items, and then remove all, then do that:
std::for_each(adjacency.begin(), adjacency.end(), &Process);
adjacency.clear();
You are pulling the rug out from underneath your own feet, as Raymond pointed out.
#include <iostream>
#include <unordered_set>
using namespace std;
int main()
{
typedef unordered_set<int> adjacency_t;
typedef adjacency_t::iterator adjacencyIt_t;
adjacency_t adjacency;
adjacency.insert(1);
adjacency.insert(2);
cout <<"Before: " << endl;
for (const auto& n : adjacency) {
cout << n << " ";
}
cout << endl;
for (adjacencyIt_t i = adjacency.begin(); i!=adjacency.end(); /*empty*/)
{
// Do some processing on *i here.
adjacency.erase(i++); // Don't erase the old iterator before using it to move to the next in line.
}
cout <<"After removing all elements: " << endl;
for (const auto& n : adjacency) {
cout << n << " ";
}
cout << endl;
return 0;
}
I am confused with how should i call MurmurHash3_x86_128() when i have lot of key value. The murmurhash3 code can be found https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp. Method definition is given below.
void MurmurHash3_x86_128 ( const void * key, const int len,
uint32_t seed, void * out )
I am passing different key value using a for loop as shown below but still the hash value return is same. If i am removing for loop and passing individual key value then the value is different. What am i doing wrong ?
int main()
{
uint64_t seed = 100;
vector <string> ex;
ex.push_back("TAA");
ex.push_back("ATT");
for(int i=0; i < ex.size(); i++)
{
uint64_t hash_otpt[2]= {};
cout<< hash_otpt << "\t" << endl;
const char *key = ex[i].c_str();
cout << key << endl;
MurmurHash3_x64_128(key, strlen(key), seed, hash_otpt); // 0xb6d99cf8
cout << hash_otpt << endl;
}
return 0;
The line
cout << hash_otpt << endl;
is emitting the address of hash_otpt, not its contents.
It should be
cout << hash_otpt[0] << hash_otpt[1] << endl;
Basically the 128-bit hash is split and stored in two 64-bit unsigned integers (the MSBs in one and the LSBs in another). On combining them, you get the complete hash.
I have std::unordered_map which I have initialized with bucket size 100. When I change the max_load_factor, code is giving segfault while accessing the bucket_size(). I am using g++ as compiler on linux. I want to increase load factor so that elements collide.
1> Why I am getting segfault?
2> What is correct way to set max_load_factor for unordered_map? as far as I know constructor of std::unordered_map does not accept load factor as argument.
Code without setting max_load_factor, gives no problem
// unordered_map::bucket_size
#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;
int main ()
{
std::unordered_map<int, std::string> mymap(10);
unsigned nbuckets = mymap.bucket_count();
std::cout << "mymap has " << nbuckets << " buckets:\n";
std::cout << "mymap load factor " << mymap.max_load_factor() << endl;
for (unsigned i=0; i<nbuckets; ++i) {
std::cout << "bucket #" << i << " has " << mymap.bucket_size(i) << " elements.\n";
}
return 0;
}
Output
$ g++ -std=c++11 map.cpp && ./a.out
mymap has 11 buckets:
mymap load factor 1
bucket #0 has 0 elements.
bucket #1 has 0 elements.
bucket #2 has 0 elements.
bucket #3 has 0 elements.
bucket #4 has 0 elements.
bucket #5 has 0 elements.
bucket #6 has 0 elements.
bucket #7 has 0 elements.
bucket #8 has 0 elements.
bucket #9 has 0 elements.
bucket #10 has 0 elements.
Now once I introduce the code to change the max_load_factor, I get segfault.
#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;
int main ()
{
std::unordered_map<int, std::string> mymap(10);
unsigned nbuckets = mymap.bucket_count();
std::cout << "mymap has " << nbuckets << " buckets:\n";
mymap.max_load_factor(10);
std::cout << "mymap load factor " << mymap.max_load_factor() << endl;
for (unsigned i=0; i<nbuckets; ++i) {
std::cout << "bucket #" << i << " has " << mymap.bucket_size(i) << " elements.\n";
}
return 0;
}
Output
$ g++ -std=c++11 map.cpp && ./a.out
mymap has 11 buckets:
mymap load factor 10
bucket #0 has 0 elements.
bucket #1 has 0 elements.
bucket #2 has 0 elements.
Segmentation fault
I suspect there's no guarantee whether the container would keep the same bucket count after a change in the max_load_factor. If that's the case, you might be iterating over an invalid number of buckets, since nbuckets is defined earlier in the code.
I have 2 std::vectors:
to first vector, I emplace instance
to second vector, I want to store the address of the instance just emplaced
But it does not work, i.e., the stored address differs from the emplaced instance's address.
If it matters at all, I'm on Linux and using g++ 5.1 and clang 3.6 with -std=c++11.
Here's a working example to illustrate the problem.
#include <iostream>
#include <vector>
struct Foo {
Foo(int a1, int a2) : f1(a1), f2(a2) {}
int f1;
int f2;
};
int main(int, char**) {
std::vector<Foo> vec1;
std::vector<Foo*> vec2;
int num = 10;
for (int i = 0; i < num; ++i) {
vec1.emplace_back(i, i * i);
// I want to store the address of *emplaced* instance...
vec2.push_back(&vec1.back());
}
// same
std::cout << "size 1: " << vec1.size() << std::endl;
std::cout << "size 2: " << vec2.size() << std::endl;
// same for me
std::cout << "back 1: " << &vec1.back() << std::endl;
std::cout << "back 2: " << vec2.back() << std::endl;
// typically differ ?
std::cout << "front 1: " << &vec1.front() << std::endl;
std::cout << "front 2: " << vec2.front() << std::endl;
for (int i = 0; i < num; ++i) {
std::cout << i + 1 << "th" << std::endl;
// same for last several (size % 4) for me
std::cout << "1: " << &vec1[i] << std::endl;
std::cout << "2: " << vec2[i] << std::endl;
}
}
Questions
Is it correct behavior ? I guess it's caused by storing the address of temporary instance but I want to know whether it's permitted by the standard (just curious).
If above is true, how to work around this ? I resolved this by changing first one to vector<unique_ptr<Foo>> but is there any idiomatic way ?
Two options:
1) You can simply fix your test. You just need in you test preallocate enough memory first with
vec1.reserve(10);
Well, this is implementation details for std::vector. As more and more items are added to std::vector it needs to get more space for them. And this space must be contigious. So when there is not enough space for a new element std::vector allocates a bigger block of memory, copies existing elements to it, add the new element and finally frees the block of memory that it used before. As a result addresses that you stored in vec2 might become invalid.
However, if you preallocate enough memory for 10 elements then you code is correct.
Or, since reserving memory is sort of tricky thing to do
2) use std::deque since insertion and deletion at either end of a deque never invalidates pointers or references to the rest of the elements (http://en.cppreference.com/w/cpp/container/deque) and forget about the problem with invalidated addresses. So no need to reserve memory.