Changing max_load_factor() causing segfault in std::unordered_map - c++11

I have std::unordered_map which I have initialized with bucket size 100. When I change the max_load_factor, code is giving segfault while accessing the bucket_size(). I am using g++ as compiler on linux. I want to increase load factor so that elements collide.
1> Why I am getting segfault?
2> What is correct way to set max_load_factor for unordered_map? as far as I know constructor of std::unordered_map does not accept load factor as argument.
Code without setting max_load_factor, gives no problem
// unordered_map::bucket_size
#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;
int main ()
{
std::unordered_map<int, std::string> mymap(10);
unsigned nbuckets = mymap.bucket_count();
std::cout << "mymap has " << nbuckets << " buckets:\n";
std::cout << "mymap load factor " << mymap.max_load_factor() << endl;
for (unsigned i=0; i<nbuckets; ++i) {
std::cout << "bucket #" << i << " has " << mymap.bucket_size(i) << " elements.\n";
}
return 0;
}
Output
$ g++ -std=c++11 map.cpp && ./a.out
mymap has 11 buckets:
mymap load factor 1
bucket #0 has 0 elements.
bucket #1 has 0 elements.
bucket #2 has 0 elements.
bucket #3 has 0 elements.
bucket #4 has 0 elements.
bucket #5 has 0 elements.
bucket #6 has 0 elements.
bucket #7 has 0 elements.
bucket #8 has 0 elements.
bucket #9 has 0 elements.
bucket #10 has 0 elements.
Now once I introduce the code to change the max_load_factor, I get segfault.
#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;
int main ()
{
std::unordered_map<int, std::string> mymap(10);
unsigned nbuckets = mymap.bucket_count();
std::cout << "mymap has " << nbuckets << " buckets:\n";
mymap.max_load_factor(10);
std::cout << "mymap load factor " << mymap.max_load_factor() << endl;
for (unsigned i=0; i<nbuckets; ++i) {
std::cout << "bucket #" << i << " has " << mymap.bucket_size(i) << " elements.\n";
}
return 0;
}
Output
$ g++ -std=c++11 map.cpp && ./a.out
mymap has 11 buckets:
mymap load factor 10
bucket #0 has 0 elements.
bucket #1 has 0 elements.
bucket #2 has 0 elements.
Segmentation fault

I suspect there's no guarantee whether the container would keep the same bucket count after a change in the max_load_factor. If that's the case, you might be iterating over an invalid number of buckets, since nbuckets is defined earlier in the code.

Related

Recursive function doesn't return the vector<int> list correctly

I'm trying to create a list which contains 10 unique random numbers between 1 and 20 by using a recursive function. Here is the code.
Compiler: GNU g++ 10.2.0 on Windows
Compiler flags: -DDEBUG=9 -ansi -pedantic -Wall -std=c++11
#include <iostream>
#include <vector>
#include <algorithm>
#include <time.h>
using namespace std;
vector<int> random (int size, int range, int randnum, vector<int> randlist ) {
if (size < 1) {
cout << "returning...(size=" << size << ")" << endl;
return randlist;
}
else {
if (any_of(randlist.begin(), randlist.end(),[randnum](int elt) {return randnum == elt;})){
cout << "repeating number: " << randnum << endl;
random(size, range, rand() % range + 1, randlist);
return randlist;
}
else {
cout << "size " << size << " randnum " << randnum << endl;
randlist.push_back(randnum);
random(size-1, range, rand() % range + 1, randlist);
return randlist; }
}
}
int main (int argc, char *argv[]) {
srand (time(NULL));
vector<int> dummy{};
vector<int> uniqrandnums = random(10, 20, (rand() % 20) + 1, dummy );
cout << "here is my unique random numbers list: " ;
for_each(uniqrandnums.begin(),uniqrandnums.end(), [](int n){cout << n << ' ';});
}
To keep track of the unique random numbers, I've added 2 cout lines inside the recursive function random. The recursive function seems to operate correctly but it can't return back the resulting vector<int list randlist correctly; it seems to return a list with just the first random number it found.
Note: Reckoning that the function would finally return from here:
if (size < 1) {
cout << "returning...(size=" << size << ")" << endl;
return randlist;
}
I haven't initially added the last 2 return randlist; lines inside the recursive function but as is, it gave compilation warning control reaches end of non-void function [-Wreturn-type] That's why I've added those 2 return statements but it made just the warnings go away and it didn't help operate correctly.
Question: How to arrange the code so the recursive function random returns the full list in a correct manner?
The issue is that you are discarding the result of recursive calls to randlist(). In the two places where you call:
random(..., randlist);
return randlist;
Replace that with:
return random(..., randlist);

hash function for a 64-bit OS/compile, for an object that's really just a 4-byte int

I have a class named Foo that is privately nothing more than 4-byte int. If I return its value as an 8-byte size_t, am I going to be screwing up unordered_map<> or anything else? I could fill all bits with something like return foo + foo << 32;. Would that be better, or would it be worse as all hashes are now multiples of 0x100000001? Or how about return ~foo + foo << 32; which would use all 64 bits and also not have a common factor?
namespace std {
template<> struct hash<MyNamespace::Foo> {
typedef size_t result_type;
typedef MyNamespace::Foo argument_tupe;
size_t operator() (const MyNamespace::Foo& f ) const { return (size_t) f.u32InternalValue; }
};
}
An incremental uint32_t key converted to uint64_t works well
unordered_map will reserve space for the hash-table incrementally.
The less significant bits of the key is used to determine the bucket position, in an example for 4 entries/buckets, the less significant 2 bits are used.
Elements with a key giving the same bucket (multiple of the number of buckets) are chained in a linked list. This carry the concept of load-factor.
// 4 Buckets example
******** ******** ******** ******** ******** ******** ******** ******XX
bucket 00 would contains keys like {0, 256, 200000 ...}
bucket 01 would contains keys like {1, 513, 4008001 ...}
bucket 10 would contains keys like {2, 130, 10002 ...}
bucket 11 would contains keys like {3, 259, 1027, 20003, ...}
If you try to save an additional values in a bucket, and it load factor goes over the limit, the table is resized (e.g. you try to save a 5th element in a 4-bucket table with load_factor=1.0).
Consequently:
Having a uint32_t or a uint64_t key will have little impact until you reach 2^32-elements hash-table.
Would that be better, or would it be worse as all hashes are now multiples of 0x100000001?
This will have no impact until you reach 32-bits overflow (2^32) hash-table.
Good key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32);
Bad key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32)<<32;
The best is to keep the keys as even as possible, avoiding hashes with the same factor again and again. E.g. in the code below, keys with all factor 7 would have collision until resized to 16 buckets.
https://onlinegdb.com/r1N7TNySv
#include <iostream>
#include <unordered_map>
using namespace std;
// Print to std output the internal structure of an unordered_map.
template <typename K, typename T>
void printMapStruct(unordered_map<K, T>& map)
{
cout << "The map has " << map.bucket_count()<<
" buckets and max load factor: " << map.max_load_factor() << endl;
for (size_t i=0; i< map.bucket_count(); ++i)
{
cout << " Bucket " << i << ": ";
for (auto it=map.begin(i); it!=map.end(i); ++it)
{
cout << it->first << " ";
}
cout << endl;
}
cout << endl;
}
// Print the list of bucket sizes by this implementation
void printMapResizes()
{
cout << "Map bucket counts:"<< endl;
unordered_map<size_t, size_t> map;
size_t lastBucketSize=0;
for (size_t i=0; i<1024*1024; ++i)
{
if (lastBucketSize!=map.bucket_count())
{
cout << map.bucket_count() << " ";
lastBucketSize = map.bucket_count();
}
map.emplace(i,i);
}
cout << endl;
}
int main()
{
unordered_map<size_t,size_t> map;
printMapStruct(map);
map.emplace(0,0);
map.emplace(1,1);
printMapStruct(map);
map.emplace(72,72);
map.emplace(17,17);
printMapStruct(map);
map.emplace(7,7);
map.emplace(14,14);
printMapStruct(map);
printMapResizes();
return 0;
}
Note over the bucket count:
In the above example, the bucket count is as follow:
1 3 7 17 37 79 167 337 709 1493 3209 6427 12983 26267 53201 107897 218971 444487 902483 1832561
This seems to purposely follow a series of prime numbers (minimizing collisions). I am not aware of the function behind.
std::unordered_map<> bucket_count() after default rehash

How to use OpenMP to deal with two for loops with

I am new to OpenMP... Please help me with this dumb question. Thank you :)
Basically, I want to use OpenMP to speed up two for loops. But I do not know why it keeps saying: invalid controlling predicate for the for loop.
By the way, my GCC version is gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005, and OS I am using is Ubuntu 16.10.
Basically, I generate a toy data that has a typical Key-Value style, like this:
Data = {
"0": ["100","99","98","97",..."1"];
"1": ["100","99","98","97",..."1"];
...
"999":["100","99","98","97",..."1"];
}
Then, for each key, I want to compare its value with the rest of the keys. Here, I sum them up through "user1_list.size()+user2_list.size();". As for each key, the sum-up process is totally independent of other keys, which means this works for parallelism.
Here is my toy example code.
#include <map>
#include <vector>
#include <string>
#include <iostream>
#include "omp.h"
using namespace std;
int main(){
// Create Data
map<string, vector<string>> data;
for(int i=0; i != 1000; i++){
vector<string> list;
for (int j=100; j!=0; j--){
list.push_back(to_string(j));
}
data[to_string(i)]=list;
}
cout << "Data Total size: " << data.size() << endl;
int count = 1;
#pragma omp parallel for private(count)
for (auto it=data.begin(); it!=data.end(); it++){
//cout << "Evoke Thread: " << omp_get_thread_num();
cout << " count: " << count << " / " << data.size() << endl;
count ++;
string user1 = it->first;
vector<string> user1_list = it->second;
for (auto it2=data.begin(); it2!=data.end(); it2++){
string user2 = it2->first;
vector<string> user2_list = it2->second;
cout << "u1:" << user1 << " u2:" << user2;
int total_size = user1_list.size()+user2_list.size();
cout << " total size: " << total_size << endl;
}
}
return 0;
}

Keys with different hash code present in same bucket. C++

I created an ordered_map and inserted the following elements
typedef std::unordered_multimap<std::string,std::string> stringMap;
stringMap mymap;
mymap.insert( {
{"house","maison"},
{"apple","pomme"},
{"tree","arbre"},
{"book","livre"},
{"door","porte"},
{"grapefruit","pamplemousse"}
} );
When I checked the buckets, I found that there were 7 buckets.
This is what I read:
The elements of an un_ordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket
But when I printed the hash code of the keys, I found that there are elements present in a bucket with different hash codes.
#include <iostream>
#include <string>
#include <unordered_map>
int main ()
{
typedef std::unordered_multimap<std::string,std::string> stringMap;
stringMap mymap;
mymap.insert( {
{"house","maison"},
{"apple","pomme"},
{"tree","arbre"},
{"book","livre"},
{"door","porte"},
{"grapefruit","pamplemousse"},
} );
unsigned n = mymap.bucket_count();
unsigned s = mymap.size();
std::cout << "mymap has " << n << " buckets.\n";
std::cout << "mymap size " << s << " keys.\n";
stringMap::hasher fn = mymap.hash_function();
for (unsigned i=0; i<n; ++i)
{
std::cout << "bucket #" << i << " contains: " << std::endl;;
for (auto it = mymap.begin(i); it!=mymap.end(i); ++it)
{
std::cout << "[" << it->first << ":" << it->second << "] ";
std::cout << "KEY HASH VALUE: " << fn (it->first) << std::endl;
}
std::cout << "\n";
}
return 0;
}
Could anyone please tell, if I'm missing anything and why Elements with different hash codes and present in same bucket.
Results:
mymap has 7 buckets.
mymap size 6 keys.
bucket #0 contains:
[book:livre] KEY HASH VALUE: 4190142187
[house:maison] KEY HASH VALUE: 4227651036
bucket #1 contains:
bucket #2 contains:
bucket #3 contains:
[grapefruit:pamplemousse] KEY HASH VALUE: 3375607049
[tree:arbre] KEY HASH VALUE: 335777326
bucket #4 contains:
bucket #5 contains:
[apple:pomme] KEY HASH VALUE: 2758877147
bucket #6 contains:
[door:porte] KEY HASH VALUE: 3658195372
Thanks
That's normal. If you have a 32 bit hash code, you don't want 2^32 buckets. Instead, the hash code is mapped to the index of a bucket. For example if you have 7 buckets, an item might use bucket #(hash % 7). So the items with hash codes 0, 7, 14, 21, ... and so on all appear in the same bucket.

Address of an instance emplaced to std::vector is invalid

I have 2 std::vectors:
to first vector, I emplace instance
to second vector, I want to store the address of the instance just emplaced
But it does not work, i.e., the stored address differs from the emplaced instance's address.
If it matters at all, I'm on Linux and using g++ 5.1 and clang 3.6 with -std=c++11.
Here's a working example to illustrate the problem.
#include <iostream>
#include <vector>
struct Foo {
Foo(int a1, int a2) : f1(a1), f2(a2) {}
int f1;
int f2;
};
int main(int, char**) {
std::vector<Foo> vec1;
std::vector<Foo*> vec2;
int num = 10;
for (int i = 0; i < num; ++i) {
vec1.emplace_back(i, i * i);
// I want to store the address of *emplaced* instance...
vec2.push_back(&vec1.back());
}
// same
std::cout << "size 1: " << vec1.size() << std::endl;
std::cout << "size 2: " << vec2.size() << std::endl;
// same for me
std::cout << "back 1: " << &vec1.back() << std::endl;
std::cout << "back 2: " << vec2.back() << std::endl;
// typically differ ?
std::cout << "front 1: " << &vec1.front() << std::endl;
std::cout << "front 2: " << vec2.front() << std::endl;
for (int i = 0; i < num; ++i) {
std::cout << i + 1 << "th" << std::endl;
// same for last several (size % 4) for me
std::cout << "1: " << &vec1[i] << std::endl;
std::cout << "2: " << vec2[i] << std::endl;
}
}
Questions
Is it correct behavior ? I guess it's caused by storing the address of temporary instance but I want to know whether it's permitted by the standard (just curious).
If above is true, how to work around this ? I resolved this by changing first one to vector<unique_ptr<Foo>> but is there any idiomatic way ?
Two options:
1) You can simply fix your test. You just need in you test preallocate enough memory first with
vec1.reserve(10);
Well, this is implementation details for std::vector. As more and more items are added to std::vector it needs to get more space for them. And this space must be contigious. So when there is not enough space for a new element std::vector allocates a bigger block of memory, copies existing elements to it, add the new element and finally frees the block of memory that it used before. As a result addresses that you stored in vec2 might become invalid.
However, if you preallocate enough memory for 10 elements then you code is correct.
Or, since reserving memory is sort of tricky thing to do
2) use std::deque since insertion and deletion at either end of a deque never invalidates pointers or references to the rest of the elements (http://en.cppreference.com/w/cpp/container/deque) and forget about the problem with invalidated addresses. So no need to reserve memory.

Resources