C++ how to subtract two time variables - c++11

I have a bunch of time variables in the format "day-month-year H:M", for example, 14-03-15 15:25. How do I correctly measure the difference between two time variables and output a duration type?
std::string t1 = 14-03-15 15:25;
std::string t2 = 19-05-15 7:32;
template <typename Duration>
auto diff(std::string& t1, std::string& t2) {
// How to do from here?
}
auto ds = diff<std::chrono::seconds>(t1, t2);
auto dm = diff<std::chrono::minutes>(t1, t2);

This question is interesting because it reveals how ambiguous questions about time can be.
Do t1 and t2 represent UTC times? Or times in time zones? If the latter, the computer's current local time zone, or some other time zone?
Although not likely to matter for these inputs, for the seconds-precision computation, do leap seconds matter?
The answer to each of these questions will impact the result. And the current C and C++ std API is inadequate to address the totality of these questions.
This library can give you the correct result, but can give different answers depending upon the desired interpretation of the input according to the above questions.
For the very simplest interpretation: these times represent UTC and leap seconds don't matter:
#include "date/date.h"
#include <chrono>
#include <iostream>
#include <sstream>
#include <stdexcept>
#include <string>
std::string t1 = "14-03-15 15:25";
std::string t2 = "19-05-15 7:32";
template <typename Duration>
auto
diff(const std::string& t1, const std::string& t2)
{
using namespace std;
using namespace date;
istringstream in{t1};
sys_time<Duration> d1;
in >> parse("%d-%m-%y %H:%M", d1);
if (in.fail())
throw runtime_error("didn't parse " + t1);
in.clear();
in.str(t2);
sys_time<Duration> d2;
in >> parse("%d-%m-%y %H:%M", d2);
if (in.fail())
throw runtime_error("didn't parse " + t2);
return d2 - d1;
}
int
main()
{
using date::operator<<;
auto ds = diff<std::chrono::seconds>(t1, t2);
std::cout << ds << '\n';
auto dm = diff<std::chrono::minutes>(t1, t2);
std::cout << dm << '\n';
}
And the output is:
5674020s
94567min
For slightly different inputs, computing assuming the input is in "America/New_York", the output would be different. And if t1 drifted prior to 01-01-15, then the matter of leap seconds can impact the output.
If these details of interpretation of input are important, you are practically guaranteed to get the computation wrong (but only slightly and rarely -- the most dangerous kind) with the existing C/C++ API.

After solving a problem with converting string to time
std::tm tm = {};
std::stringstream ss(t1);
ss >> std::get_time(&tm, "%d-%m-%y %H:%M");
auto tp1 = std::chrono::system_clock::from_time_t(std::mktime(&tm));
Or
std::tm tm = {};
strptime(t1.c_str(), "%d-%m-%y %H:%M", &tm);
auto tp1 = std::chrono::system_clock::from_time_t(std::mktime(&tm));
You can subtract clocks.
// floating-point duration: no duration_cast needed
std::chrono::duration<double, Duration> duration = tp2 - tp1;

Related

Is there an overhead in my code that makes my threading slower [C++]

I have created two programs that find the determinant of 2 matrixes, with one using threads and the other without and then recorded the time taken to complete the calculation. The threaded script appears to be slower than the one without threads yet I cannot see anything that may create any overhead issues. Any help is appreciated thanks.
Thread script:
#include <iostream>
#include <ctime>
#include <thread>
void determinant(int matrix[3][3]){
int a = matrix[0][0]*((matrix[1][1]*matrix[2][2])-(matrix[1][2]*matrix[2][1]));
int b = matrix[0][1]*((matrix[1][0]*matrix[2][2])-(matrix[1][2]*matrix[2][0]));
int c = matrix[0][2]*((matrix[1][0]*matrix[2][1])-(matrix[1][1]*matrix[2][0]));
int determinant = a-b+c;
}
int main() {
int matrix[3][3]= {
{11453, 14515, 1399954},
{13152, 11254, 11523},
{11539994, 51821, 19515}
};
int matrix2[3][3] = {
{16392, 16999942, 18682},
{5669, 466999832, 1429},
{96989, 10962, 63413}
};
const clock_t c_start = clock();
std::thread mat_thread1(determinant, matrix);
std::thread mat_thread2(determinant, matrix2);
mat_thread1.join();
mat_thread2.join();
const clock_t c_end = clock();
std::cout << "\nOperation takes: " << 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << "ms of CPU time";
}
Script with no other thread than the main one:
#include <iostream>
#include <ctime>
#include <thread>
void determinant(int matrix[3][3]){
int a = matrix[0][0]*((matrix[1][1]*matrix[2][2])-(matrix[1][2]*matrix[2][1]));
int b = matrix[0][1]*((matrix[1][0]*matrix[2][2])-(matrix[1][2]*matrix[2][0]));
int c = matrix[0][2]*((matrix[1][0]*matrix[2][1])-(matrix[1][1]*matrix[2][0]));
int determinant = a-b+c;
}
int main() {
int matrix[3][3]= {
{11453, 14515, 1399954},
{13152, 11254, 11523},
{11539994, 51821, 19515}
};
int matrix2[3][3] = {
{16392, 16999942, 18682},
{5669, 466999832, 1429},
{96989, 10962, 63413}
};
const clock_t c_start = clock();
determinant(matrix);
determinant(matrix2);
const clock_t c_end = clock();
std::cout << "\nOperation takes: " << 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << "ms of CPU time";
}
PS - the 1st script took 0.293ms on the last run and the second script took 0.002ms
Thanks again,
wndlbh
The difference seems to be the creation of two threads and the joins. I expect that the time to do this (create and join) is way more than the time to do 9 multiplications and 5 additions.
The start-up (and tear down) cost of a new thread is enormous, and in this case drowns the real work.
I seem to remember times between 1ms and 1s depending on your setup. More threads first helps if the time saved on the work is higher than the cost of creating the threads. In this case you would need 1000's of calculations to save that much.

Sorting multiple arrays using CUDA/Thrust

I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.

BGL bundled edge properties containing vector [duplicate]

I have a boost graph with multiples weights for each edges (imagine one set of weights per hour of the day). Every one of those weights values is stored in a propretyEdge class :
class propretyEdge {
std::map<std::string,double> weights; // Date indexed
}
I created a graph with those properties, and then filled it with the right values.
The problem is now that I want to launch the Dijkstra algorithm over a particular set of weight on the graph : for example a function that could be :
void Dijkstra (string date, parameters ... )
That would use the
weights[date]
value for each Edge of the graph.
I read over and over the documentation, and I couldn't have a clear picture of what I have to do. I surely need to write something like this, but I have no idea were to start :
boost::dijkstra_shortest_paths (
(*graph_m),
vertex_origin_num_l,
// weight_map (get (edge_weight, (*graph_m)))
// predecessor_map(boost::make_iterator_property_map(predecessors.begin(), get(boost::vertex_index, (*graph_m)))).
// distance_map(boost::make_iterator_property_map(distances.begin (), get(vertex_index,(*graph_m) )))
predecessor_map(predecessorMap).
distance_map(distanceMap)
);
Thank you for your help.
Edit
Thanks to the wonderful Answer of Sehe, I was able to do exactly what I wanted on MacOS and on Ubuntu.
But when we tried to compile this piece of code on Visual Studio 2012, it appeared that VS wasn't very good at understanding pointer function of boost. So we modified the part of Sehe :
auto dated_weight_f = [&](Graph::edge_descriptor ed) {
return g[ed].weights.at(date);
};
auto dated_weight_map = make_function_property_map<Graph::edge_descriptor, double>(dated_weight_f);
by :
class dated_weight_f {
public:
dated_weight_f(Graph* graph_p,std::string date_p){
graph_m=graph_p;
date_m=date_p;
}
typedef double result_type;
result_type operator()(Edge edge_p) const{
return (*graph_m)[edge_p].weights.at(date_m);
}
private:
Graph* graph_m;
std::string date_m;
};
const auto dated_weight_map = make_function_property_map<Edge>(dated_weight_f(graph_m,date_l));
Which had the advantage of not using a pointer function.
Since it's apparently not immediately clear that this question is answered in the other answer, I'll explain.
All you really need is a custom weight_map parameter that is "stateful" and can select a certain value for a given date.
You can make this as complicated as you wish ¹, so you could even interpolate/extrapolate a weight given an unknown date ², but let's for the purpose of this demonstration keep it simple.
Let's define the graph type (roughly) as above:
struct propretyEdge {
std::map<std::string, double> weights; // Date indexed
};
using Graph = adjacency_list<vecS, vecS, directedS, no_property, propretyEdge>;
Now, let's generate a random graph, with random weights for 3 different dates:
int main() {
Graph g;
std::mt19937 prng { std::random_device{}() };
generate_random_graph(g, 8, 12, prng);
uniform_real<double> weight_dist(10,42);
for (auto e : make_iterator_range(edges(g)))
for (auto&& date : { "2014-01-01", "2014-02-01", "2014-03-01" })
g[e].weights[date] = weight_dist(prng);
And, jumping to the goal:
for (std::string const& date : { "2014-01-01", "2014-02-01", "2014-03-01" }) {
Dijkstra(date, g, 0);
}
}
Now how do you implement Dijkstra(...)? Gleaning from the documentation sample, you'd do something like
void Dijkstra(std::string const& date, Graph const& g, int vertex_origin_num_l = 0) {
// magic postponed ...
std::vector<Graph::vertex_descriptor> p(num_vertices(g));
std::vector<double> d(num_vertices(g));
std::vector<default_color_type> color_map(num_vertices(g));
boost::typed_identity_property_map<Graph::vertex_descriptor> vid; // T* property maps were deprecated
dijkstra_shortest_paths(g, vertex_origin_num_l,
weight_map(dated_weight_map).
predecessor_map(make_iterator_property_map(p.data(), vid)).
distance_map(make_iterator_property_map(d.data(), vid)).
color_map(make_iterator_property_map(color_map.data(), vid))
);
Now the only unclear bit here should be dated_weight_map.
Enter Boost Property Maps
As I showed in the linked Is it possible to have several edge weight property maps for one graph BOOST?, you can have all kinds of property maps ³, including invocation of user-defined functions. This is the missing piece:
auto dated_weight_f = [&](Graph::edge_descriptor ed) {
return g[ed].weights.at(date);
};
auto dated_weight_map = make_function_property_map<Graph::edge_descriptor, double>(dated_weight_f);
Voilà: done
I hope that by now, the correspondence in the question as well as the answer of the linked question is clear. All that's left to do is post the full live sample and the outcome in a pretty picture:
Live On Coliru
#include <boost/property_map/property_map.hpp>
#include <boost/property_map/function_property_map.hpp>
#include <boost/property_map/property_map_iterator.hpp>
#include <random>
#include <boost/graph/random.hpp>
#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/dijkstra_shortest_paths.hpp>
#include <fstream>
using namespace boost;
struct propretyEdge {
std::map<std::string, double> weights; // Date indexed
};
using Graph = adjacency_list<vecS, vecS, directedS, no_property, propretyEdge>;
void Dijkstra(std::string const& date, Graph const& g, int vertex_origin_num_l = 0) {
auto dated_weight_f = [&](Graph::edge_descriptor ed) {
return g[ed].weights.at(date);
};
auto dated_weight_map = make_function_property_map<Graph::edge_descriptor, double>(dated_weight_f);
std::vector<Graph::vertex_descriptor> p(num_vertices(g));
std::vector<double> d(num_vertices(g));
std::vector<default_color_type> color_map(num_vertices(g));
boost::typed_identity_property_map<Graph::vertex_descriptor> vid; // T* property maps were deprecated
dijkstra_shortest_paths(g, vertex_origin_num_l,
weight_map(dated_weight_map).
predecessor_map(make_iterator_property_map(p.data(), vid)).
distance_map(make_iterator_property_map(d.data(), vid)).
color_map(make_iterator_property_map(color_map.data(), vid))
);
std::cout << "distances and parents for '" + date + "':" << std::endl;
for (auto vd : make_iterator_range(vertices(g)))
{
std::cout << "distance(" << vd << ") = " << d[vd] << ", ";
std::cout << "parent(" << vd << ") = " << p[vd] << std::endl;
}
std::cout << std::endl;
std::ofstream dot_file("dijkstra-eg-" + date + ".dot");
dot_file << "digraph D {\n"
" rankdir=LR\n"
" size=\"6,4\"\n"
" ratio=\"fill\"\n"
" graph[label=\"shortest path on " + date + "\"];\n"
" edge[style=\"bold\"]\n"
" node[shape=\"circle\"]\n";
for (auto ed : make_iterator_range(edges(g))) {
auto u = source(ed, g),
v = target(ed, g);
dot_file
<< u << " -> " << v << "[label=\"" << get(dated_weight_map, ed) << "\""
<< (p[v] == u?", color=\"black\"" : ", color=\"grey\"")
<< "]";
}
dot_file << "}";
}
int main() {
Graph g;
std::mt19937 prng { std::random_device{}() };
generate_random_graph(g, 8, 12, prng);
uniform_real<double> weight_dist(10,42);
for (auto e : make_iterator_range(edges(g)))
for (auto&& date : { "2014-01-01", "2014-02-01", "2014-03-01" })
g[e].weights[date] = weight_dist(prng);
for (std::string const& date : { "2014-01-01", "2014-02-01", "2014-03-01" }) {
Dijkstra(date, g, 0);
}
}
Output, e.g.
¹ As long as you keep the invariants required by the algorithm you're invoking. In particular, you must return the same weight consistently during the execution, given the same edge. Also, some algorithms don't support negative weight etc.
² I'd highly suggest using a Boost ICL interval_map in such a case but I digress
³ see also map set/get requests into C++ class/structure changes

Faster way to read/write a std::unordered_map from/to a file

I am working with some very large std::unordered_maps (hundreds of millions of entries) and need to save and load them to and from a file. The way I am currently doing this is by iterating through the map and reading/writing each key and value pair one at a time:
std::unordered_map<unsigned long long int, char> map;
void save(){
std::unordered_map<unsigned long long int, char>::iterator iter;
FILE *f = fopen("map", "wb");
for(iter=map.begin(); iter!=map.end(); iter++){
fwrite(&(iter->first), 8, 1, f);
fwrite(&(iter->second), 1, 1, f);
}
fclose(f);
}
void load(){
FILE *f = fopen("map", "rb");
unsigned long long int key;
char val;
while(fread(&key, 8, 1, f)){
fread(&val, 1, 1, f);
map[key] = val;
}
fclose(f);
}
But with around 624 million entries, reading the map from a file took 9 minutes. Writing to a file was faster but still took several minutes. Is there a faster way to do this?
C++ unordered_map implementations must all use chaining. There are a variety of really good reasons why you might want to do this for a general purpose hash table, which are discussed here.
This has enormous implications for performance. Most importantly, it means that the entries of the hash table are likely to be scattered throughout memory in a way which makes accessing each one an order of magnitude (or so) less efficient that would be the case if they could somehow be accessed serially.
Fortunately, you can build hash tables that, when nearly full, give near-sequential access to adjacent elements. This is done using open addressing.
Since your hash table is not general purpose, you could try this.
Below, I've built a simple hash table container with open addressing and linear probing. It assumes a few things:
Your keys are already somehow randomly distributed. This obviates the need for a hash function (though decent hash functions are fairly simple to build, even if great hash functions are difficult).
You only ever add elements to the hash table, you do not delete them. If this were not the case you'd need to change the used vector into something that could hold three states: USED, UNUSED, and TOMBSTONE where TOMBSTONE is the stated of a deleted element and used to continue linear search probe or halt a linear insert probe.
That you know the size of your hash table ahead of time, so you don't need to resize/rehash it.
That you don't need to traverse your elements in any particular order.
Of course, there are probably all kinds of excellent implementations of open addressing hash tables online which solve many of the above issues. However, the simplicity of my table allows me to convey the important point.
The important point is this: my design allows all the hash table's information to be stored in three vectors. That is: the memory is contiguous.
Contiguous memory is fast to allocate, fast to read from, and fast to write to. The effect of this is profound.
Using the same test setup as my previous answer, I get the following times:
Save. Save time = 82.9345 ms
Load. Load time = 115.111 ms
This is a 95% decrease in save time (22x faster) and a 98% decrease in load time (62x faster).
Code:
#include <cassert>
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <functional>
#include <iostream>
#include <random>
#include <vector>
const int TEST_TABLE_SIZE = 10000000;
template<class K, class V>
class SimpleHash {
public:
int usedslots = 0;
std::vector<K> keys;
std::vector<V> vals;
std::vector<uint8_t> used;
//size0 should be a prime and about 30% larger than the maximum number needed
SimpleHash(int size0){
vals.resize(size0);
keys.resize(size0);
used.resize(size0/8+1,0);
}
//If the key values are already uniformly distributed, using a hash gains us
//nothing
uint64_t hash(const K key){
return key;
}
bool isUsed(const uint64_t loc){
const auto used_loc = loc/8;
const auto used_bit = 1<<(loc%8);
return used[used_loc]&used_bit;
}
void setUsed(const uint64_t loc){
const auto used_loc = loc/8;
const auto used_bit = 1<<(loc%8);
used[used_loc] |= used_bit;
}
void insert(const K key, const V val){
uint64_t loc = hash(key)%keys.size();
//Use linear probing. Can create infinite loops if table too full.
while(isUsed(loc)){ loc = (loc+1)%keys.size(); }
setUsed(loc);
keys[loc] = key;
vals[loc] = val;
}
V& get(const K key) {
uint64_t loc = hash(key)%keys.size();
while(true){
if(!isUsed(loc))
throw std::runtime_error("Item not present!");
if(keys[loc]==key)
return vals[loc];
loc = (loc+1)%keys.size();
}
}
uint64_t usedSize() const {
return usedslots;
}
uint64_t size() const {
return keys.size();
}
};
typedef SimpleHash<uint64_t, char> table_t;
void SaveSimpleHash(const table_t &map){
std::cout<<"Save. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "wb");
uint64_t size = map.size();
fwrite(&size, 8, 1, f);
fwrite(map.keys.data(), 8, size, f);
fwrite(map.vals.data(), 1, size, f);
fwrite(map.used.data(), 1, size/8+1, f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
table_t LoadSimpleHash(){
std::cout<<"Load. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
uint64_t size;
fread(&size, 8, 1, f);
table_t map(size);
fread(map.keys.data(), 8, size, f);
fread(map.vals.data(), 1, size, f);
fread(map.used.data(), 1, size/8+1, f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
return map;
}
int main(){
//Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity
auto generator = std::mt19937(12345); //Combination of my luggage
//Generate values within the specified closed intervals
auto key_rand = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator);
auto val_rand = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator);
table_t map(1.3*TEST_TABLE_SIZE);
std::cout<<"Created table of size "<<map.size()<<std::endl;
std::cout<<"Generating test data..."<<std::endl;
for(int i=0;i<TEST_TABLE_SIZE;i++)
map.insert(key_rand(),(char)val_rand()); //Low chance of collisions, so we get quite close to the desired size
map.insert(23,42);
assert(map.get(23)==42);
SaveSimpleHash(map);
auto newmap = LoadSimpleHash();
//Ensure that the load worked
for(int i=0;i<map.keys.size();i++)
assert(map.keys.at(i)==newmap.keys.at(i));
for(int i=0;i<map.vals.size();i++)
assert(map.vals.at(i)==newmap.vals.at(i));
for(int i=0;i<map.used.size();i++)
assert(map.used.at(i)==newmap.used.at(i));
}
(Edit: I've added a new answer to this question which achieves a 95% decrease in wall-times.)
I made a Minimum Working Example that illustrates the problem you are trying to solve. This is something you should always do in your questions.
I then eliminated the unsigned long long int stuff and replaced it with uint64_t from the cstdint library. This ensures that we are operating on the same data size, since unsigned long long int can mean almost anything depending on what computer/compiler you use.
The resulting MWE looks like:
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <deque>
#include <functional>
#include <iostream>
#include <random>
#include <unordered_map>
#include <vector>
typedef std::unordered_map<uint64_t, char> table_t;
const int TEST_TABLE_SIZE = 10000000;
void Save(const table_t &map){
std::cout<<"Save. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "wb");
for(auto iter=map.begin(); iter!=map.end(); iter++){
fwrite(&(iter->first), 8, 1, f);
fwrite(&(iter->second), 1, 1, f);
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values to save time
void SaveLookup(const table_t &map){
std::cout<<"SaveLookup. ";
const auto start = std::chrono::steady_clock::now();
//Create a lookup table
std::vector< std::deque<uint64_t> > lookup(256);
for(auto &kv: map)
lookup.at(kv.second+128).emplace_back(kv.first);
//Save lookup table header
FILE *f = fopen("/z/map", "wb");
for(const auto &row: lookup){
const uint32_t rowsize = row.size();
fwrite(&rowsize, 4, 1, f);
}
//Save values
for(const auto &row: lookup)
for(const auto &val: row)
fwrite(&val, 8, 1, f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values and contiguous memory to
//save time
void SaveLookupVector(const table_t &map){
std::cout<<"SaveLookupVector. ";
const auto start = std::chrono::steady_clock::now();
//Create a lookup table
std::vector< std::vector<uint64_t> > lookup(256);
for(auto &kv: map)
lookup.at(kv.second+128).emplace_back(kv.first);
//Save lookup table header
FILE *f = fopen("/z/map", "wb");
for(const auto &row: lookup){
const uint32_t rowsize = row.size();
fwrite(&rowsize, 4, 1, f);
}
//Save values
for(const auto &row: lookup)
fwrite(row.data(), 8, row.size(), f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
void Load(table_t &map){
std::cout<<"Load. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
uint64_t key;
char val;
while(fread(&key, 8, 1, f)){
fread(&val, 1, 1, f);
map[key] = val;
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
void Load2(table_t &map){
std::cout<<"Load with Reserve. ";
map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
uint64_t key;
char val;
while(fread(&key, 8, 1, f)){
fread(&val, 1, 1, f);
map[key] = val;
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values to save time
void LoadLookup(table_t &map){
std::cout<<"LoadLookup. ";
map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
//Read the header
std::vector<uint32_t> inpsizes(256);
for(int i=0;i<256;i++)
fread(&inpsizes[i], 4, 1, f);
uint64_t key;
for(int i=0;i<256;i++){
const char val = i-128;
for(int v=0;v<inpsizes.at(i);v++){
fread(&key, 8, 1, f);
map[key] = val;
}
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values and contiguous memory to save time
void LoadLookupVector(table_t &map){
std::cout<<"LoadLookupVector. ";
map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
//Read the header
std::vector<uint32_t> inpsizes(256);
for(int i=0;i<256;i++)
fread(&inpsizes[i], 4, 1, f);
for(int i=0;i<256;i++){
const char val = i-128;
std::vector<uint64_t> keys(inpsizes[i]);
fread(keys.data(), 8, inpsizes[i], f);
for(const auto &key: keys)
map[key] = val;
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
int main(){
//Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity
auto generator = std::mt19937(12345); //Combination of my luggage
//Generate values within the specified closed intervals
auto key_rand = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator);
auto val_rand = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator);
std::cout<<"Generating test data..."<<std::endl;
//Generate a test table
table_t map;
for(int i=0;i<TEST_TABLE_SIZE;i++)
map[key_rand()] = (char)val_rand(); //Low chance of collisions, so we get quite close to the desired size
Save(map);
{ table_t map2; Load (map2); }
{ table_t map2; Load2(map2); }
SaveLookup(map);
SaveLookupVector(map);
{ table_t map2; LoadLookup (map2); }
{ table_t map2; LoadLookupVector(map2); }
}
On the test data set I use, this gives me a write time of 1982ms and a read time (using your original code) of 7467ms. It seemed as though the read time is the biggest bottleneck, so I created a new function Load2 which reserves sufficient space for the unordered_map prior to reading. This dropped the read time to 4700ms (a 37% savings).
Edit 1
Now, I note that the values of your unordered_map can only take 255 distinct values. Thus, I can easily convert the unordered_map into a kind of lookup table in RAM. That is, rather than having:
123123 1
234234 0
345345 1
237872 1
I can rearrange the data to look like:
0 234234
1 123123 345345 237872
What's the advantage of this? It means that I no longer have to write the value to disk. That saves 1 byte per table entry. Since each table entry consists of 8 bytes for the key and 1 byte for the value, this should give me an 11% savings in both read and write time minus the cost of rearranging the memory (which I expect to be low, because RAM).
Finally, once I've done the above rearrangement, if I have a lot of spare RAM on the machine, I can pack everything into a vector and read/write the contiguous data to disk.
Doing all this gives the following times:
Save. Save time = 1836.52 ms
Load. Load time = 7114.93 ms
Load with Reserve. Load time = 4277.58 ms
SaveLookup. Save time = 1688.73 ms
SaveLookupVector. Save time = 1394.95 ms
LoadLookup. Load time = 3927.3 ms
LoadLookupVector. Load time = 3739.37 ms
Note that the transition from Save to SaveLookup gives an 8% speed-up and the transition from Load with Reserve to LoadLookup gives an 8% speed-up as well. This is right in line our theory!
Using contiguous memory as well gives a total of a 24% speed-up over your original save time and a total of a 47% speed-up over your original load time.
Since your data seems to be static and given the amount of items, I would certainly consider using an own structure in a binary file and then use memory mapping on that file.
Opening would be instant (just mmap the file).
If you write the values in sorted order, you can use binary search on the mapped data.
If that is not good enough, you could split your data in buckets and store a list with offsets at the beginning of the file - or maybe even use some hash key.
If your keys are all unique and somewhat contiguous, you could even get a smaller file by only storing the char values in file position [key] (and use a special value for null values). Of course that wouldn't work for the full uint64 range, but depending on the data they could be grouped together in buckets containing an offset.
Using mmap this way would also use a lot less memory.
For faster access you could create your own hash map on disk (still with 'instant load').
For example, say you have 1 million hashes (in your case there would be lot more), you could write 1 million uint64 filepos values at the beginning of the file (the hash value would be the position of the uint64 containing the filepos). Each location would point to a block with one ore more key/value pairs, and each of those blocks would start with a count.
If the blocks are aligned on 2 or 4 bytes, a uint32 filepos could be used instead (multiply pos with 2 or 4).
Since the data is static you don't have to worry about possible insertions or deletions, which makes it rather easy to implement.
This has the advantage that you still can mmap the whole file and all the key/value pairs with the same hash are close together which brings them in the L1 cache (as compared to say linked lists)
I would assume you need the map to write the values ordered in the file. It would be better to load only once the values in a container, possibly a std::deque would be better since the amount is large, and use std::sort once, and then iterate through std::deque to write values. You would gain cache performance and also the run time complexity for std::sort is N*Log(N), which would be better than balancing your map ~624 million times or paying cache misses in an unordered map.
Perhaps a prefix-ordered traversal during save would help to reduce the amount of internal reordering during load?
Of course, you don't have visibility of the internal structure of the STL map containers, so the best you could do would be to simulate that by binary-chopping the iterator as if it was linear. Given that you know the total N nodes, save the node N/2, then N/4, N*3/4, and so-on.
This can be done algorithmically by visiting every odd N/(2^p) node in each pass p: N/2, N*1/4, N*3/4, N*1/8, N*3/8, N*5/8, N*7/8, etc, though you need to ensure that the series maintains step sizes such that N*4/8 = N/2, but without resorting to step sizes of 2^(P-p), and that in the last pass you visit every remaining node. You may find it advantageous to pre-calculate the highest pass number (~log2(N)), and the float value of S=N/(2^P) such that 0.5 < S <= 1.0, and then scale that back up for each p.
But as others have said, you need to profile it first to see if this is your issue, and profile again to see if this approach helps.

Wrong results using auto with Eigen

I got different results using auto and using Vector when summing two vectors.
My code:
#include "stdafx.h"
#include <iostream>
#include "D:\externals\eigen_3_1_2\include\Eigen\Geometry"
typedef Eigen::Matrix<double, 3, 1> Vector3;
void foo(const Vector3& Ha, volatile int j)
{
const auto resAuto = Ha + Vector3(0.,0.,j * 2.567);
const Vector3 resVector3 = Ha + Vector3(0.,0.,j * 2.567);
std::cout << "resAuto = " << resAuto <<std::endl;
std::cout << "resVector3 = " << resVector3 <<std::endl;
}
int main(int argc, _TCHAR* argv[])
{
Vector3 Ha(-24.9536,-29.3876,65.801);
Vector3 z(0.,0.,2.567);
int j = 7;
foo(Ha,j);
return 0;
}
The results:
resAuto = -24.9536, -29.3876,65.801
resVector3 = -24.9536,-29.3876,83.77
Press any key to continue . . .
I understand that Eigen does internal optimization that generate different results. But it looks like a bug in Eigen and C++11.
The auto keyword tells the compiler to "guess" the best object based on the right hand side of the =. You can check the results by adding
std::cout << typeid(resAuto).name() <<std::endl;
std::cout << typeid(resVector3).name() <<std::endl;
to foo (don't forget to include <typeinfo>).
In this case, after constructing the temporary Vector3, the operator+ method is called, which creates a CwiseBinaryOp object. This object is part of Eigens lazy evaluation (can increase performance). If you want to force eager evaluation (and therefore type determination), you could use
const auto resAuto = (Ha + Vector3(0.,0.,j * 2.567)).eval();
instead of your line in foo.
A few side notes:
Vector3 is identical to the Vector3d class defined in Eigen
You can use #include <Eigen/Core> instead of #include <Eigen/Geometry> to include most of the Eigen headers, plus certain things get defined there that should be.

Resources