Lock-free hash table with keys of size exceeding 32 bits - parallel-processing

I'm working on a rather big set of data. I have a hash table with keys which are tuples (size_t, size_t, int) and values which are single ints. My current approach is to calculate the keys and values on GPU, then to copy them to host and insert into the hash table. I would like to speed up the process of adding data to the hash table by utilizing the GPU (and at the same time get rid of copying the data back and forth between the CPU and GPU). I tried using a hash table with separate chaining with spinlocks set on buckets but that solution was unacceptably slow (in fact slower than the CPU version of the hash table). While doing some research on GPU-friendly hash tables, I came across cuckoo hashing. It is implemented for example in CUDPP. Now, the problem is, the hash table in CUDPP can only deal with keys and values that are unsigned ints. That is because it internally uses atomic functions which are limited to 64-bit words. It seems to me that any lock-free hash table has to use atomic operations and thus suffers this restriction on key and value size (see another example here). Is there any way to use a cuckoo hashing based hash table (or any lock-free hash table for that matter) on GPU with keys whose sizes exceed the limit of atomic operations?

Related

Partition index to reduce buffer busy waits?

From time to time our Oracle response times decrease significally for a minute or two, without having extra load.
we were able to identify an insert statement, which produces a lot of buffer busy waits.
From the ADDM report, we got the following hint:
Consider partitioning the INDEX "IDX1" with object
ID 4711 in a manner that will evenly distribute concurrent DML across
multiple partitions.
To be honest: I am not sure what that means. I don't know what a partitioned index is. I only can Image that it means to create a Partition with a local index.
Can you help me out here?
There is a very high frequency of reading and writing to that table. no updates or deletes are used.
Thanks,
E.
I am not sure what that means.
Oracle is telling you that there is a lot of concurrent ("at the same time") activity on a very small part of your index. This happens a lot.
Consider an index column TAB1_PK on table TAB1 whose values are inserted from a sequence TAB1_S. Suppose you have 5 database sessions all inserting into TAB1 at the same time.
Because TAB1_PK is indexed, and because the sequence is generating values in numeric order, what happens is that all those sessions have to read and update the same blocks of the index at the same time.
This can cause a lot of contention -- way more than you would expect, due to the way indexes work with multi-version read consistency. I mean, in some rare situations (depending on how the transaction logic is written and the number of concurrent sessions), it can really be crippling.
The (really) old way to avoid this problem was to use a reverse key index. That way, the sequential column values did not all go to the same index blocks.
However, that is a two-edged sword. On the one hand, you get less contention because you're inserting all over the index (good). On the other hand, your rows are going all over the index, meaning you cannot cache them all. You've just turned a big logical I/O problem into a physical I/O problem!
Nowadays, we have a better solution -- a GLOBAL HASH PARTITION on the index.
With a GHP, you can specify the number of hash buckets and use that to trade-off between how much contention you need to handle vs how compact you want the index updates (for better buffer caching). The more index hash partitions you use, the better your concurrency but the worse your index block buffer caching will be.
I find a number (of global hash partitions) around 16 is pretty good.

unordered map in C++ // load factor // variance of entries per bucket

I have a question regarding key/bucket statistics for a hash table. The load factor value (occupied entries in the hash table / the number of buckets) does not seem to be very beneficial. Keeping the load factor below some bound guarantees an appropriate load of the hash table, but does not take potential collisions into account. Having a very low load factor, it might happen that one bucket contains over 70% of the whole data (e.g. due to bad hashing).
Am I right that you should take care of both the load factor and variance of number of entries per bucket by yourself? Does C++11 provide any advanced tools for checking both at the same time? How shall one decide what is a good combination of the load and variance?
Thanks!
Best,
Alexey

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

Smart chunking from a huge table

I have a huge table in a data warehouse (Vertica). I am accessing this table in chunks for optimization purposes. The way I am deciding my chunks is pretty straightforward. I have a primary key column say A and I take a MAX(A). I have a chunk size of 20000 and I have now created (A/20000)+1 chunks. I frame query for each chunk and retrieve the data .
There problem with this approach is as follows:
My number of chunks is dependent on MAX(A) and MAX(A) is growing very fast and thereby my number of chunks increases with it as well.
I have decided on number 20000 because that is what gives me optimal performance. But distribution of primary key within the chunks of 20000 is so scattered. For example the 0-20000 might contain only 3 elements and range 20000-40000 might contain 500 elements and no ranges come close to 20000.
I am trying to figure whether there are any good approximation algorithm for this problem which minimizes the number of chunks and fill in close to 20000 primary keys in one chunk.
Any pointers towards the solution is appreciated.
I'm not sure what optimization purposes means, but I think the best approach would be to create a timestamp column, or use an eligible timestamp column to partition on. You could then partition on a larger frame of reference so there isn't a wide range between the partitions.
If the table is partitioned, it will be able to benefit from partition pruning. This means that Vertica can eliminate the storage containers during query execution which do not match on the timestamp predicate.
Otherwise, you can look at the segmentation clause and use the max/min from the storage containers. This could be slightly more complicated.

Time for updating a sqlite3 index fluctuates too much

I have a large-ish sqlite3 (3.6.22) database (about 1 GB, 5 million rows) with a single table indexed on one column. The problem is that the time to do a typical INSERT transaction fluctuates widely. I insert about 10000 rows at a time (wrapped in a transaction of course). Often it takes about 1.5 seconds, but about every fifth transaction it suddenly takes several minutes for the very same transaction to complete. I've done a lot of experimentation, and I've discovered that the phenomena only occurs if there is an index, which makes me think it is updating the index which takes a lot of time.
I need more consistent performance. A bit higher average insertion times times would be ok, if I can only avoid that some transactions suddenly takes 200x as long as the previous one... What should I do?
Here's the schema. The strings in blocks.md5 are always exactly 32 bytes long and likely unique. The rolling.value column will contain very large 64-bit integers.
CREATE TABLE blocks (blob char(32) NOT NULL,
offset long NOT NULL,
md5 char(32) NOT NULL,
row_md5 char(32));
CREATE TABLE rolling (value INT NOT NULL);
CREATE INDEX index_md5 ON blocks (md5);
CREATE UNIQUE INDEX index_rolling ON rolling (value);
I don't know exactly how sqlite indexes are implemented, but I'd expect the behavior you describe if they were storing the index on disk or reordering the data.
Imagine a scenario where when they are allocating blocks for the index, they start some page with N slots for data. When the page fills up, they have to allocate another and split the data between them.
When you're inserting your data, the ordering of the MD5 will be as random as it gets, so every page will fill up independently. There isn't any reasonable way for the indexing strategy to know that.
Other databases will even recommend using different indexing strategies than normal for strings, especially in the case of something like random MD5s.
Trying to do this in an all memory database would tell you whether its algorithmic or disk access.
I've only really tried to avoid this in an offline system where I could sort data before inserting. After it was all inserted I would index it and that was as fast as I could find. If you're doing 10k at a time, that might be your use case, though I don't know.

Resources