Why Segment dropped in Java 8 Concurrent Hashmap? [duplicate] - java-8

Can any concurrent expert explain in ConcurrentHashMap, which concurrent features improved comparing with which in previous JDKs

Well, the ConcurrentHashMap has been entirely rewritten. Before Java 8, each ConcurrentHashMap had a “concurrency level” which was fixed at construction time. For compatibility reasons, there is still a constructor accepting such a level though not using it in the original way. The map was split into as many segments, as its concurrency level, each of them having its own lock, so in theory, there could be up to concurrency level concurrent updates, if they all happened to target different segments, which depends on the hashing.
In Java 8, each hash bucket can get updated individually, so as long as there are no hash collisions, there can be as many concurrent updates as its current capacity. This is in line with the new features like the compute methods which guaranty atomic updates, hence, locking of at least the hash bucket which gets updated. In the best case, they lock indeed only that single bucket.
Further, the ConcurrentHashMap benefits from the general hash improvements applied to all kind of hash maps. When there are hash collisions for a certain bucket, the implementation will resort to a sorted map like structure within that bucket, thus degrading to a O(log(n)) complexity rather than the O(n) complexity of the old implementation when searching the bucket.

I think there are several changes compared with JDK7:
Lazy initialization: in JDK8, the memory used for each segment is allocated only when some entity is added to the map. In JDK7,this is done when the map is created.
Some new function is added in JDK8 like forEach, reduce, search etc.
Inner structure change : the TreeBin (red-black tree) is used in jdk8 to improve the search efficiency.

Related

Concurrent Rehashing. What is "rehashed segment" in multilevel rehashing?

Looking for an algorithm for implementing concurrent hash tables that don't rehash the entire table each time it needs to resize it I found the following paper:
Per-bucket concurrent rehashing algorithms
The "Multilevel rehashing" concept got my attention. The paper says:
2.3 Multilevel rehashing
The drawbacks of recursive hashing may be addressed by extending
buckets with a more complex control field that indicates the number of
segments rehashed. So, any operation accessing an "old" bucket can
detect whether rehashing is necessary and move the data directly into
all the new buckets available at once as shown in Figure 3.
Nevertheless, recursive rehashing may be a better choice when resizing
is a rare event and there is no space for additional field in the
bucket.
But I don't get it. What are rehashed segments?
The paper later says:
For multilevel rehashing, this check is trivial using information
about rehashed segments stored in the bucket, which in fact is current
capacity effective for the given bucket. The difficulty here is rather
in algorithm of detecting a correct root parent while other threads
can rehash it concurrently (not discussed in the paper).
Again, What are rehashed segments?

Why Redis SortedSet uses Skip List instead of Balanced Tree?

The Redis document said as below :
ZSETs are ordered sets using two data structures to hold the same elements
in order to get O(log(N)) INSERT and REMOVE operations into a sorted
data structure.
The elements are added to a hash table mapping Redis objects to
scores. At the same time the elements are added to a skip list
mapping scores to Redis objects (so objects are sorted by scores in
this "view").
I can not understand very much. Could someone give me a detailed explanation?
Antirez said, see in https://news.ycombinator.com/item?id=1171423
There are a few reasons:
They are not very memory intensive. It's up to you basically. Changing parameters about the probability of a node to have a given number of levels will make then less memory intensive than btrees.
A sorted set is often target of many ZRANGE or ZREVRANGE operations, that is, traversing the skip list as a linked list. With this operation the cache locality of skip lists is at least as good as with other kind of balanced trees.
They are simpler to implement, debug, and so forth. For instance thanks to the skip list simplicity I received a patch (already in Redis master) with augmented skip lists implementing ZRANK in O(log(N)). It required little changes to the code.
About the Append Only durability & speed, I don't think it is a good idea to optimize Redis at cost of more code and more complexity for a use case that IMHO should be rare for the Redis target (fsync() at every command). Almost no one is using this feature even with ACID SQL databases, as the performance hint is big anyway.
About threads: our experience shows that Redis is mostly I/O bound. I'm using threads to serve things from Virtual Memory. The long term solution to exploit all the cores, assuming your link is so fast that you can saturate a single core, is running multiple instances of Redis (no locks, almost fully scalable linearly with number of cores), and using the "Redis Cluster" solution that I plan to develop in the future.
First of all, I think I got the idea of what the Redis documents says. Redis ordered set maintain the order of elements by the the element's score specified by user. But when user using some Redis Zset APIs, it only gives element args. For example:
ZREM key member [member ...]
ZINCRBY key increment member
...
redis need to know what value is about this member (element), so it uses hash table maintaining a mapping, just like the documents said:
The elements are added to a hash table mapping Redis objects to
scores.
when it receives a member, it finds its value through the hash table, and then manipulate the operation on the skip list to maintain the order of set. redis uses two data structure to maintain a double mapping to satisfy the need of the different API.
I read the papers by William Pugh Skip Lists: A Probabilistic
Alternative to Balanced Trees, and found the skip list is very elegant and easier to implement than rotating.
Also, I think the general binary balanced tree is able to do this work at the same time cost. I case I've missed something, please point that out.

if huge array is faster than hash-map for look-up?

I'm receiving "order update" from stock exchange. Each order id is between 1 and 100 000 000, so I can use 100 million array to store 100 million orders and when update is received I can look-up order from array very fast just accessing it by index arrray[orderId]. I will spent several gigabytes of memory but this is OK.
Alternatively I can use hashmap, and because at any moment the number of "active" orders is limited (to, very roughly, 100 000), look-up will be pretty fast too, but probaly a little bit slower then array.
The question is - will hashmap be actually slower? Is it reasonably to create 100 millions array?
I need latency and nothing else, I completely don't care about memory, what should I choose?
Whenever considering performance issues, one experiment is worth a thousand expert opinions. Test it!
That said, I'll take a wild stab in the dark: it's likely that if you can convince your OS to keep your multi-gigabyte array resident in physical memory (this isn't necessarily easy - consider looking at the mlock and munlock syscalls), you'll have relatively better performance. Any such performance gain you notice (should one exist) will likely be by virtue of bypassing the cost of the hashing function, and avoiding the overheads associated with whichever collision-resolution and memory allocation strategies your hashmap implementation uses.
It's also worth cautioning that many hash table implementations have non-constant complexity for some operations (e.g., separate chaining could degrade to O(n) in the worst case). Given that you are attempting to optimize for latency, an array with very aggressive signaling to the OS memory manager (e.g., madvise and mlock) are likely to result in the closest to constant-latency lookups that you can get on a microprocessor easily.
While the only way to objectively answer this question is with performance tests, I will argue for using a Hashtable Map. (Caching and memory access can be so full of surprises; I do not have the expertise to speculate on which one will be faster, and when. Also consider that localized performance differences may be marginalized by other code.)
My first reason for "initially choosing" a hash is based off of the observation that there are 100M distinct keys but only 0.1M active records. This means that if using an array, index utilization will only be 0.1% - this is a very sparse array.
If the data is stored as values in the array then it needs to be relatively small or the array size will balloon. If the data is not stored in the array (e.g. array is of pointers) then the argument for locality of data in the array is partially mitigated. Either way, the simple array approach requires lots of unused space.
Since all the keys are already integers, the distribution (hash) function and can be efficiently implemented - there is no need to create a hash of a complex type/sequence so the "cost" of this function should approach zero.
So, my simple proposed hash:
Use linear probing backed by contiguous memory. It is simple, has good locality (especially during the probe), and avoids needing to do any form of dynamic allocation.
Pick a suitable initial bucket size; say, 2x (or 0.2M buckets, primed). Don't even give the hash a chance of resize. Note that this suggested bucket array size is only 0.2% the size of the simple array approach and could be reduced further as the size vs. collision rate can be tuned.
Create a good distribution function for the hash. It can also exploit knowledge of the ID range.
While I've presented specialized hashtable rules "optimized" for the given case, I would start with a normal Map implementation (be it a hashtable or tree) and test it .. if a standard implementation works suitably well, why not use it?
Now, test different candidates under expected and extreme loads - and pick the winner.
This seems to depend on the clustering of the IDs.
If the active IDs are clustered suitably already then, without hashing, the OS and/or L2 cache have a fair shot at holding on to the good data and keeping it low-latency.
If they're completely random then you're going to suffer just as soon as the number of active transactions exceeds the number of available cache lines or the size of those transactions exceeds the size of the cache (it's not clear which is likely to happen first in your case).
However, if the active IDs work out to have some unfortunate pattern which causes a high rate of contention (eg., it's a bit-pack of different attributes, and the frequently-varying attribute hits the hardware where it hurts), then you might benefit from using a 1:1 hash of the index to get back to the random case, even though that's usually considered a pretty bad case on its own.
As far as hashing for compaction goes; noting that some people are concerned about worst-case fallback behaviour for a hash collision, you might simply implement a cache of the full-sized table in contiguous memory, since that has a reasonably constrained worst case. Simply keep the busiest entry in the map, and fall back to the full table on collisions. Move the other entry into the map if it's more active (if you can find a suitable algorithm to decide this).
Even so, it's not clear that the necessary hash table size is sufficient to reduce the working set to being cacheable. How big are your orders?
The overhead of a hashmap vs. an array is almost none. I would bet on a hashmap of 100,000 records over an array of 100,000,000, without a doubt.
Remember also that, while you "don't care about memory", this also means you'd better have the memory to back it up - an array of 100,000,000 integers will take up 400mb, even if all of them are empty. You run the risk of your data being swapped out. If your data gets swapped out, you will get a performance hit of several orders of magnitude.
You should test and profile, as others have said. My random stab in the dark, though: A high-load-factor hash table will be the way to go here. One huge array is going to cost you a TLB miss and then a last-level cache miss per access. This is expensive. A hash table, given the working set size you mentioned, is probably only going to cost some arithmetic and an L1 miss.
Again, test both alternatives on representative examples. We're all just stabbing in the dark.

Data Structure Parallel Add Serial Remove Needed

I'm working on a dynamically branching particle system on the GPU. I need a parallel data structure with the following properties:
One thread must be able to remove elements one by one, in constant time. The element returned isn't important to the algorithm--just so long as some element is returned when nonempty. For extra awesomeness, change to any number of threads.
Any number of threads must be able to add elements to the data structure in constant time. Note that some locking is allowed, (and necessary) but it must still scale with no relation on the number of threads. I.e., more threads shouldn't slow it down.
Basic synchronization primitives (mutexes, semaphores), and anything that can be implemented using them, are available.
I had toyed with the idea of a linked list, but this violates condition two (since adding would be O(m) for m threads, since locking must be taken into consideration). I'm not sure such a data structure exists--but I thought I would ask.
Without knowing more about how you want your data organized (sorted? FIFO? LIFO?) I'm. Or sure whether I can give you an exact answer. However, what you're describing sounds like the definition of a lock-free structure. Lock-free implementations of stacks and queues exist, which do support O(1) insertions and deletions even when there are a lot of threads modifying the structure concurrently. They're usually based on atomic test-and-set operations.
If locks are okay and you just want a highly-concurrent data structure that's sorted, consider looking into concurrent skip lists, which provide O(log n) sorted insertion and deletion with multiple active threads.
Hope this helps!

Do any hashtables (in-memory, non-distributed) use consistent hashing?

I'm not talking about distributed key/value systems, such as typically used with memcached, which use consistent hashing to make adding/removing nodes a relatively cheap procedure.
I'm talking about your standard in-memory hashtable like python's dict or perl's hash.
It would seem like the benefits of using consistent hashing would also apply to these standard data structures, by lowering the cost of resizing the hashtable. Real-time systems (and other latency-sensitive systems) would benefit from / require hashtables optimized for low-cost growth, even if overall throughput declines slightly.
Wikipedia alludes to "incremental resizing" but basically talks about a hot/cold replacement approach to resizing; there is a separate article about "extendible hashing" that uses a trie for bucket lookup to accomplish cheap rehashing.
Just curious if anyone's heard of in-core, single-node hashtables that use consistent hashing to lower growth cost. Or is this requirement better met using something other approach (ala the two wikipedia bits listed above)?
or ... is my whole question misguided? Do memory paging considerations make the complexity not worth it? That is, the extra indirection of consistent hashing lets you rehash only a fraction of the total keys, but perhaps that doesn't matter because you'll probably have to read from each existing page, so memory latency is your primary factor, and whether you rehash some or all of the keys doesn't matter compared to the cost of the memory access.... but on the other hand, with consistent hashing, all of your key remaps have the same destination page, so there's going to be less memory thrashing than if your keys remap to any of the existing pages.
EDIT: added "data-structures" tag, clarified final sentence to say "page" instead of "bucket".
I haven't heard of this in the wild, but it may be a good idea if you choose the right consistent hash implementation. Specifically, Jump Consistent Hashing by Google et al. First I'll go into why Jump, then I'll go into how it can be useful in a local data structure.
Jump Consistent Hashing
Jump Consistent Hashing (which I'll shorten to Jump) is great for this space for a few reasons. Jump assumes that nodes don't fail, which is great for local data structures because they, well, don't fail! This allows Jump to merely be a mapping to a range of numbers [0, numBuckets), requiring only 2-4 bytes of space.
Further the implementation is simple and fast. And it is even faster if we remove the reference implementation's floating point divides and replace them with half as many integer divides. (Which we can, by the way.)
All this can be used for a variation on...
ConcurrentHashMap
But first, Java's Concurrent Hash Map at a high-level.
Java's ConcurrentHashMap is parameterized by a number of buckets. This sharding factor is constant through the life of the map. Each of these buckets is itself a hash map with its own lock.
When inserting a key-value pair into the map, the key is hashed into one of the buckets. The lock for that key is taken, and the item is inserted into the bucket's hash map before releasing the lock. Whilst inserting into bucket x another thread can be inserting concurrently into bucket y, but it will wait for the lock if inserting into bucket x. Thus Java's ConcurrentHashMap has n-way concurrency, where n is the bucket parameter of the constructor.
Just like any hash map, a bucket in ConcurrentHashMap can fill up and need to grow. Just like the regular hash map it does this by doubling its size and rehashing everything in the bucket back into its bigger self. Except that 'its bigger self' is only the bucket's 'self'. If a bucket is a hot spot and gets more than its fair share of keys, the bucket will grow disproportionately compared to the other buckets. And each time a bucket grows it takes longer and longer to rehash into itself. This last point is not only a problem for hot spots, but when the hash table plain old gets more keys.
Imagine if we could grow the number of buckets as the number of keys grows. With this we could dampen the amount of growth each individual bucket grows.
Enter consistent hashing, which allows us to add more buckets!
ConcurrentHashMap take 2: Consistent Hashing Style
We can get ConcurrentHashMap to grow its number of buckets in a two easy steps.
First replace the function that maps to each bucket with the jump consistent hash function. So far everything should work the the same.
Second split off a new bucket when a bucket is filled; also grow the filled bucket. Actually, only split off a new bucket if the filled bucket becomes the largest biggest in terms of capacity. That can be calculated without iterating the buckets.
With consistent hashing the split will only direct keys into the new bucket and not backwards into any of the old buckets.
End notes
I'm sure there can be improvements on this scheme. To wit, splitting off a bucket requires a full table scan to move keys into the new bucket. This is surely no worse than a vanilla hash map, and likely better, but it is at a disadvantage to the ConcurrentHashMap implementation which likely doesn't have to do a full scan.

Resources