What is distributed Atomic lock in caches drivers? - caching

I just want to know what is distributed Atomic lock means in caches drivers?

Distributed locks are well documented, in multiple sources.
The atomic attribute refers to the indivisible test-and-set that should be part of the lock request. Otherwise, two contenders may test at the same time, and then both set and walk away thinking they got exclusivity on the resource.
Since it is a must, you often find the term simply as distributed lock.
Now, some sources:
Antirez (Redis creator) criticized must client implementations while making a good analysis of the challenges of a distributed lock. He called his solution Redlock. Distributed locks with Redis
Then Martin Kleppmann, author of Designing Data-Intensive Applications, criticized Redlock and proposed his solution in How to do distributed locking
Then Antirez replied, in Is Redlock safe?
Going through these three articles will give you a strong sense of how to implement a distributed lock.

Related

Cache coherence for a data race free programs

All examples I've ever seen on when cache coherence is relevant are code examples that are data races (two cores simultaneously write to the same memory location).
When it comes to memory consistency, hardware vendors have decided not to provide serial consistency guarantees and e.g. C++11 has adopted the SC for DRF memory model, which basically says, if you want serial consistency, make sure your program doesn't have data races.
Why isn't the same approach applied to cache coherency? That is, a data race free program doesn't need transparent cache coherency and cache lines are synchronized exactly where the programmer/compiler inserted a synchronization barrier.
Or put in another way: Why worry about cache coherence if it is only relevant for race conditions?
All examples I've ever seen on when cache coherence is relevant are code examples that are data races (two cores simultaneously write to the same memory location).
A concurrent write/write or a concurrent read/write to the same memory location is a data-race; not just a concurrent write/write.
When it comes to memory consistency, hardware vendors have decided not to provide serial consistency guarantees
You mean sequential consistency.
Or put in another way: Why worry about cache coherence if it is only relevant for race conditions?
I guess you mean data-races. Race conditions are a higher-level problem.
I'm not sure what your question is. You need to understand that coherence equivalent to sequential consistency for a single location. So any system that is sequentially consistent, is automatically coherent.

Caffeine versus Guava cache

According to these micro benchmarks it turns out that Caffeine is a way faster than Guava cache in both read and write operations.
What is the secret of Caffeine implementation? How it differs from the Guava Cache?
Am I right that in case of timed expiration Caffeine use a scheduled executor to perform appropriate maintenance operations in background?
The main difference is because Caffeine uses ring buffers to record & replay events, whereas Guava uses ConcurrentLinkedQueue. The intent was always to migrate Guava over and it made sense to start simpler, but unfortunately there was never interest in accepting those changes. The ring buffer approach avoids allocation, is bounded (lossy), and cheaper to operate against.
The remaining costs are due to a design mismatch. The original author of MapMaker was enthusiastic about soft references as the solution to caching problems by deferring it to the GC. Unfortunately while that can seem fast in microbenchmarks, it has horrible performance in practice due to causing stop-the-world GC thrashing. The size-based solution had to be adapted into this work and that is not ideal. Caffeine optimizes for size-based and also gains an improved hash table, whereas Guava handles reference caching more elegantly.
Caffeine doesn't create its own threads for maintenance or expiration. It does defer the cost to the commonPool, which slightly improves user-facing latencies but not throughput. A future version might leverage CompletableFuture.delayedExecutor to schedule the next expiration event without directly creating threads (for users who have business logic depending on prompt removal notifications).
ConcurrentLinkedHashMap and MapMaker were written at the same time and CLHM has similar performance to Caffeine. I believe the difference is due to what scenarios the designers favored and optimized for, which impacted how other features would be implemented. There is low hanging fruit to allow Guava to have similar performance profile, but there isn't an internal champion to drive that (and even less so with Caffeine as a favored alternative).

Does Vulkan involve clustered computing?

We know Vulkan can utilize Multi-GPU resources well, but does Vulkan involve clustered computing (distributing work across many machines with one GPU, not one machine with multiple GPUs)?
Nothing in the Vulkan specification explicitly mentions clustered computing, or explicitly disallows it. The documentation does refer to the 'host' frequently, with the connotation that execution occurs on a single physical box.
It is hard to imagine that a cluster execution environment would be viable for Vulkan, as it is meant as a high performance graphics API. The implications queue synchronization and memory transfer of GPUs/memory over a network would be an extreme bottleneck. The only situations which would make sense would be operations that do not need to be synchronized very often (for example, some very long computation done on the GPU). However, in that case, the synchronization of results could be implemented outside of Vulkan itself.

When are page frame specific cache management policies useful?

I'm reading the O'Reilly Linux Kernel book and one of the things that was pointed out during the chapter on paging is that the Pentium cache lets the operating system associate a different cache management policy with each page frame. So I get that there could be scenarios where a program has very little spacial/temporal locality and memory accesses are random/infrequent enough that the probability of cache hits is below some sort of threshold.
I was wondering whether this mechanism is actually used in practice today? Or is it more of a feature that was necessary back when caches where fairly small and not as efficient as they are now? I could see it being useful for an embedded system with little overhead as far as system calls are necessary, are there other applications I am missing?
Having multiple cache management policies is widely used, whether by assigning whole regions using MTRRs (fixed/dynamic, as explained in Intel's PRM), MMIO regions, or through special instructions (e.g. streaming loads/stores, non-temporal prefetches, etc..). The use-cases also vary a lot, whether you're trying to map an external I/O device into virtual memory (and don't want CPU caching to impact its coherence), or whether you want to define a writethrough region for better integrity management of some database, or just want plain writeback to maximize the cache-hierarchy capacity and replacement efficiency (which means performance).
These usages often overlap (especially when multiple applications are running), so the flexibility is very much needed, as you said - you don't want data with little to no spatial/temporal locality to thrash out other lines you use all the time.
By the way, caches are never going to be big enough in the foreseeable future (with any known technology), since increasing them requires you to locate them further away from the core and pay in latency. So cache management is still, and is going to be for a long while, one of the most important things for performance critical systems and applications

How to avoid bottleneck performance?

A distributed system is described as scalable if it remains effective when there is a significant increase the number of resources and the number of users. However, these systems sometimes face performance bottlenecks. How can these be avoided?
The question is pretty broad, and depends entirely on what the system is doing.
Here are some things I've seen in systems to reduce bottlenecks.
Use caches, reducing network and disk bottlenecks. But remember that knowing when to evict from a cache eviction can a hard problem in some situations.
Use message queues to decouple components in the system. This way you can add more hardware to specific parts of the system that need it.
Delay computation when possible (often by using message queues). This takes the heat off the system during high-processing times.
Of course, design the system for parallel processing wherever possible. One host doing processing is NOT scalable. Note: most relational databases fall into the one-host bucket, this is why NoSQL has become suddenly popular; but not always appropriate (theoretically).
Use eventual consistency if possible. Strong consistency is much harder to scale.
Some are proponents for CQRS and DDD. Though I have never seen or designed a "CQRS system" nor a "DDD system," those have definitely affected the way I design systems.
There is a lot of overlap in the points above; some the techniques may use some of the others.
But, experience (your own and others) eventually teaches you about scalable systems. I keep up-to-date by reading about designs from google, amazon, twitter, facebook, and the like. Another good starting point is the high-scalability blog.
Just to build on a point discussed in the abover post, I would like to add that what you need for your distributed system is a distributed cache, so that when you intend on scaling your application the distributed cache acts like a "elastic" data-fabric meaning that you can increase storage capacity of the cache without compromising on performance and at the same time giving you a relaible platform that is accessible to multiple applications.
One such distributed caching solution is NCache. Do take a look!

Resources