Ehcache and CacheWriter (write-behind) relation - caching

Suppose we have a Cache configured with a write-behind CacheWriter. Let's assume we put some object in the cache and later on the object is removed because of an eviction policy.
What's is guaranteed regarding writing? More precisely, is write() event guaranteed to happen for that object, even though it was removed before it "had a chance" to be written?
Thanks!

No, write() is not guaranteed to happen. In a write-behind case, all writes are stored in a queue while some background threads read from that queue to update the underlying SoR (System of Records, i.e.: your database). That queue can be read or modified by other threads concurrently reading or modifying the same cache.
For instance, if a put() happens on a certain key, write() enqueues the command. If before one of the background thread had the chance to consume the write command before remove() happens on that same key, the write command can be removed from the queue (note the 'can' here). There are other similar optimizations that can take place ('can' again), those can change and new ones can be added in any minor version as this is all considered an implementation detail, as long as the data served by Ehcache follows its general visibility guarantees.
This means Write-Behind, and more generally all CacheWriters must not be used for any form of accounting, if that's the use-case you had in mind.

Related

Is this Redis Race Condition Scenario Possible?

I'm debugging an issue in an application and I'm running into a scneario where I'm out of ideas, but I suspect a race condition might be in play.
Essentially, I have two API routes - let's call them A and B. Route A generates some data and Route B is used to poll for that data.
Route A first creates an entry in the redis cache under a given key, then starts a background process to generate some data. The route immediately returns a polling ID to the caller, while the background data thread continues to run. When the background data is fully generated, we write it to the cache using the same cache key. Essentially, an overwrite.
Route B is a polling route. We simply query the cache using that same cache key - we expect one of 3 scenarios in this case:
The object is in the cache but contains no data - this indicates that the data is still being generated by the background thread and isn't ready yet.
The object is in the cache and contains data - this means that the process has finished and we can return the result.
The object is not in the cache - we assume that this means you are trying to poll for an ID that never existed in the first place.
For the most part, this works as intended. However, every now and then we see scenario 3 being hit, where an error is being thrown because the object wasn't in the cache. Because we add the placeholder object to the cache before the creation route ever returns, we should be able to safely assume this scenario is impossible. But that's clearly not the case.
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying? That is, is it possible that even though the call to add the cache entry has completed but the data would briefly not be returned by queries? It seems the be the only thing that can explain the behavior we are seeing.
If that is a possibility, how can I avoid this scenario? Is there some way to force Redis to wait until the data is available for query before returning?
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying?
Yes and it may depend on your Redis topology and on your network configuration. Only standalone Redis servers provides strong consistency, albeit with some considerations - see below.
Redis replication
While using replication in Redis, the writes which happen in a master need some time to propagate to its replica(s) and the whole process is asynchronous. Your client may happen to issue read-only commands to replicas, a common approach used to distribute the load among the available nodes of your topology. If that is the case, you may want to lower the chance of an inconsistent read by:
directing your read queries to the master node; and/or,
issuing a WAIT command right after the write operation, and ensure all the replicas acknowledged it: while the replication process would happen to be synchronous from the client standpoint, this option should be used only if absolutely needed because of its bad performance.
There would still be the (tiny) possibility of an inconsistent read if, during a failover, the replication process promotes a replica which did not receive the write operation.
Standalone Redis server
With a standalone Redis server, there is no need to synchronize data with replicas and, on top of that, your read-only commands would be always handled by the same server which processed the write commands. This is the only strongly consistent option, provided you are also persisting your data accordingly: in fact, you may end up having a server restart between your write and read operations.
Persistence
Redis supports several different persistence options; in your scenario, you may want to configure your server so that it
logs to disk every write operation (AOF) and
fsync every query.
Of course, every configuration setting is a trade off between performance and durability.

KStreams: implementing session window with pocessor API

I need to implement a logic similar to session windows using processor API in order to have a full control over state store. Since processor API doesn't provide windowing abstraction, this needs to be done manually. However, I fail to find the source code for KStreams session window logic, to get some initial ideas (specifically regarding session timeouts).
I was expecting to use punctuate method, but it's a per processor timer rather than per key timer. Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
[UPDATE]
As an example, assume processor instance is processing K1 and stream time is incremented which causes the session for K2 to timeout. K2 may or may not exist at all. How do you know that there exists a specific key (like K2 when stream time is incremented (while processing a different key)? In other words when stream time is incremented, how do you figure out which windows are expired (because you don't know those keys exists)?
This is the DSL code: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java -- hope it helps.
It's unclear what your question is though -- it's mostly statements. So let me try to give some general answer.
In the DSL, sessions are close based on "stream time" progress. Only relying on the input data makes the operation deterministic. Using wall-clock time would introduce non-determinism. Hence, using a Punctuation is not necessary in the DSL implementation.
Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
Sessions in the DSL are based on keys and thus it's sufficient to scan the store on a per-key basis over a time range (as done via findSessions(...)).
Update:
In the DSL, each time a session window is updated, as corresponding update event is sent downstream immediately. Hence, the DSL implementation does not wait for "stream time" to advance any further but publishes the current (potentially intermediate) result right away.
To obey the grace period, the record timestamp is compared to "stream time" and if the corresponding session window is already closed, the record is skipped (cf. https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java#L146). I.e., closing a window is just a logical step (not an actually operation); the session will still be stored and if a window is closed no additional event needs to be sent downstream because the final result was sent downstream in the last update to the window already.
Retention time itself must not be handled by the Processor implementation because it's a built-in feature of the SessionStore: internally, the session store maintains so-called "segments" that store sessions for a certain time period. Each time a put() is done, the store checks if old segments can be dropped (based on the timestamp provided by put()). I.e., old sessions are deleted lazily and as bulk deletes (i.e., all session of the whole segment will be deleted at once) as it's more efficient than individual deletes.

Spring Kafka and Re-balancing in the Middle of Processing

In this link https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html the section entitled, "Consuming Records with Specific Offsets" makes reference to a strategy of effectively updating the topic partition offsets in an external store "as you go", and then on partition revocation (e.g. a re-balancing) simply commit any interrupted transaction to the external store.
Now, I'm assuming that this strategy means that in the partition revocation callback that I don't need to process the passed-in TopicPartition collection for the offsets as any "in progress" transaction that was interrupted will be persisted and will contain the partition offsets that need to be committed/saved.
(If I'm wrong on this, please correct me.)
So, then, given that this is Spring Kafka and I'm making use of an #Transactional service to persist the necessary data, is the above strategy relevant/doable? In other words, I'm unsure of how I'd resume/commit anything marked as #Transactional since the transaction manager, boundary, etc. is all taken care of under the hood.
Is this even an issue? If so, what would be the best way to achieve this strategy? Manually track transactions (which sounds horrible across methods and callbacks)?
Or should I just go through the TopicPartition collection on partition revocation and update the partition offsets anyway?
Hopefully this makes sense as I'd like to make sure I get this right.
Thanks in advance.
Released September 2017
That book is quite old in Kafka terms; with modern versions; keeping the offsets in Kafka is much simpler; just make sure your consumer can process all the records returned by a poll() within max.poll.interval.ms in order to avoid a rebalance altogether.

boost.asio - do i need to use locks if sharing database type object between different async handlers?

I'm making a little server for a project, I have a log handler class which contains a log implemented as a map and some methods to act on it (add entry, flush to disk, commit etc..)
This object is instantiated in the server Class, and I'm passing the address to the session so each session can add entries to it.
The sessions are async, the log writes will happen in the async_read callback. I'm wondering if this will be an issue and if i need to use locks?
The map format is map<transactionId map<sequenceNum, pair<head, body>>, each session will access a different transactionId, so there should be no clashes as far as i can figure. Also hypothetically, if they were all writing to the same place in memory -- something large enough that the operation would not be atomic; would i need locks? As far as I understand each async method dispatches a thread to handle the operation, which would make me assume yes. At the same time I read that one of the great uses of async functions is the fact that synchronization primitives are not needed. So I'm a bit confused.
First time using ASIO or any type of asynchronous functions altogether, and i'm not a very experienced coder. I hope the question makes sense! The code seems to run fine so far, but i'm curios if it's correct.
Thank you!
Asynchronous handlers will only be invoked in application threads processing the io_service event loop via run(), run_one(), poll(), or poll_one(). The documentation states:
Asynchronous completion handlers will only be called from threads that are currently calling io_service::run().
Hence, for a non-thread safe shared resource:
If the application code only has one thread, then there is neither concurrency nor race conditions. Thus, no additional form of synchronization is required. Boost.Asio refers to this as an implicit strand.
If the application code has multiple threads processing the event-loop and the shared resource is only accessed within handlers, then synchronization needs to occur, as multiple threads may attempt to concurrently access the shared resource. To resolve this, one can either:
Protect the calls to the shared resource via a synchronization primitive, such as a mutex. This question covers using mutexes within handlers.
Use the same strand to wrap() the ReadHandlers. A strand will prevent concurrent invocation of handlers dispatched through it. For more details on the usage of strands, particularly for composed operations, such as async_read(), consider reading this answer.
Rather than posting the entire ReadHandler into the strand, one could limit interacting with the shared resource to a specific set of functions, and these functions are posted as CompletionHandlers to the same strand. This subtle difference between this and the previous solution is the granularity of synchronization.
If the application code has multiple threads and the shared resource is accessed from threads processing the event loop and from threads not processing the event loop, then synchronization primitives, such as a mutex, needs to be used.
Also, even if a shared resource is small enough that writes and reads are always atomic, one should prefer using explicit and proper synchronization. For example, although the write and read may be atomic, without proper memory fencing to guarantee memory visibility, a thread may not observe a chance in memory even though the actual memory has chanced. Boost.Asio's will perform the proper memory barriers to guarantee visibility. For more details, on Boost.Asio and memory barriers, consider reading this answer.

Low loading into cache speed

I'm using Infinispan 6.0.0 in a 3-node setup (distributed caching with 2 replicas for each entry, no writes into persistent store) and I'm just reading the file line-by-line and storing that lines' contents into the cache. The speed seems a bit low to me (I can achieve more writes onto the SSD (persistent storage) than into RAM with Infinispan), but there isn't any obvious bottleneck in the test code (I'm using buffered input streams, and their limits certainly aren't reached. As for now, I'm able to write 100K entries each ~45 seconds and that doesn't satisfy me. Assume simplified code snippet:
while ((s = reader.readLine()) != null) {
cache.put(s.substring(0,2), s.substring(2,5));
}
And CacheManager is created as follows:
return new DefaultCacheManager(
GlobalConfigurationBuilder.defaultClusteredBuilder()
.transport().addProperty("configurationFile", "jgroups.xml").build(),
new ConfigurationBuilder()
.clustering().cacheMode(CacheMode.DIST_ASYNC).hash().numOwners(2)
.transaction().transactionMode(TransactionMode.TRANSACTIONAL).lockingMode(LockingMode.OPTIMISTIC)
.build());
What could I be possibly doing wrong?
I am not fully aware of all the asynchronous mode specialities, but I'd afraid that something in the two-phase commit (Prepare and Commit) might force some blocking RPC => waiting for network latency => slow down.
Do you need transactional behaviour? If not, switch them off. If you really need it, you may disable just the autocommit feature and load the cluster via non-transactional operations. Or, you may try one phase commits.
Another option could be mass loading via putAll (with tens or hundreds of entries, depends on your entry size), but routing of this message is not really smart. In transactional mode it could behave a bit better, I guess.
The last option if you just want to load the cluster fast and then operate on it could be transferring the bulk data to each node without Infinispan (using your own JGroups channel, or just with sockets), and loading all nodes with the CACHE_MODE_LOCAL flag.
By default Infinispan follows the Map.put() contract of returning the previous value, so even though you are using the DIST_ASYNC cache mode you're still implicitly performing a synchronous cache.get() for every put.
You can avoid this in two ways:
configurationBuilder.unsafe().unreliableReturnValues(true) will suppress the remote lookup for all the operations on the cache.
cache.getAdvancedCache().withFlags(Flag.IGNORE_RETURN_VALUES).put(k, v) will suppress the remote lookup for a single operation.

Resources