I am using caffeine cache and looking for a way to update the value in cache without changing its expire time.
The scenario is that I am using cache for speed up data loading. A 5 seconds' delay of data changing is acceptable while I expect returns to be fast. Besides, I want these cache to expiry after 1 day of its first hit to avoid unnecessary memory use.
Thus, I want every cached keys lasts for one day but its value is updated every 5 second.
The refreshAfterWrite method seems to be close but its first returned value after refresh duration is still the old one. This is not ideal for me because the duration between two hits can be hours. In this case I still want a relatively new result (no more than 5 seconds).
So I am trying to manually updating each key.
Firstly I built a cache with 24 hours' expire time in this way:
cache = Caffeine.newBuilder()
.expireAfterWrite(24, TimeUnit.HOURS)
.build();
Then I wrote a scheduled task per 5 seconds which iterate keys in cache and do following:
cache.asMap().computeIfPresent(key, mapperFunction);
Then I checked the age of the key:
cache.policy().expireAfterWrite().get().ageOf(key)
However, the age is not growing as expected. I think the computeIfPresent method is considered as a "write" so that the expiry time is also updated.
Is there a way to do the value update without change its expire time in caffeine?
Or any other approach for my scenario?
A write is the creation or update of a mapping, so expireAfterWrite is not a good fit for you. Instead you can set a custom expiration policy that sets the initial duration and does nothing on a read or update. This is done using expireAfter(Expiry), such as
LoadingCache<Key, Graph> graphs = Caffeine.newBuilder()
.expireAfter(new Expiry<Key, Graph>() {
public long expireAfterCreate(Key key, Graph graph, long currentTime) {
return TimeUnit.HOURS.toNanos(24);
}
public long expireAfterUpdate(Key key, Graph graph,
long currentTime, long currentDuration) {
return currentDuration;
}
public long expireAfterRead(Key key, Graph graph,
long currentTime, long currentDuration) {
return currentDuration;
}
})
.build(key -> createExpensiveGraph(key));
Related
Is there a way to remove Caffeine entries based on a timestamp condition? Eg.,
At T1 I have following entries
K1 -> V1
K2 -> V2
K3 -> V3
At time T2 I update only K2 and K3. (I dont know if both entries will have exact timestamp. K2 might have T2 but K3 might be T2 + some nanos. But sake of this question let's assume they do)
Now I want caffeine to invalidate entry K1 -> V1 because T1 < T2.
One way to do this is to iterate over entries and check if their write timestamp is < T2. Collect such keys and in the end call invalidateKeys(keys).
Maybe there is a non-iterative way?
If you are using expireAfterWrite, then you can obtain a snapshot of entries in timestamp order. As this call requires obtaining the eviction lock, it provides an immutable snapshot rather than an iterator. That is messy, e.g. you have to provide a limit which might not be correct and it depends on expiration.
Duration maxAge = Duration.ofMinutes(1);
cache.policy().expireAfterWrite().ifPresent(policy -> {
Map<K, V> oldest = policy.oldest(1_000);
for (K key : oldest.keySet()) {
// Remove everything written more than 1 minute ago
policy.ageOf(key)
.filter(duration -> duration.compareTo(maxAge) > 0)
.ifPresent(duration -> cache.invalidate(key));
}
});
If you maintain the timestamp yourself, then an unordered iteration is possible using the cache.asMap() view. That's likely simplest and fast.
long cutoff = ...
var keys = cache.asMap().entrySet().stream()
.filter(entry -> entry.getValue().timestamp() < cutoff)
.collect(toList());
cache.invalidateAll(keys);
An approach that won't work, but worth mentioning to explain why, is variable expiration, expireAfter(expiry). You can set a new duration on every read based on the prior setting. This takes effect after the entry is returned to the caller, so while you can expire immediately it will serve K1 (at least) once.
Otherwise you could validate at retrieval time outside of the cache and rely on size eviction. The flaw with this approach is that it does pollute the cache will dead entries.
V value = cache.get(key);
if (value.timestamp() < cutoff) {
cache.asMap().remove(key, value);
return cache.get(key); // load a new value
}
return value;
Or you could maintain your own write-order queue, etc. All of these get messy the fancier you get. For your case, likely a full iteration is the simplest and least error-prone approach.
Scenario
We have over 5 million document in a bucket and all of it has nested JSON with a simple uuid key. We want to add one extra field to ALL of the documents.
Example
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field2":{
"nested_field1":"data2"
}
}
After adding extra field
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field3":"data3",
"field2":{
"nested_field1":"data2"
}
}
It has only one Primary Index: CREATE PRIMARY INDEX idx FOR bucket.
Problem
It takes ages. We tried it with n1ql, UPDATE bucket SET field3 = data3. Also sub-document mutation. But all of it takes hours. It's written in Go so we could put it into a goroutine, but it's still too much time.
Question
Is there any solution to reduce that time?
As you need to add new field, not modifying any existing field it is better to use SDKs SUBDOC API vs N1QL UPDATE (It is whole document update and require fetch the document).
The Best option will be Use N1QL get the document keys then use
SDK SUBDOC API to add the field you need. You can use reactive API(asynchronously)
You have 5M documents and have primary index use following
val = ""
In loop
SELECT RAW META().id FROM mybucket WHERE META().id > $val LIMIT 10000;
SDK SUBDOC update
val = last value from the SELECT
https://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/
The Eventing Service can be quite performant for these sort of enrichment tasks. Even a low end system should be able to do 5M rows in under two (2) minutes.
// Note src_bkt is an alias to the source bucket for your handler
// in read+write mode supported for version 6.5.1+, this uses DCP
// and can be 100X more performant than N1QL.
function OnUpdate(doc, meta) {
// optional filter to be more selective
// if (!doc.type && doc.type !== "mytype") return;
// test if we already have the field we want to add
if (doc.field3) return;
doc.field3 = "data3";
src_bkt[meta.id] = doc;
}
For more details on Eventing refer to https://docs.couchbase.com/server/current/eventing/eventing-overview.html I typically enrich 3/4 of a billion documents. The Eventing function will also run faster (enrich more documents per second) if you increase the number of workers in your Eventing function's setting from say 3 to 16 provided you have 8+ physical cores on your Eventing node.
I tested the above Eventing function and it enriches 5M documents (modeled on your example) on my non-MDS single node couchbase test system (12 cores at 2.2GHz) in just 72 seconds. Obviously if you have a real multi node cluster it will be faster (maybe all 5M docs in just 5 seconds).
Need some opinion/help around one use case of KStream/KTable usage.
Scenario:
I have 2 topics with common key--requestId.
input_time(requestId,StartTime)
completion_time(requestId,EndTime)
The data in input_time is populated at time t1 and the data in completion_time is populated at t+n.(n being the time taken for a process to complete).
Objective
To compare the time taken for a request by joining data from the topics and raised alert in case of breach of a threshold time.
It may happen that the process may fail and the data may not arrive on the completion_time topic at all for the request.
In that case we intend to use a check that if the currentTime is well past a specific(lets say 5s) threshold since the start time.
input_time(req1,100) completion_time(req1,104) --> no alert to be raised as 104-100 < 5(configured value)
input_time(req2,100) completion_time(req2,108) --> alert to be raised with req2,108 as 108-100 >5
input_time(req3,100) completion_time no record--> if current Time is beyond 105 raise an alert with req3,currentSysTime as currentSysTime - 100 > 5
Options Tried.
1) Tried both KTable-KTable and KStream-Kstream outer joins but the third case always fails.
final KTable<String,Long> startTimeTable = builder.table("input_time",Consumed.with(Serdes.String(),Serdes.Long()));
final KTable<String,Long> completionTimeTable = builder.table("completion_time",Consumed.with(Serdes.String(),Serdes.Long()));
KTable<String,Long> thresholdBreached =startTimeTable .outerJoin(completionTimeTable,
new MyValueJoiner());
thresholdBreached.toStream().filter((k,v)->v!=null)
.to("finalTopic",Produced.with(Serdes.String(),Serdes.Long()));
Joiner
public Long apply(Long startTime,Long endTime){
// if input record itself is not available then we cant use any alerting.
if (null==startTime){
log.info("AlertValueJoiner check: the start time itself is null so returning null");
return null;
}
// current processing time is the time used.
long currentTime= System.currentTimeMillis();
log.info("Checking startTime {} end time {} sysTime {}",startTime,endTime,currentTime);
if(null==endTime && currentTime-startTime>5000){
log.info("Alert:No corresponding record from file completion yet currentTime {} startTime {}"
,currentTime,startTime);
return currentTime-startTime;
}else if(null !=endTime && endTime-startTime>5000){
log.info("Alert: threshold breach for file completion startTime {} endTime {}"
,startTime,endTime);
return endTime-startTime;
}
return null;
}
2) Tried the custom logic approach recommended as per the thread
How to manage Kafka KStream to Kstream windowed join?
-- This approach stopped working for scenarios 2 and 3.
Is there any case of handling all three scenarios using DSL or Processors?
Not sure of we can use some kind of punctuator to listen to when the window changes and check for the stream records in current window and if there is no matching records found,produce a result with systime.?
Due to the nature of the logic involve it surely had to be done with combination of DSL and processor API.
Used a custom transformer and state store to compare with configured
values.(case 1 &2)
Added a punctuator based on wall clock for
handling the 3rd case
My question is about a best practise.
Im sending all simple user page view to redis. What i want to do is for every user,
There should be a list of pages that user ve looked.
That list needs to have a limit of max 20 items.
Lastly, that list needs to have an expiration time (key expiration).
The implemantation is (Im using StackExchange.Redis for applications):
var transaction = _cache.CreateTransaction();
transaction.ListLeftPushAsync(key, JsonConvert.SerializeObject(value), When.Always, CommandFlags.FireAndForget);
transaction.KeyExpireAsync(key, TimeSpan.FromDays(Constants.Constants.TopnUserCacheDurationinDays), CommandFlags.FireAndForget);
if (new Random().Next(7) == 6)
{
transaction.ListTrimAsync(key, 0, Constants.Constants.TopNUserHistoryLimit, CommandFlags.FireAndForget);
}
return transaction.ExecuteAsync();
The question is, is this implemantation is good for my needs. I feel that smth is not right. Because there are ~300 requests in a second. So im sending 1 request with 3 jobs 300 times in a second.
Any suggestions?
I have a document structure JobData that stores time based data starting from time 0 to time t in Ticks. And usually the data is a document per second.
public class JobData
{
long Ticks {get;set;}
double JobValue {get;set;}
}
For simplicity I am showing only one parameter JobValue, but in reality it is a complex graph of data. My question is if given a given an input time in Ticks, what kind of query would be the best for finding the last JobData based on a given tick?
So if the database has a document at 1000 ticks and then the next one at 2000 ticks, and the user wants to find the state at 1500 ticks, he/she should get the JobData at 1000 ticks as the answer.
The query I am using now is:
var jobData = documentSession.Query<JobData>().Where(t => t.Ticks <= 1500).OrderByDescinding(t => t.Ticks).FirstOrDefault();
Is this the right and most efficient query? I have thousands of these JobData nodes and want to just get to the one that is the closest.
Thanks!
Ahmad,
Yes, that is the way to go about it. And it would be very fast.