Is there a way to remove Caffeine entries based on a timestamp condition? Eg.,
At T1 I have following entries
K1 -> V1
K2 -> V2
K3 -> V3
At time T2 I update only K2 and K3. (I dont know if both entries will have exact timestamp. K2 might have T2 but K3 might be T2 + some nanos. But sake of this question let's assume they do)
Now I want caffeine to invalidate entry K1 -> V1 because T1 < T2.
One way to do this is to iterate over entries and check if their write timestamp is < T2. Collect such keys and in the end call invalidateKeys(keys).
Maybe there is a non-iterative way?
If you are using expireAfterWrite, then you can obtain a snapshot of entries in timestamp order. As this call requires obtaining the eviction lock, it provides an immutable snapshot rather than an iterator. That is messy, e.g. you have to provide a limit which might not be correct and it depends on expiration.
Duration maxAge = Duration.ofMinutes(1);
cache.policy().expireAfterWrite().ifPresent(policy -> {
Map<K, V> oldest = policy.oldest(1_000);
for (K key : oldest.keySet()) {
// Remove everything written more than 1 minute ago
policy.ageOf(key)
.filter(duration -> duration.compareTo(maxAge) > 0)
.ifPresent(duration -> cache.invalidate(key));
}
});
If you maintain the timestamp yourself, then an unordered iteration is possible using the cache.asMap() view. That's likely simplest and fast.
long cutoff = ...
var keys = cache.asMap().entrySet().stream()
.filter(entry -> entry.getValue().timestamp() < cutoff)
.collect(toList());
cache.invalidateAll(keys);
An approach that won't work, but worth mentioning to explain why, is variable expiration, expireAfter(expiry). You can set a new duration on every read based on the prior setting. This takes effect after the entry is returned to the caller, so while you can expire immediately it will serve K1 (at least) once.
Otherwise you could validate at retrieval time outside of the cache and rely on size eviction. The flaw with this approach is that it does pollute the cache will dead entries.
V value = cache.get(key);
if (value.timestamp() < cutoff) {
cache.asMap().remove(key, value);
return cache.get(key); // load a new value
}
return value;
Or you could maintain your own write-order queue, etc. All of these get messy the fancier you get. For your case, likely a full iteration is the simplest and least error-prone approach.
Related
Need some opinion/help around one use case of KStream/KTable usage.
Scenario:
I have 2 topics with common key--requestId.
input_time(requestId,StartTime)
completion_time(requestId,EndTime)
The data in input_time is populated at time t1 and the data in completion_time is populated at t+n.(n being the time taken for a process to complete).
Objective
To compare the time taken for a request by joining data from the topics and raised alert in case of breach of a threshold time.
It may happen that the process may fail and the data may not arrive on the completion_time topic at all for the request.
In that case we intend to use a check that if the currentTime is well past a specific(lets say 5s) threshold since the start time.
input_time(req1,100) completion_time(req1,104) --> no alert to be raised as 104-100 < 5(configured value)
input_time(req2,100) completion_time(req2,108) --> alert to be raised with req2,108 as 108-100 >5
input_time(req3,100) completion_time no record--> if current Time is beyond 105 raise an alert with req3,currentSysTime as currentSysTime - 100 > 5
Options Tried.
1) Tried both KTable-KTable and KStream-Kstream outer joins but the third case always fails.
final KTable<String,Long> startTimeTable = builder.table("input_time",Consumed.with(Serdes.String(),Serdes.Long()));
final KTable<String,Long> completionTimeTable = builder.table("completion_time",Consumed.with(Serdes.String(),Serdes.Long()));
KTable<String,Long> thresholdBreached =startTimeTable .outerJoin(completionTimeTable,
new MyValueJoiner());
thresholdBreached.toStream().filter((k,v)->v!=null)
.to("finalTopic",Produced.with(Serdes.String(),Serdes.Long()));
Joiner
public Long apply(Long startTime,Long endTime){
// if input record itself is not available then we cant use any alerting.
if (null==startTime){
log.info("AlertValueJoiner check: the start time itself is null so returning null");
return null;
}
// current processing time is the time used.
long currentTime= System.currentTimeMillis();
log.info("Checking startTime {} end time {} sysTime {}",startTime,endTime,currentTime);
if(null==endTime && currentTime-startTime>5000){
log.info("Alert:No corresponding record from file completion yet currentTime {} startTime {}"
,currentTime,startTime);
return currentTime-startTime;
}else if(null !=endTime && endTime-startTime>5000){
log.info("Alert: threshold breach for file completion startTime {} endTime {}"
,startTime,endTime);
return endTime-startTime;
}
return null;
}
2) Tried the custom logic approach recommended as per the thread
How to manage Kafka KStream to Kstream windowed join?
-- This approach stopped working for scenarios 2 and 3.
Is there any case of handling all three scenarios using DSL or Processors?
Not sure of we can use some kind of punctuator to listen to when the window changes and check for the stream records in current window and if there is no matching records found,produce a result with systime.?
Due to the nature of the logic involve it surely had to be done with combination of DSL and processor API.
Used a custom transformer and state store to compare with configured
values.(case 1 &2)
Added a punctuator based on wall clock for
handling the 3rd case
I want to get a DistributionSummary over some domain data that does not change very frequently. So it is not about monitoring requests or sth like that.
Let's take number of seats in an office as example. The value for each office can change from time to time and there can be new offices and also offices get removed.
So now I need the current DistributionSummary over all offices, which needs to be calculated every time I think (similar to a Gauge).
I have a Spring Boot 2 app with micrometer and collect the metrics with prometheus and display them in grafana.
What I tried so far:
When I register a DistributionSummary, I can record all the values once during startup... this gives me the distribution, but calculated values like max get lost over time and I cannot update the DistributionSummary (recording new offices would work, but not changing existing ones)
// during startup
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
I also tried to use a #Scheduled task to remove and completely rebuild the DistributionSummary. This seems to work, but feels wrong somehow. Would that be a recommended approach? That would also probably need some synchronisation to not collect the metrics between removing and recalculating distribution.
#Scheduled(fixedRate = 5 * 60 * 1000)
public void recalculateMetrics() {
if (seatsInOffice != null) {
meterRegistry.remove(seatsInOffice);
}
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
}
Another problem I just recognized with this approach: the /actuator/prometheus endpoint still returns the values for the old (removed) metrics, so everything is there mutiple times.
For sth like sla borders I could also use some gauges to provide the values (by calculating them myself), but that would not give me quantiles. Is it possible to create a new DistributionSummary without registering it and just provide the values it collected somehow?
meterRegistry.gauge("office.seats", Tags.of("le", "1"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(1).size());
meterRegistry.gauge("office.seats", Tags.of("le", "5"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(5).size());
meterRegistry.gauge("office.seats", Tags.of("le", "20"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(20).size());
meterRegistry.gauge("office.seats", Tags.of("le", "50"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(50).size());
I would like to have a DistributionSummary that takes a lambda or sth like that to get the values. But maybe these tools are not made for this usecase and I should use sth else. Can you recommend sth?
DistributionSummary has a config distributionStatisticExpiry could control rotate data. It's a workaround
But PrometheusDistributionSummary won't use this field
case Prometheus:
histogram = new TimeWindowFixedBoundaryHistogram(clock, DistributionStatisticConfig.builder()
.expiry(Duration.ofDays(1825)) // effectively never roll over
.bufferLength(1)
.build()
.merge(distributionStatisticConfig), true);
I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.
I have a pattern like this... psuedo-code, but I think it makes sense...
type K // key, function of records in B
class A // compact data structure
val a: RDD[(K, A)] // many records
class B { // massive data structure
def funcIter // does full O(n) scans of huge data structure
}
val b: RDD[(K,B)] // comparatively few records
val emptyB = new B("", Nil, etc.)
val C: RDD[(A,B)] = {
a
.leftOuterJoin(b, 1.5x increase in partitions)
.map{ case (k, (val_a, option_b)) => (val_a, option_b.getOrElse(emptyB)) }
.map{ case (val_a, val_b) => (val_a, val_b.funcIter(val_a.attributes)) }
}
My problem is that records in val b vary enormously in size with some quite enormous, and since it's a leftOuterJoin, each of those records is replicated 1,000's or 10,000's of times to join to val a... so it's not just that there are large values in b to handle, but that the worst case records in b end up copied many times in one partition after the join. So the worse partitions are almost exclusively made up of many copies of only the worse case values from b. So my last few partitions take ages to work through while most of my enormous cluster sits idle, draining my wallet.
Is there anything I can do to modify this pattern... try broadcasting b and joining with a in place (it's probably too big).... or split partitions after the join maybe splitting copies of the worst b vals apart into different partitions without doing another shuffle... like the opposite of a coalesce so at least multiple executors on the same core instance (I have 3 executors per core instance) can work on those records in parallel?
Thanks for any advice.
I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}
A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing