Kafka Streams - Creating Windowed State Store - apache-kafka-streams

The below code "works" but I am confused on the meaning on the values passed in Stores.persistentWindowStore(). I found the documentation (https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/state/Stores.html#persistentWindowStore-java.lang.String-long-int-long-boolean-) but the definition of the args is not clear to me.
Should the windowBy() value always match windowSize in persistentWindowStore()?
What should the retention period be set to? The retention policy of the source topic?
What do the number of segments do?
What are retaining duplicates for? Document seems to indicate to set to true for joins?
long windowSize = TimeUnit.MINUTES.toMillis(15);
long retentionPeriod = windowSize*4*6 //6 hours
int numSegments = 2;
boolean retainDuplicates = false;
bdrStream.groupByKey().windowedBy(TimeWindows.of(windowSize))
.aggregate(() -> Lists.newArrayList(),
(aggKey, newValue, aggValue) -> {
BdrData d = new BdrData();
d.setCharge(newValue.getBdr().getCost());
aggValue.add(d);
return aggValue;
},
Materialized.<String, ArrayList<BdrData>>as(
Stores.persistentWindowStore("store5",
retentionPeriod,
numSegments,
windowSize,
retainDuplicates))
.withKeySerde(Serdes.String())
.withValueSerde(listBdrDataSerde))
.toStream()
.process(() -> new WindowAggregatorProcessor());

Should the windowBy() value always match windowSize in persistentWindowStore()?
Yes.
What should the retention period be set to? The retention policy of the source topic?
It should match the retention period of the windows, that you can specify via Windows#until() (default is 1 day)
What do the number of segments do?
The number of segments determine how coarse/fine grained data (ie, old windows) are expired. The segment size will be "retention-period / (#segments + 1)". Note, that more segments give you more fine grained data expiration but increases overhead (each segment uses it's own RocksDB instance)
What are retaining duplicates for? Document seems to indicate to set to true for joins?
By default, keys must be unique. If you enable retain duplicates, you can store the same key multiple times. Enabling duplicates comes with a performance hit.
Note:
This part of the API was reworked and simplified in upcoming 2.1 release. Compare KIP-319 and KIP-328 for details.

Related

Apache Flink 1.6.0 - StateTtlConfig and ListState

I am in the process of implementing a proof-of-concept stream processing system using Apache Flink 1.6.0 and am storing a list of received events, partitioned by key, in a ListState. (Don't worry about why I am doing this, just work with me here.) I have a StateTtlConfig set on the corresponding ListStateDescriptor. Per the documentation:
"All state collection types support per-entry TTLs. This means that list elements and map entries expire independently."
"Currently, expired values are only removed when they are read out explicitly, e.g. by calling ValueState.value()."
Question 1
Which of the following constitutes a read of the ListState:
Requesting the iterator but not using it - myListState.get();.
Actually using the iterator - for (MyItem i : myListState.get()) { ... }
Question 2
What does "per-entry TTL" actually mean? Specifically, what I'm asking about is the following:
Assume I have a specific instance of ListState<Character>. The descriptor has a TTL of 10 seconds. I insert a 'a'. Two seconds later, I insert 'b'. Nine seconds later I insert 'c'. If I iterate over this ListState, which items will be returned?
In other words:
ListState<Character> ls = getRuntimeContext().getListState(myDescriptor);
ls.add('a');
// ...two seconds later...
ls.add('b');
// ...nine seconds later...
ls.add('c');
// Does this iterate over 'a', 'b', 'c'
// or just 'b' and 'c'?
for (Character myChar : ls.get()) { ... }
Answer 1
The answer is 1. For ListState the pruning is done for myListState.get();.
Answer 2
"per-entry TTL" means the timeout is applied to a single entry rather than whole collection. For your example assuming at the point of reading 10 seconds passed since inserting the a it will iterate over b and c. a is going to be pruned in this case.

Algorithm to time-sort N data streams

So I've got N asynchronous, timestamped data streams. Each stream has a fixed-ish rate. I want to process all of the data, but the catch is that I must process the data in order as close to the time that the data arrived as possible (it is a real-time streaming application).
So far, my implementation has been to create a fixed window of K messages which I sort by timestamp using a priority queue. I then process the entirety of this queue in order before moving on to the next window. This is okay, but its less than ideal because it creates lag proportional to the size of the buffer, and also will sometimes lead to dropped messages if a message arrives just after the end of the buffer has been processed. It looks something like this:
// Priority queue keeping track of the data in timestamp order.
ThreadSafeProrityQueue<Data> q;
// Fixed buffer size
int K = 10;
// The last successfully processed data timestamp
time_t lastTimestamp = -1;
// Called for each of the N data streams asyncronously
void receiveAsyncData(const Data& dat) {
q.push(dat.timestamp, dat);
if (q.size() > K) {
processQueue();
}
}
// Process all the data in the queue.
void processQueue() {
while (!q.empty()) {
const auto& data = q.top();
// If the data is too old, drop it.
if (data.timestamp < lastTimestamp) {
LOG("Dropping message. Too old.");
q.pop();
continue;
}
// Otherwise, process it.
processData(data);
lastTimestamp = data.timestamp;
q.pop();
}
}
Information about the data: they're guaranteed to be sorted within their own stream. Their rates are between 5 and 30 hz. They consist of images and other bits of data.
Some examples of why this is harder than it appears. Suppose I have two streams, A and B both running at 1 Hz and I get the data in the following order:
(stream, time)
(A, 2)
(B, 1.5)
(A, 3)
(B, 2.5)
(A, 4)
(B, 3.5)
(A, 5)
See how if I processed the data in order of when I received them, B would always get dropped? that's what I wanted to avoid.Now in my algorithm, B would get dropped every 10th frame, and I would process the data with a lag of 10 frames into the past.
I would suggest a producer/consumer structure. Have each stream put data into the queue, and a separate thread reading the queue. That is:
// your asynchronous update:
void receiveAsyncData(const Data& dat) {
q.push(dat.timestamp, dat);
}
// separate thread that processes the queue
void processQueue()
{
while (!stopRequested)
{
data = q.pop();
if (data.timestamp >= lastTimestamp)
{
processData(data);
lastTimestamp = data.timestamp;
}
}
}
This prevents the "lag" that you see in your current implementation when you're processing a batch.
The processQueue function is running in a separate, persistent thread. stopRequested is a flag that the program sets when it wants to shut down--forcing the thread to exit. Some people would use a volatile flag for this. I prefer to use something like a manual reset event.
To make this work, you'll need a priority queue implementation that allows concurrent updates, or you'll need to wrap your queue with a synchronization lock. In particular, you want to make sure that q.pop() waits for the next item when the queue is empty. Or that you never call q.pop() when the queue is empty. I don't know the specifics of your ThreadSafePriorityQueue, so I can't really say exactly how you'd write that.
The timestamp check is still necessary because it's possible for a later item to be processed before an earlier item. For example:
Event received from data stream 1, but thread is swapped out before it can be added to the queue.
Event received from data stream 2, and is added to the queue.
Event from data stream 2 is removed from the queue by the processQueue function.
Thread from step 1 above gets another time slice and item is added to the queue.
This isn't unusual, just infrequent. And the time difference will typically be on the order of microseconds.
If you regularly get updates out of order, then you can introduce an artificial delay. For example, in your updated question you show messages coming in out of order by 500 milliseconds. Let's assume that 500 milliseconds is the maximum tolerance you want to support. That is, if a message comes in more than 500 ms late, then it will get dropped.
What you do is add 500 ms to the timestamp when you add the thing to the priority queue. That is:
q.push(AddMs(dat.timestamp, 500), dat);
And in the loop that processes things, you don't dequeue something before its timestamp. Something like:
while (true)
{
if (q.peek().timestamp <= currentTime)
{
data = q.pop();
if (data.timestamp >= lastTimestamp)
{
processData(data);
lastTimestamp = data.timestamp;
}
}
}
This introduces a 500 ms delay in the processing of all items, but it prevents dropping "late" updates that fall within the 500 ms threshold. You have to balance your desire for "real time" updates with your desire to prevent dropping updates.
There's always be a lag and that lag will be determined by how long you'll be willing to wait for your slowest "fixed-ish rate" stream.
Suggestion:
keep the buffer
keep an array of bool flags with the meaning:"if position ix is true, in the buffer there is at least a sample originated from stream ix"
sort/process as soon as you have all flag to true
Not full-proof (each buffer will be sorted, but from one buffer to another you may have timestamp inversion), but perhaps good enough?
Playing around with the count of "satisfied" flags to trigger the processing (at step 3) may be used to make the lag smaller, but with the risk of more inter-buffer timestamp inversions. In extreme, accepting the processing with only one satisfied flag means "push a frame as soon as you receive it, timestamp sorting be damned".
I mentioned this to support my feeling that lag/timestamp inversions balance is inherent to your problem - except for absolutely equal framerates, there will be perfect solution in which one of the sides is not sacrificed.
Since a "solution" will be an act of balancing, any solution will require gathering/using extra information to help decisions (e.g. that "array of flags"). If what I suggested sounds silly for your case (may well be, the details you chose to share aren't too many), start thinking what metrics will be relevant for your targeted level of "quality of experience" and use additional data structures to help gathering/processing/using those metrics.

Random unique string against a blacklist

I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola
Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p
I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.

Algorithm - Implement two functions that assign/release unique id's from a pool

I am trying to find a good solution for this question -
Implement two functions that assign/release unique id's from a pool. Memory usage should be minimized and the assign/release should be fast, even under high contention.
alloc() returns available ID
release(id) releases previously assigned ID
The first thought was to maintain a map of IDs and availability(in boolean). Something like this
Map<Integer, Boolean> availabilityMap = new HashMap();
public Integer alloc() {
for (EntrySet es : availabilityMap.entrySet()) {
if (es.value() == false) {
Integer key = es.key();
availabilityMap.put(key, true);
return key;
}
}
}
public void release(Integer id) {
availabilityMap.put(id, false);
}
However this is not ideal for multiple threads and "Memory usage should be minimized and the assign/release should be fast, even under high contention."
What would be a good way to optimize both memory usage and speed?
For memory usage, I think map should be replaced with some other data structure but I am not sure what it is. Something like bit map or bit set? How can I maintain id and availability in this case?
For concurrency I will have to use locks but I am not sure how I can effectively handle contention. Maybe put availabile ids in separate chunks so that each of them can be accessed independently? Any good suggestions?
First of all, you do not want to run over entire map in order to find available ID.
So you can maintain two sets of IDs, the first one for available IDs, and the second one is for allocated IDs.
Then it will make allocation/release pretty easy and fast.
Also you can use ConcurrentMap for both containers (sets), it will reduce the contention.
Edit: Changed bottom sentinel, fixed a bug
First, don't iterate the entire map to find an available ID. You should only need constant time to do it.
What you could do to make it fast is to do this:
Create an int index = 1; for your counter. This is technically the number of IDs generated + 1, and is always > 0.
Create a ArrayDeque<Integer> free = new ArrayDeque<>(); to house the free IDs. Guaranteed constant access.
When you allocate an ID, if the free ID queue is empty, you can just return the counter and increment it (i.e. return index++;). Otherwise, grab its head and return that.
When you release an ID, push the previously used ID to the free deque.
Remember to synchronize your methods.
This guarantees O(1) allocation and release, and it also keeps allocation quite low (literally once per free). Although it's synchronized, it's fast enough that it shouldn't be a problem.
An implementation might look like this:
import java.util.ArrayDeque;
public class IDPool {
int index = 1;
ArrayDeque<Integer> free = new ArrayDeque<>();
public synchronized int acquire() {
if (free.isEmpty()) return index++;
else return free.pop();
}
public synchronized void release(id) {
free.push(id);
}
}
Additionally, if you want to ensure the free ID list is unique (as you should for anything important) as well as persistent, you can do the following:
Use an HashMap<Integer id, Integer prev> to hold all generated IDs. Remember it doesn't need to be ordered or even iterated.
This is technically going to be a stack encoded inside a hash map.
Highly efficient implementations of this exist.
In reality, any unordered int -> int map will do here.
Track the top ID for the free ID set. Remember that 1 can represent nothing and zero used, so you don't have to box it. (IDs are always positive.) Initially, this would just be int top = 1;
When allocating an ID, if there are free IDs (i.e. top >= 2), do the following:
Set the new top to the old head's value in the free map.
Set the old top's value in the map to 0, marking it used.
Return the old top.
When releasing an old ID, do this instead:
If the old ID is already in the pool, return early, so we don't corrupt it.
Set the ID's value in the map to the old top.
Set the new top to the ID, since it's always the last one to use.
The optimized implementation would end up looking like this:
import java.util.HashMap;
public class IDPool {
int index = 2;
int top = 1;
HashMap<Integer, Integer> pool = new HashMap<>();
public synchronized int acquire() {
int id = top;
if (id == 1) return index++;
top = pool.replace(id, 0);
return id;
}
public synchronized void release(id) {
if (pool.getOrDefault(id, 1) == 0) return;
pool.put(id, top);
top = id;
}
}
If need be, you could use a growable integer array instead of the hash map (it's always contiguous), and realize significant performance gains. Matter of fact, that is how I'd likely implement it. It'd just require a minor amount of bit twiddling to do so, because I'd maintain the array's size to be rounded up to the next power of 2.
Yeah...I had to actually write a similar pool in JavaScript because I actually needed moderately fast IDs in Node.js for potentially high-frequency, long-lived IPC communication.
The good thing about this is that it generally avoids allocations (worst case being once per acquired ID when none are released), and it's very amenable to later optimization where necessary.

Redis maxmemory-policy: performances of volatile-lru vs allkeys-lru

assuming all keys in a redis instance have an expire set, volatile-lru and allkeys-lru are similar. But is there a significative performance difference between the 2 when a key is removed?
Bonus question:
between 2 distinct instances configured with the allkeys-lru policy, having the same content and same configuration, except:
Instance A has all its keys with an expire set (different values of expire)
Instance B has none key with an expire set
Aside the overhead of memory in instance A due to the expires bits, is there a performance difference between the 2 when a key is removed by the allkeys-lru algorithm?
In both cases, I'm talking about instances of redis 2.4.x on linux 64 bits with maxmemory = 3Gb with 4-5000 keys when the maxmemory is reached (most of the keys are hashes).
Thanks
redis.c, line 2311, unstable branch:
/* volatile-lru and allkeys-lru policy */
else if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU ||
server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
{
for (k = 0; k < server.maxmemory_samples; k++) {
sds thiskey;
long thisval;
robj *o;
de = dictGetRandomKey(dict);
thiskey = dictGetKey(de);
/* When policy is volatile-lru we need an additonal lookup
* to locate the real key, as dict is set to db->expires. */
if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
de = dictFind(db->dict, thiskey);
o = dictGetVal(de);
thisval = estimateObjectIdleTime(o);
/* Higher idle time is better candidate for deletion */
if (bestkey == NULL || thisval > bestval) {
bestkey = thiskey;
bestval = thisval;
}
}
}
It seems like all things being equal allkeys-lru would be strictly speaking faster, but not by a significant magnitude. Chances are we are talking about not much more than a fraction of a microsecond faster.
The second question got pretty much already answered, but just in case: it looks like it makes no difference to allkeys-lru how many keys are set to expire, or if any are. Both Instance A and B in your example would see the same performance when a key is purged by the lru algorithm.

Resources