Apache Flink 1.6.0 - StateTtlConfig and ListState

Apache Flink 1.6.0 - StateTtlConfig and ListState - events

I am in the process of implementing a proof-of-concept stream processing system using Apache Flink 1.6.0 and am storing a list of received events, partitioned by key, in a ListState. (Don't worry about why I am doing this, just work with me here.) I have a StateTtlConfig set on the corresponding ListStateDescriptor. Per the documentation:
"All state collection types support per-entry TTLs. This means that list elements and map entries expire independently."
"Currently, expired values are only removed when they are read out explicitly, e.g. by calling ValueState.value()."
Question 1
Which of the following constitutes a read of the ListState:
Requesting the iterator but not using it - myListState.get();.
Actually using the iterator - for (MyItem i : myListState.get()) { ... }
Question 2
What does "per-entry TTL" actually mean? Specifically, what I'm asking about is the following:
Assume I have a specific instance of ListState<Character>. The descriptor has a TTL of 10 seconds. I insert a 'a'. Two seconds later, I insert 'b'. Nine seconds later I insert 'c'. If I iterate over this ListState, which items will be returned?
In other words:
ListState<Character> ls = getRuntimeContext().getListState(myDescriptor);
ls.add('a');
// ...two seconds later...
ls.add('b');
// ...nine seconds later...
ls.add('c');
// Does this iterate over 'a', 'b', 'c'
// or just 'b' and 'c'?
for (Character myChar : ls.get()) { ... }

Answer 1
The answer is 1. For ListState the pruning is done for myListState.get();.
Answer 2
"per-entry TTL" means the timeout is applied to a single entry rather than whole collection. For your example assuming at the point of reading 10 seconds passed since inserting the a it will iterate over b and c. a is going to be pruned in this case.

Related

In leaky-bucket algorithm, when the queue is not full what's the correct logic to achieve fixed rate?

I am learning leaky bucket algorithm and want to get my hand dirty by writing some simple code with redis plus golang http.
When I searched here with the keyword redis, leaky, bucket. There are many similar questions as shown in [1], which is nice. However I find I have a problem to understand the entire logic after going through those threads and wiki[2]. I suppose there is something I do not understand and am also not aware of it. So I would like to rephrase it again here; and please correct me if I get it wrong.
The pseudo code:
key := "ip address, token or anything that can be the representative of a client"
redis_queue_size := 5
interval_between_each_request := 7
request := obtain_http_request_from_somewhere()
if check_current_queue_size() < redis_queue_size:
if is_queue_empty()
add_request_to_the_queue() // zadd "ip1" now() now() // now() is something like seconds, milliseconds or nanoseconds e.g. t = 1
process_request(request)
else
now := get_current_time()
// add_request_to_... retrieves the first element in the queue
// compute the expected timestamp to execute the request and its current time
// e.g. zadd "ip1" <time of the first elment in the queue + interval_between_each_request> now
add_request_to_redis_queue_with_timestamp(now, interval_between_each_request) // e.g. zadd "ip" <timestamp as score> <timestamp a request is allowed to be executed>
// Below function check_the_time_left...() will check how many time left at which the current request need to wait.
// For instance, the first request stored in the queue with the command
// zadd "ip1" 1 1 // t = 1
// and the second request arrives at t = 4 but it is allowed t be executed at t = 8
// zadd "ip1" 8 4 // where 4 := now, 8 := 1 + interval_between_each_request
// so the N will be 4
N := check_the_time_left_for_the_current_request_to_execute(now, interval_between_each_request)
sleep(N) // now the request wait for 4 seconds before processing the request
process_request(http_request_obj)
else
return // discard request
I understand the part when queue is full, then the following requests will be discarded. However I suppose I may misunderstand when the queue is not full, how to reshape the incoming request so it can be executed in a fixed rate.
I appreciate any suggestions
[1]. https://stackoverflow.com/search?q=redis+leaky+bucket+&s=aa2eaa93-a6ba-4e31-9a83-68f791c5756e
[2]. https://en.wikipedia.org/wiki/Leaky_bucket#As_a_queue

If this is for simple rate-limiting the sliding window approach using a sorted set is what we see implemented by most Redis users https://github.com/Redislabs-Solution-Architects/RateLimitingExample/blob/sliding_window/app.py
If you are set on leaky bucket you might consider using a redis stream per consumerID (apiToken/IP Address etc) as follows
request comes in for consumerID
XADD requests-[consumerID] MAXLEN [BUCKET SIZE]
spawn a go routine if necessary for that consumerID
get current time
if XLEN of requests-[consumerID] is 0 exit go routine
XREAD COUNT [number_of_requests_per_period] BLOCK [time period - 1 ms] STREAMS requests-[consumerID]
get the current time and sleep for the remainder of the time period
https://redis.io/commands#stream details how streams work

There are several ways you can implement a leaky bucket but there should be two separate parts for the process. One that puts things in the bucket and another that removes them at a set interval if there is anything to remove.
You can use a separate goroutine that would consume the messages at a set interval. This would simplify your code since on one code path you would only have to look into the queue size and drop packets and another code path would just consume whatever there is.

Spark Dataframe suddenly become very slow when I reuse the old cached data iteratively too much time

The problem happened when I try to keep my cached result in a List and try to calculate new DataFrame by all the data from the last list in each iteration. However, Even I use an empty DataFrame and get an empty result each time, the function will suddenly get very slow after about 8~12 round.
Here is my code
testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){
// do some dummy transformation like union and cache the result
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache
// always get 0, of course
println(resultDf.count)
// benchmark action
benchmark(resultDf.count)
testLoop(resultDf::lastDfList)
}
the benchmark result
1~6 round : < 200ms
7 round : 367ms
8 round : 918ms
9 round : 2476ms
10 round : 7833ms
11 round : 24231ms
I don't think GC or Block eviction is the problem in my case since I already use an empty DataFrame, but I don't know what is the cause? Do I misunderstand the meaning of cache or something?
Thanks!
After reading ImDarrenG's solution, I changed my code to be the following:
spark.sparkContext.setCheckpointDir("/tmp")
testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){
// do some dummy transformation like union and cache the result
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache
resultDf.checkpoint()
// always get 0, of course
println(resultDf.count)
// benchmark action
benchmark(resultDf.count)
testLoop(resultDf::lastDfList)
}
But it still become very slow after a few iterations.

Here you create a list of DataFrames by adding resultDf to the beginning of lastDfList and pass that to the next iteration of testLoop:
testLoop(resultDf::lastDfList)
So lastDfList gets longer each pass.
This line creates a new DataFrame by unioning each member of lastDfList:
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf))}.cache
Each member of lastDfList is a union of it's predecessors, therefore, Spark is maintaining a lineage that becomes exponentially larger with each pass of testLoop.
I expect that the increase in time is caused by the housekeeping of the DAG. Caching the dataframes removes the need to repeat transformations, but the lineage must still be maintained by spark.
Cached data or no, it looks like you are building a really complex DAG by unioning each DataFrame with all of it's predecessors with each pass of testLoop.
You could use checkpoint to trim the lineage, and introduce some check to prevent infinite recursion.

According to API and code, checkpoint will return a new Dataset instead of changing original Dataset.

Random unique string against a blacklist

I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola

Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p

I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.

Improved logic to find the next available ID from a fixed list?

I need to find the next available ID (or key) from a fixed list of possible IDs. In this case valid IDs are from 1 to 9999, inclusive. When finding the next available ID we start looking just after the last assigned ID, wrap around at the end - only once, of course - and need to check if each ID is taken before we return it as an available ID.
I have some code that does this but I think it is neither elegant nor efficient and am interested in a simpler way to accomplish the same thing. I'm using Ruby but my question is not specific to the language, so if you'd like to write an answer using any other language I will be just as appreciative of your input!
I have elided some details about checking if an ID is available and such, so just take it as a given that the functions incr_last_id, id_taken?(id), and set_last_id(id) exist. (incr_last_id will add 1 to the last assigned ID in a data store (Redis) and return the result. id_taken?(id) returns a boolean indicating if the ID is available or not. set_last_id(id) updates the data store with the new last ID.)
MaxId = 9999
def next_id
id = incr_last_id
# if this ID is taken or out of range, find the next available id
if id > MaxId || id_taken?(id)
id += 1 while id < MaxId && id_taken?(id)
# wrap around if we've exhausted the ID space
if id > MaxId
id = 1
id += 1 while id < MaxId && id_taken?(id)
end
raise NoAvailableIdsError if id > MaxId || id_taken?(id)
set_last_id(id)
end
id
end
I'm not really interested in solutions that require me to build up a list of all possible IDs and then get the set or list difference between the assigned IDs and the available IDs. That doesn't scale. I realize that this is a linear operation no matter how you slice it and that part is fine, I just think the code can be simplified or improved. I don't like the repetition caused by having to wrap around but perhaps there's no way around that.
Is there a better way? Please show me!

Since you've already searched from incr_last_id to MaxId in the first iteration, there isn't really a need to repeat it again.
Searching from 1 to incr_last_id on the second round at least reduces the search to exactly O(n) instead of a worse case of O(2n)
If you want to do it in a single loop, use modulo,
MaxId = 9999
def next_id
id = incr_last_id
limit = id - 1 #This sets the modulo test to the id just before your start point
id += 1 while (id_taken?(id) && (i % MaxId) != limit)
raise NoAvailableIdsError if id_taken?(id)
set_last_id(id)
id
end

Using a database table (MySQL in this example):
SELECT id FROM sequences WHERE sequence_name = ? FOR UPDATE
UPDATE sequences SET id = id + 1 WHERE sequence_name = ?
The FOR UPDATE gains an exclusive lock on the table, ensuring you can be the only possible process doing the same operation at the same time.
Using an in-memory fixed list:
# somewhere global, done once
#lock = Mutex.new
#ids = (0..9999).to_a
def next_id
#lock.synchronize { #ids.shift }
end
Using redis:
LPOP list_of_ids
Or just:
INCR some_id
Redis takes care of the concurrency concerns for you.

The usual answer to improve this sort of algorithm is to keep a list of "free objects" handy; you could use just a single object in the list, if you don't really want the extra effort of maintaining a list. (This would reduce the effectiveness of the free object cache, but the overhead of managing a large list of free objects might grow to be a burden. It Depends.)
Because you're wrapping your search around when you've hit MaxId, I presume there is a function give_up_id that will return the id to the free pool. Instead of simply putting a freed id back into the big pool you keep track of it with a new variable #most_recently_free or append it to a list #free_ids[].
When you need a new id, take one off the list, if the list has one. If the list doesn't have one, begin your search as you currently do.
Here's a sketch in pseudo-code:
def give_up_id(id)
#free_ids.push(id)
end
def next_id
if #free_ids.empty?
id = old_next_id()
else
id = #free_ids.pop()
return id
end
If you allow multiple threads of execution to interact with your id allocation / free routines, you'll need to protect these routines too, of course.

Count, size, length...too many choices in Ruby?

I can't seem to find a definitive answer on this and I want to make sure I understand this to the "n'th level" :-)
a = { "a" => "Hello", "b" => "World" }
a.count # 2
a.size # 2
a.length # 2
a = [ 10, 20 ]
a.count # 2
a.size # 2
a.length # 2
So which to use? If I want to know if a has more than one element then it doesn't seem to matter but I want to make sure I understand the real difference. This applies to arrays too. I get the same results.
Also, I realize that count/size/length have different meanings with ActiveRecord. I'm mostly interested in pure Ruby (1.92) right now but if anyone wants to chime in on the difference AR makes that would be appreciated as well.
Thanks!

For arrays and hashes size is an alias for length. They are synonyms and do exactly the same thing.
count is more versatile - it can take an element or predicate and count only those items that match.
> [1,2,3].count{|x| x > 2 }
=> 1
In the case where you don't provide a parameter to count it has basically the same effect as calling length. There can be a performance difference though.
We can see from the source code for Array that they do almost exactly the same thing. Here is the C code for the implementation of array.length:
static VALUE
rb_ary_length(VALUE ary)
{
long len = RARRAY_LEN(ary);
return LONG2NUM(len);
}
And here is the relevant part from the implementation of array.count:
static VALUE
rb_ary_count(int argc, VALUE *argv, VALUE ary)
{
long n = 0;
if (argc == 0) {
VALUE *p, *pend;
if (!rb_block_given_p())
return LONG2NUM(RARRAY_LEN(ary));
// etc..
}
}
The code for array.count does a few extra checks but in the end calls the exact same code: LONG2NUM(RARRAY_LEN(ary)).
Hashes (source code) on the other hand don't seem to implement their own optimized version of count so the implementation from Enumerable (source code) is used, which iterates over all the elements and counts them one-by-one.
In general I'd advise using length (or its alias size) rather than count if you want to know how many elements there are altogether.
Regarding ActiveRecord, on the other hand, there are important differences. check out this post:
Counting ActiveRecord associations: count, size or length?

There is a crucial difference for applications which make use of database connections.
When you are using many ORMs (ActiveRecord, DataMapper, etc.) the general understanding is that .size will generate a query that requests all of the items from the database ('select * from mytable') and then give you the number of items resulting, whereas .count will generate a single query ('select count(*) from mytable') which is considerably faster.
Because these ORMs are so prevalent I following the principle of least astonishment. In general if I have something in memory already, then I use .size, and if my code will generate a request to a database (or external service via an API) I use .count.

In most cases (e.g. Array or String) size is an alias for length.
count normally comes from Enumerable and can take an optional predicate block. Thus enumerable.count {cond} is [roughly] (enumerable.select {cond}).length -- it can of course bypass the intermediate structure as it just needs the count of matching predicates.
Note: I am not sure if count forces an evaluation of the enumeration if the block is not specified or if it short-circuits to the length if possible.
Edit (and thanks to Mark's answer!): count without a block (at least for Arrays) does not force an evaluation. I suppose without formal behavior it's "open" for other implementations, if forcing an evaluation without a predicate ever even really makes sense anyway.

I found a good answare at http://blog.hasmanythrough.com/2008/2/27/count-length-size
In ActiveRecord, there are several ways to find out how many records
are in an association, and there are some subtle differences in how
they work.
post.comments.count - Determine the number of elements with an SQL
COUNT query. You can also specify conditions to count only a subset of
the associated elements (e.g. :conditions => {:author_name =>
"josh"}). If you set up a counter cache on the association, #count
will return that cached value instead of executing a new query.
post.comments.length - This always loads the contents of the
association into memory, then returns the number of elements loaded.
Note that this won't force an update if the association had been
previously loaded and then new comments were created through another
way (e.g. Comment.create(...) instead of post.comments.create(...)).
post.comments.size - This works as a combination of the two previous
options. If the collection has already been loaded, it will return its
length just like calling #length. If it hasn't been loaded yet, it's
like calling #count.
Also I have a personal experience:
<%= h(params.size.to_s) %> # works_like_that !
<%= h(params.count.to_s) %> # does_not_work_like_that !

We have a several ways to find out how many elements in an array like .length, .count and .size. However, It's better to use array.size rather than array.count. Because .size is better in performance.

Adding more to Mark Byers answer. In Ruby the method array.size is an alias to Array#length method. There is no technical difference in using any of these two methods. Possibly you won't see any difference in performance as well. However, the array.count also does the same job but with some extra functionalities Array#count
It can be used to get total no of elements based on some condition. Count can be called in three ways:
Array#count # Returns number of elements in Array
Array#count n # Returns number of elements having value n in Array
Array#count{|i| i.even?} Returns count based on condition invoked on each element array
array = [1,2,3,4,5,6,7,4,3,2,4,5,6,7,1,2,4]
array.size # => 17
array.length # => 17
array.count # => 17
Here all three methods do the same job. However here is where the count gets interesting.
Let us say, I want to find how many array elements does the array contains with value 2
array.count 2 # => 3
The array has a total of three elements with value as 2.
Now, I want to find all the array elements greater than 4
array.count{|i| i > 4} # =>6
The array has total 6 elements which are > than 4.
I hope it gives some info about count method.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Apache Flink 1.6.0 - StateTtlConfig and ListState - events

Related

In leaky-bucket algorithm, when the queue is not full what's the correct logic to achieve fixed rate?

Spark Dataframe suddenly become very slow when I reuse the old cached data iteratively too much time

Random unique string against a blacklist

Improved logic to find the next available ID from a fixed list?

Count, size, length...too many choices in Ruby?

Categories

Resources