Kafka streams interactive query - duplicate windows - apache-kafka-streams

I am trying to query a 30 seconds windowed store like this:
ReadOnlyWindowStore<String, Long> queryableStore
...
KeyValueIterator<Windowed<String>, Long> iter = queryableStore.fetch("A", "E", now - (60 * 1000 * 5), now)
Is there a scenario in which I get duplicate windows in the output? I was expecting that this returns 50 unique windows at the most (5 keys x 10 windows). What I am observing is that I get more windows than 50 and some of them are duplicates. Duplicates occur towards the last entries in the iterator. Is it recommended to manually remove the duplicates in this case?
Thanks in advance!

Related

Redis pipeline, dealing with cache misses

I'm trying to figure out the best way to implement Redis pipelining. We use redis as a cache on top of MySQL to store user data, product listings, etc.
I'm using this as a starting point: https://joshtronic.com/2014/06/08/how-to-pipeline-with-phpredis/
My question is, assuming you have an array of ids properly sorted. You loop through the redis pipeline like this:
$redis = new Redis();
// Opens up the pipeline
$pipe = $redis->multi(Redis::PIPELINE);
// Loops through the data and performs actions
foreach ($users as $user_id => $username)
{
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
}
// Executes all of the commands in one shot
$users = $pipe->exec();
What happens when $pipe->get('user:' . $user_id); is not available, because it hasn't been requested before or has been evicted by Redis, etc? Assuming it's result # 13 from 50, how do we a) find out that we weren't able to retrieve that object and b) keep the array of users properly sorted?
Thank you
I will answer the question referring to Redis protocol. How it works in particular language is more or less the same in that case.
First of all, let's check how Redis pipeline works:
It is just a way to send multiple commands to server, execute them and get multiple replies. There is nothing special, you just get an array with replies for each command in the pipeline.
Why pipelines are much faster is because roundtrip time for each command is saved, i.e. for 100 commands there is only one round-trip time instead of 100. In addition, Redis executes every command synchronously. Executing 100 commands needs potentially fighting 100 times, for Redis to pick that singular command, pipeline is treated as one long command, thus requiring only once to wait being picked synchronously.
You can read more about pipelining here: https://redis.io/topics/pipelining. One more note, because each pipelined batch runs uninterruptible (in terms of Redis) it makes sense to send these commands in overviewable chunks, i.e. don't send 100k commands in a single pipeline, that might block Redis for a long period of time, split them into chunks of 1k or 10k commands.
In your case you run in the loop the following fragment:
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
The question is what is put into pipeline? Let's say you'd update data for u1, u2, u3, u4 as user ids. Thus the pipeline with Redis commands will look like:
INCR accessed:u1
GET user:u1
INCR accessed:u2
GET user:u2
INCR accessed:u3
GET user:u3
INCR accessed:u4
GET user:u4
Let's say:
u1 was accessed 100 times before,
u2 was accessed 5 times before,
u3 was not accessed before and
u4 and accompanying data does not exist.
The result will be in that case an array of Redis replies having:
101
u1 string data stored at user:u1
6
u2 string data stored at user:u2
1
u3 string data stored at user:u3
1
NIL
As you can see, Redis will treat missing INCR values as being 0 and execute incr(0). Finally, there is nothing being sorted by Redis and the results will come in the oder as requested.
The language binding, e.g. Redis driver, will just parse for you that protocol and give the view to parsed data. Without keeping the oder of commands it'll be impossible for Redis driver to work correctly and for you as programmer to deduce smth. Just keep in mind, that request is not duplicated in the reply i.e. you will not receive key for u1 or u2 when doing GET, but just the data for that key. Thus your implementation must remember that on position 1 (zero based index) comes the result of GET for u1.

Spark Streaming:how to sum up all result for several DStreams?

I am now using Spark Streaming + Kafka to construct my message processing system.But I have a little technical problem , I will describe it below:
For example , I want to do a wordcount for each 10 minutes,So, in my earliest code,I set Batch Interval to 10 minutes.Code is like below:
val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
val ssc = new StreamingContext(sparkConf, Minutes(10))
But I don't think it is a very good solution because 10 minutes is what a long time and large amount of data that my memory cannot sustain so much data.So , I want to reduce batch interval to 1 minutes, like:
val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
val ssc = new StreamingContext(sparkConf, Minutes(1))
Then the problem comes:How can I sum up the result of 10 minutes for ten '1 minutes'? I think this word can only be done in driver instead of worker program,what can I do?
I am new learner of Spark Streaming.Any one can give me a hand?
Maybe I have my idea. In this condition ,I should use stateful function like UpdateStateByKey() because , since what I want is a global 10 minutes' result but what I can get is just each intermediate result of each 1 minute , so before each 10 minutes end , I have to record the state of each 1 minute , such as the word count result of each 1 minute and add them up for each 1 minute.
Posting here as I had a similar issue and came across the Window Operations section of Spark Streaming. In the poster's original case, they want a count for the past 10 minutes, done every 10 minutes although their program calculates counts each 1 minute. Assuming we have counts defined and calculated as the standard word count (i.e. at a 1-minute batch duration, with tuples (word, count)), we could follow the linked guide and define something along the lines of
// Reduce/count last 10 seconds worth of data, every 10 seconds
val windowedWordCounts = counts.reduceByKeyAndWindow(_+_, Seconds(10), Seconds(10))
where _+_ is a sum function.

Pop multiple values from Redis data structure atomically?

Is there a Redis data structure, which would allow atomic operation of popping (get+remove) multiple elements, which it contains?
There are well known SPOP or RPOP, but they always return a single value. Therefore, when I need first N values from set/list, I need to call the command N-times, which is expensive. Let's say the set/list contains millions of items. Is there anything like SPOPM "setName" 1000, which would return and remove 1000 random items from set or RPOPM "listName" 1000, which would return 1000 right-most items from list?
I know there are commands like SRANDMEMBER and LRANGE, but they do not remove the items from the data structure. They can be deleted separately. However, if there are more clients reading from the same data structure, some items can be read more than once and some can be deleted without reading! Therefore, atomicity is what my question is about.
Also, I am fine if the time complexity for such operation is more expensive. I doubt it will be more expensive than issuing N (let's say 1000, N from the previous example) separate requests to Redis server.
I also know about separate transaction support. However, this sentence from Redis docs discourages me from using it for parallel processes modifying the set (destructively reading from it):
When using WATCH, EXEC will execute commands only if the watched keys were not modified, allowing for a check-and-set mechanism.
Use LRANGE with LTRIM in a pipeline. The pipeline will be run as one atomic transaction. Your worry above about WATCH, EXEC will not be applicable here because you are running the LRANGE and LTRIM as one transaction without the ability for any other transactions from any other clients to come between them. Try it out.
To expand on Eli's response with a complete example for list collections, using lrange and ltrim builtins instead of Lua:
127.0.0.1:6379> lpush a 0 1 2 3 4 5 6 7 8 9
(integer) 10
127.0.0.1:6379> lrange a 0 3 # read 4 items off the top of the stack
1) "9"
2) "8"
3) "7"
4) "6"
127.0.0.1:6379> ltrim a 4 -1 # remove those 4 items
OK
127.0.0.1:6379> lrange a 0 999 # remaining items
1) "5"
2) "4"
3) "3"
4) "2"
5) "1"
6) "0"
If you wanted to make the operation atomic, you would wrap the lrange and ltrim in multi and exec commands.
Also as noted elsewhere, you should probably ltrim the number of returned items not the number of items you asked for. e.g. if you did lrange a 0 99 but got 50 items you would ltrim a 50 -1 not ltrim a 100 -1.
To implement queue semantics instead of a stack, replace lpush with rpush.
Starting from Redis 3.2, the command SPOP has a [count] argument to retrieve multiple elements from a set.
See http://redis.io/commands/spop#count-argument-extension
Here is a python snippet that can achieve this using redis-py and pipeline:
from redis import StrictRedis
client = StrictRedis()
def get_messages(q_name, prefetch_count=100):
pipe = client.pipeline()
pipe.lrange(q_name, 0, prefetch_count - 1) # Get msgs (w/o pop)
pipe.ltrim(q_name, prefetch_count, -1) # Trim (pop) list to new value
messages, trim_success = pipe.execute()
return messages
I was thinking that I could just do a a for loop of pop but that would not be efficient, even with pipeline especially if the list queue is smaller than prefetch_count. I have a full RedisQueue class implemented here if you want to look. Hope it helps!
if you want a lua script, this should be fast and easy.
local result = redis.call('lrange',KEYS[1],0,ARGV[1]-1)
redis.call('ltrim',KEYS[1],ARGV[1],-1)
return result
then you don't have to loop.
update:
I tried to do this with srandmember (in 2.6) with the following script:
local members = redis.call('srandmember', KEYS[1], ARGV[1])
redis.call('srem', KEYS[1], table.concat(table, ' '))
return members
but I get an error:
error: -ERR Error running script (call to f_6188a714abd44c1c65513b9f7531e5312b72ec9b):
Write commands not allowed after non deterministic commands
I don't know if future version allow this but I assume not. I think it would be problem with replication.
Starting from Redis 6.2 you can use count argument to determine how many elements you want it to be popped from the list. count is available for both LPOP and RPOP. This is the pull request that implements count feature.
redis> rpush foo a b c d e f g
(integer) 7
redis> lrange foo 0 -1
1) "a"
2) "b"
3) "c"
4) "d"
5) "e"
6) "f"
7) "g"
redis> lpop foo
"a"
redis> lrange foo 0 -1
1) "b"
2) "c"
3) "d"
4) "e"
5) "f"
6) "g"
redis> lpop foo 3
1) "b"
2) "c"
3) "d"
redis> lrange foo 0 -1
1) "e"
2) "f"
3) "g"
redis> rpop foo 2
1) "g"
2) "f"
redis>
Redis 4.0+ now supports modules which add all kinds of new functionality and data types with much faster and safer processing than Lua scripts or multi/exec pipelines.
Redis Labs, the current sponsor behind Redis, has a useful set of extension modules called redex here: https://github.com/RedisLabsModules/redex
The rxlists module adds several list operations including LMPOP and RMPOP so you can atomically pop multiple values from a Redis list. The logic is still O(n) (basically doing a single pop in a loop) but all you have to do is install the module once and just send that custom command. I use it on lists with millions of items and thousands popped at once generating 500MB+ of network traffic without issue.
I think you should look at LUA support in Redis. If you write a LUA script and executes it on redis, it is guaranteed that it is atomic (because Redis is mono-threaded). No queries will be performed before the end of your LUA script (ie: you can't implement a big task in LUA or redis will get slow).
So, in this script you add your SPOP and RPOP, you can append the results from each redis command in an LUA array for instance and then return the array to your redis client.
What the documentation is saying about MULTI is that it is optimistic locking, that means it will retry doing the multi thing with WATCH until the watched value is not modified. If you have many writes on the watched value, it will be slower than 'pessimistic' locking (like many SQL databases: POSTGRESQL, MYSQL...) that in some manner 'stops the world' in order for the query to be executed first. Pessimistic locking is not implemented in redis, but you can implement it if you want, but it is complex and maybe you don't need it (not so many writes on this value: optimistic should be quite enough).
you probably can try a lua script (script.lua) like this:
local result = {}
for i = 0 , ARGV[1] do
local val = redis.call('RPOP',KEYS[1])
if val then
table.insert(result,val)
end
end
return result
you can call it this way :
redis-cli eval "$(cat script.lua)" 1 "listName" 1000

Can Cube (js metrics framework) return more than 1000 events?

The Cube software (https://github.com/square/cube) allows you to retrieve events.
I want to retrieve a lot of events. But it appears that I am capped at 1000. There are well over 9000 in mongodb in the collection and time range I am querying
Example http GET queries I issue:
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&start=2012-02-02&stop=2013-07-03
# 7 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&limit=7
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&limit=9999
It appears that the limit is pinned:
https://github.com/square/cube/blob/28dad4af27a6680deb46077b16952590f2c21cad/lib/cube/event.js
Line 166
based on the 'batchSize=1000'
Is it possible that you can 'page' through the data in some way? Or is this just a hard limit?
Looks like there is a hard cap on results in three places that need to be updated for large domains:
event.js - line 166
metric.js - line 11
metric.js - line 12
In addition, I was unable to find any query-string apis for the parameters. Ideally, we can leave the cap at 1000 (to avoid server bloat for people not tuning their queries correctly) and allow the consumer to define override behavior.

Limitation in retrieving rows from a mongodb from ruby code

I have a code which gets all the records from a collection of a mongodb and then it performs some computations.
My program takes too much time as the "coll_id.find().each do |eachitem|......." returns only 300 records at an instant.
If I place a counter inside the loop and check it prints 300 records and then sleeps for around 3 to 4 seconds before printing the counter value for next set of 300 records..
coll_id.find().each do |eachcollectionitem|
puts "counter value for record " + counter.to_s
counter=counter +1
---- My computations here -----
end
Is this a limitation of ruby-mongodb api or some configurations needs to be done so that the code can get access to all the records at one instant.
How large are your documents? It's possible that the deseriaization is taking a long time. Are you using the C extensions (bson_ext)?
You might want to try passing a logger when you connect. That could help sort our what's going on. Alternatively, can you paste in the MongoDB log? What's happening there during the pause?

Resources