Need help conceptualizing in Redis/NoSQL - data-structures

I think I have a good grasp of all the commands to use Redis, but I'm having difficulty figuring out the best way to use it. I'm designing a customer notification system that will notify them via their preferred method (Email, SNMP, Syslog) when there's an alarm on any of their circuits.
So, I get a device name and a port. I need to associate that with a single customer, and then associate that customer to a delivery method. With a relational db, it would look probably look something like this:
Device name: Los_Angeles
Port: 11
SELECT Customer_ID, Customer_name from device_info where device_port = 'Los_Angeles:11'
SELECT Customer_protocol, SNMP_destination, Syslog_destination from CUSTOMER
where Customer_ID = <customer_id from above>
(Greatly simplified example).
I can see how to do this programatically with a hash of lists or hash of hashes. But I guess what I'm having trouble with in Redis is that those more complex data structures aren't available to me (as far as I know). So, how do I associate multiple pieces of information with a single key? I can think of a few ways I can do it, but they all seem to involve multiple steps, and I'd like some input from current Redis programmers about what the "best" way to do this.

You are correct that only simple data structures are available with Redis, and they cannot be composed by value (like you could do with a document oriented database such as CouchDB or MongoDB). However, it is possible to compose data structures by reference, and this is a very common pattern.
For instance, the items contained in a set can be the keys for other objects (lists, hash tables, other sets, etc ...). Let's try to apply this to your example.
To model a relationship between customers and device+port, you can use sets containing customer IDs. To store information about the customers, one hash table per customer is fine.
Here are the customers:
hmset c:1 name Smith protocol tcp snmp_dest 127.0.0.1 syslog_dest 127.0.0.2
hmset c:2 name Jackson protocol udp snmp_dest 127.0.0.1 syslog_dest 127.0.0.2
hmset c:3 name Davis protocol tcp snmp_dest 127.0.0.3 syslog_dest 127.0.0.4
The keys of these records are c:ID
Let's associate two of them to a device and port:
sadd d:Los_Angeles:11 2 3
The key of this set is d:device:port. The c: and d: prefixes are just a convention.
One set per device/port should be created. A given customer could belong to several sets (and therefore associated to several device/port).
Now to find the customers with delivery methods attached to this device/port, we just have to retrieve the content of the set.
smembers d:Los_Angeles:11
1) "2"
2) "3"
then the corresponding customer information can be retrieved by pipelining a number of hgetall commands:
hgetall c:2
hgetall c:3
1) "name"
2) "Jackson"
3) "protocol"
4) "udp"
5) "snmp_dest"
6) "127.0.0.1"
7) "syslog_dest"
8) "127.0.0.2"
1) "name"
2) "Davis"
3) "protocol"
4) "tcp"
5) "snmp_dest"
6) "127.0.0.3"
7) "syslog_dest"
8) "127.0.0.4"
Don't be afraid of the number of commands. They are very fast, and most Redis clients have the capability to pipeline the queries so that only a minimum number of roundtrips are needed. By just using one smembers and several hgetall, the problem can be solved with just two roundtrips.
Now, it is possible to optimize a bit further, thanks to the ubiquitous SORT command. This is probably the most complex command in Redis, and it can be used to save a roundtrip here.
sort d:Los_Angeles:11 by nosort get c:*->name get c:*->protocol get c:*->snmp_dest get c:*->syslog_dest
1) "Jackson"
2) "udp"
3) "127.0.0.1"
4) "127.0.0.2"
5) "Davis"
6) "tcp"
7) "127.0.0.3"
8) "127.0.0.4"
In one command, it retrieves the content of a device/port set and fetches the corresponding customer information.
This example was trivial, but more generally, while you can represent complex data structures with Redis, it is not immediate. You need to carefully think about the model both in term of structure and of data access (i.e. at design time, stick to your data AND your use cases).

Related

Managing shared resources with threads in Huey

I have to update many rows (increment one value in each rows) in peewee database (SqliteDatabase). Some objects can be uncreated so I have to create them with default values before working with them. I would use ways which are in peewee docs (Atomic updates) but I couldn't figure out how to mix model.get_or_create() and in [my_array].
So I decided to make queries in a transaction to commit it once at the end (I hope it does).
Why I'm writting in stack overflow is because I don't know how to work with db.atomic() with threading (I tested with 4 workers) in Huey because .atomic() locks the connection (peewee.OperationalError: database is locked). I've tried to use #huey.lock_task but it's not a solution of my problem as I've found.
Code of my class:
class Article(Model):
name = CharField()
mention_number = IntegerField(default=0)
class Meta:
database = db
Code of my task:
#huey.task(priority=30)
def update(names): # "names" is a list of strings
with db.atomic():
for name in names:
article, success = Article.get_or_create(name=name)
article.mention_number += 1
article.save()
Well, if you're using a recent version of Sqlite (3.24 or newer) you can use Postgres-style upsert queries. This is well supported by Peewee: http://docs.peewee-orm.com/en/latest/peewee/api.html#Insert.on_conflict
To answer the other question about shared resources, it's not clear from your example what you would like to happen... Sqlite only allows one write transaction at a time. So if you are running several threads, only one of them may be writing at any given time.
Peewee stores database connections in a thread local, so Peewee databases can be safely used in multithreaded applications.
You didn't mention why huey lock_task wouldn't work.
Another suggestion is to try using WAL-mode with Sqlite, as WAL-mode allows multiple reader transactions to co-exist with a single writer.

Redis pipeline, dealing with cache misses

I'm trying to figure out the best way to implement Redis pipelining. We use redis as a cache on top of MySQL to store user data, product listings, etc.
I'm using this as a starting point: https://joshtronic.com/2014/06/08/how-to-pipeline-with-phpredis/
My question is, assuming you have an array of ids properly sorted. You loop through the redis pipeline like this:
$redis = new Redis();
// Opens up the pipeline
$pipe = $redis->multi(Redis::PIPELINE);
// Loops through the data and performs actions
foreach ($users as $user_id => $username)
{
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
}
// Executes all of the commands in one shot
$users = $pipe->exec();
What happens when $pipe->get('user:' . $user_id); is not available, because it hasn't been requested before or has been evicted by Redis, etc? Assuming it's result # 13 from 50, how do we a) find out that we weren't able to retrieve that object and b) keep the array of users properly sorted?
Thank you
I will answer the question referring to Redis protocol. How it works in particular language is more or less the same in that case.
First of all, let's check how Redis pipeline works:
It is just a way to send multiple commands to server, execute them and get multiple replies. There is nothing special, you just get an array with replies for each command in the pipeline.
Why pipelines are much faster is because roundtrip time for each command is saved, i.e. for 100 commands there is only one round-trip time instead of 100. In addition, Redis executes every command synchronously. Executing 100 commands needs potentially fighting 100 times, for Redis to pick that singular command, pipeline is treated as one long command, thus requiring only once to wait being picked synchronously.
You can read more about pipelining here: https://redis.io/topics/pipelining. One more note, because each pipelined batch runs uninterruptible (in terms of Redis) it makes sense to send these commands in overviewable chunks, i.e. don't send 100k commands in a single pipeline, that might block Redis for a long period of time, split them into chunks of 1k or 10k commands.
In your case you run in the loop the following fragment:
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
The question is what is put into pipeline? Let's say you'd update data for u1, u2, u3, u4 as user ids. Thus the pipeline with Redis commands will look like:
INCR accessed:u1
GET user:u1
INCR accessed:u2
GET user:u2
INCR accessed:u3
GET user:u3
INCR accessed:u4
GET user:u4
Let's say:
u1 was accessed 100 times before,
u2 was accessed 5 times before,
u3 was not accessed before and
u4 and accompanying data does not exist.
The result will be in that case an array of Redis replies having:
101
u1 string data stored at user:u1
6
u2 string data stored at user:u2
1
u3 string data stored at user:u3
1
NIL
As you can see, Redis will treat missing INCR values as being 0 and execute incr(0). Finally, there is nothing being sorted by Redis and the results will come in the oder as requested.
The language binding, e.g. Redis driver, will just parse for you that protocol and give the view to parsed data. Without keeping the oder of commands it'll be impossible for Redis driver to work correctly and for you as programmer to deduce smth. Just keep in mind, that request is not duplicated in the reply i.e. you will not receive key for u1 or u2 when doing GET, but just the data for that key. Thus your implementation must remember that on position 1 (zero based index) comes the result of GET for u1.

Dealing with Memcached Race Conditions

I have two different sources of data which I need to marry together. Data set A will have a foo_key attribute which can map to Data set B's bar_key attribute with a one to many relationship.
Data set A:
[{ foo_key: 12345, other: 'blahblah' }, ...]
Data set B:
[{ bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, ...]
Data set A is coming from a SQS queue and any relationships with data set B will be available as I poll A.
Data set B is coming from a separate SQS queue that I am trying to dump into a memcached cache to do quick look ups on when an object drops into data set A.
Originally I was planning on setting the memcached key to be bar_key from the objects in data set B but then realized that if I did that it would be possible to overwrite the value since there can be many of the same bar_key value. Then I was thinking well I can create a key bar_key and the value just be an array of the SQS messages. But since I have multiple hosts polling the SQS queue I think it might be possible that when I check to see if the key is in memcached, check it out, append the new message to it, and then set it, that another host could be trying to preform the same operation and thus the first host's attempt at appending the value would just be overwritten.
I've looked around at memcached key locking but I'm not sure I understand it entirely. Would the solution be that when I get the key/value pair from memcached I create a temporary dummy lock on a new key called bar_key_dummy that expires in x seconds, and if I try to fetch a key that has a bar_key_dummy lock active I just send the SQS message back to the queue without deleting to try again in x seconds?
Here's some pseudocode for what I have going on in my head. Does this make any sense?
store = MemCache.new(host)
sqs_messages.poll do |message|
dummy_key = "#{message.bar_key}_dummy"
sqs.dont_delete_message && next unless store.get(dummy_key).nil?
# set dummy_key in memcache with a value of 1 for 3 seconds
store.set(dummy_key, 1, 3)
temp_data = store.get(message.bar_key) || []
temp_data << message
store.set(message.bar_key, temp_data, 300)
# delete dummy key when done in case shorter than x seconds
store.delete(dummy_key)
end
Thanks for any help!
Memcached has a special operation - cas Compare and Swap.
Command gets returns Item along with its unique CAS value.
Then dataset can be searched and update must be issued with the cas command which takes original unique CAS value.
If CAS was changed in between two command, update operation will fail with the EXISTS error

Syncing local state and remote state (Parse) with poor connectivity

GOAL: 1) Enable users to play my game regardless of poor connectivity, and 2) have ~reliable user state stored on Parse for customer support and stats.
My Approach: I am using local client storage as the master (so that net connectivity is not required), and I am using Parse as a secondary synced storage so that I can address customer issues and maintain stats.
The game has a sequence of fixed levels.
User state for each level is stored locally in UserDefaults as the SOT (yes, a bit ugly).
I use Parse as a secondary replicated store for customer support issues and stats.
Thus, I also store the level state on Parse (i.e. PFObject="UserLevel": userId, level #, status, high score, #wins, #losses, etc.).
There should be only one (or zero) PFObject per level per user.
==> When the network connectivity is poor, I often create multiple PFObjects for the same level.
e.g. A typical 5 minute game session:
User unlocks a level:
==> Create PFObject "UserLevel": uid=currentUser, level=#, status=UNLOCKED, wins=0, etc.
User plays this level and loses:
==> Query (async) Parse for PFObject (match userId & level #):
If one matching object found ==> ++losses... (saveEventually)
If >1 matching objects found ==> object[0]:++losses... (saveEventually) (ignore all but one!)
Else (no matching objects found) ==> Create new PFObject. (saveEventually)
User plays again and wins:
==> Query (async) Parse for PFObject (match userId & level #)
If one matching object found: status=COMPLETED, ++wins... (saveEventually)
If >1 matching objects found, status=COMPLETED, ++wins... (saveEventually) (ignore all but one!)
Else (no matching objects found): create the PFObject. (saveEventually)
... etc...
As you can guess, if the network connectivity is slow, and the various queries and saveEventually's have not completed yet, this can easily result in duplicate PFObjects for the same level.
My guts tell me not to create the PFObject during an update if it does not exist. Instead, the Parse state these be sloppy/behind, and clean it up during app start (i.e. a clean sync).
I'm assuming this is a very common design pattern and that I'm missing some basic CS fundamentals. (I'm a newb to backend coding.)
The simple solution, since there are a fixed number of levels, is to pre-create each "UserLevel" and save them, and then everywhere else you just do updated instead of create/update.

Pop multiple values from Redis data structure atomically?

Is there a Redis data structure, which would allow atomic operation of popping (get+remove) multiple elements, which it contains?
There are well known SPOP or RPOP, but they always return a single value. Therefore, when I need first N values from set/list, I need to call the command N-times, which is expensive. Let's say the set/list contains millions of items. Is there anything like SPOPM "setName" 1000, which would return and remove 1000 random items from set or RPOPM "listName" 1000, which would return 1000 right-most items from list?
I know there are commands like SRANDMEMBER and LRANGE, but they do not remove the items from the data structure. They can be deleted separately. However, if there are more clients reading from the same data structure, some items can be read more than once and some can be deleted without reading! Therefore, atomicity is what my question is about.
Also, I am fine if the time complexity for such operation is more expensive. I doubt it will be more expensive than issuing N (let's say 1000, N from the previous example) separate requests to Redis server.
I also know about separate transaction support. However, this sentence from Redis docs discourages me from using it for parallel processes modifying the set (destructively reading from it):
When using WATCH, EXEC will execute commands only if the watched keys were not modified, allowing for a check-and-set mechanism.
Use LRANGE with LTRIM in a pipeline. The pipeline will be run as one atomic transaction. Your worry above about WATCH, EXEC will not be applicable here because you are running the LRANGE and LTRIM as one transaction without the ability for any other transactions from any other clients to come between them. Try it out.
To expand on Eli's response with a complete example for list collections, using lrange and ltrim builtins instead of Lua:
127.0.0.1:6379> lpush a 0 1 2 3 4 5 6 7 8 9
(integer) 10
127.0.0.1:6379> lrange a 0 3 # read 4 items off the top of the stack
1) "9"
2) "8"
3) "7"
4) "6"
127.0.0.1:6379> ltrim a 4 -1 # remove those 4 items
OK
127.0.0.1:6379> lrange a 0 999 # remaining items
1) "5"
2) "4"
3) "3"
4) "2"
5) "1"
6) "0"
If you wanted to make the operation atomic, you would wrap the lrange and ltrim in multi and exec commands.
Also as noted elsewhere, you should probably ltrim the number of returned items not the number of items you asked for. e.g. if you did lrange a 0 99 but got 50 items you would ltrim a 50 -1 not ltrim a 100 -1.
To implement queue semantics instead of a stack, replace lpush with rpush.
Starting from Redis 3.2, the command SPOP has a [count] argument to retrieve multiple elements from a set.
See http://redis.io/commands/spop#count-argument-extension
Here is a python snippet that can achieve this using redis-py and pipeline:
from redis import StrictRedis
client = StrictRedis()
def get_messages(q_name, prefetch_count=100):
pipe = client.pipeline()
pipe.lrange(q_name, 0, prefetch_count - 1) # Get msgs (w/o pop)
pipe.ltrim(q_name, prefetch_count, -1) # Trim (pop) list to new value
messages, trim_success = pipe.execute()
return messages
I was thinking that I could just do a a for loop of pop but that would not be efficient, even with pipeline especially if the list queue is smaller than prefetch_count. I have a full RedisQueue class implemented here if you want to look. Hope it helps!
if you want a lua script, this should be fast and easy.
local result = redis.call('lrange',KEYS[1],0,ARGV[1]-1)
redis.call('ltrim',KEYS[1],ARGV[1],-1)
return result
then you don't have to loop.
update:
I tried to do this with srandmember (in 2.6) with the following script:
local members = redis.call('srandmember', KEYS[1], ARGV[1])
redis.call('srem', KEYS[1], table.concat(table, ' '))
return members
but I get an error:
error: -ERR Error running script (call to f_6188a714abd44c1c65513b9f7531e5312b72ec9b):
Write commands not allowed after non deterministic commands
I don't know if future version allow this but I assume not. I think it would be problem with replication.
Starting from Redis 6.2 you can use count argument to determine how many elements you want it to be popped from the list. count is available for both LPOP and RPOP. This is the pull request that implements count feature.
redis> rpush foo a b c d e f g
(integer) 7
redis> lrange foo 0 -1
1) "a"
2) "b"
3) "c"
4) "d"
5) "e"
6) "f"
7) "g"
redis> lpop foo
"a"
redis> lrange foo 0 -1
1) "b"
2) "c"
3) "d"
4) "e"
5) "f"
6) "g"
redis> lpop foo 3
1) "b"
2) "c"
3) "d"
redis> lrange foo 0 -1
1) "e"
2) "f"
3) "g"
redis> rpop foo 2
1) "g"
2) "f"
redis>
Redis 4.0+ now supports modules which add all kinds of new functionality and data types with much faster and safer processing than Lua scripts or multi/exec pipelines.
Redis Labs, the current sponsor behind Redis, has a useful set of extension modules called redex here: https://github.com/RedisLabsModules/redex
The rxlists module adds several list operations including LMPOP and RMPOP so you can atomically pop multiple values from a Redis list. The logic is still O(n) (basically doing a single pop in a loop) but all you have to do is install the module once and just send that custom command. I use it on lists with millions of items and thousands popped at once generating 500MB+ of network traffic without issue.
I think you should look at LUA support in Redis. If you write a LUA script and executes it on redis, it is guaranteed that it is atomic (because Redis is mono-threaded). No queries will be performed before the end of your LUA script (ie: you can't implement a big task in LUA or redis will get slow).
So, in this script you add your SPOP and RPOP, you can append the results from each redis command in an LUA array for instance and then return the array to your redis client.
What the documentation is saying about MULTI is that it is optimistic locking, that means it will retry doing the multi thing with WATCH until the watched value is not modified. If you have many writes on the watched value, it will be slower than 'pessimistic' locking (like many SQL databases: POSTGRESQL, MYSQL...) that in some manner 'stops the world' in order for the query to be executed first. Pessimistic locking is not implemented in redis, but you can implement it if you want, but it is complex and maybe you don't need it (not so many writes on this value: optimistic should be quite enough).
you probably can try a lua script (script.lua) like this:
local result = {}
for i = 0 , ARGV[1] do
local val = redis.call('RPOP',KEYS[1])
if val then
table.insert(result,val)
end
end
return result
you can call it this way :
redis-cli eval "$(cat script.lua)" 1 "listName" 1000

Resources