Redis embedding value in the key vs json - data-structures

I'm planning to store rooms availability in a redis database. The json object looks as such:
{
BuildingID: "RE0002439",
RoomID: "UN0002384391290",
SentTime: 1572616800,
ReceivedTime: 1572616801,
Status: "Occupied",
EstimatedAvailableFrom: 1572620400000,
Capacity: 20,
Layout: "classroom"
}
This is going to be reported by both devices and apps (tablet outside the room, sensor within the room in some rooms, by users etc.) and vary largely as we have hundreds of buildings and over 1000 rooms.
My intention is to use a simple key value structure in Redis. The main query would be which room is available now, but other queries are possible.
Because of that I was thinking that the key should look like
RoomID,Status,Capacity
My question is is it correct assumption because this is the main query we expect to have these all in the key? Should there be other fields in the key too or should the key be just a number with Redis increment, as if it was SQL?
There are plenty of questions I could find about hierarchy but my object has no hierarchy really.

Unless you will use the redis instance exclusively for this, using keys with pattern matching for common queries is not a good idea. KEYS is O(N) and SCAN too when called multiple times to traverse the whole keyspace.
Consider RediSearch module, it would give you a lot of power on this use case.
If RediSearch is not an option:
You can use a single hash key to store all rooms, but then you have to store the whole json string as value, and whenever you want to modify a field, you need to get, then modify then set.
You are probably better off using multiple data structures, here an idea to get you started:
Store each room as a hash key. If RoomID is unique you can use it as key, or pair it with building id if needed. This way, you can edit a field value in one operation.
HSET UN0002384391290 BuildingID RE0002439 Capacity 20 ...
Keep a set with all room IDs. SADD AllRooms UN0002384391290
Use sets and sorted sets as indexes for the rest:
A set of available rooms: Use SADD AvailableRooms UN0002384391290 and SREM AvailableRooms UN0002384391290 to mark rooms as available or not. This way your common query of all rooms available is as fast as it gets. You can use this in place of Status inside the room data. Use SISMEMBER to test is a given room is available now.
A sorted set with capacity: Use ZADD RoomsByCapacity 20 UN0002384391290. So now you can start doing nice queries like ZRANGEBYSCORE RoomsByCapacity 15 +inf WITHSCORES to get all rooms with a capacity >=15. You then can intersect with available rooms.
Sets by layout: SADD RoomsByLayout:classroom UN0002384391290. Then you can intersect by layout, like SINTER AvailableRooms RoomsByLayout:classroom to get all available classrooms.
Sets by building: SADD RoomsByBuilding:RE0002439 UN0002384391290. Then you can intersect by buildings too, like SINTER AvailableRooms RoomsByLayout:classroom RoomsByBuilding:RE0002439 to get all available classrooms in a building.
You can mix sets with sorted sets, like ZINTERSTORE Available:RE0002439:ByCap 3 RoomsByBuilding:RE0002439 RoomsByCapacity AvailableRooms AGGREGATE MAX to get all available rooms scored by capacity in building RE0002439. Sorted sets only allow ZINTERSTORE and ZUNIONSTORE, so you need to clean up after your queries.
You can avoid sorted sets by using sets with capacity buckets, like Rooms:Capacity:1-5, Rooms:Capacity:6-10, etc.
Consider adding coordinates to your buildings, so your users can query by proximity. See GEOADD and GEORADIUS.
You may want to allow reservations and availability queries into the future. See Date range overlap on Redis?.

Related

RocksDB: range query on numbers

Is it possible to use RocksDB efficiently for range queries on numbers?
For example if I have billions of tuples (price, product_id) can I use RocksDB to retrieve all products that have 10 <= price <= 100? Or it can't be used for that?
I am confused because I can't find any specific docs about number keys and range queries. However I also read that RocksDB is used as a database engine for many DBMS and that suggests that it's possible to query it efficiently for this case.
What is the recommended way to organize the above tuples in a key-value store like RocksDB in order to get arbitrary ranges (not known in advance)?
What kind of keys would you use? What type of queries would you use?
Yes, rocksdb supports efficient range queries [even for arbitrary ranges that are not known in advance]
range queries.
https://github.com/facebook/rocksdb/wiki/Prefix-Seek
number keys
There are no docs on how to model your data like that - if you don't know how to model that already you shouldn't be using rocksdb in the first place as it is too low level
What is the recommended way to organize the above tuples in a key-value store like RocksDB in order to get arbitrary ranges (not known in advance)?
In your example - it is creating an index on price to lookup the product id
So you would encode the price as a byte array and use that as the key and then the product id as a byte array as the value
Example format
key => value
priceIndex:<price>#<productId> => <productId>
Then you will
Create an iterator
Seek to the lower bound of your price [priceIndex:10 in this case]
Set upper bound on the options [priceIndex:100 in this case]
Loop over until iterator is valid
This will give you all the key value pairs that are in the range - which in your case would be all the price, product id tuples that are within the price range
Care must be taken since many products can have the same price and rocksdb keys are unique - so you can suffix the price with the product id as well to make the key unique

Redis - Sorted Dictionary

Redis has the data structure sorted sets, which allows you to create a set that is sorted by some score value.
There are several problems that I'm trying to solve
I need to store similar members but with different scores (not possible with sets). One solution is to concatenate the score with the original value and store it as the value, but it's kinda ugly.
I need to have only one member per score and I need a way to enforce it.
I need to be able to update or remove member by score, like in a dictionary.
The best example of what I'm looking for is an order book
I need to be able to set amount of a certain price, remove price and retrieve amounts and prices sorted by prices
SET orderBook_buy 1.17 30000
SET orderBook_buy 1.18 40000
SET orderBook_buy 1.19 40000
SET orderBook_buy 1.17 35000 // Override the previous value of 1.17
DEL orderBook_buy 1.18 // 1.18 was sold out
I think it can be done if I combine sorted sets and hash tables
I keep the prices in sorted sets
ZADD orderBook_buy_prices 1.17 1.17
...
ZREM orderBook_buy_prices 1.18
And the amounts in hash tables, by prices
HSET orderBook_buy 1.17 35000
...
HDEL orderBook_buy 1.17
It could work but i have to do 2 reads and 2 writes every time, and also make sure that the writes are inside a transaction.
Is there a data structure in redid that support sorted dictionaries out of the box (Could be a module)?
Thank you.
It could work but i have to do 2 reads and 2 writes every time, and also make sure that the writes are inside a transaction.
You also want to do the reads in a transaction, unless you don't care about possible read consistency problems.
Is there a data structure in redid that support sorted dictionaries out of the box (Could be a module)?
Sorted Sets are just that, but what you're looking for is a single data structure that is a sort of a two way dictionary with ordering (albeit only on one subset of keys/values <- depending the direction you're coming from).
Your approach of "welding" together two existing structures is perfectly valid, with the constraints you've pointed out about two keys and transactionality. You could use a Lua script to wrap the logic and not worry about transactions, but you'd still have it touch 2 keys via two ops.
AFAIK there is no Redis module ATM that implements this data structure (although it should be possible to write one).

Using scoring to find customers

I have a site where customers purchase items that are tagged with a variety of taxonomy terms. I want to create a group of customers who might be interested in the same items by considering the tags associated with purchases they've made. Rather than comparing a list of tags for each customer each time I want to build the group, I'm wondering if I can use some type of scoring to solve the problem.
The way I'm thinking about it, each tag would have some unique number assigned to it. When I perform a scoring operation it would render a number that could only be achieved by combining a specific set of tags.
I could update a customer's "score" periodically so that it remains relevant.
Am I on the right track? Any ideas?
Your description of the problem looks much more like a clustering or recommendation problem. I am not sure if those tags are enough of an information to use clustering or recommendation tough.
Your idea of the score doesn't look promising to me, because the same sum could be achieved in several ways, if those numbers aren't carefully enough chosen.
What I would suggest you:
You can store tags for each user. When some user purchases a new item, you will add the tags of the item to the user's tags. On periodical time you will update the users profiles. Let's say we have users A and B. If at the time of the update the similarity between A and B is greater than some threshold, you will add a relation between the users which will indicate that the two users are similar. If it's lower you will remove the relation (if previously they were related). The similarity could be either a number of common tags or num_common_tags / num_of_tags_assigned_either_in_A_or_B.
Later on, when you will want to get users with particular set of tags, you will just do a query which checks which users have that set of tags. Also you can check for similar users to given user, just by looking up which users are linked with the user in question.
If you assign a unique power of two to each tag, then you can sum the values corresponding to the tags, and users with the exact same sets of tags will get identical values.
red = 1
green = 2
blue = 4
yellow = 8
For example, only customers who have the set of { red, blue } will have a value of 5.
This is essentially using a bitmap to represent a set. The drawback is that if you have many tags, you'll quickly run out of integers. For example, if your (unsigned) integer type is four bytes, you'd be limited to 32 tags. There are libraries and classes that let you represent much larger bitsets, but, at that point, it's probably worth considering other approaches.
Another problem with this approach is that it doesn't help you cluster members that are similar but not identical.

Sorting and merging in Stata on categorical variables

I am in the process of merging two data sets together in Stata and came up with a potential concern.
I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, Green, Blue, and Yellow).
If I were to sort each data set the same way and generate an id variable (gen id = _n) and merge on that, would I run into any problems?
There is no statistical question here, as this is purely about data management in Stata, so I too shall shortly vote for this to be migrated to Stack Overflow, where I would be one of those who might try to answer it, so I will do that now.
What you describe to generate identifiers is not how to think of merging data sets, regardless of any of the other details in your question.
Imagine any two data sets, and then in each data set, generate an identifier that is based on the observation numbers, as you propose. Generating such similar identifiers does not create a genuine merge key. You might as well say that four values "Alan" "Bill" "Christopher" "David" in one data set can be merged with "William" "Xavier" "Yulia" "Zach" in another data set because both can be labelled with observation numbers 1 to 4.
My advice is threefold:
Try what you are proposing with your data and try to understand the results.
Consider whether you have something else altogether, namely an append problem. It is quite common to confuse the two.
If both of those fail, come back with a real problem and real code and real results for a small sample, rather than abstract worries.
I think I may have solved my problem - I figured I would post an answer specifically relating to the problem in case anybody has the same issue.
~~
I have two data sets: One containing information about the amount of time IT help spent at a customer and another data set with how much product a customer purchased. Both data sets contain unique ID numbers for each company and the fiscal quarter and year that link the sets together (e.g. ID# 1001 corresponds to the same company in both data sets). Additionally, the IT data set contains unique ID numbers for each IT person and the customer purchases data set contains a unique ID number for each purchase made. I am not interested in analysis at the individual employee level, so I collapsed the IT time data set to the total sum of time spent at a given company regardless of who was there.
I was interested in merging both data sets so that I could perform analysis to estimate some sort of "responsiveness" (or elasticity) function linking together IT time spent and products purchased.
I am certain this is a case of "merging" data because I want to add more VARIABLES not OBSERVATIONS - that is, I wish to horizontally elongate not vertically elongate my final data set.
Stata 12 has many options for merging - one to one, many to one, and one to many. Supposing that I treat my IT time data set as my master and my purchases data set as my merging set, I would perform a "m:1" or many to one merge. This is because I have MANY purchases corresponding to one observation per quarter per company.

Best data structure for a given set of operations - Add, Retrieve Min/Max and Retrieve a specific object

I am looking for the optimal (time and space) optimal data structure for supporting the following operations:
Add Persons (name, age) to a global data store of persons
Fetch Person with minimum and maximum age
Search for Person's age given the name
Here's what I could think of:
Keep an array of Persons, and keep adding to end of array when a new Person is to be added
Keep a hash of Person name vs. age, to assist in fetching person's age with given name
Maintain two objects minPerson and maxPerson for Person with min and max age. Update this if needed, when a new Person is added.
Now, although I keep a hash for better performance of (3), I think it may not be the best way if there are many collisions in the hash. Also, addition of a Person would mean an overhead of adding to the hash.
Is there anything that can be further optimized here?
Note: I am looking for the best (balanced) approach to support all these operations in minimum time and space.
You can get rid of the array as it doesn't provide anything that the other two structures can't do.
Otherwise, a hashtable + min/max is likely to perform well for your use case. In fact, this is precisely what I would use.
As to getting rid of the hashtable because a poor hash function might lead to collisions: well, don't use a poor hash function. I bet that the default hash function for strings that's provided by your programming language of choice is going to do pretty well out of the box.
It looks like that you need a data structure that needs fast inserts and that also supports fast queries on 2 different keys (name and age).
I would suggest keeping two data structures, one a sorted data structure (e.g. a balanced binary search tree) where the key is the age and the value is a pointer to the Person object, the other a hashtable where the key is the name and the value is a pointer to the Person object. Notice we don't keep two copies of the same object.
A balanced binary search tree would provide O(log(n)) inserts and max/min queries, while the hastable would give us O(1) (amortized) inserts and lookups.
When we add a new Person, we just add a pointer to it to both data structures. For a min/max age query, we can retrieve the Object by querying the BST. For a name query we can just query the hashtable.
Your question does not ask for updates/deletes, but those are also doable by suitably updating both data structures.
It sounds like you're expecting the name to be the unique idenitifer; otherwise your operation 3 is ambiguous (What is the correct return result if you have two entries for John Smith?)
Assuming that the uniqueness of a name is guaranteed, I would go with a plain hashtable keyed by names. Operation 1 and 3 are trivial to execute. Operation 2 could be done in O(N) time if you want to search through the data structure manually, or you can do like you suggest and keep track of the min/max and update it as you add/delete entries in the hash table.

Resources