Redis - Sorted Dictionary - data-structures

Redis has the data structure sorted sets, which allows you to create a set that is sorted by some score value.
There are several problems that I'm trying to solve
I need to store similar members but with different scores (not possible with sets). One solution is to concatenate the score with the original value and store it as the value, but it's kinda ugly.
I need to have only one member per score and I need a way to enforce it.
I need to be able to update or remove member by score, like in a dictionary.
The best example of what I'm looking for is an order book
I need to be able to set amount of a certain price, remove price and retrieve amounts and prices sorted by prices
SET orderBook_buy 1.17 30000
SET orderBook_buy 1.18 40000
SET orderBook_buy 1.19 40000
SET orderBook_buy 1.17 35000 // Override the previous value of 1.17
DEL orderBook_buy 1.18 // 1.18 was sold out
I think it can be done if I combine sorted sets and hash tables
I keep the prices in sorted sets
ZADD orderBook_buy_prices 1.17 1.17
...
ZREM orderBook_buy_prices 1.18
And the amounts in hash tables, by prices
HSET orderBook_buy 1.17 35000
...
HDEL orderBook_buy 1.17
It could work but i have to do 2 reads and 2 writes every time, and also make sure that the writes are inside a transaction.
Is there a data structure in redid that support sorted dictionaries out of the box (Could be a module)?
Thank you.

It could work but i have to do 2 reads and 2 writes every time, and also make sure that the writes are inside a transaction.
You also want to do the reads in a transaction, unless you don't care about possible read consistency problems.
Is there a data structure in redid that support sorted dictionaries out of the box (Could be a module)?
Sorted Sets are just that, but what you're looking for is a single data structure that is a sort of a two way dictionary with ordering (albeit only on one subset of keys/values <- depending the direction you're coming from).
Your approach of "welding" together two existing structures is perfectly valid, with the constraints you've pointed out about two keys and transactionality. You could use a Lua script to wrap the logic and not worry about transactions, but you'd still have it touch 2 keys via two ops.
AFAIK there is no Redis module ATM that implements this data structure (although it should be possible to write one).

Related

Any reference to definition or use of the data structuring technique "hash linking"?

I would like more information about a data structure - or perhaps it better described as a data structuring technique - that was called hash linking when I read about it in an IBM Research Report a long time ago - in the 70s or early 80s. (The RR may have been from the 60s.)
The idea was to be able to (more) compactly store a table (array, vector) of values when most values fit in a (relatively) small compact range but some values (may) have had unusually large (or small) values out of that range. Instead of making each element of the table wider to hold the entire range you would store, in the table, only those values that fit in the small compact range and put all other entries that didn't fit into a hash table.
One use case I remember being mentioned was for bank accounts - you might determine that 98% of the accounts in your bank had balances under $10,000.00 so they would nicely fit in a 6-digit (decimal) field. To handle the very few accounts $10,000.00 or over you would hash-link them.
There were two ways to arrange it: Both involved a table (array, vector, whatever) where each entry would have enough space to fit the 95-99% case of your data values, and a hash table where you would put the ones that didn't fit, as a key-value pair (key was index into table, value was the item value) where the value field could really fit the entire range of the values.
You would pick a sentinel value, depending on your data type. Might be 0, might be the largest representable value. If the value you were trying to store didn't fit the table you'd stick the sentinel in there and put the (index, actual value) into the hash table. To retrieve you'd get the value by its index, check if it was the sentinel, and if it was look it up in the hash table.
You would have no reasonable sentinel value. No problem. You just store the exceptional values in your hash table, and on retrieval you always look in the hash table first. If the index you're trying to fetch isn't there you're good: just get it out of the table itself.
Benefit was said to be saving a lot of storage while only increasing access time by a small constant factor in either case (due to the properties of a hash table).
(A related technique is to work it the other way around if most of your values were a single value and only a few were not that value: Keep a fast searchable table of index-value pairs of the ones that were not the special value and a set of the indexes of the ones that were the very-much-most-common-value. Advantage would be that the set would use less storage: it wouldn't actually have to store the value, only the indexes. But I don't remember if that was described in this report or I read about that elsewhere.)
The answer I'm looking for is a pointer to the original IBM report (though my search on the IBM research site turned up nothing), or to any other information describing this technique or using this technique to do anything. Or maybe it is a known technique under a different name, that would be good to know!
Reason I'm asking: I'm using the technique now and I'd like to credit it properly.
N.B.: This is not a question about:
anything related to hash tables as hash tables, especially not linking entries or buckets in hash tables via pointer chains (which is why I specifically did not add the tag hashtable),
an "anchor hash link" - using a # in a URL to point to an anchor tag - which is what "hash link" gets you when you search for it on the intertubes,
hash consing which is a different way to save space, for much different use cases.
Full disclosure: There's a chance it wasn't in fact an IBM report where I read it. During the 70s and 80s I was reading a lot of TRs from IBM and other corporate labs, and MIT, CMU, Stanford and other university departments. It was definitely in a TR (not a journal or ACM SIG publication) and I'm nearly 100% sure it was IBM (I've got this image in my head ...) but maybe, just maybe, it was wasn't ...

Can two items in a hashmap be in different locations but have the same hashcode?

Can two items in a hashmap be in different locations but have the same hashcode?
I'm new to hashing, and I've recently learned about hashmaps. I was wondering whether two objects with the same hashcode can possibly go to different locations in a hashmap?
I'm not completely sure and would appreciate any help
As #Dai pointed out in the comments, this will depend on what kind of hash table you're using. (Turns out, there's a bunch of different ways to make a hash table, and no one data structure is "the" way that hash tables work!)
One of more common hash tables uses a strategy called closed addressing. In closed addressing, every item is mapped to a slot based on its hash code and stored with all other items that also end up in that slot. Lookups are then done by finding which bucket to look in, then inspecting all the items in that bucket. In that case, any two items with the same hash code will end up in the same bucket. (They can't literally occupy the same spot within that bucket, though.)
Another strategy for building hash tables uses an approach called open addressing. This is a family of different methods that are all based on the following idea. We require that each slot in the table store at most one element. As before, to do an insertion, we use the element's hash code to figure out which slot to put it in. If the slot is empty, great! We put the element there. If that slot is full, we can't put the item there. Instead, using some predictable strategy, we start looking at other slots until we find a free one, then put the item there. (The simplest way of doing this, linear probing, works by trying the next slot after the desired slot, then the next one, etc., wrapping around if need be.) In this system, since we can't store multiple items in the same spot, no, two elements with the same hash code don't have to (and in fact, can't!) occupy the same spot.
A more recent hashing strategy that's becoming more popular is cuckoo hashing. In cuckoo hashing, we maintain some small number of separate hash tables (typically, two or three), where each slot can only hold one item. To insert an element, we try placing it in the first table at a spot determined by its hash code. If that spot is free, great! We put the item there. If not, we kick out the item there and try putting that item in the next table. This process repeats until eventually everything comes to rest or we get caught in a loop. Like open addressing, this system prevents multiple items from being stored in the same slot, so two elements with the same hash code might go to different places. (There are variations on cuckoo hashing in which each table slot can store a fixed but small number of items, in which case you could have two items with the same hash code in the same spot. But it's not guaranteed.)
There are some other hashing schemes I didn't describe here. FKS perfect hashing works by using two layers of hash tables, along the lines of closed addressing, but periodically rebuilds the whole table to ensure that no one bucket is too overloaded. Extendible hashing uses a trie-like structure to grow overflowing buckets once they become too fully. Hopscotch hashing is a hybrid between linear probing and chained hashing and plays well with concurrency. But hopefully this gives you a sense of how the type of hash table you use influences the answer to your question!

Redis embedding value in the key vs json

I'm planning to store rooms availability in a redis database. The json object looks as such:
{
BuildingID: "RE0002439",
RoomID: "UN0002384391290",
SentTime: 1572616800,
ReceivedTime: 1572616801,
Status: "Occupied",
EstimatedAvailableFrom: 1572620400000,
Capacity: 20,
Layout: "classroom"
}
This is going to be reported by both devices and apps (tablet outside the room, sensor within the room in some rooms, by users etc.) and vary largely as we have hundreds of buildings and over 1000 rooms.
My intention is to use a simple key value structure in Redis. The main query would be which room is available now, but other queries are possible.
Because of that I was thinking that the key should look like
RoomID,Status,Capacity
My question is is it correct assumption because this is the main query we expect to have these all in the key? Should there be other fields in the key too or should the key be just a number with Redis increment, as if it was SQL?
There are plenty of questions I could find about hierarchy but my object has no hierarchy really.
Unless you will use the redis instance exclusively for this, using keys with pattern matching for common queries is not a good idea. KEYS is O(N) and SCAN too when called multiple times to traverse the whole keyspace.
Consider RediSearch module, it would give you a lot of power on this use case.
If RediSearch is not an option:
You can use a single hash key to store all rooms, but then you have to store the whole json string as value, and whenever you want to modify a field, you need to get, then modify then set.
You are probably better off using multiple data structures, here an idea to get you started:
Store each room as a hash key. If RoomID is unique you can use it as key, or pair it with building id if needed. This way, you can edit a field value in one operation.
HSET UN0002384391290 BuildingID RE0002439 Capacity 20 ...
Keep a set with all room IDs. SADD AllRooms UN0002384391290
Use sets and sorted sets as indexes for the rest:
A set of available rooms: Use SADD AvailableRooms UN0002384391290 and SREM AvailableRooms UN0002384391290 to mark rooms as available or not. This way your common query of all rooms available is as fast as it gets. You can use this in place of Status inside the room data. Use SISMEMBER to test is a given room is available now.
A sorted set with capacity: Use ZADD RoomsByCapacity 20 UN0002384391290. So now you can start doing nice queries like ZRANGEBYSCORE RoomsByCapacity 15 +inf WITHSCORES to get all rooms with a capacity >=15. You then can intersect with available rooms.
Sets by layout: SADD RoomsByLayout:classroom UN0002384391290. Then you can intersect by layout, like SINTER AvailableRooms RoomsByLayout:classroom to get all available classrooms.
Sets by building: SADD RoomsByBuilding:RE0002439 UN0002384391290. Then you can intersect by buildings too, like SINTER AvailableRooms RoomsByLayout:classroom RoomsByBuilding:RE0002439 to get all available classrooms in a building.
You can mix sets with sorted sets, like ZINTERSTORE Available:RE0002439:ByCap 3 RoomsByBuilding:RE0002439 RoomsByCapacity AvailableRooms AGGREGATE MAX to get all available rooms scored by capacity in building RE0002439. Sorted sets only allow ZINTERSTORE and ZUNIONSTORE, so you need to clean up after your queries.
You can avoid sorted sets by using sets with capacity buckets, like Rooms:Capacity:1-5, Rooms:Capacity:6-10, etc.
Consider adding coordinates to your buildings, so your users can query by proximity. See GEOADD and GEORADIUS.
You may want to allow reservations and availability queries into the future. See Date range overlap on Redis?.

Storing large matrices in Apache Ignite

I have a large matrix of integers that I want to be able to slice and run analytics on. I'm prototyping this with Apache Ignite.
The matrix is 50000 columns x 5 million rows. I want to be able to run the following operations on this matrix:
Fetch all data for a single column
Fetch all data for some random subset of rows and columns.
Compute a correlation coefficient for one row against every other row.
I'm trying to satisfy 1. and 2. right now, but I can't figure out how to store a matrix. I was thinking of storing the matrix like this:
row1 {
co1: val
co2: val
co3: val
...
co50000: val
}
row2{ ... }
But I'm not sure if I can have complex data types like this in Ignite, or if I can only have a single key:value pair. The documentation is not clear. When I try to insert a dictionary using pyignite (my Java is a little rusty, so I'm sticking to python right now), the data comes back as an array:
>>> test.put('row2', { "col1": 50, "col2":0 })
>>> test.get('cell2')
['gene1', 'gene2']
I'm new to Apache Ignite, but the documentation doesn't seem to detail how to do this, or if it would even be performant.
I think that you need to store 5 million KV pairs using row as key and containing 50000 columns array as value.
Better stick to primitive types. Not sure how to map it best to Python.
From a thin client perspective, Ignite caches are flat, not nested. You can put arrays, sequences, dictionaries, or any combinations of above as a value in Ignite cache, but you can not traverse values inside the cache afterwards. You can only retrieve the whole value and look into it.
cache.get(row)[column] will work, but it will retrieve the whole row of 50000 elements from the cache as a Python list, and then address the single element in this list. I think in your case it will be sub-optimal.
If I got your question right, JSON-oriented databases (like MongoDB or PostgreSQL's JSONB) have the features you describe. Don't know if they are fast enough for data analysis though.

If I have two physically equivalent but logically different types of data, then how many hashes should I use to store them?

Say I have two set of items, which are similar except for their logical purpose in the program. Is it better programming practice to assign two hashes to them, or should I use only one hash for the purpose?
If you store them in the same hash table, you run the (perhaps small or non-existent) risk of overwriting one with another. Say for example you are storing first names and last names (both strings). There could conceivably be one person with first name "Jones" and another with last name "Jones".
If the above is not possible there's no technical reason why you could not use a single hash table. Items that hash to the same value will be stored in the same bucket along with other items with different hash values that map to the same bucket, but as long as you check for actual equality after hash collision, you're okay.
That being said, I would still prefer to separate logical items into their own hash tables without a very strong reason to combine them.
The code dealing with them will probably be easier to write and maintain.
It will be easier to debug issues.
Smaller hash tables will likely have fewer items per bucket and improve performance slightly.
If the set of items are same, the hashes should be same as well.
It is like saying you can use a wrench to tighten a bolt or break open a window, hence it should behave like 2 different objects, which isn't true, because it is your way of use that is differentiating, not the object itself.

Resources