How can I enumerate all keys in our Redis database? - ruby

We have a huge Redis database containing about 100 million keys, which maps phone numbers to hashes of data.
Once in a while all this data needs to be aggregated and saved to an SQL database. During aggregation we need to iterate over all the stored keys, and take a look at those arrays.
Using Redis.keys is not a good option because it will retrieve and store the whole list of keys in memory, and it take a loooong time to complete. We need something that will give back an enumerator that can be used to iterate over all the keys, like so:
redis.keys_each { |k| agg(k, redis.hgetall(k)) }
Is this even possible with Redis?
This would prevent Ruby from constructing an array of 100 million elements in memory, and would probably be way faster. Profiling shows us that using the Redis.keys command makes Ruby hog the CPU at 100%, but the Redis process seems to be idle.
I know that using keys is discouraged against building a set from the keys, but even if we construct a set out of the keys, and retrieve that using smembers, we'll be having the same problem.

Incremental enumeration of all the keys is not possible with the current Redis version.
Instead of trying to extract all the keys of a live Redis instance, you could just dump the database (bgsave) and convert the resulting dump to a json file, to be processed with any Ruby tool you want.
See https://github.com/sripathikrishnan/redis-rdb-tools
Alternatively, you can use the redis-rdb-tools API to directly write a parser in Python and extract the required data (without generating a json file).

Related

How to store two different cache "tables" in Redis under the same database/index?

Trying to build a data set of two cache tables (which are currently stored in SQL Server) - one is the actual cache table (CacheTBL); the other is the staging table (CacheTBL_Staging).
The table structure has two columns - "key", "value"
So I'm wondering how to implement this in Redis as I'm a total noob to this NoSQL stuff. Should I use a SET or LIST? Or something else?
Thank you so much in advance!
You need to decide whether you want separate REDIS keys for all entries using SET and GET, or put them into hashes with HSET and HGET. If you use the first approach, your keys should include a prefix to distinguish between main and staging. If you use hashes, this is not necessary, because the hash name can also be used to distinguish these. You probably also need to decide how you want to check for cache validity, and what your cache flushing strategy should be. This normally requires some additional data structures in REDIS.

Is there a limit in number of keys in a Redis SET UNION operation?

I have a scenario where I am dumping huge amount of data from Google Big Query to Redis SET data structure to get better response time. I need SET UNION operation to be done over millions of keys. I have tested with few thousands of keys and working fine. The question is, there is any limit on number of keys can be supplied to a SUNION command at a time? Is it really SUNION Key1 Key2 Key3 ..... KeyN?
Consider I have enough system capacity.
[...] over millions of keys
There's no statement in Redis' documentation about a limitation on how many keys can be provided in a single sunion command.
BTW, I doubt that doing such operation could be a good idea in Redis. Remember that Redis will get blocked until this operation and, and no other operation will be executed until the sunion ends.
My best advise will be you should do it using many sunionstore commands, and later get all results from many sets like if the whole sets would be pages of the result of sunion millions of keys.
sunionstore key:pages:1 key1 keyN
...and later you would use some iterator in your application layer to iterate over all generated pages.

Using hashes as IDs in key-value stores

I'm wondering whether it would be a good idea to use hashes (CityHash, Murmur and the like) as keys in a key-value store like Hazelcast. I'm expecting to have about 2,000,000,000 records (URLs) in the database, so collisions could happen. It wouldn't be super critical to lose some data through hash collisions, but of course it would be best to avoid them.
A record contains the URL, time stamp, status code. The main operations are inserting and looking up whether an URL already exists.
So, what would you suggest, given speed is relevant:
using an ID generator, or
using a hash algorithm like CityHash or Murmur, or
using the relevant string, an URL in this case, itself?
Hazelcast does not rely on hashCode/equals methods of the key object, instead it is using the MurMur hash of the binary representation of the key.
In short, you should not really worry about hash collisions.

Firebase with classic data structures

Firebase allows you to store data in remote JSON tree and it can be nested up to 32 levels.
It's cool, but is there any way (or service) to store your data in lists, sets or hashes like Redis does, but remotely, like Firebase does?
A list is a collection of ordered data? If so: see Firebase's documentation on saving lists of data. If you're used to arrays, you might want to read these two blog posts on arrays in Firebase and real-time synchronized arrays too.
In JSON (and thus in Firebase) any associative array is essentially a set: you can associate one value with each key. So I'd map sets to regular Firebase set operations.
As you may see: there are quite some links to the Firebase guide in the above.

Is there anything like memcached, but for sorted lists?

I have a situation where I could really benefit from having system like memcached, but with the ability to store (per each key) sorted list of elements, and modifying the list by addition of values.
For example:
something.add_to_sorted_list( 'topics_list_sorted_by_title', 1234, 'some_title')
something.add_to_sorted_list( 'topics_list_sorted_by_title', 5436, 'zzz')
something.add_to_sorted_list( 'topics_list_sorted_by_title', 5623, 'aaa')
Which I then could use like this:
something.get_list_size( 'topics_list_sorted_by_title' )
// returns 3
something.get_list_elements( 'topics_list_sorted_by_title', 1, 10 )
// returns: 5623, 1234, 5436
Required system would allow me to easily get items count in every array, and fetch any number of values from the array, with the assumption that the values are sorted using attached value.
I hope that the description is clear. And the question is relatively simple: is there any such system?
Take a look at MongoDB. It uses memory mapped files, so is incredibly fast and should perform at a comparative level to MemCached.
MongoDB is a schema-less database that should support what you're looking for (indexing/sorting)
Redis supports both lists and sets. You can disable disk saving and use it like Memcached instead of going for MongoDB which would save data to disk.
MongoDB will fit. What's important it has indexes, so you can add an index by title for topics collection and then retrieve items sorted by the index:
db.topics.ensureIndex({"title": 1})
db.topics.find().sort({"title": 1})
why not just store an array in memcached? at least in python and PHP the memcached APIs support this (i think python uses pickle but i dont recall for sure).
if you need permanent data storage or backup, memcacheDB uses the same API.
basic pseudopython example:
getting stored data
stored=cache.get(storedDataName)
initialize list if you have not stored anything previously
if(stored==None):
stored={}
----------------
finding stored items
try:
alreadyHaveItem=stored[itemKey]
except KeyError:
print 'no result in cached'
----------------
adding new items
for item in newItemsDict:
stored[item]=newItems[item]
----------------
saving the results in cache
cache.set(storedDataName,stored,TTL)

Resources