Caching sortable/filterable data in Redis - sorting

I have a variety of data that I've got cached in a standard Redis hashmap, and I've run into a situation where I need to respond to client requests for ordering and filtering. Order rankings for name, average rating, and number of reviews can change regularly (multiple times a minute, possibly). Can anyone advise me on a proper strategy for attacking this problem? Consider the following example to help understand what I'm looking for:
Client makes an API request to /api/v1/cookbooks?orderBy=name&limit=20&offset=0
I should respond with the first 20 entries, ordered by name
Strategies I've considered thus far:
for each type of hashmap store (cookbooks, recipes, etc), creating a sorted set for each ordering scheme (alphabetical, average rating, etc) from a Postgres ORDER BY; then pulling out ZRANGE slices based on limit and offset
storing ordering data directly into the JSON string data for each key.
hitting postgres with an SELECT id FROM table ORDER BY _, and using the ids to pull directly from the hashmap store
Any additional thoughts or advice on how to best address this issue? Thanks in advance.

So, as mentioned in a comment below Sorted Sets are a great way to implement sorting and filtering functionality in cache. Take the following example as an idea of how one might solve the issue of needing to order objects in a hash:
Given a hash called "movies" with the scheme of bucket:objectId -> object, which is a JSON string representation (read about "bucketing" your hashes for performance here.
Create a sorted set called "movieRatings", where each member is an objectId from your "movies" hash, and its score is an average of all rating values (computed by the database). Just use a numerical representation of whatever you're trying to sort, and Redis gives you a lot of flexibility on how you can extract the slices you need.
This simple scheme has a lot of flexibility in what can be achieved - you simply ask your sorted set for a set of keys that fit your requirements, and look up those keys with HMGET from your "movies" hash. Two swift Redis calls, problem solved.
Rinse and repeat for whatever type of ordering you need, such as "number of reviews", "alphabetically", "actor count", etc. Filtering can also be done in this manner, but normal sets are probably quite sufficient for that purpose.

This depends on your needs. Each of your strategies could work.
Your first approach of storing an auxiliary sorted set for each way
you want to order is the best way to do this if you have a very big
hash and/or you run your order queries frequently. This approach will
require a lot of ram if your hash is big, but it will also scale well
in terms of time complexity as your hash gets bigger and you start
running order queries more frequently. On the other hand, it
introduces complexity in your data structures, and feels like you're
trying to use Redis for something a typical DB like Postgres, MySQL,
or Mongo would be better at.
Storing ordering data directly into your keys means you need to pull
your entire hash every time you do an order query. Maybe that's not
so bad if your hash is very small, or you don't do ordered queries very often, but this won't scale at all.
If you're already hitting Postgres to get keys, why not just store the values in Postgres as well. That would be much cheaper than hitting Postgres and then hitting Redis, and would have your code depend on fewer things. IMO, this is probably your best option and would work most naturally. Do this, unless you have some really good reason to not store values in Postgres, or some really big speed concerns, in which case go with your first strategy.

Related

Efficient creation time descending lookup in Raik

I'm learning to use Raik, the NoSQL engine. Given that I have a user "timeline" with posts, and that post may range from millions to billions, how can I take the last N posts from the raik bucket? I mean, the last created.
I read that when using a Secondary Index Raik will return posts ordered by key. So I decided to use an UUID1 for post keys and to have a Secondary Index for the post author, so that I can take all posts from that author using it's key.
However the posts are sorted ASCENDING! I also want to use the max_results parameter as the SQL LIMIT.
This query however returns the FIRST N posts of that user, not the last. Given that I already saw some StackOverflow posts, and that the proposed solution, MapReduce is not efficient for big buckets, how would you model data or write the query?
Thanks
When coming from a SQL environment it is easy to treat a bucket as a table and store small individual records there, often relying on secondary indexes to get the data out. As Riak is a key-value store that uses consistent hashing, this is however often not the most efficient or scalable approach.
A lookup based on key in Riak allows the partitions holding the data to be directly identified, and the coordinating node can directly query these partitions. When querying a secondary index, Riak does not know on which partitions data that may match the index will reside. It will therefore need to send the query to a large number of partitions in order to ensure that all matching objects can be found. This is known as a 'coverage query' and means that, assuming n_val of 3 is used for the bucket, at least 1/3 of all partitions need to be queried. This generally leads to higher load on the cluster and does not scale as well as direct key lookups. Latencies also tend to be higher.
When using Riak it is therefore often recommended that you structure your data so that you can use direct key lookups as much as possible, e.g. through de-normalization.
If your messages/posts can be grouped some way, e.g. by user or conversation, it may make sense to store them in a single object representing this grouping instead of as separate objects.
If we assume that your posts can consist of either text or images and are linked to a conversation thread, you could create an object representing the conversation thread. This would contain information about the conversation as well as a list of posts. This list of posts can e.g. contain the id of the poster, a timestamp and the key of the record containing the post. If the post is a reasonably short text message it may even contain the entire post, reducing the number of records that will need to be fetched.
As posts come in to this conversation, the record is updated and the list of posts gets longer. It may be wise to set allow_mult to true in order to enable siblings, as this will allow you to handle concurrent writes. This approach allows you to always get the conversation as well as the latest posts through a single direct key lookup.
Riak works best when the size of objects are kept below a couple of MB. You will therefore need to move the oldest posts off to a separate object at some point to keep the size in check. If you keep a list of these related objects in the main conversation object, possibly together with some information about the time interval they cover, you can easily access these through direct key lookup as well if you should need to scroll back over older posts.
As the most common query usually is for the most recent entries, this can always be fulfilled through the main conversation object.
I would also like to point out that we do have a very active mailing list where these kind of issues are discussed quite frequently.
I know it's probably too late to help you, but I found this post through wondering about the same thing. The workaround I have come up with and been using to good effect is to create two secondary indexes, one with the real timestamp, and another that is (MAX_DATE - timestamp). Performing lookups on the first query gets ascending results, and performing lookups on the second query gets descending results (once you do the math to turn it back into a real date). You can find the max date value in the Javascript specification, such as reported in MDN, which is 8640000000000000. I can't speak to how performant it is under really heavy load, but I can tell you that for my purposes it has been blazingly fast and I'm very satisfied. I just came here hoping to find a less hacky way to do it.

How to design database to store and retrieve large item/skill lists in ruby

I plan a role playing game where characters are supposed to carry/use items and train skills. When it comes to store (possibly numerous) items/skills possessed by characters, I can't think of a better way than putting a row for every possible item and skill to each character instantiated. However this seems to be an overkill to me.
To be clear, if this would be an exercise or a small game where total number of items/skills is ~30, I would add an items and a skills hash to the character class and methods to add and remove them like:
def initialize
#inventory = {}
#skills = {}
end
def add_item item, number
#inventory[item] += number
end
Regarding that I would like to store the number of the items and the levels of the skills, what else can I try to handle ~1000 items and ~150 in the inventory and possibly 100 skills?
Plan for Data Retrieval
Generally, it's a good idea to design your database around how you plan to look up and retrieve your data, rather than how you want to store it. A bad design makes your data very expensive to collect from the database.
In your example, having a separate model for each inventory item or skill would be hugely expensive in terms of lookups whenever you want to load a character. Do you really want to do 1,000 lookups every time you load someone's inventory? Probably not.
Denormalize for Speed
You typically want to normalize data that needs to be consistent, and denormalize data that needs to be retrieved/updated quickly. One option might be to serialize your character attributes.
For example, it should be faster to store a serialized Character#inventory_items field than update 100 separate records with a has_many :though or has_and_belongs_to_many relationship. There are certainly trade-offs involved with denormalization in general and serialization in particular, but it might be a good fit for your specific use case.
Consider a Document Database
Character sheets are documents. Unless you need the relational power of a SQL database, a document-oriented database might be a better fit for the data you want to manage. CouchDB seems particularly well-suited for this example, but you should certainly evaluate all your NoSQL options to see if any offer the features you need. Your mileage will definitely vary.
Always Benchmark
Don't take my word for what's optimal. Try a design. Benchmark it. See what the design does with your data. In the end, that's the only thing that matters.
I can't think of a better way than putting a row for every possible item and skill to each character instantiated.
Do characters evolve independently?
Assuming yes, there is no other choice but having each end every relevant combination physically represented in the database.
If not, then you can "reuse" the same set or items/skills for multiple characters, but this is probably not what is going on here.
In any case, relational databases are very good at managing huge amounts of data and the numbers you mentioned don't even qualify as "huge". By correctly utilizing techniques such as clustering, you can ensure that a lookup of all items/skills for a given character is done in a minimal number of I/O operations, i.e. very fast.

Best approaches to reduce the number of searches between the filenet object stores to find a document based on the time of the document creation?

For example, there are 5 object stores. I am thinking of inserting documents into them, but not in sequential order. Initially it might be sequential, but if i could insert by using some ranking method it would be easier to know which object store to search to find the document. The goal is to reduce the number of object store searches. This can only be achieved if the insertion uses some intelligent algorithm.
One method i found useful is using the current year MOD N (number of object stores) to determine where a document goes. Could we have some better approaches to this?
If you want fast access there are a couple of criteria:
The hash function has to be reproducible based on the data which is queried. This means, a lot depends on the queries you expect.
You usually want to distribute your object as much evenly accross stores as possible. If you want to go parallel, you want to access each document for a given query from different stores, so they will not block each other. Hence your hashing function should spread as much as possible to different stores for similar documents. If you expect documents related to the same query to be from the same year, do not use the year directly.
This assumes, you want to be able to have fast queries which can be paralised. If you instead have a system in which you first have to open a potentially expensive connection to the store, then most documents related to the same query should go in the same store and you should not take my advice above.
Your criteria for "what goes in a FileNet object store?" is basically "what documents logically belong together?".

Does someone really sort terabytes of data?

I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?
I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.
But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?
In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?
But in reality, does companies like
Amazon/Ebay, sort terabytes of data? I
know, they store tons of info but
sorting them???
Yes. Last time I checked Google processed over 20 petabytes of data daily.
Why wouldn't they keep them sorted at
the first place instead of sorting
terabytes of data, is my question in a
nutshell.
EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.
Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.
Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.
For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.
http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/
The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.
However for sorting 1 TB I would use map-reduce using Hadoop.
Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.
Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.
Here is the link to the identity mapper:
http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.
Yes, certain companies certainly sort at least that much data every day.
Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.
Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.
Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.
Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.
Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.
Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps.
For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.

What's the performance difference between DBI's fetchall_hashref and fetchall_arrayref?

I am writing some Perl scripts to manipulate large amounts (in total about 42 million rows, but it won't be done in one hit) of data in two PostgreSQL databases.
For some of my queries it makes good sense to use fetchall_hashref because I have synthetic keys. However, in other instances, I'm going to have use an array of three columns as the unique key.
This has got me wondering about performance differences between fetchall_arrayref and fetchall_hashref. I know that in both cases everything is going in to memory so selecting several GB of data probably isn't a good idea but other than that there appears to be very little guidance in the documentation when it comes to performance.
My googling has been unsuccessful so if anyone can point me in the direction of some general performance studies I'd be grateful.
(I know I could benchmark this myself but unfortunately for dev purposes I don't have access to a machine which has identical hardware to production which is why I'm looking for general guidelines or even best practices).
Most of the choices between fetch methods depend on what format you want the data to end up in and how much of the work for that you want DBI to do for you.
My recollection is that iterating with fetchrow_arrayref and using bind_columns is the fastest (least DBI overhead) way to read through returned data.
First question is whether you really need to use a fetchall in the first place. If you don't need all 42 million rows in memory at once, then don't read them all in at once! bind_columns and fetchrow_arrayref are generally the way to go whenever possible, as ysth already pointed out.
Assuming that fetchall really is needed, my gut intuition is that fetchall_arrayref will be marginally faster, since an array is a simpler data structure and doesn't need to compute hashes of the inserted keys, but the savings in time would be dwarfed by database read times, so it's unlikely to be significant.
Memory requirements are another matter entirely, though. The structure returned by fetchall_hashref is a hash of id => row, with each row being represented as a hash of field name => field value. If you get 42 million rows, that means your list of field names is repeated in each of 42 million sets of hash keys... That's going to require a good deal more memory to store than the array of arrays of arrays returned by fetchall_arrayref. (Unless DBI is doing some magic with tie to optimize the fetchall_hashref structure, I suppose.)

Resources