How does GetItem/BatchGetItem compare to Querying and Scanning a DynamoDB table in terms of efficiency? - performance

Specifically, when is it better to use one or the other? I am using BatchGetItem now and it seems pretty damn slow.

In terms of efficiency for retrieving a single item, for which you know the partition key (and sort key if one is used in the table), GetItem is more efficient than querying or scanning. BatchGetItem is a convenient way of retrieving a bunch of items for which you know the partition/sort key and it's only more efficient in terms of network traffic savings.
However, if you only have partial information about an item then you can't use GetItem/BatchGetItem and you have to either Scan or Query for the item(s) that you care about. In such cases Query will be more efficient than Scanning since with a query you're already narrowing down the table space to a single partition key value. Filter Expressions don't really contribute all that much to the efficiency but they can save you some network traffic.
There is also the case when you need to retrieve a large number of items. If you need lots of items with the same partition key, then a query becomes more efficient than multiple GetItem (or BatchGetItem calls). Also, if you need to retrieve items making up a significant portion of your table, a Scan is the way to go.

Related

Efficient creation time descending lookup in Raik

I'm learning to use Raik, the NoSQL engine. Given that I have a user "timeline" with posts, and that post may range from millions to billions, how can I take the last N posts from the raik bucket? I mean, the last created.
I read that when using a Secondary Index Raik will return posts ordered by key. So I decided to use an UUID1 for post keys and to have a Secondary Index for the post author, so that I can take all posts from that author using it's key.
However the posts are sorted ASCENDING! I also want to use the max_results parameter as the SQL LIMIT.
This query however returns the FIRST N posts of that user, not the last. Given that I already saw some StackOverflow posts, and that the proposed solution, MapReduce is not efficient for big buckets, how would you model data or write the query?
Thanks
When coming from a SQL environment it is easy to treat a bucket as a table and store small individual records there, often relying on secondary indexes to get the data out. As Riak is a key-value store that uses consistent hashing, this is however often not the most efficient or scalable approach.
A lookup based on key in Riak allows the partitions holding the data to be directly identified, and the coordinating node can directly query these partitions. When querying a secondary index, Riak does not know on which partitions data that may match the index will reside. It will therefore need to send the query to a large number of partitions in order to ensure that all matching objects can be found. This is known as a 'coverage query' and means that, assuming n_val of 3 is used for the bucket, at least 1/3 of all partitions need to be queried. This generally leads to higher load on the cluster and does not scale as well as direct key lookups. Latencies also tend to be higher.
When using Riak it is therefore often recommended that you structure your data so that you can use direct key lookups as much as possible, e.g. through de-normalization.
If your messages/posts can be grouped some way, e.g. by user or conversation, it may make sense to store them in a single object representing this grouping instead of as separate objects.
If we assume that your posts can consist of either text or images and are linked to a conversation thread, you could create an object representing the conversation thread. This would contain information about the conversation as well as a list of posts. This list of posts can e.g. contain the id of the poster, a timestamp and the key of the record containing the post. If the post is a reasonably short text message it may even contain the entire post, reducing the number of records that will need to be fetched.
As posts come in to this conversation, the record is updated and the list of posts gets longer. It may be wise to set allow_mult to true in order to enable siblings, as this will allow you to handle concurrent writes. This approach allows you to always get the conversation as well as the latest posts through a single direct key lookup.
Riak works best when the size of objects are kept below a couple of MB. You will therefore need to move the oldest posts off to a separate object at some point to keep the size in check. If you keep a list of these related objects in the main conversation object, possibly together with some information about the time interval they cover, you can easily access these through direct key lookup as well if you should need to scroll back over older posts.
As the most common query usually is for the most recent entries, this can always be fulfilled through the main conversation object.
I would also like to point out that we do have a very active mailing list where these kind of issues are discussed quite frequently.
I know it's probably too late to help you, but I found this post through wondering about the same thing. The workaround I have come up with and been using to good effect is to create two secondary indexes, one with the real timestamp, and another that is (MAX_DATE - timestamp). Performing lookups on the first query gets ascending results, and performing lookups on the second query gets descending results (once you do the math to turn it back into a real date). You can find the max date value in the Javascript specification, such as reported in MDN, which is 8640000000000000. I can't speak to how performant it is under really heavy load, but I can tell you that for my purposes it has been blazingly fast and I'm very satisfied. I just came here hoping to find a less hacky way to do it.

How does oracle manage a hash partition

I understand the concept of range partitioning. If i have a date column and i partition on that column based on month, then if my query has a where clause just filtering for a month, then i can hit a particular partition and get my data, without hitting the full table.
In Oracle docs i read that if a logical partitioning like 'month' is not available,(e.g, you partition on a column called customer id) ,then use a hash partitioning. So how will this work? Oracle will randomly divide the data and assign it to different partitions and assign a hash code to each partition?
But in this situation, when new data comes in, how does oracle know in which partition to put the new data? And when i query data, it seems there is no way to avoid hitting multiple partitions?
"how does oracle know in which partition to put the new data?"
From the documentation
Oracle Database uses a linear hashing algorithm and to prevent data
from clustering within specific partitions, you should define the
number of partitions by a power of two (for example, 2, 4, 8).
As for your other question ...
"when i query data, it seems there is no way to avoid hitting multiple
partitions?"
If you're searching for a single Customer ID then no. Oracle's hashing algorithm is consistent, so records with the same partition key end up in the same partition (obviously). But if you are searching for, say, all the new customers from the last month then yes. Oracle's hashing algorithm will strive to distribute records evenly so the latest records will be spread across the whole table.
So the real question is, why do we choose to partition a table? Performance is often the least compelling reason to partition. Better reasons include
availability each partition can reside on a different tablespace. Hence a problem with a tablespace will take out a slice of the table's data instead of the whole thing.
management partitioning provides a mechanism for splitting whole table jobs into clear batches. Partition exchange can make it easier to bulk load data.
As for performance, physical co-location of records can speed up some queries- those which are searching records by a defined range of keys. However, any queries which don't match the grain of the query won't perform faster (and may even perform slower) than a non-partitioned table.
Hash partitioning is unlikely to provide performance benefits, precisely because it shuffles the keys across the whole table. It will provide the availability and manageability benefits of partitioning (but is obviously not particularly amenable to partition exchange).
A hash is not random, it divides the data in a repeatable (but perhaps difficult-to-predict) fashion so that the same ID will always map to the same partition.
Oracle uses a hash algorithm that should usually spread the data evenly between partitions.

Caching sortable/filterable data in Redis

I have a variety of data that I've got cached in a standard Redis hashmap, and I've run into a situation where I need to respond to client requests for ordering and filtering. Order rankings for name, average rating, and number of reviews can change regularly (multiple times a minute, possibly). Can anyone advise me on a proper strategy for attacking this problem? Consider the following example to help understand what I'm looking for:
Client makes an API request to /api/v1/cookbooks?orderBy=name&limit=20&offset=0
I should respond with the first 20 entries, ordered by name
Strategies I've considered thus far:
for each type of hashmap store (cookbooks, recipes, etc), creating a sorted set for each ordering scheme (alphabetical, average rating, etc) from a Postgres ORDER BY; then pulling out ZRANGE slices based on limit and offset
storing ordering data directly into the JSON string data for each key.
hitting postgres with an SELECT id FROM table ORDER BY _, and using the ids to pull directly from the hashmap store
Any additional thoughts or advice on how to best address this issue? Thanks in advance.
So, as mentioned in a comment below Sorted Sets are a great way to implement sorting and filtering functionality in cache. Take the following example as an idea of how one might solve the issue of needing to order objects in a hash:
Given a hash called "movies" with the scheme of bucket:objectId -> object, which is a JSON string representation (read about "bucketing" your hashes for performance here.
Create a sorted set called "movieRatings", where each member is an objectId from your "movies" hash, and its score is an average of all rating values (computed by the database). Just use a numerical representation of whatever you're trying to sort, and Redis gives you a lot of flexibility on how you can extract the slices you need.
This simple scheme has a lot of flexibility in what can be achieved - you simply ask your sorted set for a set of keys that fit your requirements, and look up those keys with HMGET from your "movies" hash. Two swift Redis calls, problem solved.
Rinse and repeat for whatever type of ordering you need, such as "number of reviews", "alphabetically", "actor count", etc. Filtering can also be done in this manner, but normal sets are probably quite sufficient for that purpose.
This depends on your needs. Each of your strategies could work.
Your first approach of storing an auxiliary sorted set for each way
you want to order is the best way to do this if you have a very big
hash and/or you run your order queries frequently. This approach will
require a lot of ram if your hash is big, but it will also scale well
in terms of time complexity as your hash gets bigger and you start
running order queries more frequently. On the other hand, it
introduces complexity in your data structures, and feels like you're
trying to use Redis for something a typical DB like Postgres, MySQL,
or Mongo would be better at.
Storing ordering data directly into your keys means you need to pull
your entire hash every time you do an order query. Maybe that's not
so bad if your hash is very small, or you don't do ordered queries very often, but this won't scale at all.
If you're already hitting Postgres to get keys, why not just store the values in Postgres as well. That would be much cheaper than hitting Postgres and then hitting Redis, and would have your code depend on fewer things. IMO, this is probably your best option and would work most naturally. Do this, unless you have some really good reason to not store values in Postgres, or some really big speed concerns, in which case go with your first strategy.

Best approaches to reduce the number of searches between the filenet object stores to find a document based on the time of the document creation?

For example, there are 5 object stores. I am thinking of inserting documents into them, but not in sequential order. Initially it might be sequential, but if i could insert by using some ranking method it would be easier to know which object store to search to find the document. The goal is to reduce the number of object store searches. This can only be achieved if the insertion uses some intelligent algorithm.
One method i found useful is using the current year MOD N (number of object stores) to determine where a document goes. Could we have some better approaches to this?
If you want fast access there are a couple of criteria:
The hash function has to be reproducible based on the data which is queried. This means, a lot depends on the queries you expect.
You usually want to distribute your object as much evenly accross stores as possible. If you want to go parallel, you want to access each document for a given query from different stores, so they will not block each other. Hence your hashing function should spread as much as possible to different stores for similar documents. If you expect documents related to the same query to be from the same year, do not use the year directly.
This assumes, you want to be able to have fast queries which can be paralised. If you instead have a system in which you first have to open a potentially expensive connection to the store, then most documents related to the same query should go in the same store and you should not take my advice above.
Your criteria for "what goes in a FileNet object store?" is basically "what documents logically belong together?".

How to efficiently search large datasets by location and date range?

I have a MongoDB collection containing attributes such as:
longitude, latitude, start_date, end_date, price
I have over 500 million documents.
My question is how to search by lat/long, date range and price as efficiently as possible?
As I see it my options are:
Create an Geo-spatial index on lat/long and use MongoDB's proximity search... and then filter this based on date range and price.
I have yet to test this but, am worrying that the amount of data would be too much to search this quickly, when we have around 1 search a second.
have you had experience with how MongoDB would react under these circumstances?
Split the data into multiple collections by location. i.e. by cities like london_collection, paris_collection, new_york_collection.
I would then have to query by lat/long first, find the nearest city collection and then do a MongoDB spatial search on that subset data in that collection with date and price filters.
I would have uneven distribution of documents as some cities would have more documents than others.
Create collections by dates instead of location. Same as above but each document is allocated a collection based on it's date range.
problem with searches that have a date range that straddles multiple collections.
Create unique ids based on city_start_date_end_date for each document.
Again I would have to use my lat/long query to find the nearest city append the date range to access the key. This seems to be pretty fast but I don't really like the city look up aspect... it seems a bit ugly.
I am in the process of experimenting with option 1.) but would really like to hear your ideas before I go too far down one particular path?
How do search engines split up and manage their data... this must be a similar kind of problem?
Also I do not have to use MongoDB, I'm open to other options?
Many thanks.
Indexing and data access performance is a deep and complex subject. A lot of factors can effect the most efficient solution including the size of your data sets, the read to write ratio, the relative performance of your IO and backing store, etc.
While I can't give you a concrete answer, I can suggest investigating using morton numbers as an efficient way of pulling multiple similar numeric values like lat longs.
Morton number
Why do you think option 1 would be too slow? Is this the result of a real world test or is this merely an assumption that it might eventually not work out?
MongoDB has native support for geohashing and turns coordinates into a single number which can then be searched by a BTree traversal. This should be reasonably fast. Messing around with multiple collections does not seem like a very good idea to me. All it does is replace one level of BTree traversal on the database with some code you still need to write, test and maintain.
Don't reinvent the wheel, but try to optimize the most obvious path (1) first:
Set up geo indexes
Use explain to make sure your queries actually use the index
Make sure your indexes fit into RAM
Profile the database using the built-in profiler
Don't measure performance on a 'cold' system where the indexes didn't have a chance to go to RAM yet
If possible, try not to use geoNear if possible, and stick to the faster (but not perfectly spherical) near queries
If you're still hitting limits, look at sharding to distribute reads and writes to multiple machines.

Resources