How RethinkDB joins are implemented? - rethinkdb

I've been poking RethinkDB lately, and was very scared to see joins documentation section. From what i know, RethinkDB stores data in shards, which may be distributed (and that's afaik practically a huge NO for joins). So how does RethinkDB perform join queries? Does it basically download all data on one node (that would render existing indexes useless, wouldn't it?), or does it use more complicated algorithm?

In RethinkDB 2.2 and before, an eqJoin performs an indexes getAll operation on the right-hand table for each document in the left-hand input.
This operation is initiated on each of the shards that are hosting the left-hand input of the eqJoin command.
As you point out, performing the getAll might require going over the network to reach a shard of the right-hand table on a different server. However indexes are still being used.
(you can find the implementation of eqJoin here: https://github.com/rethinkdb/rethinkdb/blob/v2.2.x/src/rdb_protocol/terms/rewrites.cc#L121 It's just a rewrite to other operations)
Starting with the upcoming RethinkDB 2.3, eqJoin uses batched getAll operations. This means that it reads a bunch of results (e.g. up to 1 MB) from the left-hand input, and then issues a single getAll to the shards of the right-hand table. Once it gets the data back from those shards, it combines it with the data it had previously read from the left input and passes it on to the user. Then it repeats this until all data from the left input has been processed.
This approach requires significantly fewer network roundtrips between the servers, and is usually significantly faster. You can find some more details about the new implementation at https://github.com/rethinkdb/rethinkdb/issues/5115 .
Finally, the other available join operations (innerJoin and outerJoin) are not indexed and shouldn't be used for data sets of any significant size as the documentation also points out.

Related

How to achieve Data Sharding in Endeca (data partitioning)

Currently Oracle Commerce Guided Search (Endeca) supports only language specific partitions (i.e., One MDEX per Language). For systems with huge data volume base (say ~100 million records of ~200 stores), does anyone successfully implemented data partitioning (sharding) based on logical group of data (i.e., One MDEX per group-of-stores) so that the large set of data can be divided into smaller sets of data?
If so, what precautions to be taken while indexing data and strategies for querying the Assembler?
Don't think this is possible. Endeca used to support the Adgidx which allowed you to split or shard the mdex but that is no longer supported. Oracles justification for removing this is that with multithreading and multi-core processors it is no longer necessary. Apache Solr, however, supports sharing
The large set of data can be broken into smaller sets, where each set would be attributed to a property, say record.type, which would identify the different sets. So, basically we are normalizing the records in the Endeca index.
Now, while querying endeca, we can use the concept of record relationship navigation queries, using record-record relationships by applying a relationship filter, to bring back records of different types.
However, you might have to obtain a RRN license to enable the RRN feature in the mdex engine.

Does Datomic retrieve all data to the local system (Peer) before running a query?

"Datomic queries run in application process space" : does that mean that all the data the query has to run on has to be local, too? Let's say I am running a join on two tables, each of which are 1 GB in size, does Datomic first retrieve 2 GB of data to the Peer on which the query is going to run?
Excuse me if this question is already answered in the documentation and I should RTFM.
In my understanding only the live index is provided for the query to run. With the help of the index only relevant data need to be fetched from the storage service, but only if it is not yet available in the local cache.
The data does not reside on the peers, only the indexes. When you run a query, the peer traverses the most suitable index to find the nodes that need to be retrieved from the storage service. Thus the actual query from the peer to the storage service only requests the id's that were reached in the query of the index. The index sizes can be quite large depending on how much data you have stored, but it will only retrieve from the storage service, the data it needs.
Datomic does not have the notion of table joins, so I'm interested to know exactly what you mean here; different partitions or databases?
The short answer is: No.
Datomic maintains several indexes, each sorted by different criteria. Each of these indexes is split into segments, with every segment containing thousands of data items (datoms). That's right, the data itself is contained in the index.
When doing a query, Datomic will retrieve only those index segments that it needs to perform the query. As indexes are sorted, Datomic can figure out which segments it needs. As it retrieves index data in segment units, it will always contain some data that you are not interested in - but this seems to be a pretty good tradeoff to tackle management and communication overhead and will boost performance in practice.
In all typical queries, no full database scan is necessary. In cases where it is necessary, the peer will indeed have to pull in all data to the local system. However, this does not mean that all data will reside in memory at the same time at one point - unless your query result contains all data - because Datomic will garbage collect segments once processed and not necessary any more, in case memory is scarce.
That said, the order of where clauses in queries is important for performance, although I can't say if the order affects the number of index segments retrieved.
More on indexes can be found on Datomic indexes page and on Nikita Prokopov's Unofficial guide to Datomic internals.

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

Is it feasible to use a distributed cache for queryable data sets?

My scenario is as follows. I have a data table with a million rows of tuples (say first name and last name), and a client that needs to retrieve a small subset of rows whose first name or last name begins with the query string. Caching this seems like a catch-22, because:
On the one hand, I can't store and retrieve the entire data set on every request (would overwhelm the network)
On the other hand, I can't just store each row individually, because then I'd have no way to run a query.
Storing ranges of values in the cache, with a local "index" or directory would work... except that, you'd have to essentially duplicate the data for each index, which defeats the purpose of even using a distributed cache.
What approach is advisable for this kind of thing? Is it possible to get the benefits of using a distributed cache, or is it simply not feasible for this kind of scenario?
Distributed Caching, is feasible for query-able data sets.
But for this scenario there should either be native function or procedure that would give much faster results. If different scope are not possible like session or application then it would be much of iteration required on server side for fetching the data for each request.
Indexing on server side then of Database is never a good idea.
If still there are network issues. You could go ahead for Document Oriented or Column Oriented NoSQL DB. If feasible.

Follow up Q on [Segmenting Redis By Database]

This is a follow up question Segmenting Redis By Database.
I originally asked about the time complexity of the redis keys operation in different databases within one redis instance. The reason I was asking is because I am attempting to implement a cache where there are x multi-segment keys, each of which may have y actual data instances, resulting in xy total keys.
However, I would like to support the wild-card search of the primary keys and it seems that in redis the only implemented wild-card query for keys is the keys command, the use of which is discouraged. It seemed to me to be a decent compromise to put the x keys in a separate database where the lower number of keys would make the keys operation perform satisfactorily.
Can anyone suggest a better alternative ?
Thanks.
I still think using KEYS is really not scalable with Redis, whatever clever scheme you can put in place to work the linear complexity around.
Partitioning is one of this scheme, and it is commonly used in traditional RDBMS to reduce the cost of table scans on flat tables. Your idea is actually an adaptation of this concept to Redis.
But there is an important difference compared to traditional RDBMS providing this facility (Oracle, MySQL, ...): Redis is a single-threaded event loop. So a scan cannot be done concurrently with any other activity (like serving other client connections for instance). When Redis scans data, it is blocked for all connections.
You would have to setup a huge number of partitions (i.e. of databases) to get good performance. Something like 1/1000 or 1/10000 of the global number of keys. And this is why it is not scalable: Redis is not designed to handle such a number of databases. You will likely have issues with internal mechanisms iterating on all the databases. Here is a list extracted from the source code:
automatic rehashing
item expiration management
database status logging (every 5 secs)
INFO command
maxmemory management
You would likely have to limit the number of databases, which also limits the scalability. If you set 1000 databases, it will be work fine for say 1M items, will be slower for 10M items, and unusable with 100M items.
If you still want to stick to linear scans to implement this facility, you will be better served by other stores supporting concurrent scans (like MySQL, MongoDB, etc ...). With the other stores, the critical point will be to implement item expiration in an efficient way.
If you really have to use Redis, you can easily segment the data without relying on multiple databases. For instance, you could use the method I have described here. With this strategy, the list of keys is retrieved in an incremental way, and the search is actually done on client-side. The main benefit is you can have a large number of partitions, so that Redis would not block.
Now, AFAIK no storage engine provides the capability to efficiently search data with an arbitrary regular expression (i.e. avoiding a linear scan). However, this feature is provided by some search engines, typically using n-gram indexing.
Here is a good article about it from Russ Cox: http://swtch.com/~rsc/regexp/regexp4.html
This indexing mechanism could probably be adapted to Redis (you would use Redis to store a trigram index of your keys), but it represents a lot of code to write.
You could also imagine restricting the regular expressions to prefix searches. For instance U:SMITH:(.*) is actually a search with prefix U:SMITH:
In that case, you can use a zset to index your keys, and perform the linear search on client side once the range of keys you are interested in has been retrieved. The score of the items in the zset is calculated from the keys on client-side, so that the score order corresponds to the lexicographic order of the keys.
With such zset, it is possible to retrieve the range of keys you have to scan chunk by chunk by a combination of zscore and zrange commands. The consequences are the number of keys to scan is limited (by the prefix), the search occurs on client-side, and it is friendly with Redis concurrency model. The drawbacks are the complexity (especially to handle item expiration), and the network bandwidth consumption.

Resources