Querying a view using multiple keys - view

Given the following view for the gamesim-sample example:
function (doc, meta) {
if (doc.jsonType == "player" && doc.experience) {
emit([doc.experience,meta.id], doc.id);
}
}
I would like to Query the leaderboard for users who only belong to specific group (the grouping data is maintained in an external system).
For e.g. if the view has users "orange","purple","green","blue" and "red" I would like the leaderboard to give me the rankings of only "orange" and "purple" without having to query their respective current experience points.
...view/leaderboard?keys=[[null,"orange"],[null,"purple"]
The following works fine, but it requires additional queries to find the experience point of "orange" and "purple" beforehand. However, this does not scale for obvious reasons.
...view/leaderboard?keys=[[1,"orange"],[5,"purple"]
Thanks in advance!

Some NoSql vs. SQL Background
First, you have to remember that specifically with Couchbase, the advantage is the super-fast storage and retrieval of records. Indicies were added later, as a way to make storage a little more useful and less error-prone (think of them more as an automated inventory) and their design really constrains you to move away from SQL-style thinking. Your query above is a perfect example:
select *
from leaderboard
where id in ('orange','purple')
order by experience
This is a retrieval, computation, and filter all in one shot. This is exactly what NoSql databases are optimized not to do (and conversely, SQL databases are, which often makes them hopelessly complex, but that is another topic).
So, this leads to the primary difference between a SQL vs a NoSQL database: NoSql is optimized for storage while SQL is optimized for querying. In conjunction, it causes one to adjust how one thinks about the role of the database, which in my opinion should be more the former than the latter.
The creators of Couchbase originally focused purely on the storage aspect of the database. However, storage makes a lot more sense when you know what it is you have stored, and indices were added later as a feature (originally you had to keep track of your own stuff - it was not much fun!) They also added in map-reduce in a way that takes advantage of CB's ability to store and retrieve massive quantities of records simultaneously. Neither of these features were really intended to solve complex query problems (even though this query is simple, it is a perfect example because of this). This is the function of your application logic.
Addressing Your Specific Issue
So, now on to your question. The query itself appears to be a simple one, and indeed it is. However,
select * from leaderboard
is not actually simple. It is instead a 2-layer deep query, as your definition of leaderboard implies a sorted list from largest to smallest player experience. Therefore, this query, expanded out, becomes:
select * from players order by experience desc
Couchbase supports the above natively in the index mechanism (remember, it inventories your objects), and you have accurately described in your question how to leverage views to achieve this output. What Couchbase does not support is the third-level query, which represents your where clause. Typically, a where in Couchbase is executed in either the view "map" definition or the index selection parameters. You can't do it in "map" because you don't always want the same selection, and you can't do it in the index selection parameter because the index is sorted on experience level first.
Method 1
Let's assume that you are displaying this to a user on a web page. You can easily implement this filter client-side (or in your web service) by pulling the data as-is and throwing out values that you don't want. Use the limit and skip parameters to ask for more as the user scrolls down (or clicks more pages, or whatever).
Method 2
Reverse the order of your index, and sort by "group" (aka color) first, then experience level. Run separate queries to select the top 'N' users of each color, then merge and sort on the client side. This will take longer to load up-front but will give you a larger in-memory data set to work with if you need it for that reason. This method may not work well if you have a very uneven distribution of categories, in which case 'N' would need to be tailored to match the statistical distribution(s) within the categories.
Bottom Line
One parting thought is that NoSql databases were designed to deal with highly dynamic data sets. This requires some statistical thinking, because there no longer is a single "right" answer. Some degree of inconsistency and error is to be expected (as there always is in the real world). You can't expect a NoSql database to return a perfect query result - because there is no perfection. You have to settle for "good enough" - which is often much better than what is needed anyway.

Related

Business Intelligence Datasource Performance - Large Table

I use Tableau and have a table with 140 fields. Due to the size/width of the table, the performance is poor. I would like to remove fields to increase reading speed, but my user base is so large, that at least one person uses each of the fields, while 90% use the same ~20 fields.
What is the best solution to this issue? (Tableau is our BI tool, BigQuery is our database)
What I have done thus far:
In Tableau, it isn't clear how to user dynamic data sources that change based on the field selected. Ideally, I would like to have smaller views OR denormalized tables. As the users makes their selections in Tableau, the underlying data sources updates to the table or view with that field.
I have tried a simple version of a large view, but that performed worse than my large table, and read significantly more data (remember, I am BigQuery, so I care very much about bytes read due to costs)
Suggestion 1: Extract your data.
Especially when it comes to datasources which are pay per query byte, (Big Query, Athena, Etc) extracts make a great deal of sense. Depending how 'fresh' the data must be for the users. (Of course all users will say 'live is the only way to go', but dig into this a little and see what it might actually be.) Refreshes can be scheduled for as little as 15 minutes. The real power of refreshes comes in the form of 'incremental refreshes' whereby only new records are added (along an index of int or date.) This is a great way to reduce costs - if your BigQuery database is partitioned - (which it should be.) Since Tableau Extracts are contained within .hyper files, a structure of Tableau's own design/control, they are extremely fast and optimized perfectly for use in Tableau.
Suggestion 2: Create 3 Data Sources (or more.) Certify these datasources after validating that they provide correct information. Provide users with with clear descriptions.
Original Large Dataset.
Subset of ~20 fields for the 90%.
Remainder of fields for the 10%
Extract of 1
Extract of 2
Extract of 3
Importantly, if field names match in each datasource (ie: not changed manually ever) then it should be easy for a user to 'scale up' to larger datasets as needed. This means that they could generally always start out with a small subset of data to begin their exploration, and then use the 'replace datasource' feature to switch to a different datasource while keeping their same views. (This wouldn't work as well if at all for scaling down, though.)

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

how to use redis for sorting and filtration at the same time?

Imagine: someone has a huge website selling, let's say, T-shirts.
we want to show paginated sorted listings of offers, also with options to filter by parameters, let's say - T-shirt colour.
offers should be sortable by any of 5 properties (creating date,
price, etc...)
Important requirement 1: we have to give a user an ability to browse all the 15 million offers, and not just the "top-N".
Important requirement 2: they must be able to jump to a random page at any time, not just flick through them sequentially
we use some sort of a traditional data storage (MongoDB, to be precise).
The problem is that MongoDB (as well as other traditional databases) performs poorly when it comes to big offsets. Imagine if a user wants to fetch a page of results somewhere in the middle of this huge list sorted by creation date with some additional filters (for instance - by colour)
There is an article describing this kind of problem:
http://openmymind.net/Paging-And-Ranking-With-Large-Offsets-MongoDB-vs-Redis-vs-Postgresql/
Okay now, so we are told that redis is a solution for similar kind of problem. You "just" need to prepare certain data structures and search them instead of your primary storage.
the question is:
What kind of structures and approaches whould you suggest to use in order to solve this with Redis?
Sorted Sets, paging through with ZRANGE.

How to design database to store and retrieve large item/skill lists in ruby

I plan a role playing game where characters are supposed to carry/use items and train skills. When it comes to store (possibly numerous) items/skills possessed by characters, I can't think of a better way than putting a row for every possible item and skill to each character instantiated. However this seems to be an overkill to me.
To be clear, if this would be an exercise or a small game where total number of items/skills is ~30, I would add an items and a skills hash to the character class and methods to add and remove them like:
def initialize
#inventory = {}
#skills = {}
end
def add_item item, number
#inventory[item] += number
end
Regarding that I would like to store the number of the items and the levels of the skills, what else can I try to handle ~1000 items and ~150 in the inventory and possibly 100 skills?
Plan for Data Retrieval
Generally, it's a good idea to design your database around how you plan to look up and retrieve your data, rather than how you want to store it. A bad design makes your data very expensive to collect from the database.
In your example, having a separate model for each inventory item or skill would be hugely expensive in terms of lookups whenever you want to load a character. Do you really want to do 1,000 lookups every time you load someone's inventory? Probably not.
Denormalize for Speed
You typically want to normalize data that needs to be consistent, and denormalize data that needs to be retrieved/updated quickly. One option might be to serialize your character attributes.
For example, it should be faster to store a serialized Character#inventory_items field than update 100 separate records with a has_many :though or has_and_belongs_to_many relationship. There are certainly trade-offs involved with denormalization in general and serialization in particular, but it might be a good fit for your specific use case.
Consider a Document Database
Character sheets are documents. Unless you need the relational power of a SQL database, a document-oriented database might be a better fit for the data you want to manage. CouchDB seems particularly well-suited for this example, but you should certainly evaluate all your NoSQL options to see if any offer the features you need. Your mileage will definitely vary.
Always Benchmark
Don't take my word for what's optimal. Try a design. Benchmark it. See what the design does with your data. In the end, that's the only thing that matters.
I can't think of a better way than putting a row for every possible item and skill to each character instantiated.
Do characters evolve independently?
Assuming yes, there is no other choice but having each end every relevant combination physically represented in the database.
If not, then you can "reuse" the same set or items/skills for multiple characters, but this is probably not what is going on here.
In any case, relational databases are very good at managing huge amounts of data and the numbers you mentioned don't even qualify as "huge". By correctly utilizing techniques such as clustering, you can ensure that a lookup of all items/skills for a given character is done in a minimal number of I/O operations, i.e. very fast.

Pitfalls in prototype database design (for performance viability testing)

Following on from my previous question, I'm looking to run some performance tests on various potential schema representations of an object model. However, the catch is that while the model is conceptually complete, it's not actually finalised yet - and so the exact number of tables, and numbers/types of attributes in each table aren't definite.
From my (possibly naive) perspective it seems like it should be possible to put together a representative prototype model for each approach, and test the performance of each of these to determine which is the fastest approach for each case.
And that's where the question comes in. I'm aware that the performance characteristics of databases can be very non-intuitive, such that a small (even "trivial") change can lead to an order of magnitude difference. Thus I'm wondering what common pitfalls there might be when setting up a dummy table structure and populating it with dummy data. Since the environment is likely to make a massive difference here, the target is Oracle 10.2.0.3.0 running on RHEL 3.
(In particular, I'm looking for examples such as "make sure that one of your tables has a much more selective index than the other"; "make sure you have more than x rows/columns because below this you won't hit page faults and the performance will be different"; "ensure you test with the DATETIME datatype if you're going to use it because it will change the query plan greatly", and so on. I tried Google, expecting there would be lots of pages/blog posts on best practices in this area, but couldn't find the trees for the wood (lots of pages about tuning performance of an existing DB instead).)
As a note, I'm willing to accept an answer along the lines of "it's not feasible to perform a test like this with any degree of confidence in the transitivity of the result", if that is indeed the case.
There are a few things that you can do to position yourself to meet performance objectives. I think they happen in this order:
be aware of architectures, best practices and patterns
be aware of how the database works
spot-test performance to get additional precision or determine impact of wacky design areas
More on each:
Architectures, best practices and patterns: one of the most common reasons for reporting databases to fail to perform is that those who build them are completely unfamiliar with the reporting domain. They may be experts on the transactional database domain - but the techniques from that domain do not translate to the warehouse/reporting domain. So, you need to know your domain well - and if you do you'll be able to quickly identify an appropriate approach that will work almost always - and that you can tweak from there.
How the database works: you need to understand in general what options the optimizer/planner has for your queries. What's the impact to different statements of adding indexes? What's the impact of indexing a 256 byte varchar? Will reporting queries even use your indexes? etc
Now that you've got the right approach, and generally understand how 90% of your model will perform - you're often done forecasting performance with most small to medium size databases. If you've got a huge one, there's a ton at stake, you've got to get more precise (might need to order more hardware), or have a few wacky spots in the design - then focus your tests on just this. Generate reasonable test data - and (important) stats that you'd see in production. And look to see what the database will do with that data. Unless you've got real data and real prod-sized servers you'll still have to extrapolate - but you should at least be able to get reasonably close.
Running performance tests against various putative implementation of a conceptual model is not naive so much as heroically forward thinking. Alas I suspect it will be a waste of your time.
Let's take one example: data. Presumably you are intending to generate random data to populate your tables. That might give you some feeling for how well a query might perform with large volumes. But often performance problems are a product of skew in the data; a random set of data will give you an averaged distribution of values.
Another example: code. Most performance problems are due to badly written SQL, especially inappropriate joins. You might be able to apply an index to tune an individual for SELECT * FROM my_table WHERE blah but that isn't going to help you forestall badly written queries.
The truism about premature optimization applies to databases as well as algorithms. The most important thing is to get the data model complete and correct. If you manage that you are already ahead of the game.
edit
Having read the question which you linked to I more clearly understand where you are coming from. I have a little experience of this Hibernate mapping problem from the database designer perspective. Taking the example you give at the end of the page ...
Animal > Vertebrate > Mammal > Carnivore > Canine > Dog type hierarchy,
... the key thing is to instantiate objects as far down the chain as possible. Instantiating a column of Animals will perform much slower than instantiating separate collections of Dogs, Cats, etc. (presuming you have tables for all or some of those sub-types).
This is more of an application design issue than a database one. What will make a difference is whether you only build tables at the concrete level (CATS, DOGS) or whether you replicate the hierarchy in tables (ANIMALS, VERTEBRATES, etc). Unfortunately there are no simple answers here. For instance, you have to consider not just the performance of data retrieval but also how Hibernate will handle inserts and updates: a design which performs well for queries might be a real nightmare when it comes to persisting data. Also relational integrity has an impact: if you have some entity which applies to all Mammals, it is comforting to be able to enforce a foreign key against a MAMMALS table.
Performance problems with databases do not scale linearly with data volume. A database with a million rows in it might show one hotspot, while a similar database with a billion rows in it might reveal an entirely different hotspot. Beware of tests conducted with sample data.
You need good sound database design practices in order to keep your design simple and sound. Worry about whether your database meets the data requirements, and whether your model is relevant, complete, correct and relational (provided you're building a relational database) before you even start worrying about speed.
Then, once you've got something that's simple, sound, and correct, start worrying about speed. You'd be amazed at how much you can speed things up by just tweaking the physical features of your database, without changing any app code. To do this, you need to learn a lot about your particular DBMS.
They never said database development would be easy. They just said it would be this much fun!

Resources