I've created a table on Vertica, and I want to create an index on that table. I can't see how to create an index on Vertica, though. Is it possible? If so, how can I do that?
Vertica's speed is hinged on using columnar projections, not indexes. Please see:
https://my.vertica.com/docs/6.1.x/HTML/index.htm#12037.htm
So, in fact, Vertica doesn't have the ability to create an index. You will have to use a projection to achieve good performance.
kimbo's answer is correct.
I try to explain it to people a few ways. But basically, the table itself is a construct like a view. Unlike traditional databases, the table itself isn't saved to disk and then indexed in different ways. Projections handle the sorting, indexing, layout on disk, etc.
I also use an analogy of a deck of cards. A table can be considered a deck of cards. You ask for particular hands. Projections are like particular shuffles. Some may be sorted by suit. Some by face value. And depending by what you ask for depends on what projection (in this analogy shuffle) you query.
Related
Given the following view for the gamesim-sample example:
function (doc, meta) {
if (doc.jsonType == "player" && doc.experience) {
emit([doc.experience,meta.id], doc.id);
}
}
I would like to Query the leaderboard for users who only belong to specific group (the grouping data is maintained in an external system).
For e.g. if the view has users "orange","purple","green","blue" and "red" I would like the leaderboard to give me the rankings of only "orange" and "purple" without having to query their respective current experience points.
...view/leaderboard?keys=[[null,"orange"],[null,"purple"]
The following works fine, but it requires additional queries to find the experience point of "orange" and "purple" beforehand. However, this does not scale for obvious reasons.
...view/leaderboard?keys=[[1,"orange"],[5,"purple"]
Thanks in advance!
Some NoSql vs. SQL Background
First, you have to remember that specifically with Couchbase, the advantage is the super-fast storage and retrieval of records. Indicies were added later, as a way to make storage a little more useful and less error-prone (think of them more as an automated inventory) and their design really constrains you to move away from SQL-style thinking. Your query above is a perfect example:
select *
from leaderboard
where id in ('orange','purple')
order by experience
This is a retrieval, computation, and filter all in one shot. This is exactly what NoSql databases are optimized not to do (and conversely, SQL databases are, which often makes them hopelessly complex, but that is another topic).
So, this leads to the primary difference between a SQL vs a NoSQL database: NoSql is optimized for storage while SQL is optimized for querying. In conjunction, it causes one to adjust how one thinks about the role of the database, which in my opinion should be more the former than the latter.
The creators of Couchbase originally focused purely on the storage aspect of the database. However, storage makes a lot more sense when you know what it is you have stored, and indices were added later as a feature (originally you had to keep track of your own stuff - it was not much fun!) They also added in map-reduce in a way that takes advantage of CB's ability to store and retrieve massive quantities of records simultaneously. Neither of these features were really intended to solve complex query problems (even though this query is simple, it is a perfect example because of this). This is the function of your application logic.
Addressing Your Specific Issue
So, now on to your question. The query itself appears to be a simple one, and indeed it is. However,
select * from leaderboard
is not actually simple. It is instead a 2-layer deep query, as your definition of leaderboard implies a sorted list from largest to smallest player experience. Therefore, this query, expanded out, becomes:
select * from players order by experience desc
Couchbase supports the above natively in the index mechanism (remember, it inventories your objects), and you have accurately described in your question how to leverage views to achieve this output. What Couchbase does not support is the third-level query, which represents your where clause. Typically, a where in Couchbase is executed in either the view "map" definition or the index selection parameters. You can't do it in "map" because you don't always want the same selection, and you can't do it in the index selection parameter because the index is sorted on experience level first.
Method 1
Let's assume that you are displaying this to a user on a web page. You can easily implement this filter client-side (or in your web service) by pulling the data as-is and throwing out values that you don't want. Use the limit and skip parameters to ask for more as the user scrolls down (or clicks more pages, or whatever).
Method 2
Reverse the order of your index, and sort by "group" (aka color) first, then experience level. Run separate queries to select the top 'N' users of each color, then merge and sort on the client side. This will take longer to load up-front but will give you a larger in-memory data set to work with if you need it for that reason. This method may not work well if you have a very uneven distribution of categories, in which case 'N' would need to be tailored to match the statistical distribution(s) within the categories.
Bottom Line
One parting thought is that NoSql databases were designed to deal with highly dynamic data sets. This requires some statistical thinking, because there no longer is a single "right" answer. Some degree of inconsistency and error is to be expected (as there always is in the real world). You can't expect a NoSql database to return a perfect query result - because there is no perfection. You have to settle for "good enough" - which is often much better than what is needed anyway.
I did a bit R&D on the fact tables, whether they are normalized or de-normalized.
I came across some findings which make me confused.
According to Kimball:
Dimensional models combine normalized and denormalized table structures. The dimension tables of descriptive information are highly denormalized with detailed and hierarchical roll-up attributes in the same table. Meanwhile, the fact tables with performance metrics are typically normalized. While we advise against a fully normalized with snowflaked dimension attributes in separate tables (creating blizzard-like conditions for the business user), a single denormalized big wide table containing both metrics and descriptions in the same table is also ill-advised.
The other finding, which I also I think is ok, by fazalhp at GeekInterview:
The main funda of DW is de-normalizing the data for faster access by the reporting tool...so if ur building a DW ..90% it has to be de-normalized and off course the fact table has to be de normalized...
So my question is, are fact tables normalized or de-normalized? If any of these then how & why?
From the point of relational database design theory, dimension tables are usually in 2NF and fact tables anywhere between 2NF and 6NF.
However, dimensional modelling is a methodology unto itself, tailored to:
one use case, namely reporting
mostly one basic type (pattern) of a query
one main user category -- business analyst, or similar
row-store RDBMS like Oracle, SQl Server, Postgres ...
one independently controlled load/update process (ETL); all other clients are read-only
There are other DW design methodologies out there, like
Inmon's -- data structure driven
Data Vault -- data structure driven
Anchor modelling -- schema evolution driven
The main thing is not to mix-up database design theory with specific design methodology. You may look at a certain methodology through database design theory perspective, but have to study each methodology separately.
Most people working with a data warehouse are familiar with transactional RDBMS and apply various levels of normalization, so those concepts are used to describe working a star schema. What they're doing is trying to get you to unlearn all those normalization habits. This can get confusing because there is a tendency to focus on what "not" to do.
The fact table(s) will probably be the most normalized since they usually contain just numerical values along with various id's for linking to dimensions. They key with fact tables is how granular do you need to get with your data. An example for Purchases could be specific line items by product in an order or aggregated at a daily, weekly, monthly level.
My suggestion is to keep searching and studying how to design a warehouse based on your needs. Don't look to get to high levels of normalized forms. Think more about the reports you want to generate and the analysis capabilities to give your users.
I have two MySQL tables with one containing a set of 6000 users and another set of 10000 ratings they have provided for products. I'd like to make a matrix of feature vectors that have for each row that denotes a user a 1 or 0 if they have given a rating to a particular product (or even the rating value). What is the best way to accomplish this (given too that the matrix will be sparse?).
I'm curious as to what implementations I can test out with tools at my disposal (like MySQL or MATLAB) - the end purpose is to perform clustering of similar users. Somehow I think a 10,000 column MySQL table won't make my db admin happy... at all.
The obvious way of storing a sparse matrix in SQL is to use three columns, where user and product together are the primary key, and the extra column is the rating.
It does not make sense to do the actual processing with the SQL database. This is just a huge overhead, and makes things slow. Just get the data out into a primitive and fast data structure, do the analysis, then eventually translate the output in whatever output format you need.
SQL is good when you need only part of the data or have to perform changes, need locking and all this. But I'd never run the computation directly on the database, because unless you can load your low-level linear algebra libraries into your database, it will be slow.
I have a MongoDB collection containing attributes such as:
longitude, latitude, start_date, end_date, price
I have over 500 million documents.
My question is how to search by lat/long, date range and price as efficiently as possible?
As I see it my options are:
Create an Geo-spatial index on lat/long and use MongoDB's proximity search... and then filter this based on date range and price.
I have yet to test this but, am worrying that the amount of data would be too much to search this quickly, when we have around 1 search a second.
have you had experience with how MongoDB would react under these circumstances?
Split the data into multiple collections by location. i.e. by cities like london_collection, paris_collection, new_york_collection.
I would then have to query by lat/long first, find the nearest city collection and then do a MongoDB spatial search on that subset data in that collection with date and price filters.
I would have uneven distribution of documents as some cities would have more documents than others.
Create collections by dates instead of location. Same as above but each document is allocated a collection based on it's date range.
problem with searches that have a date range that straddles multiple collections.
Create unique ids based on city_start_date_end_date for each document.
Again I would have to use my lat/long query to find the nearest city append the date range to access the key. This seems to be pretty fast but I don't really like the city look up aspect... it seems a bit ugly.
I am in the process of experimenting with option 1.) but would really like to hear your ideas before I go too far down one particular path?
How do search engines split up and manage their data... this must be a similar kind of problem?
Also I do not have to use MongoDB, I'm open to other options?
Many thanks.
Indexing and data access performance is a deep and complex subject. A lot of factors can effect the most efficient solution including the size of your data sets, the read to write ratio, the relative performance of your IO and backing store, etc.
While I can't give you a concrete answer, I can suggest investigating using morton numbers as an efficient way of pulling multiple similar numeric values like lat longs.
Morton number
Why do you think option 1 would be too slow? Is this the result of a real world test or is this merely an assumption that it might eventually not work out?
MongoDB has native support for geohashing and turns coordinates into a single number which can then be searched by a BTree traversal. This should be reasonably fast. Messing around with multiple collections does not seem like a very good idea to me. All it does is replace one level of BTree traversal on the database with some code you still need to write, test and maintain.
Don't reinvent the wheel, but try to optimize the most obvious path (1) first:
Set up geo indexes
Use explain to make sure your queries actually use the index
Make sure your indexes fit into RAM
Profile the database using the built-in profiler
Don't measure performance on a 'cold' system where the indexes didn't have a chance to go to RAM yet
If possible, try not to use geoNear if possible, and stick to the faster (but not perfectly spherical) near queries
If you're still hitting limits, look at sharding to distribute reads and writes to multiple machines.
I'm not really trying to compress a database. This is more of a logical problem. Is there any algorithm that will take a data table with lots of columns and repeated data and find a way to organize it into many tables with ID's in such a way that in total there are as few cells as possible, and that this tables can be then joined with a query to replicate the original one.
I don't care about any particular database engine or language. I just want to see if there is a logical way of doing it. If you will post code, I like C# and SQL but you can use any.
I don't know of any automated algorithms but what you really need to do is heavily normalize your database. This means looking at your actual functional dependencies and breaking this off wherever it makes sense.
The problem with trying to do this in a computer program is that it isn't always clear if your current set of stored data represents all possible problem cases. You can't only look at numbers of values either. It makes little sense to break off booleans into their own table because they have only two values, for example, and this is only the tip of the iceberg.
I think that at this point, nothing is going to beat good ol' patient, hand-crafted normalization. This is something to do by hand. Any possible computer algorithm will either make a total mess of things or make you define the relationships such that you might as well do it all yourself.