I'm working in a distributed database environment (Greenplum) and there are times when I want to access the same (very large) dimension but joining to it on different keys, so I want parallel versions of the dimension with different distribution keys to enable localized joins.
My questions are:
Is this a terrible idea?
What should I call these alternative distributions?
How can I maintain data integrity, i.e. ensure that the data is replicated reliably and efficiently?
My answers so far are:
No.
I have no good ideas - subdim_xxx? dim_xxx_2, dim_xxx_3?
dim_xxx_by_distkey?
There should be one primary dimension table distributed by the dimension key, and each subsidiary distribution should be updated by looking in the primary dimension for new records and inserting them into the subsidiary - this dimension is not updated so I don't need to worry about that just yet.
Related
We have a multi-tenant system whose data we're starting to migrate to CockroachDB.
We will often need to join data belonging to a single tenant from across multiple tables. In more seldom cases, we will also read data belonging multiple/all tenants - but that doesn't happen as often, and thus is not as important seen from a performance perspective.
We thought table interleaving might have been part of a good solution - i.e.: have a tenant "parent" table, and then interleave other tables into that table, to increase chance data belonging to same tenant end up in same range. But whether that was a good idea or not, doesn't matter now, as interleaving has been deprecated.
Creating a separate schema or database per tenant, might be one solution. But as we already have one schema per microservice, that would lead to an explosion of schemas (we have thousands of tenants). Thus some design with a "discriminator" column telling which tenant a given row of data belongs to is preferable.
What would be a good design, that is both nice to work with and has good performance? E.g. would it help to use composite keys with the tenant-ID as the first part of the key? I guess that might at least ensure that data from the same tenant in a single table is located next to each other, and thus more likely to end up in the same range? That won't help when doing joins across tables though (which is what I had expected interleaving to help with).
Creating a separate schema or database per tenant, might be one solution. But as we already have one schema per microservice, that would lead to an explosion of schemas (we have thousands of tenants). Thus some design with a "discriminator" column telling which tenant a given row of data belongs to is preferable.
Either of these approaches will be appropriate in terms of performance. In fact, they'll result in very similar key encodings under the hood in CRDB's distributed key-value store.
That said, CockroachDB has a soft-limit of around 10k databases, so if you already have thousands of users, a database-per-user may be inadvisable.
So including a tenant-id as the prefix of each table's primary key and secondary indexes is likely the best approach. It will ensure that data from the same tenant in a single table is located next to each other for the purposes of optimizing scans and multi-row read-write transactions.
What is the best practice for including Created By, Created Timestamp, Modified By, Modified Timestamp into a dimensional model?
The first two never change. The last two will change slowly for some data elements but rapidly for other data elements. However, I'd prefer a consistent approach so that reporting users become familiar with it.
Assume that I really only care about the most recent value; I don't need history.
Is it best to put them into a dimension knowing that, for highly-modified data, that dimension is going to change often? Or, is it better to put them into the fact table, treating the unchanging Created information much the same way a sales order number becomes a degenerate dimension?
In my answer I will assume that these ADDITIONAL Columns do NOT define the validity of the Dimensional record and that you are talking about a Slowly Changing Dimension type 1
So we are in fact talking about dimensional metadata here, about who / which process created or modified the dimensional row.
I would always put this kind of metadata in the dimension because it:
Is related to changes in the dimension. These changes happen independent of the fact table
In general it is advised to keep Fact tables as small as possible. If your Fact table would contain 5 Dimensions, this would lead to adding 5*4=20 extra columns to your fact table which will seriously bloath it and impact performance.
I understand the concept of range partitioning. If i have a date column and i partition on that column based on month, then if my query has a where clause just filtering for a month, then i can hit a particular partition and get my data, without hitting the full table.
In Oracle docs i read that if a logical partitioning like 'month' is not available,(e.g, you partition on a column called customer id) ,then use a hash partitioning. So how will this work? Oracle will randomly divide the data and assign it to different partitions and assign a hash code to each partition?
But in this situation, when new data comes in, how does oracle know in which partition to put the new data? And when i query data, it seems there is no way to avoid hitting multiple partitions?
"how does oracle know in which partition to put the new data?"
From the documentation
Oracle Database uses a linear hashing algorithm and to prevent data
from clustering within specific partitions, you should define the
number of partitions by a power of two (for example, 2, 4, 8).
As for your other question ...
"when i query data, it seems there is no way to avoid hitting multiple
partitions?"
If you're searching for a single Customer ID then no. Oracle's hashing algorithm is consistent, so records with the same partition key end up in the same partition (obviously). But if you are searching for, say, all the new customers from the last month then yes. Oracle's hashing algorithm will strive to distribute records evenly so the latest records will be spread across the whole table.
So the real question is, why do we choose to partition a table? Performance is often the least compelling reason to partition. Better reasons include
availability each partition can reside on a different tablespace. Hence a problem with a tablespace will take out a slice of the table's data instead of the whole thing.
management partitioning provides a mechanism for splitting whole table jobs into clear batches. Partition exchange can make it easier to bulk load data.
As for performance, physical co-location of records can speed up some queries- those which are searching records by a defined range of keys. However, any queries which don't match the grain of the query won't perform faster (and may even perform slower) than a non-partitioned table.
Hash partitioning is unlikely to provide performance benefits, precisely because it shuffles the keys across the whole table. It will provide the availability and manageability benefits of partitioning (but is obviously not particularly amenable to partition exchange).
A hash is not random, it divides the data in a repeatable (but perhaps difficult-to-predict) fashion so that the same ID will always map to the same partition.
Oracle uses a hash algorithm that should usually spread the data evenly between partitions.
I have two MySQL tables with one containing a set of 6000 users and another set of 10000 ratings they have provided for products. I'd like to make a matrix of feature vectors that have for each row that denotes a user a 1 or 0 if they have given a rating to a particular product (or even the rating value). What is the best way to accomplish this (given too that the matrix will be sparse?).
I'm curious as to what implementations I can test out with tools at my disposal (like MySQL or MATLAB) - the end purpose is to perform clustering of similar users. Somehow I think a 10,000 column MySQL table won't make my db admin happy... at all.
The obvious way of storing a sparse matrix in SQL is to use three columns, where user and product together are the primary key, and the extra column is the rating.
It does not make sense to do the actual processing with the SQL database. This is just a huge overhead, and makes things slow. Just get the data out into a primitive and fast data structure, do the analysis, then eventually translate the output in whatever output format you need.
SQL is good when you need only part of the data or have to perform changes, need locking and all this. But I'd never run the computation directly on the database, because unless you can load your low-level linear algebra libraries into your database, it will be slow.
Here's another one I've been thinking about lately.
We have concluded in earlier discussions : 'natural primary keys are bad, artificial primary keys are good.'
Working with Hibernate earlier I have seen that Hibernate default creates one sequence for all tables. At first I was puzzled by this, why would you do this. But later I saw the advantage that it makes linking parents and children fool proof. Because no tables have the same primary key value, accidentally linking a parent with a table that is not a child gives no results.
Does anyone see any downsides to this approach. I only see one : you cannot have more than 999999999999999999999999999 records in your database.
There could be performance issues with all code getting values from a single sequence - see this Ask Tom thread.
Depending on how sequences are implemented in the database, always hitting the same sequence can be better or worse. When only a few or only one thread request new values, there will be no locking issues. But a bad implementation could cause congestion.
Another problem is rolling back transactions: Sequences don't get rolled back (because someone else might have requested a higher value already), so you can have large gaps which will eat your number space much more quickly than you might expect. OTOH, it will take some time to eat 2 or 4 billion IDs (if you "only" use 32 bit (signed) ints), so it's rarely an issue in practice.
Lastly, you can't easily reset the sequence if you have to. But if you need to have a restarting sequence (say, number of records since midnight), you can tell Hibernate to create/use a second sequence.
A major advantage is that you can uniquely identify objects anywhere in the DB just by the ID. That means you can severely cut down the log information you write in the production system and still find something if you only have the ID.
I prefer having one sequence per table. This comes from one general observation: Some tables ("master tables") have a relatively small row count and have to be kept "forever". For example, the customer table in an ERP.
In other tables ("transaction tables"), many rows are generated perpetually, but after some time, those rows can be archived (or simply deleted). The most extreme example is a tracing table used for debugging purposes; it might grow by hundreds of rows per second, but each row is obsolete after a few days.
Small IDs in the master tables make it easier when working directly on the database, e.g. for debugging purposes.
select * from orders where customerid=415
vs
select * from orders where customerid=89461836571
But this is only a minor issue. The bigger issue is cycling. If you use one sequence for all tables, you simply cannot let it restart. With one sequence per table, you can restart the sequences for the transaction tables when you have archived or deleted the old data. Master tables hardly ever have that problem, since they grow much slower.
I see little value in having only one sequence for all tables. The arguments told so far do not convince me.
There are a couple of disadvantages of using a single sequence:-
reduced concurrency. Handing out the next sequence value involves synchronisation. In practice, I do not think this is likely to be a big problem
Oracle has special code when maintaining btree indexes to detect monotonically increasing values and balance the tree approriately
The CBO might have a better time estimating range queries on the index (if you ever did this) if most values were filled in
An advantage might be that you can determine the order of inserts amongst different tables.
Certainly there are pros and cons to the one-sequence versus one-sequence-per-table approach. Personally I find the ability to assign a truly unique identifier to a row, making each id column a uuid, to be enough of a benefit to outweigh any disadvantages. As Aaron D. succinctly writes:
you can uniquely identify objects anywhere in the DB just by the ID
And, for most applications, due to the way Hibernate3 batches IMPORT statements, this will not be a performance bottleneck unless massive amounts of records are vying for the same db resource (SELECT hibernate_sequence.nextval FROM dual).
Also, this sequence mapping is not supported in the latest release (1.2) of Grails. Though it was supported in Grails 1.1 (!). It now requires subclassing one of the Hibernate dialect classes as a workaround.
For those using Grails/GORM, have a look at this JIRA entry:
Oracle Sequence mappings ignored