What happens to surrogate keys of transactional system , when converting it to dimensioanl schema? - dimensional-modeling

Our OLTP systems use several surrogate keys .Now we want to create a dimensional model for our system for analysis. Should we keep OLTP system surrogate keys and natural keys and also create one more datamart surrogate key? or shall we ignore the OLTP system surrogate key and just keep the natural key from OLTP and datamart surrogate key?

The dimensional model's surrogate keys are specific to the dimensional model, and independent of any source keys you might have. You should definitely keep the natural keys and create a datamart surrogate key, but whether it is useful to also bring in the OLTP system's surrogate key as a back reference depends on whether it is useful in identifying rows back in the OLTP system- i.e. how important is that OLTP surrogate key? Normally I'd stick with just the new surrogate in the dimension and the natural key, but sometimes the surrogate key serves as the natural key too.

Related

Hash Table vs Dictionary

As far as I know hash table uses has key to store any item whereas dictionary uses simple key value pair to store item.it means that dictionary is a lot faster than hash table (Which I think. Please correct me if I am wrong).
Does this mean I should never use hash table?
The answer is "it depends".
A Dictionary is merely a way to map a key to a value. You can either use a library or implement one yourself.
A Hash table is a specific way to implement a dictionary where the key based upon a hash function. This function is usually based on modulo arithmetic. This means that two distinct value may end up with the hash key and therefore there will be a collision between the keys. It is then up to you (or whoever implements the hash table) to the determine how to resolve the collision. You could chain the value at the same key, re-hash and use a sub-hash table, or you may even want to start over with a new hash function (which would be expensive).
Depending on the underlying implementation of the dictionary (hash table) will affect your lookup performance.

oracle composite primary key vs index

I'm designing a table which I has a multiple foreign keys. What I did is create an extra column for primary key which will work more as a correlative, but I could also make the foreign keys as a composite primary key.
So my question is about performance: Is it better (at least for Oracle) to have a composite primary key than a index? What is better for my case?
Thanks!
As #Sylvain_Leroux points out, the term "better" is actually very ambiguous depending on your goals because there are tradeoffs to both approaches.
Ensure Composite Key is Actually Unique
First of all, if you want to use a composite primary key out of the foreign keys, then you must be sure that the combination of the foreign keys will be truly unique for each record. Otherwise, of course, you won't be able to use them as a primary key. If instead you are describing using a composite key made up of the foreign keys plus a surrogate key, that's kind of the worst of both worlds and is generally frowned upon.
ETL Back Room Considerations
The choice you are considering is a common one in OLAP, where a designer must choose whether or not to use a surrogate key for the fact table or a composite key comprised of the keys of the dimension tables. This advice from page 487 of Ralph Kimball's The Data Warehouse Toolkit Third Edition would therefore apply to your situation (you can consider your table as being analogous to what he describes as a fact table, and the foreign keys are for tables that he refers to as dimensions):
Fact table surrogate keys have a number of uses in the ETL back room. First, as previously described, they can be used as the basis for backing out or resuming an interrupted load. Second, they provide immediate and unambiguous identification of a single fact row without needing to constrain multiple dimensions to fetch a unique row. Third, updates to fact table rows can be replaced by inserts plus deletes because the fact table surrogate key is now the actual key for the fact table. Thus, a row containing updated columns can now be inserted into the fact table without overwriting the row it is to replace. When all such insertions are complete, then the underlying old rows can be deleted in a single step. Fourth, the fact table surrogate key is an ideal parent key to be used in a parent/child design. The fact table surrogate key appears as a foreign key in the child, along with the parent's dimension foreign key.
Performance Considerations
From a performance perspective, the records are stored in order by primary key(s) physically on the disk. That makes reads based on queries that use a foreign key (or keys) for lookup faster, but also could mean that writes will be slower if they require inserting records at points other than at the end. This is because the DBMS will have to physically move the records to make room (this is slightly oversimplified because there are some schemes employed by the DBMS to combat this, but they are overwhelmed if the inserts are numerous enough).
If you were to use a surrogate key, the insert problem wouldn't be an issue, but of course in situations where you are looking up by foreign keys, you wouldn't get the advantage of having your data in order physically on the disk. Assuming you would put an index on each foreign key, then that would add some overhead to insert tasks because the DBMS has to update multiple indices.
All of this is only noticeable with large amounts of data and will not make much of a difference for a relatively small amount of data.

What is the point of using a Partitioner for Secondary Sorting in MapReduce?

If you need to have the values sorted for a given key when passed to the reduce phase, such as for a moving average, or to mimick the LAG/LEAD Analytic functions in SQL, you need to implement a Secondary Sort in MapReduce.
After searching around on Google, the common suggestion is to:
A) Emit composite key, which includes the , in the map phase
B) Create a "composite key comparator" class, the purpose of which is for the secondary sort, comparing the values to sort on after comparing the key, so that the Iterable passed to the reducer is sorted.
C) Create a "natural key grouping comparator" class, the purpose of which is for the primary sort, comparing only the key to sort on, so that the Iterable passed to the reducer contains all of the values belonging to a given key.
D) Create a "natural key partitioner class", the purpose of which I do not know and is the purpose of my question.
From here:
The natural key partitioner uses the natural key to partition the data to the reducer(s). Again, note that here, we only consider the “natural” key.
By natural key he of course means the actual key, not the composite key + value.
From here:
The default partition will calculate a hash over the entire key resulting in different hashes and the potential that the records are sent to separate reducers. To ensure that both records are sent to the same reducer let's implement a customer partitioner.
From here:
In a real Hadoop cluster, there are many reducers running in different nodes. If the data for the same zone and day don’t land in the same reducer after the map reduce shuffle, we are in trouble. The way to ensure that is taking charge of defining our own partitioning logic.
Every source I've presented plus all the others I've seen recommends the partioner class to be written according to the following pseudo code:
naturalKey = compositeKey.getNaturalKey()
return naturalKey.hashCode() % NUMBER_OF_REDUCERS
Now, I was under the impression that Hadoop guarentees that for a given key, ALL the values corresponding to that key will be directed to the same reducer.
Is the reason we create a custom Partitioner the same for which we created the "natural key grouping comparator" class, to prevent MapReduce from sending the composite key instead of the reducer key?
The question is almost as good as an answer :), Everything you mentioned above is correct, I guess a different way of explaining the concept should help out.
So let me give it a shot.
Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name.
With the composite key out of the way, now lets look at the secondary sorting mechanism
The partitioner and the group comparator use only natural key, the partitioner uses it to channel all records with the same natural key to a single reducer. This partitioning happens in the Map Phase, data from various Map tasks are received by reducers where they are grouped and then sent to the reduce method. This grouping is where the group comparator comes into picture, if we would not have specified a custom group comparator then Hadoop would have used the default implementation which would have considered the entire composite key, which would have lead to incorrect results.

UUID as PK good idea?

at the moment i an looking at a table with 210 million records. The primary key is a 36 char alphanumeric key (uuid).
Would it be better for storing to use a sequential number as PK and the UUID as normal column?
It would be more compact, but the data would be harder to move between systems while maintaining data integrity.
So if you have to move the data, use UUID's. Otherwise, I see no benefit and some disadvantages.

How to sort by counter in Cassandra?

Let's assume I have a forum software, and I would like to sort the threads by the amount of views it has. The views would be stored in a counter.
Having experience in relational databases, I thought this would be simple to solve, turns out it's not. I have thought about creating one massive row with the columns being counters (thus being sorted), but as a single row can only be stored on a single node, this does not seam feasible (beats the point of using Cassandra).
How can I sort by counter column in Cassandra?
You can't sort big data. That's one of the fundamental assumptions.
The only things that you can sort by on cassandra, are the things that cassandra uses to store its data - the row key and the column key.
Moving to NoSQL from normal SQL you have to drop the notion of being able to sort/join data. It's just (generally) not possible in Big Data implementations.
To update on this question:
Korya is correct that you cannot assume that ALL NoSQL of BigData nature cannot sort (MongoDB can sort and it is NoSql).
Regarding to Cassandra itself: you can sort any given elements of your Primary Key AFTER your partition key inside a Composite Key:
Example:
Primary Key ((A),B,C,D);
A is your partition Key.
B,C,D are part of your composite Key, and can now be sorted ASC (default) or DESC. If you want something naturally in latest first (ie time) then you would specify it in your schema:
WITH CLUSTERING ORDER BY (media_type_id ASC,media_id ASC);
As far as the question goes about counters:
You cannot sort the counter inside cassandra because the counter would need to be part of the KEY and the key is unique.
As pointed by Martin the solution refenced by a whitepage example of eBay they explain that two tables are used to keep track.

Resources