Database design: Same table structure but different table - performance

My latest project deals with a lot of "staging" data.
Like when a customer registers, the data is stored in "customer_temp" table, and when he is verified, the data is moved to "customer" table.
Before I start shooting e-mails, go on a rampage on how I think this is wrong and you should just put a flag on the row, there is always a chance that I'm the idiot.
Can anybody explain to me why this is desirable?
Creating 2 tables with the same structure, populating a table (table 1), then moving the whole row to a different table (table 2) when certain events occur.
I can understand if table 2 will store archival, non seldom used data.
But I can't understand if table 2 stores live data that can changes constantly.
To recap:
Can anyone explain how wrong (or right) this seemingly counter-productive approach is?

If there is a significant difference between a "customer" and a "potential customer" in the business logic, separating them out in the database can make sense (you don't need to always remember to query by the flag, for example). In particular if the data stored for the two may diverge in the future.
It makes reporting somewhat easier and reduces the chances of treating both types of entities as the same one.
As you say, however, this does look redundant and would probably not be the way most people design the database.

There seems to be several explanations about why would you want "customer_temp".
As you noted would be for archival purposes. To allow analyzing data but in that case the historical data should be aggregated according to some interesting query. However it using live data does not sound plausible
As oded noted, there could be a certain business logic that differentiates between customer and potential customer.
Or it could be a security feature which requires logging all attempts to register a customer in addition to storing approved customers.

Any time I see a permenant table names "customer_temp" I see a red flag. This typically means that someone was working through a problem as they were going along and didn't think ahead about it.
As for the structure you describe there are some advantages. For example the tables could be indexed differently or placed on different File locations for performance.
But typically these advantages aren't worth the cost cost of keeping the structures in synch for changes (adding a column to different tables searching for two sets of dependencies etc. )
If you really need them to be treated differently then its better to handle that by adding a layer of abstraction with a view rather than creating two separate models.

I would have used a single table design, as you suggest. But I only know what you posted about the case. Before deciding that the designer was an idiot, I would want to know what other consequences, intended or unintended, may have followed from the two table design.
For, example, it may reduce contention between processes that are storing new potential customers and processes accessing the existing customer base. Or it may permit certain columns to be constrained to be not null in the customer table that are permitted to be null in the potential customer table. Or it may permit write access to the customer table to be tightly controlled, and unavailable to operations that originate from the web.
Or the original designer may simply not have seen the benefits you and I see in a single table design.

Related

insert data from one table to two tables group by for Oracle

I have a situation where I need a large amount of data (9+ billion per day) data being collected in a loading table that has fields like
-TABLE loader
first_seen,request,type,response,hits
1232036346,mydomain.com,A,203.11.12.1,200
1332036546,ogm.com,A,103.13.12.1,600
1432039646,mydomain.com,A,203.11.12.1,30
that need to split into two tables (de-duplicated)
-TABLE final
request,type,response,hitcount,id
mydomain.com,A,203.11.12.1,230,1
ogm.com,A,103.13.12.1,600,2
and
-TABLE timestamps
id,times_seen
1,1232036346
2,1432036546
1,1432039646
I can create the schemas and do the select like
select request,type,response,sum(hitcount) from loader group by request,type,response;
get data into the final table. for best performance I want to see if I can use "insert all" to move data from the loader to these two tables and perhaps use triggers in the database to try to achieve this. Any ideas and recommendations on the best ways to solve this?
"9+ billion per day"
That's more than just a large number of rows: that's a huge number, and it will require special engineering to handle it.
For starters, you don't just need INSERT statements. The requirement to maintain the count for existing (request,type,response) tuples points to UPDATE too. The need to generate and return a synthetic key is problematic in this scenario. It rules out MERGE, the easiest way of implementing upserts (because the MERGE syntax doesn't support the RETURNING clause).
Beyond that, attempting to handle nine billion rows in a single transaction is a bad idea. How long will it take to process? What happens if it fails halfway through? You need to define a more granular unit of work.
Although, that raises some business issues. What do the users only want to see the whole picture, after the Close-Of-Day? Or would they derive benefit from seeing Intra-day results? If yes, how to distinguish Intra-day from Close-Of-Day results? If no, how to hide partially processed results whilst the rest is still in flight? Also, how soon after Close-Of-Day do they want to see those totals?
Then there are the architectural considerations. These figure mean processing over one hundred thousand (one lakh) rows every second. That requires serious crunch and expensive licensing extras. Obviously Enterprise Edition for parallel processing but also Partitioning and perhaps RAC options.
By now you should have an inkling why nobody answered your question straight-away. This is a consultancy gig not a StackOverflow question.
But let's sketch a solution.
We must have continuous processing of incoming raw data. So we stream records for loading into FINAL and TIMESTAMP tables alongside the LOADER table, which becomes an audit of the raw data (or else perhaps we get rid of the LOADER table altogether).
We need to batch the incoming records to leverage set-based operations. Depending on the synthetic key implementation we should aim for pure SQL, otherwise Bulk PL/SQL.
Keeping the thing going is vital so we need to pay attention to Bulk Error Handling.
Ideally the target tables can be partitioned, so we can load into offline tables and use Partition Exchange to bring the cleaned data online.
For the synthetic key I would be tempted to use a hash key based on the (request,type,response) tuple rather than a sequence, as that would give us the option to load TIMESTAMP and FINAL independently. (Collisions are extremely unlikely.)
Just to be clear, this is a bagatelle not a serious architecture. You need to experiment and benchmark various approaches against realistic volumes of data on Production-equivalent hardware.

Database efficiency: references/pointers from table to table

I am working on learning databases and am unsure about something that doesn't seem to make any sense to me. In the relational model you are able to combine through references but always require a global sort of key in each table to be able to combine this information. That is obviously required in most cases, but I feel like in a perfect tree hierarchy set up of a database this is inefficient.
To explain this better I shall use the example of storing products in a database. Products have main categories and sub categories and these are very clear. (ie. Milk is a subcategory of Dairy which is a subcategory of Food, etc.)
I thought in cases like this the ability to store single or a list of references/pointers to tables in fields would take away a lot of search querying and storage requirements.
Here is a link to a simple pain layout I made to illustrate this:
Image (the table entry could have some command character like '|' after which it knows the following entry is a file directory so when the database initiates it knows to make a pointer there)
Since I am only learning to work with databases now I understand that I may just be missing some knowledge on the subject, but I don't seem to find anything when I try googling this problem. Any help explaining where to start or any confirmation that this may improve efficiency and where I could learn how to write this myself would be great.
The concept of "pointer" is useful only if the object you want to point to has a well-defined address that is at least as permanent as the pointer itself. If the address is less permanent, you could end up with a "dangling" pointer.
A row in the database does not necessarily have a permanent address.1 By referencing the row through a logical value (instead of the physical address), the reference stays valid even when the row physically moves.2 And to ensure that the value identifies exactly one row, it must be unique.3
As for storing the list of values (be it "pointers" or anything else) inside a single field, this violates the principle of atomicity and therefore the 1NF. There are very good reasons to avoid violating the 1NF, including the ability to maintain the referential integrity and utilize indexing. That being said, there are DBMSes that support arrays or even sub-tables within a single field, which may be useful on rare occasions.
1 For example, Oracle ROWID is constant as long as the row is not physically moved on disk, but that can happen in many situations that are part of the normal database operation. So aside from putting severe restrictions on how your database is used, you couldn't rely on the ROWID staying constant over the lifetime of the rows that reference it (which could be as long as the lifetime of the database itself).
2 I suppose it would be theoretically possible for a DBMS to keep track of all the pointers and update them when the row physically moves. However, I'm not aware of any DBMS that actually supports such "updatable" pointers in practice, probably because the underlying mechanism needed for that wouldn't be any more efficient than the standard "value-based" referencing.
3 And must obviously be non-NULL. Saying that the attribute (or combination thereof) is "non-NULL and unique", is synonymous to saying it's a "key". Ideally, the key should also be immutable (so there is no need for a cascading referential action such as ON UPDATE CASCADE).

database for enterprise level using oracle - normalization and duplication

I am developing an enterprise application with an Oracle backend. I am designing a core part of the DB architecture now and im having some questions on it.
First and most important thing is, most of my tables needs to preserve old data. For example
Consider a table with the fields
Contract No, Contract Name, Contract Person, Contract Email
I have a records like
12, xxx, yyy, xxx#zzz.ccc
and some one modifies it to
12, xxx, zzz, xxx#zzz.ccc
at any point of time i need to display the new record while still have copy of the old record.
So what i thought was to put a duplicate record of the old data and update the fields that was changed and have a flag to keep track of active records with something like "is active" as 1.
The downside is that this creates redundancy in the table and seems like a bad design. But any other model seems unnecessarily complex and this seems cleaner to me. Also i dont see any performance issues having a duplicate record too. So please let me know if this is ok or am i missing something here.
Some times where there is a one to many relationship my assumption is to have a mapping table where i map the multiple entity in individual records by repeating master ID and changing child ID in each record. Is this a right way to do it or is there a better way to do it.
Is there a book on database best practices.
Thanks.
The database im dealing with is Oracle 11g on a two node RAC cluster
Also i dont see any performance issues having a duplicate record too.
Assume you have a row that, over time, has 15 updates to it. If you don't store any temporal data (if you don't store different versions of the row), you end up storing one row. If you do store temporal data, you end up storing 15 rows.
You also need more indexes, because the id number is no longer sufficient to identify a single row.
If you have only relatively small tables, you probably won't see any performance difference. (There will be one, but it probably won't be noticeable to users.) But a table that has 10 million rows will perform differently than a table that has 150 million rows. (15 versions per row, times 10 million rows.)
Some times where there is a one to many relationship my assumption is
to have a mapping table where i map the multiple entity in individual
records by repeating master ID and changing child ID in each record.
Is this a right way to do it or is there a better way to do it.
You probably need to know which child rows belong to which parent rows. So you need more than a single master id for the key. The master id alone doesn't tell you which version of that row in the parent table applies to a given child row.
Is there a book on database best practices.
There are books on temporal databases. The first one that I know of is Snodgrass's Developing Time-Oriented Database Applications in SQL. It's available in several formats, and it's free. It's also kind of old, but the information in it is important to understand if you're going to be building a temporal database. Also, think about reading Date's book Temporal Data and the Relational Model.
Wikipedia has an article that summarizes the ideas behind temporal databases.
Is normalization completely mandatory.
That's a meaningless question. You will have different issues with tables normalized to 2NF than you'll have with tables normalized to 5NF or 6NF.
I would keep the old/history records in a separate table. Create an upd/del trigger to populate your audit/history table for you, and keep only the most current data in your main table.
See here for an example. Many other similar examples exists in SO.

Stumped and Seeking Input Re: Database Design

We have an Oracle database here that's been around for about 10 years. It's passed through a lot of hands. In the course of those years, it's grown quite large, and there are some interesting anomalies in its design that have me perplexed.
Now, I'm historically a SQL Server developer. I used to steam and fume about the differences between The Microsoft Way(tm) and The Oracle Way(R). Now, I realize, they're just different. I also used to yank my hair out and slam my head against the desk thinking that the people who came before me were blind, deaf mutes jacked up on Jolt and Red Bull, who wrote code in Tourette's.NET.
(Yes, I'm going somewhere.)
As time passed, I realized that neither database platform was inherently better than the other. They're just different. Further, I also realized that the developers who came before me often had compelling reasons for designing and writing things the way they did. Just because I wasn't privy to it didn't make it untrue. Sure, the documentation could have been better, but still.
So here's where all this leads me:
We have a few tables in the database that have two separate owners. Both owners define identical primary key constraints on the table. This has me perplexed. Why would a table have multiple owners? And why would each owner define separate yet identical primary keys?
These guys designed a pretty well-layed out database with lots of primary keys. But they didn't make a lot of use of indexes. When they did use indexes, they tended to make one large index instead of many distinct indexes. Is there some compelling performance gain to be had from that?
We also avoided foreign key constraints like the plague. Not sure why we would have done that. Is there a reason to avoid them in Oracle? I can see a lot of reasons to use them to enforce data integrity between tables, and we're just not using them. I'm assuming that there's a compelling reason, and I'm just not privy to it.
Finally, is there a compelling reason to avoid the use of triggers (aside from the obvious pitfall that lies in performance hits)? We don't seem to be using those much either.
For the record, we're still using Oracle 9i.
Again, thanks for your patience, everyone. I'm an old Microsoft hand, so bending my brain around the Oracle Way is challenging at times. It's a big beast, with tons to learn, and sometimes, finding that information on the Web is a chore.
Thank His Noodliness for StackOverflow.
Salient Post-Post Points
Historically, we haven't used sequences, except in very rare cases.
Historically, we haven't used stored procedures or functions, except in very rare cases.
There are some references in very old documents to ERWIN. (Thanks to the poster below for bringing it to my memory.) Chances are, the bulk of the design was the product of an ORM, and the natural design flowed from that.
The vast majority of the SQL appears hard-coded in the application, and there's a lot of it.
I'm doing everything in my power to move us away from hard-coded SQL, and to get the SQL into the database where it belongs. But I'm trying to do that in a way that makes sense, is practical, and doesn't break the business in the process. (Read: On new software only.)
We have a few tables in the database that have two separate owners. Both owners define identical primary key constraints on the table. This has me perplexed. Why would a table have multiple owners? And why would each owner define separate yet identical primary keys?
You cannot define two PRIMARY KEY's on one table in Oracle. You can define one PRIMARY KEY and one UNIQUE key on the same column set. I can see no point in such a design.
These guys designed a pretty well-layed out database with lots of primary keys. But they didn't make a lot of use of indexes. When they did use indexes, they tended to make one large index instead of many distinct indexes. Is there some compelling performance gain to be had from that?
In Oracle, an index cannot be used for RANGE SCANS on something that doesn't constitute a leftmost prefix of this index.
A composite index on (col1, col2, col3) cannot be used to do a plain RANGE SCAN on col2 alone or col3 alone.
We also avoided foreign key constraints like the plague. Not sure why we would have done that. Is there a reason to avoid them in Oracle? I can see a lot of reasons to use them to enforce data integrity between tables, and we're just not using them. I'm assuming that there's a compelling reason, and I'm just not privy to it.
If you make all interaction with the database through a set of well-defined procedures, a MERGE statement can yield far better performance than a FOREIGN KEY with ON DELETE CASCADE. You, though, should be very very careful and get used to this programming paradigma.
Finally, is there a compelling reason to avoid the use of triggers (aside from the obvious pitfall that lies in performance hits)? We don't seem to be using those much either.
I personally don't use triggers at all. Not every business rule can be expressed in terms of cascading inserts or updates, and any two-pass DML operation will lead to mutating tables. If all interaction with the database is done via stored procedures (or packages), triggers become useless.
Using triggers means in fact using SQL statements inside CURSOR loops, which every SQL cheechako knows to be a bad thing.
You don't want to be seen using cursors instead of set-based operations, do you?
FOREIGN KEY's are not as bad as triggers (as long as you don't define CASCADE operations on them), since they just don't let you do wrong things at the expense of some performance loss.
But when your database grows large, you will notice that the rules for integrity checking are far more complex than just verifying that the values being inserted into one table exist in another one.
You will have to check newly inserted values against aggregates, complex joins, etc., and all will checks will imply having a corresponding value in other table, and failing these checks compromises your database integrity just as good as violating the FOREIGN KEY's
So it will turn out that these FOREIGN KEY's are double and triple checked anyway, and there is no point to keep data integrity rules scattered all around the database rather than having them in one place (a stored procedure that is always used for updating the data).
How can the same table belong to two schemas. It doesn't make any sense.
That given there is nothing inherently bad practice in the questions you have asked.
I develop a large .net application with Oracle database and we have an excellent Oracle DBA in our team. We have used Foreign key constraints wherever possible for data integrity. Triggers are used only to get a new value from sequence or for auditing purpose and not for any business logic. We have used multicolumn unique indexes for data integrity and single column non-unique indexes.
"In Oracle, an index cannot be used for RANGE SCANS on something that doesn't constitute a leftmost prefix."
I believe this is not true anymore since Oracle 10g.
"When they did use indexes, they tended to make one large index instead of many distinct indexes. Is there some compelling performance gain to be had from that?"
You create indexes to speed up queries. If you query on "surname = 'Smith' and given_name = 'john'", then it is better to have a single index on (surname, given_name) than two separate indexes.
If no-one is complaining about performance, you probably don't need to worry about indexes.
Lots of primary keys.
We also avoided foreign key constraints.
Avoid the use of triggers.
Sounds like they used an ORM to fetch objects out of the database. That means fewer ultra-complex joins and SELECT statements and more simple SELECTS. It means constraints in the code, not the database. Similarly, "trigger"-like behavior is in the code.
Doesn't sound Oracle-specific. Sounds like the application has an ORM.
A lot of people, including me, don't like triggers because it makes it a lot harder to troubleshoot.
This pretty much sums up my opinion
I did Oracle database design for a large organization, and we used triggers as much as we could due to the fact that we had business rules that had to be enforced when data was coming from several directions (the application's GUI, and SQL scripts used for data migration). The business rules we enforced were pretty simple (date checking, checking for existence of rows in another table, etc...). If we tried to make them to complex, we got the dreaded "mutating table" error, which basically means you're trying to inspect the table that is currently changing. So triggers can be useful in some situations, but can cause headaches.
As far as indexes go, in my opinion it is -very- important to have indexes on the columns that are used for joining tables together. That's an easy way to increase performance.
About the foreign keys: since the database changed hands so much, I wonder if the foreign keys could have been dropped accidentally, somewhere along the line. I used PL-SQL developer and some seemingly-innocent operations (like adding/removing a column I think, but I'm not sure) caused the foreign keys to all be deleted.
They may have avoided using foreign constraints for performance. I'm told it can be very slow. They also make it difficult to bulk load data which may be inaccurate when loaded but will be corrected programatically.
"We have a few tables in the database that have two separate owners. Both owners define identical primary key constraints on the table. This has me perplexed. Why would a table have multiple owners? And why would each owner define separate yet identical primary keys?"
A SQL Server database corresponds more to an Oracle user/schema. So you can have multiple tables in the same Oracle database belonging to different schemas/users. These are DIFFERENT tables (ie with different data inside, and potentially different columns/indexes...).
Sometimes bits of a business want a snaphot of the data (eg at month or year end). Sometimes, before a datafix, a DBA will create a copy of a table (possibly with a different name or in a different schema) just in case the datafix goes horribly wrong.
Either way, where you have copies of a table, one is probably out of date (intentionally).
Assuming that you are not in a data warehousing situation here -
Foreign keys ensure referential integrity and are absolutely vital. I can't think of a situation when you would not want them.
Indexes again are very important tools to ensure query performance.
Not sure why they would define PKs without Indexes - PKs are usually implemented via a unique index.
Using large indexes, I assume you mean indexes that compound multiple columns
Using ERWIN-engineered Oracle database need not result in such a design - so what you have is not an ERWIN artifact.
If I had to hazard a guess - I am thinking the designer was overly, un-necessarily trying to design for performance - he avoided indexes for update performance, he also avoided FK constraints for a similar 'imagined' performance.
Unless the database is being used for a unique kind of application in a very special way, there really is no grounds for omitting FKs, and Indices.
Regarding triggers, other posters have already weighed in - triggers will be useful for capturing business rules in one central-place (same for Stored Procedures - good for encapsulating Business Logic).

How to stop thinking "relationally"

At work, we recently started a project using CouchDB (a document-oriented database). I've been having a hard time un-learning all of my relational db knowledge.
I was wondering how some of you overcame this obstacle? How did you stop thinking relationally and start think documentally (I apologise for making up that word).
Any suggestions? Helpful hints?
Edit: If it makes any difference, we're using Ruby & CouchPotato to connect to the database.
Edit 2: SO was hassling me to accept an answer. I chose the one that helped me learn the most, I think. However, there's no real "correct" answer, I suppose.
I think, after perusing about on a couple of pages on this subject, it all depends upon the types of data you are dealing with.
RDBMSes represent a top-down approach, where you, the database designer, assert the structure of all data that will exist in the database. You define that a Person has a First,Last,Middle Name and a Home Address, etc. You can enforce this using a RDBMS. If you don't have a column for a Person's HomePlanet, tough luck wanna-be-Person that has a different HomePlanet than Earth; you'll have to add a column in at a later date or the data can't be stored in the RDBMS. Most programmers make assumptions like this in their apps anyway, so this isn't a dumb thing to assume and enforce. Defining things can be good. But if you need to log additional attributes in the future, you'll have to add them in. The relation model assumes that your data attributes won't change much.
"Cloud" type databases using something like MapReduce, in your case CouchDB, do not make the above assumption, and instead look at data from the bottom-up. Data is input in documents, which could have any number of varying attributes. It assumes that your data, by its very definition, is diverse in the types of attributes it could have. It says, "I just know that I have this document in database Person that has a HomePlanet attribute of "Eternium" and a FirstName of "Lord Nibbler" but no LastName." This model fits webpages: all webpages are a document, but the actual contents/tags/keys of the document vary soo widely that you can't fit them into the rigid structure that the DBMS pontificates from upon high. This is why Google thinks the MapReduce model roxors soxors, because Google's data set is so diverse it needs to build in for ambiguity from the get-go, and due to the massive data sets be able to utilize parallel processing (which MapReduce makes trivial). The document-database model assumes that your data's attributes may/will change a lot or be very diverse with "gaps" and lots of sparsely populated columns that one might find if the data was stored in a relational database. While you could use an RDBMS to store data like this, it would get ugly really fast.
To answer your question then: you can't think "relationally" at all when looking at a database that uses the MapReduce paradigm. Because, it doesn't actually have an enforced relation. It's a conceptual hump you'll just have to get over.
A good article I ran into that compares and contrasts the two databases pretty well is MapReduce: A Major Step Back, which argues that MapReduce paradigm databases are a technological step backwards, and are inferior to RDBMSes. I have to disagree with the thesis of the author and would submit that the database designer would simply have to select the right one for his/her situation.
It's all about the data. If you have data which makes most sense relationally, a document store may not be useful. A typical document based system is a search server, you have a huge data set and want to find a specific item/document, the document is static, or versioned.
In an archive type situation, the documents might literally be documents, that don't change and have very flexible structures. It doesn't make sense to store their meta data in a relational databases, since they are all very different so very few documents may share those tags. Document based systems don't store null values.
Non-relational/document-like data makes sense when denormalized. It doesn't change much or you don't care as much about consistency.
If your use case fits a relational model well then it's probably not worth squeezing it into a document model.
Here's a good article about non relational databases.
Another way of thinking about it is, a document is a row. Everything about a document is in that row and it is specific to that document. Rows are easy to split on, so scaling is easier.
In CouchDB, like Lotus Notes, you really shouldn't think about a Document as being analogous to a row.
Instead, a Document is a relation (table).
Each document has a number of rows--the field values:
ValueID(PK) Document ID(FK) Field Name Field Value
========================================================
92834756293 MyDocument First Name Richard
92834756294 MyDocument States Lived In TX
92834756295 MyDocument States Lived In KY
Each View is a cross-tab query that selects across a massive UNION ALL's of every Document.
So, it's still relational, but not in the most intuitive sense, and not in the sense that matters most: good data management practices.
Document-oriented databases do not reject the concept of relations, they just sometimes let applications dereference the links (CouchDB) or even have direct support for relations between documents (MongoDB). What's more important is that DODBs are schema-less. In table-based storages this property can be achieved with significant overhead (see answer by richardtallent), but here it's done more efficiently. What we really should learn when switching from a RDBMS to a DODB is to forget about tables and to start thinking about data. That's what sheepsimulator calls the "bottom-up" approach. It's an ever-evolving schema, not a predefined Procrustean bed. Of course this does not mean that schemata should be completely abandoned in any form. Your application must interpret the data, somehow constrain its form -- this can be done by organizing documents into collections, by making models with validation methods -- but this is now the application's job.
may be you should read this
http://books.couchdb.org/relax/getting-started
i myself just heard it and it is interesting but have no idea how to implemented that in the real world application ;)
One thing you can try is getting a copy of firefox and firebug, and playing with the map and reduce functions in javascript. they're actually quite cool and fun, and appear to be the basis of how to get things done in CouchDB
here's Joel's little article on the subject : http://www.joelonsoftware.com/items/2006/08/01.html

Resources