Database efficiency: references/pointers from table to table - performance

I am working on learning databases and am unsure about something that doesn't seem to make any sense to me. In the relational model you are able to combine through references but always require a global sort of key in each table to be able to combine this information. That is obviously required in most cases, but I feel like in a perfect tree hierarchy set up of a database this is inefficient.
To explain this better I shall use the example of storing products in a database. Products have main categories and sub categories and these are very clear. (ie. Milk is a subcategory of Dairy which is a subcategory of Food, etc.)
I thought in cases like this the ability to store single or a list of references/pointers to tables in fields would take away a lot of search querying and storage requirements.
Here is a link to a simple pain layout I made to illustrate this:
Image (the table entry could have some command character like '|' after which it knows the following entry is a file directory so when the database initiates it knows to make a pointer there)
Since I am only learning to work with databases now I understand that I may just be missing some knowledge on the subject, but I don't seem to find anything when I try googling this problem. Any help explaining where to start or any confirmation that this may improve efficiency and where I could learn how to write this myself would be great.

The concept of "pointer" is useful only if the object you want to point to has a well-defined address that is at least as permanent as the pointer itself. If the address is less permanent, you could end up with a "dangling" pointer.
A row in the database does not necessarily have a permanent address.1 By referencing the row through a logical value (instead of the physical address), the reference stays valid even when the row physically moves.2 And to ensure that the value identifies exactly one row, it must be unique.3
As for storing the list of values (be it "pointers" or anything else) inside a single field, this violates the principle of atomicity and therefore the 1NF. There are very good reasons to avoid violating the 1NF, including the ability to maintain the referential integrity and utilize indexing. That being said, there are DBMSes that support arrays or even sub-tables within a single field, which may be useful on rare occasions.
1 For example, Oracle ROWID is constant as long as the row is not physically moved on disk, but that can happen in many situations that are part of the normal database operation. So aside from putting severe restrictions on how your database is used, you couldn't rely on the ROWID staying constant over the lifetime of the rows that reference it (which could be as long as the lifetime of the database itself).
2 I suppose it would be theoretically possible for a DBMS to keep track of all the pointers and update them when the row physically moves. However, I'm not aware of any DBMS that actually supports such "updatable" pointers in practice, probably because the underlying mechanism needed for that wouldn't be any more efficient than the standard "value-based" referencing.
3 And must obviously be non-NULL. Saying that the attribute (or combination thereof) is "non-NULL and unique", is synonymous to saying it's a "key". Ideally, the key should also be immutable (so there is no need for a cascading referential action such as ON UPDATE CASCADE).

Related

Android room insert/update result

I've been using room for a while. I'm from a mysql background where you have to check the values of queries and stuff. In room, I find this a bit complicated because so far I can either declare the dao insert query as void or as long returning the rowId
If I return a long, I have to write a listener to notify the UI of success/failure
My question is, is this necessary? Do I need the return value of inserts/updates/deletes or are these queries guaranteed to succeed?
My question is, is this necessary?
This depends and is hopefully better explained (a least a little anyway) below.
Do I need the return value of inserts/updates/deletes or are these
queries guaranteed to succeed?
There is no guarantee that they will succeed. However, you may be able to assume they have or use CONFLICT handling.
Much could depend upon how the Entities are coded, for example say you had a simple table (Entity) with an id and a name and for simplicity that you have autogenerate = true and you never allow the id to be specified when inserting. Unless the database is massive (beyond storage device capacity) or tweaked. A unique id will always result.
If the name needs to be UNIQUE then you are introducing a facet that makes it more likely that the insert will not succeed. If you had the onConflictStrategy of the insert as IGNORE, then a duplicate wouldn't fail but you'd may want to know if nothing was inserted (-1 returned).
This is just one facet. The answer is really that you need to consider the design of the database and of the app itself. Personally I'd always go with informing the user at least of the abnormal/unexpected which probably then means yes it is necessary (typically it is easier to supress code than add new code).

Why no primary key

I have inherited a datababase with tables that lack primary keys. It's an OLTP database. One of the tables in question has ~300k records, and has no primary key implemented, even though examining the rest of the schema tells me one column is used AS a primary key, ie being replicated in another table, with identical name, etc. ie. This is not an 'end of line' table
This database also does not implement FKs.
My question is - is there ANY valid reason for a table (in Oracle for that matter) NOT to have a primary key?
I think PK is mandatory for almost all cases. Lots of reasons will exist but I'll treat some of them.
prevent to insert duplicate rows
rows will be referenced, so it must have a key for it
I saw very few cases make tables without PK (e.g. table for logs).
Not specific to Oracle but I recall reading about one such use-case where mysql was highly customized for a dam (electricity generation) project, I think. The input data from sensors were in the order 100-1000 per second or something. They were using timestamps for each record so didn't need a primary key (like with logs/logging mentioned in another answer here).
So good reasons would be:
Overhead, in the case of high frequency transactions
Necessity or Un-necessity in that case
"Uniqueness" maintained or inferred by application, not by db
In a normalized table, if every record needs to be unique and every field is referenced in other tables, then having a PK additionally adds an index overhead and if the PK would never actually be used in any SQL query (imho, I disagree with this but it's possible). But it should still have a unique index encompassing all the fields.
Bad reasons are infinite :-)
The most frequent bad reason which is actually responsible for the lack of a primary key is when DBs are designed by application/code-developers with little or no DB experience, who want to (or think they should) handle all data constraints in the application.
Any valid reason? I'd say "No"--I'm a database guy--but there are places that insist on using the database as a dumb data store. They usually implement all integrity "constraints" in application code.
Putting integrity constraints into application code isn't usually done to improve performance. In fact, if you built one database that enforces all the known constraints, and you built another with functionally identical constraints only in application code, the first one would almost certainly run rings around the second one.
Instead, application-level constraints usually hope to increase flexibility. (And, in the process, some of the known constraints are usually dropped, which appears to improve performance.) If it becomes inconvenient to enforce certain constraints in order to bulk load some scruffy data, an application programmer can just side-step the application-level constraints for a little while, then clean up the data when it's more convenient.
I'm not a db expert but I remember a conversation with a friend who worked in the Oracle apps dept. who told me that this was done to handle emergencies. If there was a problem in some report being generated which you could fix by putting in a row, db level constraints often stand in your way. They generally implemented things like unique primary keys in the application rather than the database. It was inefficient but enough and for them and much more manageable in case of a disaster recovery scenario.
You need a primary key to enforce uniqueness for a subset of its columns (useful if you need to refer to individual rows). It also speeds up certain queries because of the index associated to it.
If you do not need that index, or that uniqueness constraint, then you may not need a primary key (the index does not come free).
An example that comes to mind are logging tables, that just record some data (that is never updated or queried for individual records).
There is a small overhead when inserting to a table with an index and you need an index if you have a primary key. Downside of course is that finding a row is very costly.

Database design: Same table structure but different table

My latest project deals with a lot of "staging" data.
Like when a customer registers, the data is stored in "customer_temp" table, and when he is verified, the data is moved to "customer" table.
Before I start shooting e-mails, go on a rampage on how I think this is wrong and you should just put a flag on the row, there is always a chance that I'm the idiot.
Can anybody explain to me why this is desirable?
Creating 2 tables with the same structure, populating a table (table 1), then moving the whole row to a different table (table 2) when certain events occur.
I can understand if table 2 will store archival, non seldom used data.
But I can't understand if table 2 stores live data that can changes constantly.
To recap:
Can anyone explain how wrong (or right) this seemingly counter-productive approach is?
If there is a significant difference between a "customer" and a "potential customer" in the business logic, separating them out in the database can make sense (you don't need to always remember to query by the flag, for example). In particular if the data stored for the two may diverge in the future.
It makes reporting somewhat easier and reduces the chances of treating both types of entities as the same one.
As you say, however, this does look redundant and would probably not be the way most people design the database.
There seems to be several explanations about why would you want "customer_temp".
As you noted would be for archival purposes. To allow analyzing data but in that case the historical data should be aggregated according to some interesting query. However it using live data does not sound plausible
As oded noted, there could be a certain business logic that differentiates between customer and potential customer.
Or it could be a security feature which requires logging all attempts to register a customer in addition to storing approved customers.
Any time I see a permenant table names "customer_temp" I see a red flag. This typically means that someone was working through a problem as they were going along and didn't think ahead about it.
As for the structure you describe there are some advantages. For example the tables could be indexed differently or placed on different File locations for performance.
But typically these advantages aren't worth the cost cost of keeping the structures in synch for changes (adding a column to different tables searching for two sets of dependencies etc. )
If you really need them to be treated differently then its better to handle that by adding a layer of abstraction with a view rather than creating two separate models.
I would have used a single table design, as you suggest. But I only know what you posted about the case. Before deciding that the designer was an idiot, I would want to know what other consequences, intended or unintended, may have followed from the two table design.
For, example, it may reduce contention between processes that are storing new potential customers and processes accessing the existing customer base. Or it may permit certain columns to be constrained to be not null in the customer table that are permitted to be null in the potential customer table. Or it may permit write access to the customer table to be tightly controlled, and unavailable to operations that originate from the web.
Or the original designer may simply not have seen the benefits you and I see in a single table design.

Best-performing method for associating arbitrary key/value pairs with a table row in a Postgres DB?

I have an otherwise perfectly relational data schema in place for my Postgres 8.4 DB, but I need the ability to associate arbitrary key/value pairs with several of my tables, with the assigned keys varying by row. Key/value pairs are user-generated, so I have no way of predicting them ahead of time or wrangling orderly schema changes.
I have the following requirements:
Key/value pairs will be read often, written occasionally. Reads must be reasonably fast.
No (present) need to query off of the keys or values. (But it might come in handy some day.)
I see the following possible solutions:
The Entity-Attribute-Value pattern/antipattern. Annoying, but the annoyance would be generally offset by my ORM.
Storing key/value pairs as serialized JSON data on a text column. A simple solution, and again the ORM comes in handy, but I can kiss my future self's need for queries good-bye.
Storing key/value pairs in some other NoSQL db--probably a key/value or document store. ORM is no help here. I'll have to manage the separate queries (and looming data integrity issues?) myself.
I'm concerned about query performance, as I hope to have a lot of these some day. I'm also concerned about programmer performance, as I have to build, maintain, and use the darned thing. Is there an obvious best approach here? Or something I've missed?
That's precisely what the hstore datatype is for in PostgreSQL.
http://www.postgresql.org/docs/current/static/hstore.html
It's really fast (you can index it) and quite easy to handle. The only drawback is that you can only store character data, but you'd have that problem with the other solutions as well.
Indexes support "exists" operator, so you can query quite quickly for rows where a certain key is present, or for rows where a specific attribute has a specific value.
And with 9.0 it got even better because some size restrictions were lifted.
hstore is generally good solution for that, but personally I prefer to use plain key:value tables. One table with definitions, other table with values and relation to bind values to definition, and relation to bind values to particular record in other table.
Why I'm against hstore? Because it's like a registry pattern. Often mentioned as example of anti pattern. You can put anything there, it's hard to easy validate if it's still needed, when loading a whole row (in ORM especially), the whole hstore is loaded which can have much junk and very little sense. Not mentioning that there is need to convert hstore data type into your language type and convert back again when saved. So you get some overhead of type conversion.
So actually I'm trying to convert all hstores in company I'm working for into simple key:value tables. It's not that hard task though, because structures kept here in hstore are huge (or at least big), and reading/writing an object crates huge overhead of function calls. Thus making a simple task like that "select * from base_product where id = 1;" is making a server sweat and hits performance badly. Want to point that performance issue is not because db, but because python has to convert several times results received from postgres. While key:value is not requiring such conversion.
As you do not control data then do not try to overcomplicate this.
create table sometable_attributes (
sometable_id int not null references sometable(sometable_id),
attribute_key varchar(50) not null check (length(attribute_key>0)),
attribute_value varchar(5000) not null,
primary_key(sometable_id, attribute_key)
);
This is like EAV, but without attribute_keys table, which has no added value if you do not control what will be there.
For speed you should periodically do "cluster sometable_attributes using sometable_attributes_idx", so all attributes for one row will be physically close.

Stumped and Seeking Input Re: Database Design

We have an Oracle database here that's been around for about 10 years. It's passed through a lot of hands. In the course of those years, it's grown quite large, and there are some interesting anomalies in its design that have me perplexed.
Now, I'm historically a SQL Server developer. I used to steam and fume about the differences between The Microsoft Way(tm) and The Oracle Way(R). Now, I realize, they're just different. I also used to yank my hair out and slam my head against the desk thinking that the people who came before me were blind, deaf mutes jacked up on Jolt and Red Bull, who wrote code in Tourette's.NET.
(Yes, I'm going somewhere.)
As time passed, I realized that neither database platform was inherently better than the other. They're just different. Further, I also realized that the developers who came before me often had compelling reasons for designing and writing things the way they did. Just because I wasn't privy to it didn't make it untrue. Sure, the documentation could have been better, but still.
So here's where all this leads me:
We have a few tables in the database that have two separate owners. Both owners define identical primary key constraints on the table. This has me perplexed. Why would a table have multiple owners? And why would each owner define separate yet identical primary keys?
These guys designed a pretty well-layed out database with lots of primary keys. But they didn't make a lot of use of indexes. When they did use indexes, they tended to make one large index instead of many distinct indexes. Is there some compelling performance gain to be had from that?
We also avoided foreign key constraints like the plague. Not sure why we would have done that. Is there a reason to avoid them in Oracle? I can see a lot of reasons to use them to enforce data integrity between tables, and we're just not using them. I'm assuming that there's a compelling reason, and I'm just not privy to it.
Finally, is there a compelling reason to avoid the use of triggers (aside from the obvious pitfall that lies in performance hits)? We don't seem to be using those much either.
For the record, we're still using Oracle 9i.
Again, thanks for your patience, everyone. I'm an old Microsoft hand, so bending my brain around the Oracle Way is challenging at times. It's a big beast, with tons to learn, and sometimes, finding that information on the Web is a chore.
Thank His Noodliness for StackOverflow.
Salient Post-Post Points
Historically, we haven't used sequences, except in very rare cases.
Historically, we haven't used stored procedures or functions, except in very rare cases.
There are some references in very old documents to ERWIN. (Thanks to the poster below for bringing it to my memory.) Chances are, the bulk of the design was the product of an ORM, and the natural design flowed from that.
The vast majority of the SQL appears hard-coded in the application, and there's a lot of it.
I'm doing everything in my power to move us away from hard-coded SQL, and to get the SQL into the database where it belongs. But I'm trying to do that in a way that makes sense, is practical, and doesn't break the business in the process. (Read: On new software only.)
We have a few tables in the database that have two separate owners. Both owners define identical primary key constraints on the table. This has me perplexed. Why would a table have multiple owners? And why would each owner define separate yet identical primary keys?
You cannot define two PRIMARY KEY's on one table in Oracle. You can define one PRIMARY KEY and one UNIQUE key on the same column set. I can see no point in such a design.
These guys designed a pretty well-layed out database with lots of primary keys. But they didn't make a lot of use of indexes. When they did use indexes, they tended to make one large index instead of many distinct indexes. Is there some compelling performance gain to be had from that?
In Oracle, an index cannot be used for RANGE SCANS on something that doesn't constitute a leftmost prefix of this index.
A composite index on (col1, col2, col3) cannot be used to do a plain RANGE SCAN on col2 alone or col3 alone.
We also avoided foreign key constraints like the plague. Not sure why we would have done that. Is there a reason to avoid them in Oracle? I can see a lot of reasons to use them to enforce data integrity between tables, and we're just not using them. I'm assuming that there's a compelling reason, and I'm just not privy to it.
If you make all interaction with the database through a set of well-defined procedures, a MERGE statement can yield far better performance than a FOREIGN KEY with ON DELETE CASCADE. You, though, should be very very careful and get used to this programming paradigma.
Finally, is there a compelling reason to avoid the use of triggers (aside from the obvious pitfall that lies in performance hits)? We don't seem to be using those much either.
I personally don't use triggers at all. Not every business rule can be expressed in terms of cascading inserts or updates, and any two-pass DML operation will lead to mutating tables. If all interaction with the database is done via stored procedures (or packages), triggers become useless.
Using triggers means in fact using SQL statements inside CURSOR loops, which every SQL cheechako knows to be a bad thing.
You don't want to be seen using cursors instead of set-based operations, do you?
FOREIGN KEY's are not as bad as triggers (as long as you don't define CASCADE operations on them), since they just don't let you do wrong things at the expense of some performance loss.
But when your database grows large, you will notice that the rules for integrity checking are far more complex than just verifying that the values being inserted into one table exist in another one.
You will have to check newly inserted values against aggregates, complex joins, etc., and all will checks will imply having a corresponding value in other table, and failing these checks compromises your database integrity just as good as violating the FOREIGN KEY's
So it will turn out that these FOREIGN KEY's are double and triple checked anyway, and there is no point to keep data integrity rules scattered all around the database rather than having them in one place (a stored procedure that is always used for updating the data).
How can the same table belong to two schemas. It doesn't make any sense.
That given there is nothing inherently bad practice in the questions you have asked.
I develop a large .net application with Oracle database and we have an excellent Oracle DBA in our team. We have used Foreign key constraints wherever possible for data integrity. Triggers are used only to get a new value from sequence or for auditing purpose and not for any business logic. We have used multicolumn unique indexes for data integrity and single column non-unique indexes.
"In Oracle, an index cannot be used for RANGE SCANS on something that doesn't constitute a leftmost prefix."
I believe this is not true anymore since Oracle 10g.
"When they did use indexes, they tended to make one large index instead of many distinct indexes. Is there some compelling performance gain to be had from that?"
You create indexes to speed up queries. If you query on "surname = 'Smith' and given_name = 'john'", then it is better to have a single index on (surname, given_name) than two separate indexes.
If no-one is complaining about performance, you probably don't need to worry about indexes.
Lots of primary keys.
We also avoided foreign key constraints.
Avoid the use of triggers.
Sounds like they used an ORM to fetch objects out of the database. That means fewer ultra-complex joins and SELECT statements and more simple SELECTS. It means constraints in the code, not the database. Similarly, "trigger"-like behavior is in the code.
Doesn't sound Oracle-specific. Sounds like the application has an ORM.
A lot of people, including me, don't like triggers because it makes it a lot harder to troubleshoot.
This pretty much sums up my opinion
I did Oracle database design for a large organization, and we used triggers as much as we could due to the fact that we had business rules that had to be enforced when data was coming from several directions (the application's GUI, and SQL scripts used for data migration). The business rules we enforced were pretty simple (date checking, checking for existence of rows in another table, etc...). If we tried to make them to complex, we got the dreaded "mutating table" error, which basically means you're trying to inspect the table that is currently changing. So triggers can be useful in some situations, but can cause headaches.
As far as indexes go, in my opinion it is -very- important to have indexes on the columns that are used for joining tables together. That's an easy way to increase performance.
About the foreign keys: since the database changed hands so much, I wonder if the foreign keys could have been dropped accidentally, somewhere along the line. I used PL-SQL developer and some seemingly-innocent operations (like adding/removing a column I think, but I'm not sure) caused the foreign keys to all be deleted.
They may have avoided using foreign constraints for performance. I'm told it can be very slow. They also make it difficult to bulk load data which may be inaccurate when loaded but will be corrected programatically.
"We have a few tables in the database that have two separate owners. Both owners define identical primary key constraints on the table. This has me perplexed. Why would a table have multiple owners? And why would each owner define separate yet identical primary keys?"
A SQL Server database corresponds more to an Oracle user/schema. So you can have multiple tables in the same Oracle database belonging to different schemas/users. These are DIFFERENT tables (ie with different data inside, and potentially different columns/indexes...).
Sometimes bits of a business want a snaphot of the data (eg at month or year end). Sometimes, before a datafix, a DBA will create a copy of a table (possibly with a different name or in a different schema) just in case the datafix goes horribly wrong.
Either way, where you have copies of a table, one is probably out of date (intentionally).
Assuming that you are not in a data warehousing situation here -
Foreign keys ensure referential integrity and are absolutely vital. I can't think of a situation when you would not want them.
Indexes again are very important tools to ensure query performance.
Not sure why they would define PKs without Indexes - PKs are usually implemented via a unique index.
Using large indexes, I assume you mean indexes that compound multiple columns
Using ERWIN-engineered Oracle database need not result in such a design - so what you have is not an ERWIN artifact.
If I had to hazard a guess - I am thinking the designer was overly, un-necessarily trying to design for performance - he avoided indexes for update performance, he also avoided FK constraints for a similar 'imagined' performance.
Unless the database is being used for a unique kind of application in a very special way, there really is no grounds for omitting FKs, and Indices.
Regarding triggers, other posters have already weighed in - triggers will be useful for capturing business rules in one central-place (same for Stored Procedures - good for encapsulating Business Logic).

Resources