Here's another one I've been thinking about lately.
We have concluded in earlier discussions : 'natural primary keys are bad, artificial primary keys are good.'
Working with Hibernate earlier I have seen that Hibernate default creates one sequence for all tables. At first I was puzzled by this, why would you do this. But later I saw the advantage that it makes linking parents and children fool proof. Because no tables have the same primary key value, accidentally linking a parent with a table that is not a child gives no results.
Does anyone see any downsides to this approach. I only see one : you cannot have more than 999999999999999999999999999 records in your database.
There could be performance issues with all code getting values from a single sequence - see this Ask Tom thread.
Depending on how sequences are implemented in the database, always hitting the same sequence can be better or worse. When only a few or only one thread request new values, there will be no locking issues. But a bad implementation could cause congestion.
Another problem is rolling back transactions: Sequences don't get rolled back (because someone else might have requested a higher value already), so you can have large gaps which will eat your number space much more quickly than you might expect. OTOH, it will take some time to eat 2 or 4 billion IDs (if you "only" use 32 bit (signed) ints), so it's rarely an issue in practice.
Lastly, you can't easily reset the sequence if you have to. But if you need to have a restarting sequence (say, number of records since midnight), you can tell Hibernate to create/use a second sequence.
A major advantage is that you can uniquely identify objects anywhere in the DB just by the ID. That means you can severely cut down the log information you write in the production system and still find something if you only have the ID.
I prefer having one sequence per table. This comes from one general observation: Some tables ("master tables") have a relatively small row count and have to be kept "forever". For example, the customer table in an ERP.
In other tables ("transaction tables"), many rows are generated perpetually, but after some time, those rows can be archived (or simply deleted). The most extreme example is a tracing table used for debugging purposes; it might grow by hundreds of rows per second, but each row is obsolete after a few days.
Small IDs in the master tables make it easier when working directly on the database, e.g. for debugging purposes.
select * from orders where customerid=415
vs
select * from orders where customerid=89461836571
But this is only a minor issue. The bigger issue is cycling. If you use one sequence for all tables, you simply cannot let it restart. With one sequence per table, you can restart the sequences for the transaction tables when you have archived or deleted the old data. Master tables hardly ever have that problem, since they grow much slower.
I see little value in having only one sequence for all tables. The arguments told so far do not convince me.
There are a couple of disadvantages of using a single sequence:-
reduced concurrency. Handing out the next sequence value involves synchronisation. In practice, I do not think this is likely to be a big problem
Oracle has special code when maintaining btree indexes to detect monotonically increasing values and balance the tree approriately
The CBO might have a better time estimating range queries on the index (if you ever did this) if most values were filled in
An advantage might be that you can determine the order of inserts amongst different tables.
Certainly there are pros and cons to the one-sequence versus one-sequence-per-table approach. Personally I find the ability to assign a truly unique identifier to a row, making each id column a uuid, to be enough of a benefit to outweigh any disadvantages. As Aaron D. succinctly writes:
you can uniquely identify objects anywhere in the DB just by the ID
And, for most applications, due to the way Hibernate3 batches IMPORT statements, this will not be a performance bottleneck unless massive amounts of records are vying for the same db resource (SELECT hibernate_sequence.nextval FROM dual).
Also, this sequence mapping is not supported in the latest release (1.2) of Grails. Though it was supported in Grails 1.1 (!). It now requires subclassing one of the Hibernate dialect classes as a workaround.
For those using Grails/GORM, have a look at this JIRA entry:
Oracle Sequence mappings ignored
Related
We need to stress test our Oracle database with about 5 million row inserts. According to our DBA, the only columns that need to be different are the Primary or foreign key...all other columns can be the same. He said if we do that, then Oracle will not do any sort of caching when inserting the data.
I just want to make sure that he is right and that by doing this, the stress testing results would be nearly as accurate as using random data. Thank you for your help.
In a very narrow set of circumstances, the DBA is correct. If ALL your queries are lookups based upon primary and foreign keys, then they may be right. In the past when the rule-based optimizer was king, then the data didn't matter so much. Record counts, yes, but not really the data.
In the real world, though, this is not the case. Do you have any other indexes? Then the data matters. Do you join against things other than primary/foreign keys? Then the data matters. Are your strings all 1 byte or null? I doubt it, and the size of these variable-length fields may affect the amount of IO. Basically, for any non-trivial schema in a non-trivial application, having "realistic" data can be significant. The Oracle optimizer takes into account a large variety of statistics when determining how to perform a query.
Are you REALLY only doing inserts in this load test? That's kinda silly. 5 million records is chump change by modern standards. Desktops do that in seconds, typically. Even simple applications will perform some select to do a lookup, or get a set of records based upon a non-key value.
You seem to be smart enough to evaluate the DBA's statement. If you can get him to put that in writing, sign off on it, and have the responsibility fall on him when his idea of a load test doesn't work as expected, then that's great. It sounds like you're the one responsible for this test, though.
If I were in your shoes, I would want to load test with the most accurate data possible. Copying from a production system or known test set of data is a much better option than "random" and light-years better than "nulls except for the primary key" approach.
What happens when more than one user inserts data in Database (MySQL, Postgres) at exactly same time? How does it prioritize which record to be inserted first and which one later. If the answer is specific to application of program, I am asking in reference to web-applications.
In general, two things never happen at exactly the same time. There's a queue of work and at some level one thing always happens before the other.
However, there are cases where an overall transaction may take multiple steps -- and if two of these kinds of transactions begin at nearly the same time, they may overlap in time. This can cause problems.
For example, imagine a person buys something in a shopping cart and the steps include both creating an order record for them and decrementing and inventory count. If two people begin this process at nearly the same time, they could both potentially buy the item before the inventory is decremented to show the item out of stock.
In cases where things like this can occur, postgres (and other modern databases) provide ways to restrict for programs to protect themselves. These include both transactions and locking.
With transactions (see postgres docs here), groups of statements are run as a single unit -- and if one of the later steps fails, all steps are 'rolled back'. (For example, if decrementing inventory isn't possible because the item is now out of stock, the order creation can be rolled back.)
With locking (see postgres docs here), tables (or even individual rows in a table) are locked so that any other process wanting to access them either waits or is timed out. This would prevent two processes from updating the same data at nearly the same time.
In general, the vast majority of applications don't require either of these approaches. Unless you're working in an environment such as at a bank where the tables involved contain financial transactions, you probably won't have to worry about it.
It's never exactly the same time. One will happen before the other.
Which one will, unless you implement your own prioritisation mechanism, is indeterminate, and you should never rely on it.
As to what will happen, well that depends.
For two inserts to the same table, if data integrity is dependant on what order they are executed in your database design has a horrendous flaw.
For collisions (two updates to the same record for instance). There are two implementations.
Pessimistic locking. Assume there will be a significant number of updates to teh same data, so issue a lock around it. If The lock exists fail the update (e.g. second one if first hasn't finished) with some suitable message.
Optimistic locking. Assume collisions will rarely happen. Usual way of doing this is to add a timestamp field to the record which changes every update. So when you read the data you get the timestamp, and when you write the data you only do it, if the timestamp you have matches the one that's there now, and update said timestamp as part of it. If it does not match you do the "Someone else has changed this data message".
There is a compromise position, where you try and merge two updates. (for instance you change name and I change address). You need to really think about that though, it's messy, and get very complicated very quickly, and getting it wrong run's a real risk of messing up the data.
People with far larger IQs than mine spend a lot of time on this stuff, personally I like to keep it like me, simple...
In a JPA/Hibernate application using native Oracle sequences, is it better to use a single global sequence like hibernate_sequence, or to define a separate sequence per table?
It seems like a single sequence is easier to maintain, but could make troubleshooting or ad hoc queries harder by making longer ID's.
Although cacheing alleviates it, a sequence can cause contention when multiple sessions require nextvals.
If you have one sequence serving all tables then all inserts on all tables will contend for the same sequence. If you are after performance, one sequence for each table will give less contention.
A single sequence means no matching ids in two tables. In practice I like this because you can never get something when you accidentally query the wrong table. Particularly useful with deletes. It is a little thing but I find it useful.
I would recommend you use a sequence per table. It is just a little cleaner in my book. The standard at my current placement of employment is sequence per table.
I have a database (running on postgres, precisely) , with the following structure :
user1 (schema)
|
- cars (table)
- airplanes (table, again)
...
user2
|
- cars
- airplanes
...
It's clearly not structurized the way classic relational databes should be, but it "just works" as it is now. As you can see, schemas are like primary keys used to identify entries.
In terms of performance -and nothing else-, is it worth rebuilding it so it'll have traditional primary keys (varchar being their type) & clustered indexes instead of schemas ?
From a Performance Perspective, actually from any perspective surely this is a NIGHTMARE, REBUILD!
Without knowing any more about your situation, I guess the answer would be YES, this would effect performance. Ordinarilly simple queries would not only be much more complicated to write and maintain but the db would produce query plans that were significantly more costly to execute.
Edit: I've worked with, and designed, DB's to handle a lot of data in high workload environments (banking and medical) and I have never seen anything like it; well not in the modern world!
So it looks like each user just has their own schema? Often large, large data sets are split up close to this (more often by customer in a lot of business scenarios). It's often a premature optimization because it introduces additional complexity to your application and a single table with a user column would scale to a reasonable number of rows.
However, whether or not you'll gain any performance from combining into a single schema really is determinate on whether or not you do many cross-user queries (in other words, queries that have to cross schemas/tables) and whether the data in each set of tables is exclusive to that user. If you're replicating data from other user's table to another, then you need to at least redesign those tables into a common schema.
I personally try to avoid a per-schema approach under normal circumstances (due to additional maintenance overhead and app complexity), but it has its place. And I'd hardly call this a "nightmare" unless I'm not understanding something correctly.
We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.
The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.
I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.
I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.
I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)