There are almost 3 million financial transaction records in my database. These records are loaded from external files containing following fields which are mapped to the table's columns.
Account, Date, Amount, Particulars/Description/Details/Narration
Now There is a need to maintain uniqueness of already loaded and future records.
Since there was no uniqueness in external files which are already loaded so, I think, we have to update existing records by making unique key using given fields, but, it is quite clear that fields in the external file may duplicate.
How to maintain such uniqueness that we can identify a transactions from the file is already loaded. All type of suggestions are welcome.
Edit 1
Currently loaded records are confirmed to be valid, the need to maintain uniqueness has just came up due to loading of some missing records from older files or missing files
Edit 2
Existing records may have duplicate records based on given 4 fields i.e. same values for Account, Date, Amount and Particulars for two or more valid transactions, but it is sure that these records are valid even with duplicate values.
Now for loading missing records we need to identify if a record is already loaded or not so that we don't load a record which is already loaded. So, to me, it looks very hard to know if a record is already loaded based on these fields. I see it as beyond the limits of these fields
Edit 3
Situation has changed now and this is no more a valid question but it would be better to keep it here for others. It has been agreed to add a unique key in records and hence check against this key for duplication
Note - following some clarification from the OP this answer is not relevant to their scenario. The problem is a political or business problem rather than a technical one. I will leave this answer as a solution to a hypothetical question because it may still be of use to some future seekers.
My other response addresses the OP's actual situation.
It seems like you need a compound unique key:
alter table your_table add constraint your_table_uk
unique (Account, Date, Amount, Particulars)
using index
particulars seems a bit woolly as a source of uniqueness, but presumably an account can have more than one transaction for the same amount on any given day, so you need all four columns to guarantee uniqueness of the row.
Or perhaps, as #ypercube suggests, only (Account, Date, Particulars) are necessary.
I have suggested a unique key rather than a primary key constraint because composite primary keys are bad news when it comes to enforcing foreign keys. In this case I would suggest you add a synthetic primary key, populated with a sequence.
You say the loaded records have a proven validity, but if that is not the case change the ALTER TABLE statement to use the EXCEPTIONS INTO clause to find the duplicated rows. You will a special table to capture the constraint violations. Find out more.
"Existing records may have duplicate records based on given 4 fields
i.e. same values for Account, Date, Amount and Particulars for two or
more valid transactions, but it is sure that these records are valid
even with duplicate values."
But how can anybody tell, if there is no token of uniqueness in the loaded data or the source files? What does validity even mean?
"Now for loading missing records we need to identify if a record is
already loaded or not so that we don't load a record which is already
loaded."
Without an existing source of uniqueness you cannot do this. Because it you have two rows for a given combination of (Account, Date, Amount, Particulars) and that's okay, what are the rules for determining that a third instance of (Account, Date, Amount, Particulars) is a record which which has already been loaded, hence invalid, or record which has not been loaded, hence valid.
"So, to me, it looks very hard to know if a record is already loaded
based on these fields. I see it as beyond the limits of these fields"
You're right to say that the solution cannot be found in the data as you describe it. But the solution is actually very simple. You go to the people who have asserted the validity of the loaded records and present them with a list of these additional records. They'll be able to use their skill and judgement to tell you which records are valid, and you load those.
" it is my duty to find the solution"
No it is not your duty. Right now the duty lies on the shoulders of the data owner to define their data set accurately, and that includes identifying a business key. They are the ones abrogating their responsibilities.
Under the circumstances you have three choices:
Refuse to load any further records until the data owner does their duty.
Load all the records presented to you for loading, without any validation.
Use the horrible NOVALIDATE syntax.
NOVALIDATE is a way of enforcing validation rules for future rows but ignoring violations in the existing data. Basically it's a technical kludge for a political problem.
SQL> select * from t23
/
COL1 COL2
---------- --------------------
1 MR KNOX
1 MR KNOX
2 FOX IN SOCKS
2 FOX IN SOCKS
SQL> create index t23_idx on t23(col1,col2)
/
Index created.
SQL> alter table t23 add constraint t23_uk
unique (col1,col2) novalidate
/
Table altered.
SQL> insert into t23 values (2, 'FOX IN SOCKS')
/
insert into t23 values (2, 'FOX IN SOCKS')
*
ERROR at line 1:
ORA-00001: unique constraint (APC.T23_UK) violated
SQL>
Note that you need to pre-create a non-unique index before adding the constraint. If you don't do that the database will build a unique index and that will override the NOVALIDATE clause.
I describe the NOVALIDATE as horrible because it is. It bakes data corruption into the database. But it is the closest thing you'll get to a solution.
This approach completely ignores the notion of "validity". So it will reject records which perhaps should have loaded because they represent a "valid" nth occurrence of (Account, Date, Amount, Particulars). This is unavoidable. The good news is, nobody will be able to tell, because there are no defined rules for establishing validity.
Whatever option you choose, it is crucial that you explain it clearly to your boss, the data owner, the data owner's boss and whoever else you think fit, and get their written assent to go ahead. Otherwise, sometime down the line people will discover that the database is full of duplicate rows or somebody will complain that a "valid" record hasn't been loaded, and it will all be your fault ... unless you have a signed piece of paper with authorisation from the appropriate top brass.
Good luck
Haki's suggestion of using MERGE has the same effect as NOVALIDATE, because it would load new records and suppress all duplicates. However, it is even more of a kludge: it doesn't address the notion of uniqueness at all. Anybody who had INSERT or UPDATE access would still be able to have any rows they liked. So this approach would only work if you could completely lock down privileges on that table so that its data can only be manipulated through MERGE and no other DML. Depends whether ongoing uniqueness matters. Again, a business decision.
sounds like you need an upsert - or as oracle calls it MERGE
A MERGE operation between two tables allows you to handle two common situations -
The record already exist in the target table and I need to do
something with it - either update or do nothing.
The record does not exist in the target table - Insert it.
Related
I'm updating a table that was originally poorly designed. The table currently has a primary key that is the name of the vendor. This serves as a foreign key to many other tables. This has led to issues with the Vendor name initially being entered incorrectly or with typos that need to be fixed. Since it's the foreign key to relationships, this is more complicated than it's worth.
Current Schema:
Vendor_name(pk) Vendor_contact comments
Desired Schema:
id(pk) Vendor_name Vendor_contact comments
I want to update the primary key to be an auto-generated numeric key. The vendor name field needs to persist but no longer be the key. I'll also need to update the value of the foreign key on other tables and on join tables.
Is the best way to do this to create a new numeric id column on my Vendor table, crosswalk the id to vendor names and add a new foreign key with the new id as the foreign key, drop the foreign key of vendor name on those tables (per this post), and then somehow mark the id as the primary key and unmark the vendor name?
Or is there a more streamlined way of doing this that isn't so broken out?
It's important to note that only 5 users can access this table so I can easily shut them out for a period of time while these updates are made - that's not an issue.
I'm working with SQLDeveloper and Python/Django.
The biggest problem you have is all the application code which references VENDOR_NAME in the dependent tables. Not just using it to join to the parent table, but also relying on it to display the name without joining to VENDOR.
So, although having a natural key as a foreign key is a PITN, changing this situation is likely to generate a whole lot of work, with a marginal overall benefit. Be sure to get buy-in from all the stakeholders before starting out.
The way I would approach it is this:
Do a really thorough impact analysis
Ensure you have complete regression tests for all the functions which rely on the Vendor data
Create VENDOR_ID as a unique key on VENDOR
Add VENDOR_ID to all the dependent tables
Create a second foreign on all the dependent tables referencing VENDOR_ID
Ensure that the VENDOR_ID is populated whenever the VENDOR_NAME is.
That last point can be tackled by either fix the insert and update statements on the dependent tables, or with triggers. Which approach you take will determine on your application design and also the number of tables involved. Obviously you want to avoid the performance hit of all those triggers if you can.
At this point you have an infrastructure which will support the new primary key but which still uses the old one. Why would you want to do this? Because you could go into Production like this without changing the application code. It gives you the option to move the application code to use VENDOR_ID across a broader time frame. Obviously, if developers have been keen on coding SELECT * FROM you will have issues that need addressing immediately.
Once you've fixed all the code you can drop VENDOR_NAME from all the dependent tables, and switch VENDOR_NAME to unique key and VENDOR_ID to primary key on the master table.
If you're on 11g you should check out Edition-Based Redefinition. It's designed to make this sort of exercise an awful lot easier. Find out more.
I would do it this way:
create your new sequence
create table temp as select your_sequence.nextval,vendor_name, vendor_contact, comments from vendor.
rename the original table to something like vendor_old
add the primary key and other constraints to the new table
rename the new table to the old name
Testing is essential and you must ensure no one is working on the database except you when this is done.
I have a database which I've opened in phpMyAdmin. I clicked the "Insert" button, which has an icon showing one row being inserted between two others.
When I actually try to insert a row, I get the following error:
1062 - Duplicate entry '294' for key 'PRIMARY'
How do I get phpMyAdmin to insert a row (presumably by increasing all the higher-numbered rows by 1) as the icon and the term "Insert" implies? It only seems to want to "Add" a row to the end, not "Insert" it.
As I said, the icon specifically shows one row being inserted between two others, and this is what I want to do. How do I get it to do what it claims it will do?
First, "INSERT" is standard SQL terminology for putting something in the database; it doesn't specifically mean "putting it between two existing values". I see how the icon can be a bit confusing, but when "insertting" data there is no difference between putting something at the end or in the middle of the database. For that matter, there's no real inherent order to data stored in a database; you can select many different ways to sort it when you display the data (and phpMyAdmin generally does a good job of guessing what's reasonable), but data just exists. You can select to sort it by the primary key or alphabetically by user name or any means you wish.
Second, your primary key shouldn't change. It's the key that holds your data together; if you start changing that your references from other tables will be messed up (see below). So don't change that.
Third, if you have your primary key set up with auto_increment (the A_I checkbox in phpMyAdmin), then you shouldn't ever need to set it or worry about it yourself. It's all managed by MySQL. If you aren't happy with the order and want to move 294 to 295 so you can insert something else at 294, then your database design needs tweaking because that's not how auto_incrementing primary keys are designed to work. As a simple solution, you may wish to create another field called "sort_value" or something that you can change.
Which all brings me to the root cause of your trouble: you're trying to create a new row while reusing an existing auto_increment value, and MySQL is smart enough to know this is a bad idea.
So as I said above, changing your primary key (whether or not it's auto generated) is a bad idea, but it may not be obvious why if you only have one table. But relational databases are designed so that you can reference tables from other tables, so for instance a customer database might have a table for "customers", "products", and "purchases" where the purchases table references the primary key ID from both customers and products...imagine the carnage your data would see if you then change the value of those keys in the customer table. You'd show customers associated with some other customer's purchases. So it might not make sense in your database, but overall that's the best way to handle things.
If you really, really don't want to change your database structure, don't reference that key from any other tables, and don't want to listen to my advice, you should be able to simply turn off the auto_increment function on your primary key and reorder them however you wish.
My question may seems more general. But only answer I got so far is from the SO itself. My question is, I have a table customer information. I have 47 fields in it. Some of the fields are optional. I would like to split that table into two customer_info and customer_additional_info. One of its column is storing a file in byte format. Is there any advantage by splitting the table. I saw that the JOIN will slow down the query execution. Can I have more PROs and CONs of splitting a table into two?
I don't see much advantage in splitting the table unless some of the columns are very infrequently accessed and fairly large. There's a theoretical advantage to keeping rows small as you're going to get more of them in a cached block, and you improve the efficiency of a full table scan and of the buffer cache. Based on that I'd be wary of storing this file column in the customer table if it was more than a very small size.
Other than that, I'd keep it in a single table.
I can think of only 2 arguments in favor of splitting the table:
If all the columns in Customer_Addition_info are related, you could potentially get the benefit of additional declarative data integrity that you couldn't get with a single table. For instance, lets say your addition table was CustomerAddress. Your business logic may dictate that a customer address is optional, but once you have a customer Zip code, the addressL1, City and State become required fields. You could set these columns to non null if they exist in a customerAddress table. You couldn't do that if they existed directly in the customer table.
If you were doing some Object-relational mapping and your had a customer class with many subclasses and you didn't want to use Single Table Inheritance. Sometimes STI creates problems when you have similar properties of various subclasses that require different storage layout. Being that all subclasses have to use the same table, you might have name clashes. The alternative is Class Table inheritance where you have a table for the superclass, and an addition table for each subclass. This is a similar scenario to the one you described in your question.
As for CONS, The join makes things harder and slower. You also run the risk of accidentally creating a 1 to many relationship. I.E. You create 2 addresses in the CustomerAddress table and now you don't know which one is valid.
EDIT:
Let me explain the declarative ref integrity point further.
If your business rules are such that a customer address is optional, and you embed addressL1, addressL2, City, State, and Zip in your customer table, you would need to make each of these fields Nullable. That would allow someone to insert a customer with a City but no state. You could write a table level check constraint to cover this situation. But that isn't as easy as simply setting the AddressL1, City, State and Zip columns in the CustomerAddress table not nullable. To be clear, I am NOT advocating using the multi-table approach. However you asked for Pros and Cons, and I'm just pointing out this aspect falls on the pro side of the ledger.
I second what David Aldridge said, I'd just like to add a point about the file column (presumably BLOB)...
BLOBs are stored up to approx. 4000 bytes in-line1. If a BLOB is used rarely, you can specify DISABLE STORAGE IN ROW to store it out-of-line, removing the "cache pollution" without the need to split the table.
But whatever you do, measure the effects on realistic amounts of data before you make the final decision.
1 That is, in the row itself.
Background: http://jeffkemponoracle.com/2011/03/11/handling-unique-constraint-violations-by-hibernate
Our table is:
BOND_PAYMENTS (BOND_PAYMENT_ID, BOND_NUMBER, PAYMENT_ID)
There is a Primary key constraint on BOND_PAYMENT_ID, and a Unique constraint on (BOND_NUMBER, PAYMENT_ID).
The application uses Hibernate, and allows a user to view all the Payments linked to a particular Bond; and it allows them to create new links, and delete existing links. Once they’ve made all their desired changes on the page, they hit “Save”, and Hibernate does its magic to run the required SQL on the database. Apparently, Hibernate works out which records need to be deleted, which need to be inserted, and leaves the rest untouched. Unfortunately, it does the INSERTs first, then it does the DELETEs.
If the user deletes a link to a payment, then changes their mind and re-inserts a link to the same payment, Hibernate quite happily tries to insert it then delete it. Since these inserts/deletes are running as separate SQL statements, Oracle validates the constraint immediately on the first insert and issues ORA-00001 unique constraint violated.
We know of only two options:
Make the constraint deferrable
Remove the unique constraint
Option 2 is not very palatable, because the constraint provides excellent protection from nasty application bugs that might allow inconsistent data to be saved. We went with option 1.
ALTER TABLE bond_payments ADD
CONSTRAINT bond_payment_uk UNIQUE (bond_number, payment_id)
DEFERRABLE INITIALLY DEFERRED;
The downside is that the index created to police this constraint is now a non-unique index, so may be somewhat less efficient for queries. We have decided this is not as great a detriment for this particular case. Another downside (advised by Gary) is that it may suffer from a particular Oracle bug - although I believe we will be immune (at least, mostly) due to the way the application works.
Are there any other options we should consider?
From the problem you described, it's not clear if you have an entity BondPayment or if you have a Bond linked directly to a Payment. For now, I suppose you have the link between Payment and Bond through BondPayment. In this case, Hibernate is doing the right thing, and you'll need to add some logic in your app to retrieve the link and remove it (or change it). Something like this:
bond.getBondPayment().setPayment(newPayment);
You are probably doing something like this:
BondPayment bondPayment = new BondPayment();
bondPayment.setPayment(newPayment);
bondPayment.setBond(bond);
bond.setBondPayment(bondPayment);
In the first case, the BondPayment.id is kept, and you are just changing the payment for it. In the second case, it's a brand new BondPayment, and it will conflict with an existing record in the database.
I said that Hibernate is doing the right thing because it threats BondPayment as a "regular" entity, whose lifecycle is defined by your app. It's the same as having a User with a unique constraint on login, and you are trying to insert a second record with a duplicate login. Hibernate will accept (it doesn't knows if the login exists in the database) and your database will refuse.
I am creating a laboratory database which analyzes a variety of samples from a variety of locations. Some locations want their own reference number (or other attributes) kept with the sample.
How should I represent the columns which only apply to a subset of my samples?
Option 1:
Create a separate table for each unique set of attributes?
SAMPLE_BOILER: sample_id (FK), tank_number, boiler_temp, lot_number
SAMPLE_ACID: sample_id (FK), vial_number
This option seems too tedious, especially as the system grows.
Option 1a: Class table inheritance (link): Tree with common fields in internal node/table
Option 1b: Concrete table inheritance (link): Tree with common fields in leaf node/table
Option 2: Put every attribute which applies to any sample into the SAMPLE table.
Most columns of each entry would most likely be NULL, however all of the fields are stored together.
Option 3: Create _VALUE_ tables for each Oracle data type used.
This option is far more complex. Getting all of the attributes for a sample requires accessing all of the tables below. However, the system can expand dynamically without separate tables for each new sample type.
SAMPLE:
sample_id*
sample_template_id (FK)
SAMPLE_TEMPLATE:
sample_template_id*
version *
status
date_created
name
SAMPLE_ATTR_OF
sample_template_id* (FK)
sample_attribute_id* (FK)
SAMPLE_ATTRIBUTE:
sample_attribute_id*
name
description
SAMPLE_NUMBER:
sample_id* (FK)
sample_attribute_id (FK)
value
SAMPLE_DATE:
sample_id* (FK)
sample_attribute_id (FK)
value
Option 4: (Add your own option)
To help with Googling, your third option looks a little like the Entity-Attribute-Value pattern, which has been discussed on StackOverflow before although often critically.
As others have suggested, if at all possible (eg: once the system is up and running, few new attributes will appear), you should use your relational database in a conventional manner with tables as types and columns as attributes - your option 1. The initial setup pain will be worth it later as your database gets to work the way it was designed to.
Another thing to consider: are you tied to Oracle? If not, there are non-relational databases out there like CouchDB that aren't constrained by up-front schemas in the same way as relational databases are.
Edit: you've asked about handling new attributes under option 1 (now 1a and 1b in the question)...
If option 1 is a suitable solution, there are sufficiently few new attributes that the overhead of altering the database schema to accommodate them is acceptable, so...
you'll be writing database scripts to alter tables and add columns, so the provision of a default value can be handled easily in these scripts.
Of the two 1 options (1a, 1b), my personal preference would be concrete table inheritance (1b):
It's the simplest thing that works;
It requires fewer joins for any given query;
Updates are simpler as you only write to one table (no FK relationship to maintain).
Although either of these first options is a better solution than the others, and there's nothing wrong with the class table inheritance method if that's what you'd prefer.
It all comes down to how often genuinely new attributes will appear.
If the answer is "rarely" then the occasional schema update can cope.
If the answer is "a lot" then the relational DB model (which has fixed schemas baked-in) isn't the best tool for the job, so solutions that incorporate it (entity-attribute-value, XML columns and so on) will always seem a little laboured.
Good luck, and let us know how you solve this problem - it's a common issue that people run into.
Option 1, except that it's not a separate table for each set of attributes: create a separate table for each sample source.
i.e. from your examples: samples from a boiler will have tank number, boiler temp, lot number; acid samples have vial number.
You say this is tedious; but I suggest that the more work you put into gathering and encoding the meaning of the data now will pay off huge dividends later - you'll save in the long term because your reports will be easier to write, understand and maintain. Those guys from the boiler room will ask "we need to know the total of X for tank grouped by this set of boiler temperature ranges" and you'll say "no prob, give me half an hour" because you've done the hard yards already.
Option 2 would be my fall-back option if Option 1 turns out to be overkill. You'll still want to analyse what fields are needed, what their datatypes and constraints are.
Option 4 is to use a combination of options 1 and 2. You may find some attributes are shared among a lot of sample types, and it might make sense for these attributes to live in the main sample table; whereas other attributes will be very specific to certain sample types.
You should really go with Option 1. Although it is more tedious to create, Option 2 and 3 will bite you back when trying to query you data. The queries will become more complex.
In fact, the most important part of storing the data, is querying it. You haven't mentioned how you are planning to use the data, and this is a big factor in the database design.
As far as I can see, the first option will be most easy to query. If you plan on using reporting tools or an ORM, they will prefer it as well, so you are keeping your options open.
In fact, if you find building the tables tedious, try using an ORM from the start. Good ORMs will help you with creating the tables from the get-go.
I would base your decision on the how you usually see the data. For instance, if you get 5-6 new attributes per day, you're never going to be able to keep up adding new columns. In this case you should create columns for 'standard' attributes and add a key/value layout similar to your 'Option 3'.
If you don't expect to see this, I'd go with Option 1 for now, and modify your design to 'Option 3' only if you get to the point that it is turning into too much work. It could end up that you have 25 attributes added in the first few weeks and then nothing for several months. In which case you'll be glad you didn't do the extra work.
As for Option 2, I generally advise against this as Null in a relational database means the value is 'Unknown', not that it 'doesn't apply' to a specific record. Though I have disagreed on this in the past with people I generally respect, so I wouldn't start any wars over it.
Whatever you do option 3 is horrible, every query will have join the data to create a SAMPLE.
It sounds like you have some generic SAMPLE fields which need to be join with more specific data for the type of sample. Have you considered some user_defined fields.
Example:
SAMPLE_BASE: sample_id(PK), version, status, date_create, name, userdata1, userdata2, userdata3
SAMPLE_BOILER: sample_id (FK), tank_number, boiler_temp, lot_number
This might be a dumb question but what do you need to do with the attribute values? If you only need to display the data then just store them in one field, perhaps in XML or some serialised format.
You could always use a template table to define a sample 'type' and the available fields you display for the purposes of a data entry form.
If you need to filter on them, the only efficient model is option 2. As everyone else is saying the entity-attribute-value style of option 3 is somewhat mental and no real fun to work with. I've tried it myself in the past and once implemented I wished I hadn't bothered.
Try to design your database around how your users need to interact with it (and thus how you need to query it), rather than just modelling the data.
If the set of sample attributes was relatively static then the pragmatic solution that would make your life easier in the long run would be option #2 - these are all attributes of a SAMPLE so they should all be in the same table.
Ok - you could put together a nice object hierarchy of base attributes with various extensions but it would be more trouble than it's worth. Keep it simple. You could always put together a few views of subsets of sample attributes.
I would only go for a variant of your option #3 if the list of sample attributes was very dynamic and you needed your users to be able to create their own fields.
In terms of implementing dynamic user-defined fields then you might first like to read through Tom Kyte's comments to this question. Now, Tom can be pretty insistent in his views but I take from his comments that you have to be very sure that you really need the flexibility for your users to add fields on the fly before you go about doing it. If you really need to do it, then don't create a table for each data type - that's going too far - just store everything in a varchar2 in a standard way and flag each attribute with an appropriate data type.
create table sample (
sample_id integer,
name varchar2(120 char),
constraint pk_sample primary key (sample_id)
);
create table attribute (
attribute_id integer,
name varchar2(120 char) not null,
data_type varchar2(30 char) not null,
constraint pk_attribute primary key (attribute_id)
);
create table sample_attribute (
sample_id integer,
attribute_id integer,
value varchar2(4000 char),
constraint pk_sample_attribute primary key (sample_id, attribute_id)
);
Now... that just looks evil doesn't it? Do you really want to go there?
I work on both a commercial and a home-made system where users have the ability to create their own fields/controls dynamically. This is a simplified version of how it works.
Tables:
Pages
Controls
Values
A page is just a container for one or more controls. It can be given a name.
Controls are linked to pages and represents user input controls.
A control contains what datatype it is (int, string etc) and how it should be represented to the user (textbox, dropdown, checkboxes etc).
Values are the actual data that the users have typed into the controls, a value contains one column for every datatype that it can represent (int, string, etc) and depending on the control type, the relevant column is set with the user input.
There is an additional column in Values which specifies which group the value belong to.
Each time a user fills in a form of controls and clicks save, the values typed into the controls are saved into the same group so that we know that they belong together (incremental counter).
CodeSpeaker,
I like you answer, it's pointing me in the right direction for a similar problem.
But how would you handle drop-downlist values?
I am thinking of a Lookup table of values so that many lookups link to one UserDefinedField.
But I also have another problem to add to the mix. Each field must have multiple linked languages so each value must link to the equivilant value for multiple languages.
Maybe I'm thinking too hard about this as I've got about 6 tables so far.