I am creating a laboratory database which analyzes a variety of samples from a variety of locations. Some locations want their own reference number (or other attributes) kept with the sample.
How should I represent the columns which only apply to a subset of my samples?
Option 1:
Create a separate table for each unique set of attributes?
SAMPLE_BOILER: sample_id (FK), tank_number, boiler_temp, lot_number
SAMPLE_ACID: sample_id (FK), vial_number
This option seems too tedious, especially as the system grows.
Option 1a: Class table inheritance (link): Tree with common fields in internal node/table
Option 1b: Concrete table inheritance (link): Tree with common fields in leaf node/table
Option 2: Put every attribute which applies to any sample into the SAMPLE table.
Most columns of each entry would most likely be NULL, however all of the fields are stored together.
Option 3: Create _VALUE_ tables for each Oracle data type used.
This option is far more complex. Getting all of the attributes for a sample requires accessing all of the tables below. However, the system can expand dynamically without separate tables for each new sample type.
SAMPLE:
sample_id*
sample_template_id (FK)
SAMPLE_TEMPLATE:
sample_template_id*
version *
status
date_created
name
SAMPLE_ATTR_OF
sample_template_id* (FK)
sample_attribute_id* (FK)
SAMPLE_ATTRIBUTE:
sample_attribute_id*
name
description
SAMPLE_NUMBER:
sample_id* (FK)
sample_attribute_id (FK)
value
SAMPLE_DATE:
sample_id* (FK)
sample_attribute_id (FK)
value
Option 4: (Add your own option)
To help with Googling, your third option looks a little like the Entity-Attribute-Value pattern, which has been discussed on StackOverflow before although often critically.
As others have suggested, if at all possible (eg: once the system is up and running, few new attributes will appear), you should use your relational database in a conventional manner with tables as types and columns as attributes - your option 1. The initial setup pain will be worth it later as your database gets to work the way it was designed to.
Another thing to consider: are you tied to Oracle? If not, there are non-relational databases out there like CouchDB that aren't constrained by up-front schemas in the same way as relational databases are.
Edit: you've asked about handling new attributes under option 1 (now 1a and 1b in the question)...
If option 1 is a suitable solution, there are sufficiently few new attributes that the overhead of altering the database schema to accommodate them is acceptable, so...
you'll be writing database scripts to alter tables and add columns, so the provision of a default value can be handled easily in these scripts.
Of the two 1 options (1a, 1b), my personal preference would be concrete table inheritance (1b):
It's the simplest thing that works;
It requires fewer joins for any given query;
Updates are simpler as you only write to one table (no FK relationship to maintain).
Although either of these first options is a better solution than the others, and there's nothing wrong with the class table inheritance method if that's what you'd prefer.
It all comes down to how often genuinely new attributes will appear.
If the answer is "rarely" then the occasional schema update can cope.
If the answer is "a lot" then the relational DB model (which has fixed schemas baked-in) isn't the best tool for the job, so solutions that incorporate it (entity-attribute-value, XML columns and so on) will always seem a little laboured.
Good luck, and let us know how you solve this problem - it's a common issue that people run into.
Option 1, except that it's not a separate table for each set of attributes: create a separate table for each sample source.
i.e. from your examples: samples from a boiler will have tank number, boiler temp, lot number; acid samples have vial number.
You say this is tedious; but I suggest that the more work you put into gathering and encoding the meaning of the data now will pay off huge dividends later - you'll save in the long term because your reports will be easier to write, understand and maintain. Those guys from the boiler room will ask "we need to know the total of X for tank grouped by this set of boiler temperature ranges" and you'll say "no prob, give me half an hour" because you've done the hard yards already.
Option 2 would be my fall-back option if Option 1 turns out to be overkill. You'll still want to analyse what fields are needed, what their datatypes and constraints are.
Option 4 is to use a combination of options 1 and 2. You may find some attributes are shared among a lot of sample types, and it might make sense for these attributes to live in the main sample table; whereas other attributes will be very specific to certain sample types.
You should really go with Option 1. Although it is more tedious to create, Option 2 and 3 will bite you back when trying to query you data. The queries will become more complex.
In fact, the most important part of storing the data, is querying it. You haven't mentioned how you are planning to use the data, and this is a big factor in the database design.
As far as I can see, the first option will be most easy to query. If you plan on using reporting tools or an ORM, they will prefer it as well, so you are keeping your options open.
In fact, if you find building the tables tedious, try using an ORM from the start. Good ORMs will help you with creating the tables from the get-go.
I would base your decision on the how you usually see the data. For instance, if you get 5-6 new attributes per day, you're never going to be able to keep up adding new columns. In this case you should create columns for 'standard' attributes and add a key/value layout similar to your 'Option 3'.
If you don't expect to see this, I'd go with Option 1 for now, and modify your design to 'Option 3' only if you get to the point that it is turning into too much work. It could end up that you have 25 attributes added in the first few weeks and then nothing for several months. In which case you'll be glad you didn't do the extra work.
As for Option 2, I generally advise against this as Null in a relational database means the value is 'Unknown', not that it 'doesn't apply' to a specific record. Though I have disagreed on this in the past with people I generally respect, so I wouldn't start any wars over it.
Whatever you do option 3 is horrible, every query will have join the data to create a SAMPLE.
It sounds like you have some generic SAMPLE fields which need to be join with more specific data for the type of sample. Have you considered some user_defined fields.
Example:
SAMPLE_BASE: sample_id(PK), version, status, date_create, name, userdata1, userdata2, userdata3
SAMPLE_BOILER: sample_id (FK), tank_number, boiler_temp, lot_number
This might be a dumb question but what do you need to do with the attribute values? If you only need to display the data then just store them in one field, perhaps in XML or some serialised format.
You could always use a template table to define a sample 'type' and the available fields you display for the purposes of a data entry form.
If you need to filter on them, the only efficient model is option 2. As everyone else is saying the entity-attribute-value style of option 3 is somewhat mental and no real fun to work with. I've tried it myself in the past and once implemented I wished I hadn't bothered.
Try to design your database around how your users need to interact with it (and thus how you need to query it), rather than just modelling the data.
If the set of sample attributes was relatively static then the pragmatic solution that would make your life easier in the long run would be option #2 - these are all attributes of a SAMPLE so they should all be in the same table.
Ok - you could put together a nice object hierarchy of base attributes with various extensions but it would be more trouble than it's worth. Keep it simple. You could always put together a few views of subsets of sample attributes.
I would only go for a variant of your option #3 if the list of sample attributes was very dynamic and you needed your users to be able to create their own fields.
In terms of implementing dynamic user-defined fields then you might first like to read through Tom Kyte's comments to this question. Now, Tom can be pretty insistent in his views but I take from his comments that you have to be very sure that you really need the flexibility for your users to add fields on the fly before you go about doing it. If you really need to do it, then don't create a table for each data type - that's going too far - just store everything in a varchar2 in a standard way and flag each attribute with an appropriate data type.
create table sample (
sample_id integer,
name varchar2(120 char),
constraint pk_sample primary key (sample_id)
);
create table attribute (
attribute_id integer,
name varchar2(120 char) not null,
data_type varchar2(30 char) not null,
constraint pk_attribute primary key (attribute_id)
);
create table sample_attribute (
sample_id integer,
attribute_id integer,
value varchar2(4000 char),
constraint pk_sample_attribute primary key (sample_id, attribute_id)
);
Now... that just looks evil doesn't it? Do you really want to go there?
I work on both a commercial and a home-made system where users have the ability to create their own fields/controls dynamically. This is a simplified version of how it works.
Tables:
Pages
Controls
Values
A page is just a container for one or more controls. It can be given a name.
Controls are linked to pages and represents user input controls.
A control contains what datatype it is (int, string etc) and how it should be represented to the user (textbox, dropdown, checkboxes etc).
Values are the actual data that the users have typed into the controls, a value contains one column for every datatype that it can represent (int, string, etc) and depending on the control type, the relevant column is set with the user input.
There is an additional column in Values which specifies which group the value belong to.
Each time a user fills in a form of controls and clicks save, the values typed into the controls are saved into the same group so that we know that they belong together (incremental counter).
CodeSpeaker,
I like you answer, it's pointing me in the right direction for a similar problem.
But how would you handle drop-downlist values?
I am thinking of a Lookup table of values so that many lookups link to one UserDefinedField.
But I also have another problem to add to the mix. Each field must have multiple linked languages so each value must link to the equivilant value for multiple languages.
Maybe I'm thinking too hard about this as I've got about 6 tables so far.
Related
I have a use case where I need to model reference data for e.g. different flavors of ice cream. Say I have 50 flavors of ice cream :-
20 attributes e.g. freezing-temp, creaminess will be shared across all flavors
every flavor of ice cream would have 20-30 attributes that will not be shared with other flavors e.g. :-
Strawberry ice cream might track tartness, fruit percentage etc.
Chocolate ice cream might track bitterness, cocoa level etc.
How would I model this data neatly in a database model, purely from a storage / retrieval point of view?
The options I can think of :-
One table per flavor. This will need 50 tables, and each table will have 20 columns that will overlap with each other, and another 20-30 attributes that will be unique to the flavor.
Pros : models the data of each flavor quite well
Cons : column overlap and large number of tables needed
One table for all flavors. This will only need 1 table, but will require 1000+ columns most of which would be empty.
Pros : models the data of ice cream in general, quite well
Cons : large number of columns and large amount of 'wasted' space
One key-value table for all flavors, with flavor Id, attribute name and attribute value.
Pros : simplest to create and insert data
Cons : harder to extract, not really a data model per se, difficult to form constraints for attributes, or for attributes related to other attributes
Never store a value in the wrong type.
Whatever design you choose, make sure that values are stored in their natural format. Use NUMBER, DATE, VARCHAR2, CLOB, XMLTYPE, CLOB (IS JSON), TIMESTAMP, etc. Trying to cram everything in a string will cause many problems. You lose validation, convenience, performance, and type safety.
For example, here is a common type safety problem. Imagine this simple query to find ice cream that is more than 25% fruit:
select *
from ice_cream_flavor_attribute
where attribute_name = 'Fruit Percentage'
and attribute_value > 25;
Do you see the bug? Do you see how the same query, with the same data, may work one day and fail the next with ORA-01722: invalid number?
It's difficult to write a query that forces Oracle to evaluate conditions in a specific order. Re-ordering the predicates won't help (99.9% of the time). Adding an inline view won't help (99.9% of the time). Using a CASE statement will work but not 100% of the time. Using hints will work but is tricky. Using an inline view and a ROWNUM is my preferred way of solving the problem but it looks odd and is difficult to understand.
If you must use an Entity Attribute Value model (and if you have more than 1000 attributes it may be unavoidable), at least use the right types.
Don't worry about space - a null column uses at most 1 byte.
Don't worry about complaints like "but then our queries are more complicated, we always need to know which column to use!" - realistically there is almost nothing useful you can do with a value without knowing its type. Every time you read or write a value you must already be thinking about the type.
I'd have one table with all the common attributes, then another for the non-shared attributes. For example:
CREATE TABLE ICE_CREAM_FLAVOR
(FLAVOR VARCHAR2(100) PRIMARY KEY,
FREEZING_TEMP NUMBER,
CREAMINESS NUMBER,
ETC VARCHAR2(25),
BLAH NUMBER);
CREATE TABLE ICE_CREAM_FLAVOR_ATTRIBUTE
(ID_ICF_ATTRIBUTE NUMBER, -- should be populated by an insert trigger
FLAVOR VARCHAR2(100)
NOT NULL
REFERENCES ICE_CREAM_FLAVOR(FLAVOR),
ATTRIBUTE_NAME VARCHAR2(25),
ATTRIBUTE_VALUE VARCHAR2(100));
Your mileage may vary.
Share and enjoy.
I would like to suggest, You can create 3 different tables.
Ice Cream Flavor: You can store all the flavors of ice cream. It will be icecream_flavor_master table. Let say if you have 50 flavors than 50 rows will create, like Strawberry,Chocolate etc.
Ice Cream Attributes: You can store all the attributes of ice cream. It will icecream_attribute_master table. Let say if you have 50 attributes than 50 rows will create, like tartness,bitterness,fruit percentage, cocoa level etc.
Ice Cream Flavor Attributes: You can store primary key of icecream_flavor_master and icecream_attribute_master in this table, to make the relation between flavor and attribute of icecream.
Let me know for further information.
You might be able to group flavors into classes of flavors, ones that share certain attributes. This lends itself to classes and subclasses that extend other classes.
If you want to do ER modeling on this, look up "generalization/specialization" on the web. Some websites will call this a feature of "Extended ER modeling" or EER.
If you want to design relational tables to implement the ER design, look into two patterns: Single Table Inheritance and Class Table Inheritance.
https://stackoverflow.com/tags/single-table-inheritance/info
https://stackoverflow.com/tags/class-table-inheritance/info
Also, look into Martin Fowler's treatment on this subject on the web, or in one of his textbooks.
What big vendors are doing for huge data in ECM (enterprise content management), where you have a quite similar scenario (many custom classes with custom attributes, some of them might be the same, having various types over attributes):
One key-value table for all flavors, with flavor Id, attribute name and attribute value.
They use one key-value table per type (string, number, date etc.).
For performance optimization, they allow to define dedicated tables for attributes, in order to keep index small and not crowded with other attributes.
Dedicated tables make sense for:
Massive usage (having many rows)
Bad histograms (like flags)
Otherwise Oracle index could be tricked, and full table access is the fastest access, which would be really bad.
So think early about performance when having huge amount of data.
My question may seems more general. But only answer I got so far is from the SO itself. My question is, I have a table customer information. I have 47 fields in it. Some of the fields are optional. I would like to split that table into two customer_info and customer_additional_info. One of its column is storing a file in byte format. Is there any advantage by splitting the table. I saw that the JOIN will slow down the query execution. Can I have more PROs and CONs of splitting a table into two?
I don't see much advantage in splitting the table unless some of the columns are very infrequently accessed and fairly large. There's a theoretical advantage to keeping rows small as you're going to get more of them in a cached block, and you improve the efficiency of a full table scan and of the buffer cache. Based on that I'd be wary of storing this file column in the customer table if it was more than a very small size.
Other than that, I'd keep it in a single table.
I can think of only 2 arguments in favor of splitting the table:
If all the columns in Customer_Addition_info are related, you could potentially get the benefit of additional declarative data integrity that you couldn't get with a single table. For instance, lets say your addition table was CustomerAddress. Your business logic may dictate that a customer address is optional, but once you have a customer Zip code, the addressL1, City and State become required fields. You could set these columns to non null if they exist in a customerAddress table. You couldn't do that if they existed directly in the customer table.
If you were doing some Object-relational mapping and your had a customer class with many subclasses and you didn't want to use Single Table Inheritance. Sometimes STI creates problems when you have similar properties of various subclasses that require different storage layout. Being that all subclasses have to use the same table, you might have name clashes. The alternative is Class Table inheritance where you have a table for the superclass, and an addition table for each subclass. This is a similar scenario to the one you described in your question.
As for CONS, The join makes things harder and slower. You also run the risk of accidentally creating a 1 to many relationship. I.E. You create 2 addresses in the CustomerAddress table and now you don't know which one is valid.
EDIT:
Let me explain the declarative ref integrity point further.
If your business rules are such that a customer address is optional, and you embed addressL1, addressL2, City, State, and Zip in your customer table, you would need to make each of these fields Nullable. That would allow someone to insert a customer with a City but no state. You could write a table level check constraint to cover this situation. But that isn't as easy as simply setting the AddressL1, City, State and Zip columns in the CustomerAddress table not nullable. To be clear, I am NOT advocating using the multi-table approach. However you asked for Pros and Cons, and I'm just pointing out this aspect falls on the pro side of the ledger.
I second what David Aldridge said, I'd just like to add a point about the file column (presumably BLOB)...
BLOBs are stored up to approx. 4000 bytes in-line1. If a BLOB is used rarely, you can specify DISABLE STORAGE IN ROW to store it out-of-line, removing the "cache pollution" without the need to split the table.
But whatever you do, measure the effects on realistic amounts of data before you make the final decision.
1 That is, in the row itself.
Does anyone have an example on how to create an Hbase table with a nested entity?
Example
UserName (string)
SSN (string)
+ Books (collection)
The books collection would look like this for example
Books
isbn
title
etc...
I cannot find a single example are how to create a table like this. I see many people talk about it, and how it is a best practice in certain scenarios, but I cannot find an example on how to do it anywhere.
Thanks...
Nested entities isn't an official feature of HBase; it's just a way some people talk about one usage pattern. In this pattern, you use the fact that "columns" in HBase are really just a big map (a bunch of key/value pairs) to let you to model a dimension of cardinality inside the row by adding one column per "row" of the nested entity.
Schema-wise, you don't need to do much on the table itself; when you create a table in HBase, you just specify the name & column family (and associated properties), like so (in hbase shell):
hbase:001:0> create 'UserWithBooks', 'cf1'
Then, it's up to you what you put in it, column wise. You could insert values like:
hbase:002:0> put 'UsersWithBooks', 'userid1234', 'cf1:username', 'my username'
hbase:003:0> put 'UsersWithBooks', 'userid1234', 'cf1:ssn', 'my ssn'
hbase:004:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_12345', '<isbn>12345</isbn><title>mary had a little lamb</title>'
hbase:005:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_67890', '<isbn>67890</isbn><title>the importance of being earnest</title>'
The column names are totally up to you, and there's no limit to how many you can have (within reason: see the HBase Reference Guide for more on this). Of course, doing this, you have to do your own legwork re: putting in and getting out values (and you'd probably do it with the java client in a more sophisticated way than I'm doing with these shell commands, they're just for explanatory purposes). And while you can efficiently scan just a portion of the columns in a table by key (using a column pagination filter), you can't do much with the contents of the cells other than pull them and parse them elsewhere.
Why would you do this? Probably just if you wanted atomicity around all the nested rows for one parent row. It's not very common, your best bet is probably to start by modeling them as separate tables, and only move to this approach if you really understand the tradeoffs.
There are some limitations to this. First, this technique only works to
one level deep: your nested entities can’t themselves have nested entities. You can still
have multiple different nested child entities in a single parent, and the column qualifier is their identifying attributes.
Second, it’s not as efficient to access an individual value stored as a nested column
qualifier inside a row, as compared to accessing a row in another table, as you learned
earlier in the chapter.
Still, there are compelling cases where this kind of schema design is appropriate. If
the only way you get at the child entities is via the parent entity, and you’d like to have transactional protection around all children of a parent, this can be the right way to go.
How to do databse driveen jsp page,
Suppose i have 5 text fields,if user wants to put one of the form field as select box.JSp should identify and return the select box if it define in db as select box.
I dont know how to achieve this,can anyone suggest this.
Regards,
Raju komaturi
There are multiple tasks if you want to do this completely. The world at large has not gone this way and so there are not many tools (if any) for this. But basically here are the main ideas.
1) You want a "data dictionary", a collection of meta-data that tells you what the types and sizes of each column are, and the primary and foreign keys are.
2) For your example of "knowing" that a field should be a drop-down, this almost always means that column value is a foreign key to another table. Your code detects this and builds a listbox out of the values in the parent table.
3) You can go so far as to create a complete form generator for simple tables, where all of the HTML is generated, but you always need a way to override this for the more complex forms. If you do this, your data dictionary should also have column descriptions or captions.
There are many many more ideas, but this is the starting point for what you describe.
Recently I've been requested to add on something for the administrator of a site where he can 'feature' something.
For this discussion let's say it's a 'featured article'.
So naturally we already have a database model of 'articles' and it has ~20 columns as it is so I really do not feel like bloating it anymore than it already is.
My options:
Tack on a 'featured' bool (or int) and realize that only one thing will be featured at any given time
Create a new model to hold this and any other feature-creep items that might pop up.
I take your suggestions! ;)
What do you do in this instance? I come across this every now and then and I just hate having to tack on one more column to something. This information DOES need to be persisted.
I'd probably just add a simple two-column table that's basically a key-value store. Then add a new column with values like (featured_article_id, 45) or whatever the first featured ID is.
Edit: as pointed out in the comments by rmeador, it should be noted that this is only a good solution as long as things stay relatively simple. If you need to store more complex data, consider figuring out a more flexible solution.
If only one article can be featured at a time it is a waste to add a bool column. You should go up a level and add a column for the FeaturedArticleID. Do you have a Site_Settings table?
You could use an extensible model like having a table of attributes, and then a linking table to form a many-to-many relationship between articles and attributes. This way, these sorts of features do not require the schema to be modified.
Have some kind of global_settings table with a parameter_name and parameter_value columns. Put featured article id here.
For quick-and-dirty stuff like this, I like to include some sort of Settings table:
CREATE TABLE Settings (
SettingName NVARCHAR(250) NOT NULL,
SettingValue NVARCHAR(250)
)
If you need per-user or per-customer settings, instead of global ones, you could add a column to identify it to that specific user/customer. Then, you could just add a row for "FeaturedArticle" and parse the ID from a string. It's not super optimized, but plaintext is very flexible, which sounds like exactly what you need.