Constraint with Query - oracle

There are two tables
City (Name, Country_code, Population)
Country (Name, Code, Population)
The task is:
The sum of population of all cities in a country, should be less or equal to population of a country. -
Create a constraint and an assertion
Create a trigger using constraint and assertion. Or propose your own
way of trigger syntax
I tried to create a constraint on table country, but get an error because of query
ALTER TABLE country
ADD CONSTRAINT check_pop_sum
CHECK (population <= ANY(SELECT SUM(POPULATION)
FROM CITY
GROUP BY COUNTRY_CODE));

You can do this using a trigger. Check this:
CREATE TRIGGER check_population
BEFORE INSERT
ON CITY
FOR EACH ROW
DECLARE
POPULATION_AMOUNT_CITY NUMBER;
POPULATION_AMOUNT_COUNTRY NUMBER;
BEGIN
SELECT SUM(POPULATION) INTO POPULATION_AMOUNT_CITY FROM CITY WHERE CODE = :NEW.Country_code;
SELECT Population INTO POPULATION_AMOUNT_COUNTRY FROM COUNTRY WHERE CODE = :NEW.Country_code;
IF (POPULATION_AMOUNT_CITY + :NEW.POPULATION) > POPULATION_AMOUNT_COUNTRY THEN
RAISE_APPLICATION_ERROR(-20000, 'Population exceeded');
END IF;
END;

The situation you describe is not a legitimate data integrity issue. so the solution is not a constraint, no matter how it is implemented. Data integrity is concerned primarily with the validity of the data, not accuracy. Accuracy is not a concern of the database at all.
Data integrity can fit into two categories: context-free integrity and contextual integrity. Context-free integrity is when you can verify the validity of the datum without referring to any other data. If you try to write an integer value into a date field for example (domain checking) or set an integer field to "3" instead of 3 or set it to 3 when the range is defined as "between 100 and 2000".
Contextual integrity is when the validity must be considered as part of a group. Any foreign key, for example. The value 3 may be perfectly valid in and of itself, but can fail validity checking if the proper row in a different table doesn't exist.
Accuracy, on the other hand, is completely different. Let's look again at the integer field constrained to a range of between 100 and 2000. Suppose the value 599 is written to that field. Is it valid? Yes, it meets all applicable constraints. Is it accurate? There is no way to tell. Accuracy of the data, as the data itself, originates from outside the database.
But doesn't the ability to add all city's population within a county and compare it to the overall county population mean that we can check for accuracy here?
Not really, or not in a significant way. Upon inserting a new city or updating a city population value, we can test to see if the total of all city populations exceeds the county population. We can alert the user to a problem but we can't tell the user where the problem is. Is the error in the insert or update? Is there a too-large population value in an existing city that was entered earlier? Are there several such too-large values for many cities? Are all city population values correct but the country population too small?
From within the database, there is no way to tell. The best we can to is verify the incorrect total and warn the user. But we can't say "The population of city XYZ is too large" because we can't tell if that is the problem. The best we can do is warn that the total of all cities within the county exceed the population defined for the county as a whole. It is up to the data owners to determine where the problem actually occurs.
This may seem like a small difference but a constraint determines that the data is invalid and doesn't allow the operation to continue ("Data Integrity: preventing bogus data from entering the database in the first place").
In the case of a city population, the data is not invalid. We can't even be sure if it is wrong, it could well be absolutely correct. There is no reason to prevent the operation from completing and the data entering the database.
It is nice if there can be some ability to verify accuracy but notice that this is not even such a case. As city data is entered into the database, the population value for most of them could be wildly erroneous. You aren't aware of a problem until the county population is finally exceeded. Some check is better than none, I suppose, but it only alerts if the inaccuracies result in a too-large value. There could just as well be inaccuracies that result in too small a value. So some sort of accuracy check must be implemented from the get-go that will test for any inaccuracies -- too large or too small.
That is what will maintain the accuracy of the data, not some misplaced operation within the database.

Related

Tableau - Aggregate and non aggregate error for divide forumla

I used COUNT (CUST_ID) as measure value to come up [Total No of Customer]. When I created new measure for [Average Profit per customer] by formula - [Total Profit] / [Total No of Customer], the error of Aggregate and non aggregate error prompted.
DB level:
Cust ID_____Profit
123_______100
234_______500
345_______350
567_______505
You must be looking for avg aggregate function.
Select cust_id, avg(profit)
From your_table
Group by cust_id;
Cheers!!
In your database table, you appear to have one data row per customer. Customer ID is serving as a unique primary key. The level of detail (or granularity) of the database table is the customer.
Given that, the simplest solution to your question is to display AVG([Profit]) -- without having [Cust ID] in the view (i.e. not on any shelf)
If the assumptions mentioned above are not correct, then you may need to employ other methods depending on how you define your question. I suggest making sure you understand what COUNT() actually does compared to COUNTD(). The behavior is not what people tend to assume. LOD calculations may prove useful. All described in the online help.
Put the calculations directly in the calculated field as:
SUM([Profit])/COUNT([CUST_ID])
This will give you aggregate and aggregate calculation.
If you want to show Average profit using a key like [CUST_ID], you can use LOD expression:
{FIXED [CUST_ID]: AVG[Profit]}

How to normalize this UNF table to 3NF given these FDs - staff, patient, date, time, surgery

The following table is not normalized:
Assuming the following functional dependencies are in place, how would we normalize this table?:
I can't seem to find a way to normalize the table while following all of the functional dependencies as well. I have the following (Modeled in Oracle SQL Developer Data Modeler):
What can I do to fully normalize the original table?
So Wikipedia’s entry for functional dependency includes this explanation:
a dependency FD: X → Y means that the values of Y are determined by the values of X. Two tuples sharing the same values of X will necessarily have the same values of Y.
So FD1 says, if you know the appointment date time and the staffer, you can determine the individual patient, and likewise for FD5 if you know the appointment dateline and the patient you can determine the staffer.
FD2 is pretty obvious, a staffer Id needs to map to an individual dentist. That is why you have ids.
Then it gets weird. FD3 indicates that from a patient number you can determine a single procedure. So if you’re required to abide by that, the surgery can go on the patient entity. Which is stupid, of course.
FD4 is puzzling too because it says that a staffer can perform only type of procedure in a given day. When you create data models in real life this is the kind of business rule you would not try to enforce through table design, you’d use a constraint, or enforce it with application code. If you did enforce this with tables you would get a weird intersection table with staffer Id, date, and procedure.
Assignments are not going to be totally realistic, but this seems far enough off you should check with your instructor about whether you are on the right track or not.
.

How do I specify invalid time in a time dimension?

I am building a time dimension for only time in my data warehouse. I already have a date dimension.
How do I denote an unknown time? In my DimDate dimension, I marked 01/01/1753 as being reserved for unknown dates, but I think a time will be a bit harder. We don't allow NULLs in our fact tables. How do I do this, and what might that row look like?
You state the "We don't allow NULLs in our fact tables " but ask "How do I denote an unknown time?"
Assuming you are using in your FACT table a data type TIME + enforce a NOT NULL constraint on data arriving from source system => you simply cannot insert unknown\invalid time into your fact and hence should have no problem.
The obvious exception to the above is an invalid business wise value reported by the source system such as Sunil proposed ('00:59:59.9999999') but this is very uncommon, unstable solution for obvius reasons (changing requirements can easily turn this value into a valid one)
If you chose to allow (and i hope you did) records with NULL values or invalid dates from your source system to enter the fact then the best practice would be using surrogate keys on our DimTime and insert them as FK into your FACT tables – this will easily allow you to represent valid + invalid values in your dimension.
This approach can easily also support the approach of an invalid business wise value ('00:59:59.9999999'), such a value gets an FK_DimTime=-1.
I strongly advise on allowing specific types of garbage from source systems to enter the FACT (i.e – invalid\missing\NULL date\time values) tables as long as you clearly mark it in relevant DIMs as this tends to drive Users to improve data quality in source systems.
Here is some background on the matter
https://www.kimballgroup.com/1997/07/its-time-for-time/
https://www.kimballgroup.com/2004/02/design-tip-51-latest-thinking-on-time-dimension-tables/
It can look like anything you want. Most dimensions have a 'display name' of some kind, so your dimensions could look something like this:
create table dbo.DimDate (DateID int, DateValue date, DisplayDate nvarchar(20))
go
-- this is an unknown date; 1753-01-01 is only there because we need some valid date value
insert into dbo.DimDate values (1, '1753-01-01', 'Unknown')
go
-- this is the real date 1 Jan 1753
insert into dbo.DimDate values (2, '1753-01-01', '01 Jan 1753')
go
create table dbo.DimTime (TimeID int, TimeValue time, DisplayTime nvarchar(20))
go
-- this is an unknown time; 00:00 is only there because we need some valid time value
insert into dbo.DimTime values (1, '00:00', 'Unknown')
go
-- this is the real time value for midnight
insert into dbo.DimTime values (2, '00:00', 'Midnight')
go
Of course, this assumes that your reporting tool and users use the DisplayDate and DisplayTime columns for filtering instead of the DateValue and TimeValue columns directly, but that's simply a matter of training and standards and whatever solution you adopt needs to be understood anyway.
There are other alternatives such as a flag column for unknown values, or a convention that a negative TimeID indicates an unknown value. But those are less obvious and harder to maintain than an explicit row value, in my opinion.
Just create a DimTime records with a -1 technical surrogate key and populate to the time column a value '00:59:59.9999999'. This way this will be a unlikely time ever captured (accuracy to the last digit) by your DWH, it will always equate to a unknown in your reports or queries when you want to put filter like,
EventTime < #ReportTime AND EventTime <> '00:59:59.9999999'
Hope this is viable solution to your problem.

Teradata: How to design table to be normalized with many foreign key columns?

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

What would be the best algorithm to find an ID that is not used from a table that has the capacity to hold a million rows

To elaborate ..
a) A table (BIGTABLE) has a capacity to hold a million rows with a primary Key as the ID. (random and unique)
b) What algorithm can be used to arrive at an ID that has not been used so far. This number will be used to insert another row into table BIGTABLE.
Updated the question with more details..
C) This table already has about 100 K rows and the primary key is not an set as identity.
d) Currently, a random number is generated as the primary key and a row inserted into this table, if the insert fails another random number is generated. the problem is sometimes it goes into a loop and the random numbers generated are pretty random, but unfortunately, They already exist in the table. so if we re try the random number generation number after some time it works.
e) The sybase rand() function is used to generate the random number.
Hope this addition to the question helps clarify some points.
The question is of course: why do you want a random ID?
One case where I encountered a similar requirement, was for client IDs of a webapp: the client identifies himself with his client ID (stored in a cookie), so it has to be hard to brute force guess another client's ID (because that would allow hijacking his data).
The solution I went with, was to combine a sequential int32 with a random int32 to obtain an int64 that I used as the client ID. In PostgreSQL:
CREATE FUNCTION lift(integer, integer) returns bigint AS $$
SELECT ($1::bigint << 31) + $2
$$ LANGUAGE SQL;
CREATE FUNCTION random_pos_int() RETURNS integer AS $$
select floor((lift(1,0) - 1)*random())::integer
$$ LANGUAGE sql;
ALTER TABLE client ALTER COLUMN id SET DEFAULT
lift((nextval('client_id_seq'::regclass))::integer, random_pos_int());
The generated IDs are 'half' random, while the other 'half' guarantees you cannot obtain the same ID twice:
select lift(1, random_pos_int()); => 3108167398
select lift(2, random_pos_int()); => 4673906795
select lift(3, random_pos_int()); => 7414644984
...
Why is the unique ID Random? Why not use IDENTITY?
How was the ID chosen for the existing rows.
The simplest thing to do is probably (Select Max(ID) from BIGTABLE) and then make sure your new "Random" ID is larger than that...
EDIT: Based on the added information I'd suggest that you're screwed.
If it's an option: Copy the table, then redefine it and use an Identity Column.
If, as another answer speculated, you do need a truly random Identifier: make your PK two fields. An Identity Field and then a random number.
If you simply can't change the tables structure checking to see if the id exists before trying the insert is probably your only recourse.
There isn't really a good algorithm for this. You can use this basic construct to find an unused id:
int id;
do {
id = generateRandomId();
} while (doesIdAlreadyExist(id));
doSomethingWithNewId(id);
Your best bet is to make your key space big enough that the probability of collisions is extremely low, then don't worry about it. As mentioned, GUIDs will do this for you. Or, you can use a pure random number as long as it has enough bits.
This page has the formula for calculating the collision probability.
A bit outside of the box.
Why not pre-generate your random numbers ahead of time? That way, when you insert a new row into bigtable, the check has already been made. That would make inserts into bigtable a constant time operation.
You will have to perform the checks eventually, but that could be offloaded to a second process that doesn’t involve the sensitive process of inserting into bigtable.
Or go generate a few billion random numbers, and delete the duplicates, then you won't have to worry for quite some time.
Make the key field UNIQUE and IDENTITY and you wont have to worry about it.
If this is something you'll need to do often you will probably want to maintain a live (non-db) data structure to help you quickly answer this question. A 10-way tree would be good. When the app starts it populates the tree by reading the keys from the db, and then keeps it in sync with the various inserts and deletes made in the db. So long as your app is the only one updating the db the tree can be consulted very quickly when verifying that the next large random key is not already in use.
Pick a random number, check if it already exists, if so then keep trying until you hit one that doesn't.
Edit: Or
better yet, skip the check and just try to insert the row with different IDs until it works.
First question: Is this a planned database or a already functional one. If it already has data inside then the answer by bmdhacks is correct. If it is a planned database here is the second question:
Does your primary key really need to be random? If the answer is yes then use a function to create a random id from with a known seed and a counter to know how many Ids have been created. Each Id created will increment the counter.
If you keep the seed secret (i.e., have the seed called and declared private) then no one else should be able to predict the next ID.
If ID is purely random, there is no algorithm to find an unused ID in a similarly random fashion without brute forcing. However, as long as the bit-depth of your random unique id is reasonably large (say 64 bits), you're pretty safe from collisions with only a million rows. If it collides on insert, just try again.
depending on your database you might have the option of either using a sequenser (oracle) or a autoincrement (mysql, ms sql, etc). Or last resort do a select max(id) + 1 as new id - just be carefull of concurrent requests so you don't end up with the same max-id twice - wrap it in a lock with the upcomming insert statement
I've seen this done so many times before via brute force, using random number generators, and it's always a bad idea. Generating a random number outside of the db and attempting to see if it exists will put a lot strain on your app and database. And it could lead to 2 processes picking the same id.
Your best option is to use MySQL's autoincrement ability. Other databases have similar functionality. You are guaranteed a unique id and won't have issues with concurrency.
It is probably a bad idea to scan every value in that table every time looking for a unique value. I think the way to do this would be to have a value in another table, lock on that table, read the value, calculate the value of the next id, write the value of the next id, release the lock. You can then use the id you read with the confidence your current process is the only one holding that unique value. Not sure how well it scales.
Alternatively use a GUID for your ids, since each newly generated GUID is supposed to be unique.
Is it a requirement that the new ID also be random? If so, the best answer is just to loop over (randomize, test for existence) until you find one that doesn't exist.
If the data just happens to be random, but that isn't a strong constraint, you can just use SELECT MAX(idcolumn), increment in a way appropriate to the data, and use that as the primary key for your next record.
You need to do this atomically, so either lock the table or use some other concurrency control appropriate to your DB configuration and schema. Stored procs, table locks, row locks, SELECT...FOR UPDATE, whatever.
Note that in either approach you may need to handle failed transactions. You may theoretically get duplicate key issues in the first (though that's unlikely if your key space is sparsely populated), and you are likely to get deadlocks on some DBs with approaches like SELECT...FOR UPDATE. So be sure to check and restart the transaction on error.
First check if Max(ID) + 1 is not taken and use that.
If Max(ID) + 1 exceeds the maximum then select an ordered chunk at the top and start looping backwards looking for a hole. Repeat the chunks until you run out of numbers (in which case throw a big error).
if the "hole" is found then save the ID in another table and you can use that as the starting point for the next case to save looping.
Skipping the reasoning of the task itself, the only algorithm that
will give you an ID not in the table
that will be used to insert a new line in the table
will result in a table still having random unique IDs
is generating a random number and then checking if it's already used
The best algorithm in that case is to generate a random number and do a select to see if it exists, or just try to add it if your database errs out sanely. Depending on the range of your key, vs, how many records there are, this could be a small amount of time. It also has the ability to spike and isn't consistent at all.
Would it be possible to run some queries on the BigTable and see if there are any ranges that could be exploited? ie. between 100,000 and 234,000 there are no ID's yet, so we could add ID's there?
Why not append your random number creator with the current date in seconds. This way the only way to have an identical ID is if two users are created at the same second and are given the same random number by your generator.

Resources