oracle index for like query - oracle

I'm fighting a case of customer stupidity / stubbornness here. We have an application to look up retail shopper by various criteria. The most common variety we see is some combination of (partial) last name and (partial) postal code.
When they enter the full postal code, it works remarkably well. The problem is they sometimes choose to enter, effectively, postal code like '3%'.
Any miracle out there to overcome our customer stupidity?
ETA: There are two tables involved in this particular dog of an operation: customers and addresses. I'm a DBA involved in supporting this application, rather than on the development side. I have no ability to change the code (though I can pass on suggestions in that vein) but I have some leeway on improving indexing.
Customers has 22 million rows; addresses has 23 million.
"Stupidity" may be a harsh word, but I don't understand why you would ever try to look up a customer by postal code like '3%'. I mean, how much effort is it to type in their full zip or postal code?

A difficulty is that
WHERE postal_code LIKE '3%'
AND last_name LIKE 'MC%'
can usually only benefit from either an index on postal_code or an index on last_name. A composite index on both is no help (beyond the leading column).
Consider this as a possible solution (assuming your table name is RETAIL_RECORDS:
alter table retail_records
add postal_code_first_1 VARCHAR2(2)
GENERATED ALWAYS AS ( substr(postal_code, 1,1) );
alter table retail_records
add last_name_first_1 VARCHAR2(2)
GENERATED ALWAYS AS ( substr(last_name, 1,1) );
create index retail_records_n1
on retail_records ( postal_code_first_1, last_name_first_1, postal_code );
create index retail_records_n2
on retail_records ( postal_code_first_1, last_name_first_1, last_name );
Then, in situations, where postal_code and/or last_name conditions are given to you, also include a condition on the appropriate ...first_1 column.
So,
WHERE postal_code LIKE :p1
AND last_name LIKE :p2
AND postal_code_first_1 = SUBSTR(:p1,1,1)
AND last_name_first_2 = SUBSTR(:p2,1,2)
That's going to allow Oracle to, on average, search through 1/260th of the data. (1/10th for the postal codes and 1/26th for the first letter). OK, there are a lot more last names starting with "M" than with "Z", so that's a little generous. But even for a high-frequency combination (say postal_code like '1%' and last_name like 'M%'), it still shouldn't have to look through more than 1% of the rows.
I expect that you'll have to tweak this once you see what Oracle's Cost-Based Optimizer is actually doing, but I think the basic principle of the idea should be sound.

Related

Tableau - Aggregate and non aggregate error for divide forumla

I used COUNT (CUST_ID) as measure value to come up [Total No of Customer]. When I created new measure for [Average Profit per customer] by formula - [Total Profit] / [Total No of Customer], the error of Aggregate and non aggregate error prompted.
DB level:
Cust ID_____Profit
123_______100
234_______500
345_______350
567_______505
You must be looking for avg aggregate function.
Select cust_id, avg(profit)
From your_table
Group by cust_id;
Cheers!!
In your database table, you appear to have one data row per customer. Customer ID is serving as a unique primary key. The level of detail (or granularity) of the database table is the customer.
Given that, the simplest solution to your question is to display AVG([Profit]) -- without having [Cust ID] in the view (i.e. not on any shelf)
If the assumptions mentioned above are not correct, then you may need to employ other methods depending on how you define your question. I suggest making sure you understand what COUNT() actually does compared to COUNTD(). The behavior is not what people tend to assume. LOD calculations may prove useful. All described in the online help.
Put the calculations directly in the calculated field as:
SUM([Profit])/COUNT([CUST_ID])
This will give you aggregate and aggregate calculation.
If you want to show Average profit using a key like [CUST_ID], you can use LOD expression:
{FIXED [CUST_ID]: AVG[Profit]}

Constraint with Query

There are two tables
City (Name, Country_code, Population)
Country (Name, Code, Population)
The task is:
The sum of population of all cities in a country, should be less or equal to population of a country. -
Create a constraint and an assertion
Create a trigger using constraint and assertion. Or propose your own
way of trigger syntax
I tried to create a constraint on table country, but get an error because of query
ALTER TABLE country
ADD CONSTRAINT check_pop_sum
CHECK (population <= ANY(SELECT SUM(POPULATION)
FROM CITY
GROUP BY COUNTRY_CODE));
You can do this using a trigger. Check this:
CREATE TRIGGER check_population
BEFORE INSERT
ON CITY
FOR EACH ROW
DECLARE
POPULATION_AMOUNT_CITY NUMBER;
POPULATION_AMOUNT_COUNTRY NUMBER;
BEGIN
SELECT SUM(POPULATION) INTO POPULATION_AMOUNT_CITY FROM CITY WHERE CODE = :NEW.Country_code;
SELECT Population INTO POPULATION_AMOUNT_COUNTRY FROM COUNTRY WHERE CODE = :NEW.Country_code;
IF (POPULATION_AMOUNT_CITY + :NEW.POPULATION) > POPULATION_AMOUNT_COUNTRY THEN
RAISE_APPLICATION_ERROR(-20000, 'Population exceeded');
END IF;
END;
The situation you describe is not a legitimate data integrity issue. so the solution is not a constraint, no matter how it is implemented. Data integrity is concerned primarily with the validity of the data, not accuracy. Accuracy is not a concern of the database at all.
Data integrity can fit into two categories: context-free integrity and contextual integrity. Context-free integrity is when you can verify the validity of the datum without referring to any other data. If you try to write an integer value into a date field for example (domain checking) or set an integer field to "3" instead of 3 or set it to 3 when the range is defined as "between 100 and 2000".
Contextual integrity is when the validity must be considered as part of a group. Any foreign key, for example. The value 3 may be perfectly valid in and of itself, but can fail validity checking if the proper row in a different table doesn't exist.
Accuracy, on the other hand, is completely different. Let's look again at the integer field constrained to a range of between 100 and 2000. Suppose the value 599 is written to that field. Is it valid? Yes, it meets all applicable constraints. Is it accurate? There is no way to tell. Accuracy of the data, as the data itself, originates from outside the database.
But doesn't the ability to add all city's population within a county and compare it to the overall county population mean that we can check for accuracy here?
Not really, or not in a significant way. Upon inserting a new city or updating a city population value, we can test to see if the total of all city populations exceeds the county population. We can alert the user to a problem but we can't tell the user where the problem is. Is the error in the insert or update? Is there a too-large population value in an existing city that was entered earlier? Are there several such too-large values for many cities? Are all city population values correct but the country population too small?
From within the database, there is no way to tell. The best we can to is verify the incorrect total and warn the user. But we can't say "The population of city XYZ is too large" because we can't tell if that is the problem. The best we can do is warn that the total of all cities within the county exceed the population defined for the county as a whole. It is up to the data owners to determine where the problem actually occurs.
This may seem like a small difference but a constraint determines that the data is invalid and doesn't allow the operation to continue ("Data Integrity: preventing bogus data from entering the database in the first place").
In the case of a city population, the data is not invalid. We can't even be sure if it is wrong, it could well be absolutely correct. There is no reason to prevent the operation from completing and the data entering the database.
It is nice if there can be some ability to verify accuracy but notice that this is not even such a case. As city data is entered into the database, the population value for most of them could be wildly erroneous. You aren't aware of a problem until the county population is finally exceeded. Some check is better than none, I suppose, but it only alerts if the inaccuracies result in a too-large value. There could just as well be inaccuracies that result in too small a value. So some sort of accuracy check must be implemented from the get-go that will test for any inaccuracies -- too large or too small.
That is what will maintain the accuracy of the data, not some misplaced operation within the database.

(var)char as the type of the column for performance?

I have a column called "status" in PostgreSQL. First it used to be "status_id" of type integer. The values were kept on client, so there was no table on the server called statuses where I'd keep those statuses and then do inner join with the first table.
I used to send the ids of the statuses from the client (they had the names on the client). However, at some point I understood I'd better make the server hold those statuses. Not in a separate table but in the first one and I want to make them strings. So the initial table will have a status column of type string (varchar, to be more specific). I read it wouldn't be that slow.
In general, is it a good idea? I suppose it is because doing inner join (in case I'd keep statuses in the separate table) each time is expensive as well as sending ids from the client.
1) The only concern I have is that the column status should be of type char, not varchar. It should make it more effective I suppose. Is that so?
2) If the first case is correct then I'm not sure I'll be able to name all the statuses using exactly the same amount of characters, let's say, 5 characters. Some of them might be longer, some shorter. How can I solve this?
UPDATE:
It's not denationalization because I'm talking about 1 single table. There is no and has never been the second table called Statuses with the fields (id, status_name).
What I'm trying to convey is that I could use char(n) for status_name and also add index on it. Then it should be fast enough. However, it might be or not possible to name all the statuses with the certain (n) amount of characters and that's the only concern.
I don't think so using char or varchar instead integer is good idea. It is hard to expect how much slower it will be than integer PK, but this design will be slower - impact will be more terrible when you will join larger tables. If you can, use ENUM types instead.
http://www.postgresql.org/docs/9.2/static/datatype-enum.html
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (
name text,
current_mood mood
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
name | current_mood
------+--------------
Moe | happy
(1 row)
PostgreSQL varchar and char types are very similar. Internal implementation is same - char can be (it is paradox) little bit slower due addition by spaces.
I'd go one step further. Never use the outdated data type char(n), unless you know you have to (for compatibility or some rare exotic reason). The type is utterly useless in a modern database. Padding strings with blank characters is nonsense, and if you have to do it, you can do it in a cheaper fashion with rpad() on data retrieval.
SELECT rpad('short', 10) AS char_10_string;
varchar is basically the same as text and allows a length specifier: varchar(n). I generally use just text. If I need to limit the length, I use a CHECK constraint. Here's one example, why.
Whenever you can use a simple integer (or enum) instead, that's a bit smaller and faster in every respect. Consider #Pavel's answer for enum.
As for:
because doing inner join (...) each time is expensive
Well, it carries a small cost, but it's generally cheaper than redundantly saving text representation of the status instead of a much cheaper integer in the main table. That kind of rumor is spread by people having problems understanding the concept of database normalization. The enum type is a compromise here - for relatively static sets of values.

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.
It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.
One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.
A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.
Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

Resources