fact table is being populated with too many records - oracle

there are 62000 records in my fact table which is not correct because I only have six records in my time dim, 240 records in my student dim and 140 records in my placement dim, does it have something to do with my where clause any help would be mostly appreciated.
INSERT INTO fact_placements (
report_id,
no_of_placements,
no_of_students,
fk1_time_id,
fk2_placement_id,
fk3_student_id )
SELECT
fact_seq.nextval,
no_of_placements,
no_of_students,
time_id,
placement_id,
student_id
FROM
time_dim,
placement_dim,
student_dim
WHERE
placement_dim.year = time_dim.year AND
student_dim.year = time_dim.year;

Unless you do a cartesian join i.e. without any WHERE clause, you will get less than 140 (placement) * 240 (student) * 6 (time) = 201600 fact records. Your current SQL uses the year column in the 3 tables to join, this is filtering down the records to the 62000 you are getting.
Your question title says that even this is "too many". If that is the case, then you would need to understand the granularity of your dimensions and the fact before joining these on any criteria. Are these all at the "year" level, if so do you have 1 record per year in each of these tables and no duplicates based on year?
If not, you might need to re-think the fact tables granularity or alternatively would need to join unique records based on year in each dimension to get the actual (less) number of records you are expecting, which can also be done by summarizing these tables based on year.
Ideally the fact table contains the combinations of the dimension keys with additional column i.e. the factual metrics (in this case no_of_placements and no_of_students). But depending on the available data not all combinations will be present in the fact table.
Also you might want to change the SQL syntax to use the INNER JOIN clause instead of the implied joins using the commas between table names in the FROM clause, as shown below
FROM time_dim
INNER
JOIN placement_dim
ON placement_dim.year = time_dim.year
INNER
JOIN student_dim
ON placement_dim.year = student_dim.year

There's no relationship between placement and student that's why you have so many records.
Your query is saying: Give me all the students and all the placements where year is the same.
I'm not sure that's what you want. What is really strange here is that you are loading a fact table with dimensions tables.

Related

Table joins with between clause performances

I want to fetch a column from one table using between condition as in the below query. I joined the tables but it takes lot of time if the tables are having 100k records. Is there any way to rewrite this ?
I need a.grade in the result of my value lies between a.low and a.high.. there can be many matches for one value.
Select a.grade, b.val
From tbl1 a, tbl2b
Where b.val between a.low and a.high;
Also I have an index on (low,high) but optimiser is not using it

MonetDB simple join performance on 2 tables

Let's assume I have two tables of the same row count. Both tables contain a column that allows for 1-1 join between them.
If those tables were turned into one table instead and thus JOIN statement eliminated from the query, would there be any performance benefit of that?
Another example... Let's assume I have table with 10 columns. From that table I created new table but only taking one column. If I issue statement selecting that one column with WHERE predicate on the same column would there be any performance difference in executing this query on both tables?
What I'm trying to get to is if performance is the same in above cases is it safe to say tables are only containers wrapping number of columns together?
I did run couple tests but with non conclusive results.
Let's assume I have two tables of the same row count. Both tables
contain a column that allows for 1-1 join between them. If those
tables were turned into one table instead and thus JOIN statement
eliminated from the query, would there be any performance benefit of
that?
Performing that join for every query is of course more expensive than materializing the table once and then reading it. So yes, there would be a performance benefit.
Another example... Let's assume I have table with 10 columns. From
that table I created new table but only taking one column. If I issue
statement selecting that one column with WHERE predicate on the same
column would there be any performance difference in executing this
query on both tables?
No, there would be no difference, since tables are represented as collections of columns, which are each stored in their own file.
What I'm trying to get to is if performance is the same in above cases
is it safe to say tables are only containers wrapping number of
columns together?
That is indeed safe to say.

Slow query on view when using single column order by clause

I have a view that joins 24 tables (all but 1 are left outer joins) and 83 columns. When I Select * from the view without an order by clause it returns 27k rows all columns in about 4:27 seconds. If I do the same select but add a 'order by requestId' clause it takes 83 minutes to complete.
The column being ordered by is indexed in the original table.
I've tried wrapping it in a Select * from (.......) order by requestId but get the same results.
Suggestions on where to look
Explain might tell you more, if you have the time to wade through it, but guessing I'd say it's doing a full sort of all 27,000 rows as it can't find a useful ordered index to avoid the extra sort.
It will be hard to spot amongst what you have but a simple scenario would be
TableA KeyColumn,DataColumn, where key column is the primary key
Select * From TableA Order By KeyColumn, will use the PK index which is in order, so no sort required.
select * From TableA Order By DataColumn, will read the table and then do a sort.
Add an Index for Datacolumn, and the sort won't be required.
As soon as you get to more complex scenarios, it might be that you have a useful index for ordering, but it's not the best for joining, so it does the join fast, and then spends all the time ordering.
If I was looking at this and what to do didn't leap out at me eg. no index at all for requestid, then I'd start chopping tables out of the query until I stopped getting the undesirable behaviour. Then put one back in to get it back, and then use this hopefully less arduous query and explain and see if I could get a usful index in or restate the query to use a more useful index.
Best of luck.
If you have order by on a column, then that column must be part of either:
- An index of its own where only that column exists
- Or in an index that has the fields in the WHERE clauses and then the fields in the ORDER BY clause in the exact order.
Best if you show me the query. Then I can brainstorm with you.

How can I speed up a diff between tables?

I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB...
My current query is:
SELECT * FROM tableA EXCEPT SELECT * FROM tableB;
and
SELECT * FROM tableB EXCEPT SELECT * FROM tableA;
When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.
I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.
Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours.
Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)
Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.
What am I doing wrong here, I can't possibly see why with the index my net diff time is larger than without it?
This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries
EDIT:
Here is the schema for the two tables, they are identical except the table name.
CREATE TABLE bulk.blue
(
"partA" text NOT NULL,
"type" text NOT NULL,
"partB" text NOT NULL
)
WITH (
OIDS=FALSE
);
In the statements above you are not using the indexes.
You could do something like:
SELECT * FROM tableA a
FULL OUTER JOIN tableB b ON a.someID = b.someID
You could then use the same statement to show which tables had missing values
SELECT * FROM tableA a
FULL OUTER JOIN tableB b ON a.someID = b.someID
WHERE ISNULL(a.someID) OR ISNULL(b.someID)
This should give you the rows that were missing in table A OR table B
Confirm you indexes are being used (they are likely not in such a generic except statement), but you are not joining against a specified column(s) so likely that lack of explicit join will not make for an optimized query:
http://www.postgresql.org/docs/9.0/static/indexes-examine.html
This will help you view the explain analyze more clearly:
http://explain.depesz.com
Also, make sure you do an analyze on the table after you create the index if you want it to perform well right away:}
The queries as specified require a comparison of every column of the tables.
For example if tableA and tableB each have five columns then the query is having to compare tableA.col1 to tableB.col1, tableA.col2 to tableB.col2, . . . tableA.col5 to tableB.col5
If there are just few columns that uniquely identify a record instead of all the columnS in the table then joining the tables on the specific columns that uniquely identify a record will improve your performance.
The above statement assumes that a primary key has not been created. If a primary key has been defined to indicated which columns uniquely identify a record then I believe the EXCEPT statement would take that into consideration.
What kind of index did you apply? Indexes are only useful to improve WHERE conditions. If you're doing a select *, you're grabbing all the fields and the index is probably not doing anything, but taking up space, and adding a little more processing behind the scenes for the db-engine to compare the query to the index cache.
Instead of SELECT *, you can try selecting your unique fields and create an index for those unique fields
You can also use an OUTER JOIN to show results from both tables that did not match on the unique fields
You may want to consider is clustering your tables
What version of Postgres are you running?
When was the last time you vacuumed?
Other than the above, 13GB is pretty large, so you'll want to check your config settings. It shouldn't take hours to run that, unless you don't have enough memory on your system.

Query does cartesian join unless

I've got a query that's supposed to return 2 rows. However, it returns 48 rows. It's acting like one of the tables that's being joined isn't there. But if I add a column from that table to the select clause, with no changes to the from or where parts of the query, it returns 2 rows.
Here's what "Explain plan" says without the "m.*" in the select:
Here it is again after adding m.* in the select:
Can anybody explain why it should behave this way?
Update: We only had this problem on one system and not another. The DBA verified that the one with the problem is running optimizer_features_enable set to 10.2.0.5, and the one where it doesn't happen is running optimizer_features_enable set to 10.2.0.4. Unfortunately the customer site is running 10.2.0.5.
It's about a join elimination that was introduced in 10gR2:
Table elimination (alternately called
"join elimination") removes redundant
tables from a query. A table is
redundant if its columns are only
referenced to in join predicates, and
it is guaranteed that those joins
neither filter nor expand the
resulting rows. There are several
cases where Oracle will eliminate a
redundant table.
Maybe that's kind of related bug or so. Have a look at this article.
Looks like a bug. What are the constraints ?
Logically, if all rows in MASTERSOURCE_FUNCTION had the function NON-OSDA then that wouldn't exclude any rows (or if none had that value, then all rows would be excluded).
Going one step further, if every row in MASTERSOURCE had one or zero NON-OSDA rows in MASTERSOURCE_FUNCTION, then it should be a candidate for exclusion. But there would also need to be a one-to-one between the MASTERSOURCE ID and NAME.
I'd pull the ROWIDs from ACCOUNTSOURCE for the 48 rows, then track the MASTERSOURCE ID and NAME and see on what grounds those rows are being duplicated or not excluded. That is, are there 12 duplicate names in MASTERSOURCE where it is expected to be unique through a NOVALIDATE constraint.

Resources