How can I speed up a diff between tables? - performance

I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB...
My current query is:
SELECT * FROM tableA EXCEPT SELECT * FROM tableB;
and
SELECT * FROM tableB EXCEPT SELECT * FROM tableA;
When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.
I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.
Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours.
Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)
Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.
What am I doing wrong here, I can't possibly see why with the index my net diff time is larger than without it?
This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries
EDIT:
Here is the schema for the two tables, they are identical except the table name.
CREATE TABLE bulk.blue
(
"partA" text NOT NULL,
"type" text NOT NULL,
"partB" text NOT NULL
)
WITH (
OIDS=FALSE
);

In the statements above you are not using the indexes.
You could do something like:
SELECT * FROM tableA a
FULL OUTER JOIN tableB b ON a.someID = b.someID
You could then use the same statement to show which tables had missing values
SELECT * FROM tableA a
FULL OUTER JOIN tableB b ON a.someID = b.someID
WHERE ISNULL(a.someID) OR ISNULL(b.someID)
This should give you the rows that were missing in table A OR table B

Confirm you indexes are being used (they are likely not in such a generic except statement), but you are not joining against a specified column(s) so likely that lack of explicit join will not make for an optimized query:
http://www.postgresql.org/docs/9.0/static/indexes-examine.html
This will help you view the explain analyze more clearly:
http://explain.depesz.com
Also, make sure you do an analyze on the table after you create the index if you want it to perform well right away:}

The queries as specified require a comparison of every column of the tables.
For example if tableA and tableB each have five columns then the query is having to compare tableA.col1 to tableB.col1, tableA.col2 to tableB.col2, . . . tableA.col5 to tableB.col5
If there are just few columns that uniquely identify a record instead of all the columnS in the table then joining the tables on the specific columns that uniquely identify a record will improve your performance.
The above statement assumes that a primary key has not been created. If a primary key has been defined to indicated which columns uniquely identify a record then I believe the EXCEPT statement would take that into consideration.

What kind of index did you apply? Indexes are only useful to improve WHERE conditions. If you're doing a select *, you're grabbing all the fields and the index is probably not doing anything, but taking up space, and adding a little more processing behind the scenes for the db-engine to compare the query to the index cache.
Instead of SELECT *, you can try selecting your unique fields and create an index for those unique fields
You can also use an OUTER JOIN to show results from both tables that did not match on the unique fields
You may want to consider is clustering your tables
What version of Postgres are you running?
When was the last time you vacuumed?
Other than the above, 13GB is pretty large, so you'll want to check your config settings. It shouldn't take hours to run that, unless you don't have enough memory on your system.

Related

fact table is being populated with too many records

there are 62000 records in my fact table which is not correct because I only have six records in my time dim, 240 records in my student dim and 140 records in my placement dim, does it have something to do with my where clause any help would be mostly appreciated.
INSERT INTO fact_placements (
report_id,
no_of_placements,
no_of_students,
fk1_time_id,
fk2_placement_id,
fk3_student_id )
SELECT
fact_seq.nextval,
no_of_placements,
no_of_students,
time_id,
placement_id,
student_id
FROM
time_dim,
placement_dim,
student_dim
WHERE
placement_dim.year = time_dim.year AND
student_dim.year = time_dim.year;
Unless you do a cartesian join i.e. without any WHERE clause, you will get less than 140 (placement) * 240 (student) * 6 (time) = 201600 fact records. Your current SQL uses the year column in the 3 tables to join, this is filtering down the records to the 62000 you are getting.
Your question title says that even this is "too many". If that is the case, then you would need to understand the granularity of your dimensions and the fact before joining these on any criteria. Are these all at the "year" level, if so do you have 1 record per year in each of these tables and no duplicates based on year?
If not, you might need to re-think the fact tables granularity or alternatively would need to join unique records based on year in each dimension to get the actual (less) number of records you are expecting, which can also be done by summarizing these tables based on year.
Ideally the fact table contains the combinations of the dimension keys with additional column i.e. the factual metrics (in this case no_of_placements and no_of_students). But depending on the available data not all combinations will be present in the fact table.
Also you might want to change the SQL syntax to use the INNER JOIN clause instead of the implied joins using the commas between table names in the FROM clause, as shown below
FROM time_dim
INNER
JOIN placement_dim
ON placement_dim.year = time_dim.year
INNER
JOIN student_dim
ON placement_dim.year = student_dim.year
There's no relationship between placement and student that's why you have so many records.
Your query is saying: Give me all the students and all the placements where year is the same.
I'm not sure that's what you want. What is really strange here is that you are loading a fact table with dimensions tables.

Optimizing a delete... where query with rownum

I'm working with an application that has a large amount of outdated data clogging up a table in my databank. Ideally, I'd want to delete all entries in the table whose reference date is too old:
delete outdatedTable where referenceDate < :deletionCutoffDate
If this statement were to be run, it would take ages to complete, so I'd rather break it up into chunks with the following:
delete outdatedTable where referenceData < :deletionCutoffDate and rownum <= 10000
In testing, this works suprisingly slowly. The following query, however, runs dramatically faster:
delete outdatedTable where rownum <= 10000
I've been reading through multiple blogs and similar questions on StackOverflow, but I haven't yet found a straightforward description of how/whether using rownum affects the Oracle optimizer when there are other Where clauses in the query. In my case, it seems to me as if Oracle checks
referenceData < :deletionCutoffDate
on every single row, executes a massive Select on all matching rows, and only then filters out the top 10000 rows to return. Is this in fact the case? If so, is there any clever way to make Oracle stop checking the Where clause as soon as it's found enough matching rows?
How about a different approach without so much DML on the table. As a permanent solution for future you could go for table partitioning.
Create a new table with required partition(s).
Move ONLY the required rows from your existing table to the new partitioned table.
Once the new table is populated, add the required constraints and indexes.
Drop the old table.
In future, you would just need to DROP the old partitions.
CTAS(create table as select) is another way, however, if you want to have a new table with partition, you would have to go for exchange partition concept.
First of all, you should read about SQL statement's execution plan and learn how to explain in. It will help you to find answers on such questions.
Generally, one single delete is more effective than several chunked. It's main disadvantage is extremal using of undo tablespace.
If you wish to delete most rows of table, much faster way usially a trick:
create table new_table as select * from old_table where date >= :date_limit;
drop table old_table;
rename table new_table to old_table;
... recreate indexes and other stuff ...
If you wish to do it more than once, partitioning is a much better way. If table partitioned by date, you can select actual date quickly and you can drop partion with outdated data in milliseconds.
At last, paritioning if a way to dismiss 'deleting outdated records' at all. Sometimes we need old data, and it's sad if we delete it by own hands. With paritioning you can archive outdated partitions outside of the database, but connects them when you need to access old data.
This is an old request, but I'd like to show another approach (also using partitions).
Depending on what you consider old, you could create corresponding partitions (optimally exactly two; one current, one old; but you could just as well make more), e.g.:
PARTITION BY LIST ( mod(referenceDate,2) )
(
PARTITION year_odd VALUES (1),
PARTITION year_even VALUES (0)
);
This could as well be months (Jan, Feb, ... Dec), decades (XX0X, XX1X, ... XX9X), half years (first_half, second_half), etc. Anything circular.
Then whenever you want to get rid of old data, truncate:
ALTER TABLE mytable TRUNCATE PARTITION year_even;
delete from your_table
where PK not in
(select PK from your_table where rounum<=...) -- these records you want to leave

Is there any use to create index on all the table columns in oracle?

In our one of production database, we have 4 column table and there are no PK,UK constraints on it. only one notnull constraint on one column. The inserts are slow on this table and when I checked the indexes , there is one index which is built on all columns.
It is a normal table and not IOT. I really don't see a need of all column index, but wondering why the developers has created it?
Appreciate your thoughts?
It might be usefull, i.e. if you (mainly) query all columns oracle doesn't have to access the table at all, but can get all the data from the index. Though inserts take longer because a larger index has to be maintained by the dbms everytime.
One case where it could be useful is,
Say for example, you are trying to check the existence of records in this table and for that you have to have joins on all four columns. So in such a case if you have written a correlated query like below,
SELECT <something>
FROM table_1 t1
WHERE EXISTS
(SELECT 1 FROM table_t2 t2 where t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 and t1.c4=t2.c4)
Apart from above case, it looks an error to me from developer's side.
Indexes are good to better query optimization but causes slow updates/inserts because the indexes needs to be updated at each modification.
If these tables first use is querying and inserts happens only in a specific periods like a batch at the beginning or the end of the day only, then you can remove the indexes before updating tables and then restore them.
In addition, all the queries all these tables need to be analysed to see which indexes are useful and which are not?
Anyway, You need to ask developers before removing these indexes.

Slow query on view when using single column order by clause

I have a view that joins 24 tables (all but 1 are left outer joins) and 83 columns. When I Select * from the view without an order by clause it returns 27k rows all columns in about 4:27 seconds. If I do the same select but add a 'order by requestId' clause it takes 83 minutes to complete.
The column being ordered by is indexed in the original table.
I've tried wrapping it in a Select * from (.......) order by requestId but get the same results.
Suggestions on where to look
Explain might tell you more, if you have the time to wade through it, but guessing I'd say it's doing a full sort of all 27,000 rows as it can't find a useful ordered index to avoid the extra sort.
It will be hard to spot amongst what you have but a simple scenario would be
TableA KeyColumn,DataColumn, where key column is the primary key
Select * From TableA Order By KeyColumn, will use the PK index which is in order, so no sort required.
select * From TableA Order By DataColumn, will read the table and then do a sort.
Add an Index for Datacolumn, and the sort won't be required.
As soon as you get to more complex scenarios, it might be that you have a useful index for ordering, but it's not the best for joining, so it does the join fast, and then spends all the time ordering.
If I was looking at this and what to do didn't leap out at me eg. no index at all for requestid, then I'd start chopping tables out of the query until I stopped getting the undesirable behaviour. Then put one back in to get it back, and then use this hopefully less arduous query and explain and see if I could get a usful index in or restate the query to use a more useful index.
Best of luck.
If you have order by on a column, then that column must be part of either:
- An index of its own where only that column exists
- Or in an index that has the fields in the WHERE clauses and then the fields in the ORDER BY clause in the exact order.
Best if you show me the query. Then I can brainstorm with you.

T-SQL Performance dilemma when looping through a Big table (details inside)

Let's say I have a Big and a Bigger table.
I need to cycle through the Big table, that is indexed but not sequential (since it is a filter of a sequentially indexed Bigger table).
For this example, let's say I needed to cycle through about 20000 rows.
Should I do 20000 of these
set #currentID = (select min(ID) from myData where ID > #currentID)
or
Creating a (big) temporary sequentially indexed table (copy of the Big table) and do 20000 of
#Row = #Row + 1
?
I imagine that doing 20000 filters of the Bigger table just to fetch the next ID is heavy, but so must be filling a big (Big sized) temporary table just to add a dummy identity column.
Is the solution somewhere else?
For example, if I could loop through the results of the select statement (the filter of the Bigger table that originates "table" (actually a resultset) Big) without needing to create temporary tables, it would be ideal, but I seem to be unable to add something like an IDENTITY(1,1) dummy column to the results.
Thanks!
You may want to consider finding out how to do your work set based instead of RBAR. With that said, for very big tables, you may want to not make a temp table so that you are sure that you have live data if you suspect that the proc may run for a while in production. If your proc fails, you'll be able to pick up where you left off. If you use a temp table then if your proc crashes, then you could lose data that hasn't been completed yet.
You need to provide more information on what your end result is, It is only very rarely necessary to do row-by-row processing (and almost always the worst possible choice from a performance perspective). This article will get you started on how to do many tasks in a set-based manner:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
If you just want a temp table with an identity, here are two methods:
create table #temp ( test varchar (10) , id int identity)
insert #temp (test)
select test from mytable
select test, identity(int) as id into #temp from mytable
I think a join will serve your purposes better.
SELECT BIG.*, BIGGER.*, -- Add additional calcs here involving BIG and BIGGER.
FROM TableBig BIG (NOLOCK)
JOIN TableBigger BIGGER (NOLOCK)
ON BIG.ID = BIGGER.ID
This will limit the set you are working with to. But again it comes down to the specifics of your solution.
Remember too, you can do bulk inserts and bulk updates in this manner too.

Resources