I have simple SQL table structure where there is three tables: entry (~70k rows), tag (~27k rows) and entry_tag (~2.5M rows). entry_tag -table contains relations between entries and tags (n-n relation) like this entry_tag (entry_id, tag_id, ...).
When I first tried to fetch all the entry_ids with particular tag, I made query like
SELECT entry_id FROM entry_tag et, tag t
WHERE t.id=et.tag_id AND t.name LIKE 'Something';
For no apparent reason query took over 6 secs on Adobe Air app on Win7 x64 (i5-2500K, 8GB RAM, 7200rpm hdd), which is way too slow (particularly with more complex INTERSECT queries). In SQLite command line tool query returned results immediately. I compared EXPLAIN and EXPLAIN QUERY PLAN between cli and Air (using Air based SQLite Sorcerer) and the results where identical. Both queries were using indices.
After I restructured the query by switching the order of the tables, performance got multiplied in Air to 250ms, which is satisfactory:
SELECT entry_id FROM tag t, entry_tag et
WHERE et.tag_id=t.id AND t.name LIKE 'Something';
So I was left wondering, why the cli tool was so much faster than Adobe Air environment? For background information I have generated the db file with PHP (all columns INTEGER or TEXT).
Here is the query plan for the original slow query:
sele order from deta
---- ------------- ---- ----
0 0 0 SCAN TABLE entry_tag AS et USING INDEX idx_et_eid (~1000000 rows)
0 1 1 SEARCH TABLE tag AS t USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
(~1000000 rows looks anyways bad ;)
Related
I work on a project to transfer data from an Oracle database to a PostgreSQL database to create a datawarehouse with bash & SQL scripts. To access to the Oracle database, I use the PostgreSQL extension oracle-fdw.
One of my scripts import data from a massive table (~ 100 000 000 new rows/day). This table is partitioned and each partition contains 1 day of data. The query I use to import data looks like that :
INSERT INTO postgre_target_table (some_fields)
SELECT some_aggregated_fields -- (~150 fields)
FROM oracle_source_table
WHERE partition_id = :v_partition_id AND some_others_filters
GROUP BY primary_key;
On DEV server, the query works fine (there is much less data on this server) but in PREPROD, it returns the error ORA-01406: fetched column value was truncated.
In some posts, people say that the output fields may be too small but if I try to send a simple SELECT query without INSERT or GROUP BY I have the same error.
Another idea I found in another post is to create an Oracle side view but in my query I use multiple parameters that I cannot use in a view.
The last idea I found is to create an Oracle stored procedure that fills a table with aggregated data and then import data from this table but the Oracle database is critical and my customer prefers to avoid adding more data on it.
Now, I'm starting to think there's no solution and it's not good...
PostgreSQL version : 12.4 / Oracle version : 11.2
UPDATE
It seems my problem is more complecated than I thought.
After applying the modification given by Laurenz Albe, the query runs correctly on PGAdmin but the problem still appears when I use psql command.
Moreover, another query seems to have the same problem. This other query does not use the same source table as the first query, it uses 4 joined tables without any partition. The common point between these queries is the structure.
The detail I omit to specify in the original post is that the purpose of both queries is to pivot a table. They look like that :
SELECT osr.id,
MIN(CASE osr.category
WHEN 123 THEN
1
END) AS field1,
MIN(CASE osr.category
WHEN 264 THEN
1
END) AS field2,
MIN(CASE osr.category
WHEN 975 THEN
1
END) AS field3,
...
FROM oracle_source_table osr
WHERE osr.category IN (123, 264, 975, ...)
GROUP BY osr.id;
Now that I have detailed what the queries look like, I can give you some results I had with the second one without changing the value of max_long (this query is lighter than the first one) :
Sometimes it works (~10%), sometimes it failed (~90%) on PGadmin but it never works with psql command
If I delete the WHERE, it always works
I don't understand why deleting the WHERE change something, the field used in this clause is a NUMBER(6, 0) between 0 and 2500 and it is still used in the SELECT clause... Oh and in the 4 Oracle tables used by this query, there is no LONG datatype, only NUMBER datatype is used.
Among 20 queries I have, only these two have a problem, their structure is similar and I don't believe in coincidences.
Don't despair!
Set the max_long option on the foreign table big enough that all your oversized data fit.
The documentation has the details:
max_long (optional, defaults to "32767")
The maximal length of any LONG, LONG RAW and XMLTYPE columns in the Oracle table. Possible values are integers between 1 and 1073741823 (the maximal size of a bytea in PostgreSQL). This amount of memory will be allocated at least twice, so large values will consume a lot of memory.
If max_long is less than the length of the longest value retrieved, you will receive the error message
ORA-01406: fetched column value was truncated
Example:
ALTER FOREIGN TABLE my_tab OPTIONS (ADD max_long '1000000');
There is a query having multiple inner joins. It involves two views, of which one view is based on four tables, and total there are four tables(including two views).
The same query with the same amount of data in the source tables runs in both, Oracle and DB2. In DB2, surprisingly, it takes 2 minutes to load 3 million records. While in Oracle, it is taking two hours. Same indexes are on all source tables in both the environments. Is the behavior of views (when used in joins) different in both environments (Oracle vs DB2)?
a dummy query I am sharing :-
INSERT INTO TABLE_A
SELECT
adf.column1,
adf.column2,
dd.column3,
SUM(otl.column4) column4,
SUM(otl.column5) column5,
(Case when SUM(otl.column5) = 0 then 0
else round(CAST(SUM(otl.column4) AS DECIMAL(19,2)) /abs(CAST(SUM(otl.column4) AS DECIMAL(18,2))),4)
end) taxl_unrlz_cgl_pct
FROM
view_a adf
INNER JOIN table_b hr on hr.hh_ref_id = adf.hh_ref_id
AND hr.col_typ_cd = 'FIRM'
AND hr.col_end_dt = TO_DATE('1/1/2900','MM/DD/YYYY')
INNER JOIN dw.table_c ar on ar.colb_id = adf.colb_id
AND ar.col_cd = '#'
AND ar.col_num BETWEEN 10000000 AND 89999999
AND ar.col_dt IS NULL
INNER JOIN table_d dd on dd.col_id = adf.col_id
INNER JOIN view2 otl ON otl.cola_id = ar.cola_id
GROUP BY adf.column1, adf.column2, dd.column3;
Technically, both DB2 and Oracle will try to rewrite the query in most efficient way possible using the base query that you have coded. But one of the common (but not frequent) issues that I have seen when using multi-table view is DBMS not being able to rewrite the query using underlying tables. So depending on complexity of the view itself and sometime the additional joins, DBMS may not be able to rewrite the query to use the underlying tables properly and hence resulting in not being able to use the indexes on the underlying tables used in the view. When this happens, the view itself acts like a materialized table (work table) and query goes for table scan on the materialized table.
There is no consistent pattern on when such issue can happen. So you will need to check on a case by case basis.
Since you are mentioning about 2 hrs vs 2 minutes, in most probability that might be the case. So you will need to check the access path on both Oracle and DB2. But you will also need to make sure that stats are updated and access path is based on latest stats on DBMS. Else it won't be apples to apples compare.
I’m a longtime MSSQL developer who finds himself back in PL/SQL for the first time since Oracle 7. I’m looking for some tuning advice re a large export stored procedure, which is sporadically and not very reproducably running slow at certain points. This happens around some static working tables which it truncates, fills and uses as part of the export. The code in outline typically looks like this:
create or replace Procedure BigMultiPurposeExport as (
-- about 2000 lines of other code
INSERT WORK_TABLE_5 SELECT WHATEVER1 FROM WHEREVER1;
INSERT WORK_TABLE_5 SELECT WHATEVER2 FROM WHEREVER2;
INSERT WORK_TABLE_5 SELECT WHATEVER3 FROM WHEREVER3;
INSERT WORK_TABLE_5 SELECT WHATEVER4 FROM WHEREVER4;
-- WORK_TABLE_5 now has 0 to ~500k rows whose content can vary drastically from run to run
-- e.g. one hourly run exports 3 whale sightings, next exports all tourist visits to Kenya this decade
-- about 1000 lines of other code
INSERT OUTPUT_TABLE_3
SELECT THIS, THAT, THE_OTHER
FROM BUSINESS_TABLE_1 BT1
INNER JOIN BUSINESS_TABLE_2 ON etc -- typical join on indexed columns
INNER JOIN BUSINESS_TABLE_3 ON etc -- typical join on indexed columns
INNER JOIN BUSINESS_TABLE_4 ON etc -- typical join on indexed columns
LEFT OUTER JOIN WORK_TABLE_1 ON etc -- typical join on indexed columns
LEFT OUTER JOIN WORK_TABLE_2 ON etc -- typical join on indexed columns
LEFT OUTER JOIN WORK_TABLE_3 ON etc -- typical join on indexed columns
LEFT OUTER JOIN WORK_TABLE_4 ON etc -- typical join on indexed columns
LEFT OUTER JOIN WORK_TABLE_5 WT5 ON BT1.ID = WT5.BT1_ID AND WT5.RECORD_TYPE = 21
-- join above is now supported by indexes on BUSINESS_TABLE_1 (ID) and WORK_TABLE_5 (BT1_ID, RECORD_TYPE), originally wasn't
LEFT OUTER JOIN WORK_TABLE_6 ON etc -- typical join on indexed columns
LEFT OUTER JOIN WORK_TABLE_7 ON etc -- typical join on indexed columns
-- about 4000 lines of other code
)
That final insert into OUTPUT_TABLE_3 usually runs in under 10 seconds, but once in a while on certain customer servers it times out at our default 99 minutes. Then we have them take the tiemout off and run it on Friday night, and it finishes but takes 16 hours.
I narrowed the problem down to the join to WORK_TABLE_5, which had no index support, and put an index on the join terms. The next run took 4 seconds. But success has been intermittent, the customer occasionally gets some slow runs when they drastically change their export selection (i.e. drastically change the data in WORK_TABLE_5). And if we update statistics and rebuild indexes after a timed out export, it runs fine at the next attempt.
So, I am wondering about how best to handle truncating/filling static work tables with static indexes, statistics updated overnight, and a stored procedure compiled when the statistics are nothing like runtime.
I have a few general questions about things I'd like to understand better:
Is the nature of the data in the work table going to substantially effect the query plan? Does Oracle form its query plan when you compile the stored procedure? Could we get a highly inappropriate query plan if we compile the stored procedure with the table empty then use a table with 500k rows at runtime?
I expect that if this were an ad-hoc script then updating statistics on the problem table just before selecting from it would eliminate the sporadic slowdowns. But what if I were to update statistics inside the stored procedure, which is compiled with different statistics from runtime?
Anything else you'd like to add...
Thanks for any advice. I hope my MSSQL preconceptions haven't made me too far off base.
This is happening in Oracle 11g, but the code is deployed to assorted customers using Oracle 10 through 12 and I'd like to cater to all of those if possible.
-- Joel
Huge differences in table or index sizes can most definitely cause performance problems. The solution is to add statistics gathering to the procedure instead of relying on the default statistics jobs.
If you've been away from Oracle since version 7, the most important new feature is the Cost Based Optimizer. Oracle now builds query execution plans based on the optimizer statistics of tables, indexes, columns, expressions, system statistics, outlines, directives, dynamic sampling, etc. If you're a full time Oracle developer you should probably spend a day reading about optimizer statistics. Start with Managing Optimizer Statistics and DBMS_STATS in the official documentation.
Eventually the stored procedure should look like this:
--1: Insert into working tables.
insert into work_table...
--2: Gather statistics on working tables.
dbms_stats.gather_table_stats('SCHEMA_NAME', 'WORK_TABLE', ...);
--3: Use working tables.
insert into other_table select * from work_table...
There are so many statistics features it's hard to know exactly what parameters to use in that second step above. Here are some guesses about some features you might find useful:
DEGREE - One reason people avoid gathering statistics inside a process is the time is takes. You can significantly improve the run time by setting the degree. Although this also uses significantly more resources.
NO_INVALIDATE - It can be tricky to know when exactly are the statistics "set" for a query. Gathering statistics usually quickly invalidates execution plans that were based on old statistics. But not always. If you want to be 100% sure that the next query is using the latest statistics you want to set NO_INVALIDATE=>FALSE.
ESTIMATE_PERCENT In 11g and above you definitely want to use the default, which uses a faster algorithm. In 10g and below you may need to set the value to something low to make the gathering fast enough.
Although Oracle 10g and above comes with default statistics gathering jobs you cannot rely on them for a few reasons:
They are scheduled and may not run at the right time. If a process significantly changes the data then new stats are needed right away, not at 10 PM. If there are a lot of tables that need to be analyzed the job may not get to them all in one day.
Many DBAs disable the jobs. This is ridiculous and almost always a mistake. But you'll find many DBAs that disabled the job because they think they can do it better. Instead of working with the auto tasks, and settings preferences, many DBAs like to throw the whole thing out and replace it with a custom procedure that rots over time.
I have the following query (generated by Entity Framework with standard paging. This is the inner query and I added the TOP 438 part):
SELECT TOP 438 [Extent1].[Id] AS [Id],
[Extent1].[MemberType] AS [MemberType],
[Extent1].[FullName] AS [FullName],
[Extent1].[Image] AS [Image],
row_number() OVER (ORDER BY [Extent1].[FullName] ASC) AS [row_number]
FROM [dbo].[ShowMembers] AS [Extent1]
WHERE 3 = CAST( [Extent1].[MemberType] AS int)
ShowMembers table has about 11K rows, but only 438 with MemberType == 3. The 'Image' column is of type nvarchar(2000) that holds the URL to the image on a CDN. If I include this column in the query (only in SELECT part), the query chokes up somehow and generates result in a range between 2-30 seconds (it differs in different runs). If I comment out that column, query runs fast as expected. If I include the 'Image' column, but comment out the row_number column, query also runs fast as expected.
Obviously, I've been too liberal with the size of the URL, so I started playing around with the size. I found out that if I set the Image column to nvarchar(884), then the query runs fast as expected. If I set it up to 885 it's slow again.
This is not bound to one column, but to the size of all columns in the SELECT statement. If I just increase the size by one, performance differences are obvious.
I am not a DB expert, so any advice is welcomed.
PS In local SQL Server 2012 Express there are no performance issues.
PPS Running the query with OFFSET 0 ROWS FETCH NEXT 438 ROWS ONLY (without the row_count column of course) is also slow.
Row_number has to sort all the rows to get you things in the order you want. Adding a larger column into the result set implies that it all get sorted and thus is much slower/does more IO. You can see this, btw, if you enable "set statistics io on" and "set statistics time on" in SSMS when debugging problems like this. It will give you some insight into the number of IOs and other operations happening at runtime in the query:
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-statistics-io-transact-sql?view=sql-server-2017
In terms of what you can do to make this query go faster, I encourage you to think about some things that may change the design of your database schema a bit. First, consider whether you actually need the rows sorted in a specific order at all. If you don't need things in order, it will be cheaper to iterate over them without the row_number (by a measurable amount). So, if you just want to conceptually iterate over each entry once, you can do that by doing an order by something more static that is still monotonic such as the identity column. Second, if you do need to have things in sorted order, then consider whether they are changing frequently/infrequently. If it is infrequent, it may be possible to just compute and persist a column value into each row that has the relative order that you want (and update it each time you modify the table). In this model, you could index the new column and then request things in that order (in the top-level order by in the query - row_number not needed). If you do need things dynamically computed like you are doing and you need things in an exact order all the time, your final option is to move the URL to a second table and join with it after the row_number. This will avoid the sort being "wide" in the computation of row_number.
Best of luck to you
I have four tables; two for current data, two for archive data. One of the archive tables has tens of millions of rows. All tables have a couple narrow indexes and are very similar.
Given the following queries:
SELECT (SELECT COUNT(*) FROM A)
UNION SELECT (SELECT COUNT(*) FROM B)
UNION SELECT (SELECT COUNT(*) FROM C_LargeTable)
UNION SELECT (SELECT COUNT(*) FROM D);
A, B and D perform index scans. C_LargeTable uses a seq scan and the query takes about 20 seconds to execute. Table D has millions of rows as well, but is only about 10% of the size of C_LargeTable
If I then modify my query to execute using the following logic, which sufficiently narrows counts, I still get the same results, the index is used and the query takes about 5 seconds, or 1/4th of the time
...
SELECT (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col < 'G')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col BETWEEN 'G' AND 'Q')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col > 'Q')
...
It does not makes sense to me to have the I/O overhead of a full table scan for a count when perfectly good indexes exist and there is a covering primary key which would ensure uniqueness. My understanding of postgres is that a PRIMARY KEY isn't like a SQL Server clustering index in that it determines a sort, but it implicitly creates a btree index to ensure uniqueness, which I assume should require significantly less I/O than a full table scan.
Is this potentially an indication of an optimization that I may need to perform to organize data within C_LargeTable?
There isn't a covering index on the primary key because PostgreSQL doesn't support them (true up to and including 9.4 anyway).
The heap scan is required because of MVCC visibility. The index doesn't contain visibility information. Pg can do an index scan, but it still has to check visibility info from the heap, and with an index scan that'd be random I/O to read the whole table, so a seqscan will be much faster.
Make sure you run 9.2 or newer, and that autovacuum is configured to run frequently on the table. You should then be able to do an index-only scan where the visibility map is used. This only works under limited circumstances as Horse notes; see the wiki page on count and on index-only scans. If you aren't letting autovacuum run regularly enough the visibility map will be outdated and Pg won't be able to do an index-only scan.
In future, make sure you post explain or preferably explain analyze output with any queries.