Can QAbstractItemModel.beginInsertRows() be given an overestimate of the number of rows to be inserted? - pyside

I am using a custom QSqlTableModel class to edit an SQL table, which has certain constraints (e.g. a UNIQUE constraint). Because of these SQL constraints, at the time .insertRows() is called on a new batch of rows, it is not known which (or how many) of those rows will be successfully inserted. Therefore, when reimplementing the .insertRows() function, is it kosher to give .beginInsertRows() the upper limit of the number of rows which will be inserted, or will this cause problems down the line?
See https://doc.qt.io/qtforpython/PySide6/QtCore/QAbstractItemModel.html#PySide6.QtCore.PySide6.QtCore.QAbstractItemModel.beginInsertRows

Related

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

Problems with a primary key sequence

When adding new data to a form my primary key sequence increases by 1.
However if i was to delete a data and replace it with new data the sequence would carry on.
So for example my primary keys for data go 1,2,3,4,5,6,10 because of previously deleted rows.
I hope that makes sence.
SEQUENCE values in Oracle are guaranteed to be unique, but you cannot expect the values to form a contiguous sequence without any gaps.
Even if you would never delete any rows from the table, you're likely to see gaps at some point, because sequence values are cached (pre-reserved) between different transactions.
It is a SEQUENCE of numbers, it doesn't care if you have used the "current value" or not.
As opposed to MySQL, in Oracle the Sequence is not tied to a column, but it is a separate object that you can ask a value from (through your_sequence.nextval). To handle the uniqueness, it doesn't take back values and offer them again.
If you always want to have a dense sequence of ID-s even through deletion, you would have to either
rearrange the ID-s (read: change ID-s of the rows newer than the deleted one), or
without knowing your task, I would suggest using the DENSE_RANK analytic function for querying your dataset, and separating the real (in-table) ID-s from the ranking of the rows.

How to predict table sizes Oracle?

I'm trying to do a growth prediction on some tables I have and for that I've got to do some calculations on my row sizes, how many rows I generate by day and well.. the maths.
I'm calculating the average size of each row in my table as the sum of the average size of each field. So basicaly:
SELECT 'COL1' , avg(vsize(COL1)) FROM TABLE union
SELECT 'COL2' , avg(vsize(COL2)) FROM TABLE
Sum that up, multiply by the number of entries of a day and work the predictions from there.
Turns out that for one of the tables I've looked the resulting size is a lot smaller than I thought it would be and got me wondering if my method was right.
Also, I did not consider indexes sizes for my predictions - and of course I should.
My questions are:
Is this method I'm using reliable?
Tips on how could I work the predictions for the Indexes?
I've done my googling, but the methods I find are all about the segments and extends or else calculations based in the whole table. I will need the step with the actual row of my table to do the predictions (I have to analyse the data in the table in order to figure how many records a day).
And finally, this is an approximation. I know I'm missing some bytes here and there with overheads and stuff. I just want to make sure I'm only missing bytes and not gigas :)
1) Your method is sound to calculate the average size of a row. (Though be aware that if your column contains null, you should use avg(nvl(vsize(col1), 0)) instead of avg(vsize(COL1))). However, it doesn't take into account the physical arrangement of rows.
First of all, it doesn't take into account the header info (from both blocks and rows): you can't fit 8k data into 8k blocks. See the documentation on data block format for more information.
Then, rows are not always stored neatly packed. Oracle lets some space in each blocks so that the rows can grow when they are updated (governed by the pctfree parameter). Also when the rows are deleted the empty space is not reclaimed right away (if you're not using ASSM with locally managed tablespaces, the amount of free space required for a block to return to the list of available blocks depends on pctused).
If you already have some representative data in your table, you can estimate the amount of extra space you will need by comparing the space physically used (all_tables.blocks*block_size after having gathered statistics) to the average row length.
By the way Oracle can easily give you a good estimate of the average row length: gather statistics on the table and query all_tables.avg_row_len.
2) Most of the time (read: unless there is a bug or you fall into an atypical use of the index), the index will grow proportionaly to the number of rows.
If you have representative data, you can have a good estimation of its future size by multiplying its actual size by the relative growth of the number of rows.
The last time Oracle published their formulae for estimating the size of schema objects was in Oracle 8.0, which means the linked document is ten years out of date. However, I don't expect very much has changed in how Oracle reserves segment header, block header, or row header information.

Index needed for max(col)?

I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...
The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.
If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.
Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.
This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.

Sybase: Does the column order in a non-clustered index affect insert performance?

To be more specific (since the general answer to the subject is likely "yes"):
We have a table with lots of data in Sybase.
One of the columns is "date of insertion" (DATE, datetime type).
The clustered index on the table starts with the "DATE".
Question: For another, non-clustered index, does the order of columns (more specifically, whether "DATE" is the first or a second index column) affect the insert query performance?
Assume everything else is equal, e.g. the order of the second non-clustered index does not affect select query performance (even if it does, I don't care for the purposes of this question).
what locking scheme is in place on the table All-pages or data-pages? (you can find out by selecting lockscheme('table_name'). The (application-observed) performance of index-maintenance is much better with data-pages locking scheme.
An index is ordered. The insert time depends on the cost of maintaining that order. If you insert rows with monotonically increasing values for the columns which are indexed then the index will grow 'at the end' and performance will be great (modulo any concurrency issues due to multiple concurrent updaters). The index tree will have to be re-balanced from time to time but I believe that is a rapid operation.
If the order of inserts is not in the same order as an index then that index will have to have entries inserted 'into the middle' and this is likely to cause page-splits (modulo there being sufficient space left unallocated by setting a fill-factor as mentioned above) and also 'index fragmentation'
anyway, the answer -- as always -- is to conduct some experiments and measure elapsed time and IO activity. You also might want to look at the optdiag output.
pjjH
I believe it is largely determined by the Fill factor in the indexes and to a much lesser extent how selective the columns are.

Resources