Apache Ignite SQL query performance degrades when indexes columns overlap H2 cache

Apache Ignite SQL query performance degrades when indexes columns overlap H2 cache - h2

I have a cache (ITEMS) in Ignite which has 10 columns, and millions of records in it.
Key is the ID(have 3 fields of 10 cos) and value is the Java object having these 10 fields which are the columns.
columns: col0, col1, col2... col9
-> col0 is the set of the records in the cache.
for e.g. if the cache has 10 million records then col0 has 10 unique values(DS1, DS2, DS3..DS10) and each of them has 1 million records
I have 3 indexes on the cache:
INDEX_1. ID
INDEX_2. col0,col1,col2,col3
INDEX_3. col0,col1,col2,col3,col4 (it has overlapping columns wih 2 + 1 extra
column)
I have a query which looks like:
SELECT ID from ITEMS I USE INDEX(INDEX_3) WHERE col0=? and col1>? and col2<0 and col3>? and col4=0 order by col0, col3 ASC LIMIT 10000;
The issue is if I keep all the 3 indexes, the query takes longer time to execute. ~3-4 seconds
-> If I just remove INDEX_1 I get a good query performance ~200ms (cannot remove INDEX_1 as this is the primary index and used in other queries)
-> Also If I just remove INDEX_2 I get a good query performance ~200ms ( INDEX_2 is used in other query flow and cannot remove as well)
Unable to understand this behavior of the H2 cache as I was under impression that having more indexes affects the write performance of the cache, how it is affecting the read performance? can anyone help me understand this? or guild me to set up an appropriate cache config.
Thank you in advance.

I don't know why you're seeing that behaviour, but I can say that you don't need INDEX_2. Ignite uses "leading edge" indexes, which means a lookup for columns 0, 1, 2, and 3 can use INDEX_3.

Related

why does the optimizer choose the higher cost execution plan?

This is a re-occuring Problem for me. I have statements that work well for a while and after a while the optimizer decides to choose another execution plan. This even happens for when I query for exactly one (composite) primary key.
When I look up the execution plan in dba_hist_sql_plan, it shows me costs of 20 for the query using the primary key index and costs of 270 for the query doing a full table scan.
plan_hash_value Operation Options Cost Search_Columns
2550672280 0 SELECT STATEMENT 20
2550672280 1 PARTITION HASH SINGLE 20
2550672280 2 TABLE ACCESS BY LOCAL INDEX ROWID 20
2550672280 3 INDEX RANGE SCAN 19 1
3908080950 0 SELECT STATEMENT 270
3908080950 1 PARTITION HASH SINGLE 270
3908080950 2 TABLE ACCESS FULL 270
I already noticed that the optimizer only uses the first column in the Primary key index and then does a range scan. But my real question is: Why does the optimizer choose the higher cost execution plan? It's not that both executions plans are used at the same time, I notice a switch within one snapshot and then it stays like that for several hours/days. So it can't be an issue of bind peeking.
Our current solution is that I call our DBA and he flushes the Statement Cache. But this is not really sustainable.
EDIT:
The SQL looks something like this: select * from X where X.id1 = ? and X.id2 = ? and X.id3 = ?
with (id1,id2,id3) being the composite primary key (with a unique index) on the table.

Maybe it's related to one bug on Oracle 11g.
Bug 18377553 : POOR CARDINALITY ESTIMATE WITH HISTOGRAMS AND VALUES > 32 BYTES
When your data is like :
AAAAAAAAAAAAAAAAAAAAmyvalue
AAAAAAAAAAAAAAAAAAAAsomeohtervalue
AAAAAAAAAAAAAAAAAAAAandsoon
B1234
Histograms do not work well.
The solution is disabling histograms on primary key and all will start working smoothly.

Most likely clustering factor and blevel of the index could be very high. Check the blevel by querying dba_indexes. If blevel is greater than 3 try rebuilding the index.
Also check whether the index created for primary key is unique or not. As per the plan it is using range scan instead of unique scan. Most likely the index is not unique.

Apparently the optimizer doesn't correctly display costs regarding type conversions. The root cause for this Problem was incorrect type mapping for a date value. While the column in the database is of type DATE, the JDBC type was incorrectly java.sql.Timestamp. To compare a DATE column with a Timestamp search parameter, all values in the table need to be transferred to Timestamp first. Which is additional cost and renders an index unusable.

postgres not using index on SELECT COUNT(*) for a large table

I have four tables; two for current data, two for archive data. One of the archive tables has tens of millions of rows. All tables have a couple narrow indexes and are very similar.
Given the following queries:
SELECT (SELECT COUNT(*) FROM A)
UNION SELECT (SELECT COUNT(*) FROM B)
UNION SELECT (SELECT COUNT(*) FROM C_LargeTable)
UNION SELECT (SELECT COUNT(*) FROM D);
A, B and D perform index scans. C_LargeTable uses a seq scan and the query takes about 20 seconds to execute. Table D has millions of rows as well, but is only about 10% of the size of C_LargeTable
If I then modify my query to execute using the following logic, which sufficiently narrows counts, I still get the same results, the index is used and the query takes about 5 seconds, or 1/4th of the time
...
SELECT (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col < 'G')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col BETWEEN 'G' AND 'Q')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col > 'Q')
...
It does not makes sense to me to have the I/O overhead of a full table scan for a count when perfectly good indexes exist and there is a covering primary key which would ensure uniqueness. My understanding of postgres is that a PRIMARY KEY isn't like a SQL Server clustering index in that it determines a sort, but it implicitly creates a btree index to ensure uniqueness, which I assume should require significantly less I/O than a full table scan.
Is this potentially an indication of an optimization that I may need to perform to organize data within C_LargeTable?

There isn't a covering index on the primary key because PostgreSQL doesn't support them (true up to and including 9.4 anyway).
The heap scan is required because of MVCC visibility. The index doesn't contain visibility information. Pg can do an index scan, but it still has to check visibility info from the heap, and with an index scan that'd be random I/O to read the whole table, so a seqscan will be much faster.
Make sure you run 9.2 or newer, and that autovacuum is configured to run frequently on the table. You should then be able to do an index-only scan where the visibility map is used. This only works under limited circumstances as Horse notes; see the wiki page on count and on index-only scans. If you aren't letting autovacuum run regularly enough the visibility map will be outdated and Pg won't be able to do an index-only scan.
In future, make sure you post explain or preferably explain analyze output with any queries.

different columns on select query results different costs

There is an index at table invt_item_d on (item_id & branch_id & co_id) columns.
The plan results for the first query are TABLE ACCESS FULL and cost is 528,
results for the second query are INDEX FAST FULL SCAN (my index) and cost is 27.
The only difference is, as you can see, the selected column is used in index on the second query.
Is there something wrong with this? And please, can you tell me what should I do to fix this at db administration level?
select d.qty
from invt_item_d d
where d.item_id = 999
and d.branch_id = 888
and d.co_id = 777
select d.item_id
from invt_item_d d
where d.item_id = 999
and d.branch_id = 888
and d.co_id = 777
EDIT:
i made a new query and this query's cost is 529, with TABLE ACCESS FULL.
select qty from invt_item_d
so it doesn't matter if i use an index or not. Some says this is normal, is this a normal behaviour really?

In the first case, the table must be accessed, since the "qty" column is only stored in the table.
In the second case, all the columns used in the query can be read from the index, skipping the table read altogether.
You can add another index on columns (item_id, branch_id, co_id, qty) and it will most probably be used in the first query.
From the Oracle documentation: http://docs.oracle.com/cd/E11882_01/server.112/e25789/indexiot.htm
A fast full index scan is a full index scan in which the database
accesses the data in the index itself without accessing the table, and
the database reads the index blocks in no particular order.
Fast full index scans are an alternative to a full table scan when
both of the following conditions are met:
The index must contain all columns needed for the query.
A row containing all nulls must not appear in the query result set. For this result to be guaranteed, at least one column in the
index must have either:
A NOT NULL constraint
A predicate applied to it that prevents nulls from being considered in the query result set

This is exactly the main purpose of using index -- make search faster.
Querying columns with indexes are faster compared to querying columns without indexes.
Its basic oracle knowledge.

I am adding another answer because it seems to be more convinient.
First:
" i doesn't hit the index because there are 34000 rows, not millions". This is COMPLETELY WRONG and a dangerous understanding.
What I meant was, if there are a few thousand rows, and the index is not hit(oracle engine does a full table scan(TABLE ACCESS FULL) then), its not a big deal. Oracle is fast enough to read few thousand rows in a matter of a second(even without indexes) , and hence you wont feel the difference.The query is still slower(than the occasion when there is an index) , but its is so minimally slower that you wont feel the difference.
But, if there are millions of rows, the execution of the query will be much, much slower without index ( as this time it will scan millions of rows in a full table scan)and your performance will be hit.
Second: Why on earth do you have to loop over a table with 34000 rows, that too 4000 times???
Thats a terrible approach. Avoid loops as much as possible.There has to be a better approach!
Third:
You can force the oracle optimiser to hit the index by using the index hint.You will need to know the name of the index for that.
select /*+ index(invt_item_d <index_name>) */
d.qty
from invt_item_d d
where d.item_id = 999
and d.branch_id = 888
and d.co_id = 777
Here is the link to a stack overflow question on index hint

Is it OK to create several indexes on a table if they're really needed

I have a table with 7 columns.
It's going to contain lots and lots of data - something like more than 1.7 million records will be added every month.
Of those 7 columns 5 are the ones that I'll be using in the WHERE clause of my queries against this table in different combinations.
Is it OK to create different indexes for those possible combinations ?
I'm asking this question because if I do that, there'll be more than 10 indexes on this table and I'm not sure if this is a good idea.
On the other hand, I'm afraid of querying a table with this big amount of data without indexes.
Here's the table:
CREATE TABLE AG_PAYMENTS_TO_BE
(
PAYMENTID NUMBER(15, 0) NOT NULL
, DEPARTID NUMBER(3,0)
, PENSIONERID NUMBER(11, 0) NOT NULL
, AMOUNT NUMBER(6, 2)
, PERIOD CHAR(6 CHAR)
, PAYMENTTYPE NUMBER(1,0)
, ST NUMBER(1, 0) DEFAULT 0
, CONSTRAINT AG_PAYMENTS_TO_BE_PK PRIMARY KEY
(
PAYMENTID
)
ENABLE
);
Possible queries:
SELECT AMOUNT FROM AG_PAYMENTS_TO_BE WHERE ST=0 AND DEPARTID=112 AND PERIOD='201207';
SELECT AMOUNT FROM AG_PAYMENTS_TO_BE WHERE ST=0 AND PENSIONERID=123456 AND PERIOD='201207';
SELECT AMOUNT FROM AG_PAYMENTS_TO_BE WHERE ST=0 AND PENSIONERID=123456 AND PERIOD='201207' AND PAYMENTTYPE=1;
SELECT AMOUNT FROM AG_PAYMENTS_TO_BE WHERE ST=0 AND DEPARTID=112 AND ST=0;
SELECT AMOUNT FROM AG_PAYMENTS_TO_BE WHERE ST=0 AND PENSIONERID=123456;
and so on.

Ignoring index skip scans* for the moment, in order for a query to use an index:
The leading index columns must be listed in the query
They must compared using exact joins (i.e. using =, not <,> or like)
For example, a table with a composite index on (a, b) could use the index in the following queries:
a = :b1 and b >= :b2
a = :b1
but not:
b = :b2
because column b is listed second in the index. * In some cases, it's possible for the index to be used in this case via an index skip scan. This is where the leading column in the index is skipped. There needs to be relatively few distinct values for the first column however, which doesn't happen often (in my experience).
Note that a "larger" index can be used by queries which only use some of the leading columns from it. So in the example above, an index on just a is redundant because the queries shown can use the index on a, b. An index on just b may be useful however.
The more indexes you add, the slower your inserts/updates/deletes will be, because the indexes have to be maintained at the same time as the table. Therefore you should aim to keep the number of indexes down, unless there's significant query benefits to adding a new one. This is something you'll have to measure in your environment to determine the exact cost/benefit.
Note that having multiple indexes with similar columns can lead to the wrong index being selected. So there is potential downside for selects when you have many similar indexes. There is also a slight overhead in parse times, as Oracle has more options to consider when selecting the execution plan.
Looking at your queries I believe you only need indexes on:
st, departid, period
st, pensionerid, period
You may wish to add amount at the end of these as well, so your queries can be fully answered from the index, saving you a table lookup. You may also need further indexes if these columns are foreign keys to other tables, to prevent locking issues.

This decision would greatly depend on expected number of distinct values in each column, and thus selectivity of each possible index.
Things I would consider while making decisions:
Obviously, PAYMENTTYPE & ST fields hold up to 10 19 distinct values each, which is pretty unselective if we keep in mind your expected volume of data (~400M rows), so they won't help you much.
However, they probably could become good candidates for list partitioning instead.
I would also think of switching PERIOD CHAR(6 CHAR) to DATE and making a composite range-list partition on period+st/paymenttype.
DEPARTID - If you have hundreds of departments, then it's probably an indexing candidate, but if only dozens - then probably a full scan would perform way faster.
PENSIONERID seems to be a high-selectivity field, so I would consider creating a separate index on it, and including it in a composite index on PERIOD+PENSIONERID (in that field order).

I think you should create a few combined indexes (like ('ST' and 'PERIOD') and ('ST' and 'PENSIONERID'). That will speed up most of your sample queries...

Inserts are 4x slower if table has lots of record (400K) vs. if it's empty

(Database: Oracle 10G R2)
It takes 1 minute to insert 100,000 records into a table. But if the table already contains some records (400K), then it takes 4 minutes and 12 seconds; also CPU-wait jumps up and “Free Buffer Waits” become really high (from dbconsole).
Do you know what’s happing here? Is this because of frequent table extents? The extent size for these tables is 1,048,576 bytes. I have a feeling DB is trying to extend the table storage.
I am really confused about this. So any help would be great!
This is the insert statement:
begin
for i in 1 .. 100000 loop
insert into customer
(id, business_name, address1,
address2, city,
zip, state, country, fax,
phone, email
)
values (customer_seq.nextval, dbms_random.string ('A', 20), dbms_random.string ('A', 20),
dbms_random.string ('A', 20), dbms_random.string ('A', 20),
trunc (dbms_random.value (10000, 99999)), 'CA', 'US', '798-779-7987',
'798-779-7987', 'asdfasf#asfasf.com'
);
end loop;
end;
Here dstat output (CPU, IO, MEMORY, NET) for :
Empty Table inserts: http://pastebin.com/f40f50dbb
Table with 400K records: http://pastebin.com/f48d8ebc7
Output from v$buffer_pool_statistics
ID: 3
NAME: DEFAULT
BLOCK_SIZE: 8192
SET_MSIZE: 4446
CNUM_REPL: 4446
CNUM_WRITE: 0
CNUM_SET: 4446
BUF_GOT: 1407656
SUM_WRITE: 1244533
SUM_SCAN: 0
FREE_BUFFER_WAIT: 93314
WRITE_COMPLETE_WAIT: 832
BUFFER_BUSY_WAIT: 788
FREE_BUFFER_INSPECTED: 2141883
DIRTY_BUFFERS_INSPECTED: 1030570
DB_BLOCK_CHANGE: 44445969
DB_BLOCK_GETS: 44866836
CONSISTENT_GETS: 8195371
PHYSICAL_READS: 930646
PHYSICAL_WRITES: 1244533
UPDATE
I dropped indexes off this table and performance improved drastically even when inserting 100K into 600K records table (which took 47 seconds with no CPU wait - see dstat output http://pastebin.com/fbaccb10 ) .

Not sure if this is the same in Oracle, but in SQL Server the first thing I'd check is how many indexes you have on the table. If it's a lot the DB has to do a lot of work reindexing the table as records are inserted. It's more difficult to reindex 500k rows than 100k.

The indices are some form of tree, which means the time to insert a record is going to be O(log n), where n is the size of the tree (≈ number of rows for the standard unique index).
The fastest way to insert them is going to be dropping/disabling the index during the insert and recreating it after, as you've already found.

Even with indexes, 4 minutes to insert 100,000 records seems like a problem to me.
If this database has I/O problems, you haven't fixed them and they will appear again. I would recommend that you identify the root cause.
If you post the index DDL, I'll time it for a comparison.
I added indexes on id and business_name. Doing 10 iterations in a loop, the average time per 100,000 rows was 25 seconds. This was on my home PC/server all running on a single disk.

Another trick to improve performance is to turn on or set the cache higher on your sequence(customer_seq). This will allow oracle to allocate the sequence into memory instead of hitting the object for each insert.
Be careful with this one though. In some situations this will cause gaps your sequence to have gaps between values.
More information here:
Oracle/PLSQL: Sequences (Autonumber)

Sorted inserts always take longer the more entries there are in the table.

You don't say which columns are indexed. If you had indexes on fax, phone or email, you would have had a LOT of duplicates (ie every row).
Oracle 'pretends' to have non-unique indexes. In reality every index entry is unique with the rowid of the actual table row being the deciding factor. The rowid is made up of the file/block/record.
It is possible that, once you hit a certain number of records, the new ones were getting rowids which meant that had to be fitted into the middle of existing indexes with a lot of index re-writing going on.
If you supply full table and index creation statements, others would be able to reproduce the experience which would have allowed for more evidence based responses.

i think it has to do with the extending the internal structure of the file, as well as building database indexes for the added information - i believe the database arranges the data in a non-linear fashion that helps speed up data retrieval on selects

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio