If you query for a range of dates and another thing, is having the date column as a leading edge in your index a bad thing?
I'm using PostgreSQL, but assume this applies to all B-tree indexes.
Let's say I queried for records where the created date was 2013-01-02 or later and the status is Active. I'm fuzzy on how a B-tree index would organize dates, but here's how I imagine it. If the index was on (created, status), it would be structured roughly like this:
created status
------------------------
2013-01-01 Active
2013-01-01 Inactive
2013-01-02 Active <-- This record is selected
2013-01-02 Inactive
2013-01-03 Active <-- This non-adjacent record is selected (SLOW)
2013-01-03 Inactive
If the index was on (status, created):
status created
------------------------
Active 2013-01-01
Active 2013-01-02 <-- This record is selected
Active 2013-01-03 <-- This adjacent record is selected (FAST)
Inactive 2013-01-01
Inactive 2013-01-02
Inactive 2013-01-03
So in my mind, if you used a date as a leading edge and query for a range of those dates, then records you want would be fragmented in the index, leading to poorer performance. It's even worse with a datetime.
I think your best best here is to use an expression index. It sounds like you will mostly be running queries such as:
select * from my_table where status='Active' and created_date > whatever
If that is the case, you would likely see the best performance by creating the index on creation date, filtered by status:
CREATE INDEX active_status_created_idx on my_table(created) WHERE status='Active'
That will result in a significantly smaller index that can be used for any queries including the WHERE status='Active' clause.
See:
https://devcenter.heroku.com/articles/postgresql-indexes#expression-indexes and
http://www.postgresql.org/docs/9.2/static/indexes-expressional.html
You have it all correct in your assumptions insofar as I read. You should pick your index according to the types of queries you're going to do most.
If you're doing a lot of where status = ? order by created limit 10 or order by status, created limit 10, then an index on (status, created) is usually in order.
If you're doing a lot of where created = ? order by status limit 10 or order by created, status limit 10, then you'll typically want an index on (created, status) instead.
Note that Postgres allows explicit sorting for indexes too, e.g. (created, status desc). The docs provide a lengthy discussion on why this is sometimes desirable. (I can't recall where exactly, but I'm sure you've found it already considering how you phrased your question.)
Also note the limit in each case. Usage of the index for the ordering clause depends on the number of rows that you're retrieving. Fetch enough rows and Postgres may prefer to ignore your carefully created index altogether, and top-n sort of rows retrieved through other means instead.
Lastly, note that Postgres is quite good, especially in recent versions, at managing multiple independent indexes on a single column. In fact, there is a discussion in the manual's chapters related to indexes that discusses precisely this point.
If you've an index on (created) and another on (status), it'll know to do a bitmap index scan on queries such as where status = ? and/or created = ? when both are selective enough. Along the same lines, it'll know to simply use the index on (created) for queries such as where status = ? order by created limit 10, and filter out rows where the status doesn't have the right value.
Related
I have a table with an enumerated column named "status". I am implementing an endpoint to get statistics about active and inactive entries. It will return a response like this
{ "activeCount" : 10, "inactiveCount" : 10 }
There are 4 possible status for each entry (active, inactive, awaitingApproval, suspicious). activeCount = amount of entries with active status.
inactiveCount = amount of entries with inactive/awaitingApproval/suspicious status.
I am using controller-service-repository pattern and H2 in-memory database. I need this to be as fast as possible. Also assume that this table will hold massive amount of data in the future so getting all entries into memory and calculating the status statistics is not possible.
What are your best practice suggestions?
Thanks for help in advance.
Just use a query like select e.status, count(*) from Entity e group by e.status. If this is not fast enough for you, you will have to maintain a current count per group somehow in a dedicated table and just query that. That obviously requires you to change the count respectively for every status change or insert/delete. Usually, this can be done by using triggers.
We have a time series table with the following definition
CREATE TABLE timeseries.mytable
(
`ts` DateTime('UTC'),
`src_ip` String,
`dst_ip` String,
`col_other` String
)
ENGINE = MergeTree()
PARTITION BY toDate(tr)
ORDER BY (dst_ip,ts,src_ip)
SETTINGS index_granularity = 8192
SELECT count(*) FROM timeseries.mytable;
# Elapsed 0.004 sec. Has 383M records
SELECT count(*) FROM timeseries.timeseries WHERE dst_ip = 'a.b.c.d';
# Elapsed: 0.085 sec.
SELECT count(*) FROM timeseries.timeseries WHERE src_ip = 'a.b.c.d';
# Elapsed: 53.031 sec
As can be seen above, filtering the data using the first sorted column (dst_ip) is very quick.
How can I make the select using the third sorted column (src_ip) faster?
Some remarks:
the third query (WHERE src_ip = 'a.b.c.d') works slowly because of index is not used and CH uses full scan. No good way to make it faster besides as redesign the primary key or if this query calculates just aggregates use the additional AggregatingMergeTree-table
use-cases which you provided looks as artificial because the calculation of row count by all dataset is not key use-case for timeseries data. Why the result not restricted by dst_ip and ts?
consider using ClickHouse AggregatingMergeTree Approach when need to calculate aggregated-values (as count in your case)
design of primary key required the understanding as CH use it in query optimization (see Primary Keys and Indexes in Queries, More secrets of ClickHouse Query Performance)
it recommends using the monotonic index
to choose the best index need to make the series of tests to find the index fittest for concrete use-cases
I would suggest the next primary keys:
/* [pretty suspicious suggestion] Remove date-column (it makes much slower the all date range queries with a range less than Daily). */
ORDER BY (dst_ip, src_ip)
/* Define the granularity of date. Instead of toStartOfHour can be used any interval less than 'Daily' (where Daily is defined by partition key) */
ORDER BY (dst_ip, toStartOfHour(ts), src_ip)
/* Move the date to the first position (it makes faster queries with date range without dst_ip and get monotonic-index related advantages). */
ORDER BY (toStartOfHour(ts), dst_ip, src_ip)
For each primary key need to choose the more effective Index granularity-value.
As for 2022, the solution is to use Data Skipping Index https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes for src_ip
You should try testing by keeping different order in ORDER BY clause depending upon value cardinality of your columns. In this case, maybe trying bringing src_ip before ts in ORDER by class.
In MergeTree engine, rows are sorted on the basis of ORDER BY keys in each partition.
After that, you can decide the final arrangement of columns in ORDER by clause depending upon how your application will query data most of the items.
You can find a similar discussion here.
I would like to add a compressed index to the Oracle Applications workflow table hr.pqh_ss_transaction_history in order to access specific types of workflows (process_name) and workflows for specific people (selected_person_id).
There are lots of repeating values in process_name although the data is skewed. I would however want to access the TFG_HR_NEW_HIRE_PLACE_JSP_PRC and TFG_HR_TERMINATION_JSP_PRC process types.
"PROCESS_NAME","CNT"
"HR_GENERIC_APPROVAL_PRC",40347
"HR_PERSONAL_INFO_JSP_PRC",39284
"TFG_HR_NEW_HIRE_PLACE_JSP_PRC",18117
"TFG_HREMPSTS_TERMS_CHG_JSP_PRC",14076
"TFG_HR_TERMINATION_JSP_PRC",8764
"HR_ADV_INDIVIDUAL_COMP_PRC",4907
"TFG_HR_SIT_NOAPP",3979
"TFG_YE_TAX_PROV",2663
"HR_TERMINATION_JSP_PRC",1310
"HR_CHANGE_PAY_JSP_PRC",953
"TFG_HR_SIT_EXIT_JSP_PRC",797
"HR_SIT_JSP_PRC",630
"HR_QUALIFICATION_JSP_PRC",282
"HR_CAED_JSP_PRC",250
"TFG_HR_EMP_TERM_JSP_PRC",211
"PER_DOR_JSP_PRC",174
"HR_AWARD_JSP_PRC",101
"TFG_HR_SIT_REP_MOT",32
"TFG_HR_SIT_NEWPOS_NIB_JSP_PRC",30
"TFG_HR_SIT_NEWPOS_INBU_JSP_PRC",28
"HR_NEW_HIRE_PLACE_JSP_PRC",22
"HR_NEWHIRE_JSP_PRC",6
selected_person_id would obviously be more selective. Unfortunately there are 3774 nulls for this column and the highest count after that is 73 for one person. A lot of people would only have 1 row. The total row count is 136963.
My query would be in this format:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, :p_person_id) = :p_person_id
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I am on Oracle 12c release 1.
I assume it would be a good idea to put a non-compressed b-tree index on selected_person_id since the values returned would fall in the less than 5% of the total rows scenario, but how do you handle the nulls in the column which would not go into the index when you select using nvl(psth.selected_person_id, :p_person_id) = :p_person_id? Is there a more efficient way to write the sql and how should you create this index?
For process_name I would like to use a compressed b-tree index. I am assuming that the statement is
CREATE INDEX idxname ON pqh_ss_transaction_history(process_name) COMPRESS
where there would be an implicit second column for rowid. Is it safe for it to use rowid here, since normally it is not advised to use rowid? Is the skewed data an issue (most of the time I would be selecting on the high volume side)? I don't understand how compressed indexes would be efficient. For b-tree indexes you would normally want to return 5% of the data, otherwise a full table scan is actually more efficient. How does the compressed index return so many rowids and then do lookup into the full table using those rowids, faster than a full table scan?
Or since the optimizer will only be able to use one of the two indexes should I rather create an uncompressed function based index with selected_person_id and process_name concatenated?
Perhaps you could create this index:
CREATE INDEX idxname ON pqh_ss_transaction_history
(process_name, NVL(selected_person_id,-1)) COMPRESS 1
Then change your query to:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, -1) in (:p_person_id,-1)
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I was wondering if there was an performance implications of adding a rowversion column on a table in a Sql-Server database?
There are few performance implications, rowversion is just a new name for the old timestamp datatype. So your database will need to store the additional binary field. Your performance will suffer much more when you try to do queries on this data such as:
SELECT *
FROM MyTable
WHERE rowVersion > #rowVersion
Which is the common way that can be used to get the list of updated items since last #rowVersion.
This looks fine and will work perfect for a table with say 10,000 rows. But when you get to 1M rows you will quickly discover that it has always been doing a tablescan and your performance hit is because your table now no longer fits entirely within the RAM of the server.
This is the common problem that is encountered with the rowVersion column, it is not magically indexed on it's own. Also, when you index a rowVersion column you have to accept that the index will often get very fragmented over time, because the new updated values are always going to be at the bottom of the index, leaving gaps throughout the index as you update existing items.
Edit: If you're not going use the rowVersion field for checking for updated items and instead you're going to use it for consistency to ensure that the record isn't updated since you last read, then this is going to be a perfectly acceptable use and will not impact.
UPDATE MyTable SET MyField = ' #myField
WHERE Key = #key AND rowVersion = #rowVersion
We are facing a strange performance problem with "SQL Server Express 2005" in a very simple condition.
We have a table with: [timestamp], [id], [value] columns.
and only one primary unique index on [timestamp]+[id].
The table contains around 68.000.000 records.
The request is:
SELECT TOP 1 timestamp FROM table WHERE id=1234 ORDER BY timestamp
If there is at least one record for this id the result is given in few miliseconds.
If there is no record for this id the result is given in at least 30 SECONDS!!!
We tried many other simple similar request, and as soon as we have no corresponding records for the id the processing time is awfully long.
Do you have any explanation and idea to avoid this?
TOP 1 ORDER BY what?
If it finds one record, it must scan the entire table to find more, since you don't have
an index on id.
If you did, but wanted "ORDER BY timestamp", it would still table scan because it doesn't know the id is unique in the timestamp index (even though it might make sense to you because the id is declared unique, say - is it? How if it's not a unique index of its own or as the 1st field in a multicolumn index? - or they both increase monotonically, say - do they?)
If the ID is a unique ID then your ORDER BY isn't needed - and an index on just that field would be enough.