LOW_VALUE and HIGH_VALUE in USER_TAB_COLUMNS - performance

I have a question regarding the columns LOW_VALUE and HIGH_VALUE in the view USER_TAB_COLUMNS (or equivalent).
I was just wondering if these values are always correct, as in, if you have a column with 500k rows with value 1, 500k rows with value of 5 and 1 row with a value of 1000, the LOW_VALUE should be 1 (after you convert the raw figure) and HIGH_VALUE should be 1000 (after you convert the raw figure). However, are there any circumstances where Oracle would 'miss' this outlier value and instead have 5 for HIGH_VALUE?
Also, what is the purpose of these 2 values?
Thanks

As with all optimizer-related statistics, these values are estimates with varying degrees of accuracy from whenever statistics were gathered on the table. As such, it is entirely expected that they would be close but not completely accurate and entirely possible that they would be wildly incorrect.
When you gather statistics, you specify a percentage of the rows (or blocks) that should be sampled. It is possible to specify a 100% sample size, in which case Oracle would examine every row, but it is relatively rare to ask for a sample size nearly that large. It is much more efficient to ask for a much smaller sample size (either explicitly or by letting Oracle automatically determine the sample size). If your sample of rows happens not to include the one row with a value of 1000, the HIGH_VALUE would not be 1000, the HIGH_VALUE would be 5 assuming that is the largest value that the sample saw.
Statistics are also a snapshot in time. By default, 11g will gather statistics every night on objects that have undergone enough change since the last time that statistics were gathered on that object to warrant refreshing the statistics though you can disable that job or change the parameters. So if you gather statistics today with a 100% sample size in order to get a HIGH_VALUE of 1000 and then insert one row with a value of 3000 and never modify the table again, it's likely that Oracle would never gather statistics on that table again (unless you explicitly requested it to) and that the HIGH_VALUE would remain 1000 forever.
Assuming that there is no histogram on the column (which is another whole discussion), Oracle uses the LOW_VALUE and HIGH_VALUE to estimate how selective a particular predicate would be. If the LOW_VALUE is 1, the HIGH_VALUE is 1000, there are 1,000,000 rows in the table, there is no histogram on the column, and you run a query like
SELECT *
FROM some_table
WHERE column_name BETWEEN 100 and 101
Oracle will guess that the data is uniformly distributed between 1 and 1000 so that this query would return 1,000 rows (multiplying the number of rows in the table (1 million) by the fraction of the range the query covers (1/1000)). This selectivity estimate, in turn, would drive the optimizer's determination of whether it would be more efficient to use an index or to do a table scan, what join methods to use, what order to evaluate the various predicates, etc. If you have a non-uniform distribution of data, however, you'll likely end up with a histogram on the column which gives Oracle more detailed information about the distribution of data in the column than the LOW_VALUE and HIGH_VALUE provide.

Related

What to do when the SQL index does not solve the problem of performance?

This query ONE
SELECT * FROM TEST_RANDOM WHERE EMPNO >= '236400' AND EMPNO <= '456000';
in the Oracle Database is running with cost 1927.
And this query TWO :
SELECT * FROM TEST_RANDOM WHERE EMPNO = '236400';
is running with cost 1924.
This table TEST_RANDOM has 1.000.000 rows, I created this table so:
Create table test_normal (empno varchar2(10), ename varchar2(30), sal number(10), faixa varchar2(10));
Begin
For i in 1..1000000
Loop
Insert into test_normal values(
to_char(i), dbms_random.string('U',30),
dbms_random.value(1000,7000), 'ND'
);
If mod(i, **10000)** = 0 then
Commit;
End if;
End loop;
End;
Create table test_random
as
select /*+ append */ * from test_normal order by dbms_random.random;
I created a B-Tree index in the field EMPNO so:
CREATE INDEX IDX_RANDOM_1 ON TEST_RANDOM (EMPNO);
After this, the query TWO improved, and the cost changed to 4.
But the query ONE did not improve, because Oracle Database ignored it, for some reason Oracle Database understood that this query is not worth it to use the plan execution with the index...
My question is: What could we do to improve this query ONE performance? Because the solution of the index did not solve and its cost continues to be expensive...
For this query, Oracle does not use an index because the optimizer correctly estimated the number of rows and correctly decided that a full table scan would be faster or more efficient.
B-Tree indexes are generally only useful when they can be used to return a small percentage of rows, and your first query returns about 25% of the rows. It's hard to say what the ideal percentage of rows is, but 25% is almost always too large. On my system, the execution plan changes from full table scan to index range scan when the query returns 1723 rows - but that number will likely be different for you.
There are several reasons why full table scans are better than indexes for retrieving a large percentage of rows:
Single-block versus multi-block: In Oracle, like in almost all computer systems, it can be significantly faster to retrieve multiple chunks of data at a time (sequential access) instead of retrieving one random chunk of data at a time (random access).
Clustering factor: Oracle stores all rows in blocks, which are usually 8KB large and are analogous to pages. If the index is very inefficient, like if the index is built on randomly sorted data and two sequential reads rarely read from the same block, then reading 25% of all the rows from an index may still require reading 100% of the table blocks.
Algorithmic complexity: A full table scan reads the data as a simple heap, which is O(N). A single index access is much faster, at O(LOG(N)). But as the number of index accesses increases, the benefit wears off, until eventually using the index is O(N * LOG(N)).
Some things you can do to improve performance without indexes:
Partitioning: Partitioning is the idea solution for retrieving a large percentage of data from a table (but the option must be licensed). With partitioning, Oracle splits the logical table into multiple physical tables, and the query can only read from the required partitions. This can create the benefit of multi-block reads, but still limits the amount of data scanned.
Parallelism: Make Oracle work harder instead of smarter. But parallelism probably isn't worth the trouble for such a small table.
Materialized views: Create tables that only store exactly what you need.
Ordering the data: Improve the index clustering factor by sorting the table data by the relevant column instead of doing it randomly. In your case, replace order by dbms_random.random with order by empno. Depending on your version and platform, you may be able to use a materialized zone map to keep the table sorted.
Compression: Shrink the table to make it faster to read the whole thing.
That's quite a lot of information for what is possibly a minor performance problem. Before you go down this rabbit hole, it might be worth asking if you actually have an important performance problem as measured by a clock or by resource consumption, or are you just fishing for performance problems by looking at the somewhat meaningless cost metric?

Clickhouse: Should i optimize MergeTree table manually?

I have a table like:
create table test (id String, timestamp DateTime, somestring String) ENGINE = MergeTree ORDER BY (id, timestamp)
i inserted 100 records then inserted another 100 records and i run select query
select * from test clickhouse returning with 2 parts their lengths are 100 and they are ordered in themselves. Then i run the query optimize table test and it started to return with 1 part and its length is 200 and ordered. So should i run optimize query after all insert and does it increase select query performance like select count(*) from test where id = 'foo' ?
Merges are eventual and may never happen. It depends on the number of inserts that happened after, the number of parts in the partition, size of parts. If the total size of input parts are greater than the maximum part size then they will never be merged.
It is very unreasonable to constantly merge up to one part.
Merger does not have such goal. In the contrary the goal is to have the minimum number of parts withing smallest number of merges. Merges consume the huge amount of disk and processor resources.
It makes no sense to merge two 300GB parts into one 600GB part for 3 hours. Merger have to read, decompress 600GB, merge, compress, write them back, after that the performance of the selects will not grow at all or will grow minimally.
Usually not, you can rely on Clickhouse background merges.
Also, Clickhouse has no intention to merge all the data from the partition into one part file, because "over-optimization" can affect performance too

Why does the query run fast with a small number of rows, but takes hours for many rows

oracle version:10.2.0.4.0
table: va_edges_detail_temp
The fields are the following:
source_label: varchar2
target_label: varchar2
edge_weight: number
The following query:
select v.*, level
from va_edges_detail_temp v
start with v.source_label = 'smith'
connect by nocycle prior v.target_label = v.source_label
order by level;
When there are 552 rows in the table it only takes 0.005 seconds.
When there are 6600 rows in the table, execution never finishes. I waited for hours, but it does not finish, returns no result but shows no error either.
What's the matter?
Well, its too wide question.
In common it depends on your data. And count of rows provided via connecting of rows in va_edges_detail_temp. Its may be n^2 or n^4 or
even n!.
In any case its may increase dramatically and may not
Another part of performance its memory size. If resulted rows set are
fits into RAM oracle do it in memory. If not Oracle will try to fold data into hard drive. Its time-expensive operation in common.

What is the reason of high CPU cost but low IO cost?

I am running a query on a table which only has 283 records. The query is going for Full table scan as no indexed column value is passed in predicate.
Cost is only 12 , but CPU cost is very high - 4,75,189.
What is the reason for high CPU cost even though the table has very less no of records?
What is the difference between Cost & CPU Cost?
PL SQL developer is used as an IDE.
Query -:
SELECT qmh_client, qmh_ip_timestamp, qmh_plant, qmh_key_tsklst_grp,
qmh_grp_cntr, qmh_valid_dt, qmh_tdc_desc, qmh_cert_std,
qmh_tsklst_usage, qmh_statu, qmh_lot_size_from, qmh_lot_size_to,
qmh_tl_usage_uom, qmh_ctyp, qmh_cp_no, qmh_cp_version, qmh_tdform,
qmh_ref_tdc, qmh_licn_no, qmh_guege_len, qmh_ip_activity,
qmh_cp_activity, qmh_ip_sts_code, qmh_cp_sts_code, qmh_ltext_act,
qmh_ltxt_sts_code, qph_ip_id, qmh_ip_mess, qmh_cp_id, qmh_cp_mess,
qmh_rfd, qmh_smtp_addr, qmh_crt_time, qmh_crt_date, qmh_crt_by,
qmh_ip_upd_time, qmh_ip_upd_date, qmh_ip_upd_by, qmh_cp_upd_time,
qmh_upd_date, qmh_cp_upd_by, qmh_clas_sts_code, qmh_clas_id,
qmh_clas_mess, qmh_clas_upd_time, qmh_clas_upd_date,
qmh_clas_upd_by, qmh_prd_ind, qmh_tdc_type, qmh_pi_status
FROM ipdmdm.t_qm_insp_hdr
WHERE qmh_pi_status = 'N'
FOR UPDATE
According to the manual CPU_COST and IO_COST are measured in different ways. IO_COST is "proportional to the number of data blocks read by the operation", CPU_COST is "proportional to the number of machine cycles required for the operation".
The difference between the costs should not be too surprising since many database operations require orders of magnitude more CPU cycles than disk reads. A simple test on my PC produced similar results:
create table test1(a char(1000));
insert into test1 select level from dual connect by level <= 283;
begin
dbms_stats.gather_table_stats(user, 'TEST1');
end;
/
explain plan set statement_id = 'cost test' for select * from test1 for update;
select cpu_cost, io_cost from plan_table where statement_id = 'cost test' and id = 0;
CPU_COST IO_COST
-------- -------
348672 13
Even though it's called the Cost Based Optimizer, the cost is usually not a helpful metric when evaluating execution plans. The "Operation" and "Rows" columns are much more useful.
Also, if you're interested in explain plans, stop using the IDE's crippled view of them and use the text version that Oracle supplies. Use explain plan for select ... and select * from table(dbms_xplan.display);. PL/SQL Developer is a great tool, but its explain plan window is missing critical information (the Notes section) and has some bugs (it does not include session settings).
Check this
COST: Cost of the operation as estimated by the optimizer’s query
approach. Cost is not determined for table access operations. The
value of this column does not have any particular unit of measurement;
it is merely a weighted value used to compare costs of execution
plans. The value of this column is a function of the CPU_COST and
IO_COST columns.
CPU_COST: CPU cost of the operation as estimated by the query
optimizer’s approach. The value of this column is proportional to the
number of machine cycles required for the operation. For statements
that use the rule-based approach, this column is null.
You can refer this article to understand the What is the cost column in an explain plan?
Depending on your release and setting for the hidden parameter
_optimizer_cost_model (cpu or io), the cost is taken from the cpu_cost and io_cost columns in the plan table (, in turn, estimates from
sys.aux_stats$. The "cost" column is not any particular unit of
measurement, it is a weighted average of the costs derived from the
cost-based decision tree generated when the SQL statement is bring
processed. The cost column is essentially an estimate of the run-time
for a given operation.

What are the disadvantages of choosing higher timestamp precision in Oracle?

Oracle allows to specify precision of TIMESTAMP type in a table - the number of digits in the fractional part of the SECOND datetime field. Are there any disadvantages of specifying maximal precision TIMESTAMP(9)?
One reason I could think is that this information may be used for prettier output by Oracle tools.
Maximum of 9 digits suggests that the field is stored as a 4 byte integer so it should not have any performance implications, please correct if I'm wrong here.
There are no disadvantages, use timestamp(9) if it makes sense.
Timestamp(9) and timestamp(1) use the same amount of space, and their performance is identical. I could only find one case where there was a performance difference, and in that case timestamp(9) was actually faster than timestamp(1).
(I'll spare you the many lines of boring code inserting into timestamp(1) and timestamp(9) columns and comparing different
operations on them.)
This demonstrates that they use the same amount of space (inserting many values and comparing dba_segments):
--Create tables with timestamps and populate them with the same data (with different precision)
--Set initial and next to a low value so we can closely check the segment size)
create table timestamp1 (t1 timestamp(1), t2 timestamp(1), t3 timestamp(1), t4 timestamp(1), t5 timestamp(1))
storage(initial 65536 next 65536);
insert into timestamp1
select current_timestamp(1), current_timestamp(1), current_timestamp(1), current_timestamp(1), current_timestamp(1)
from dual connect by level <= 100000;
create table timestamp9 (t1 timestamp(9), t2 timestamp(9), t3 timestamp(9), t4 timestamp(9), t5 timestamp(9))
storage(initial 65536 next 65536);
insert into timestamp9
select current_timestamp(9), current_timestamp(9), current_timestamp(9), current_timestamp(9), current_timestamp(9)
from dual connect by level <= 100000;
--Segment size is identical
select segment_name, bytes from dba_segments where segment_name in ('TIMESTAMP1', 'TIMESTAMP9');
--SEGMENT_NAME BYTES
--TIMESTAMP1 8388608
--TIMESTAMP9 8388608
This is where timestamp(9) is faster, when using current_timestamp, which you'll probably need to use at some point to generate the data. But we're only talking about the difference between about 0.175 and 0.25 seconds on my slow desktop to generate 100K timestamps. I'm not sure why timestamp(9) is faster, maybe timestamps are always generated as timestamp(9) and then rounded to other precisions?
--current_timestamp(9) is slightly faster than current_timestamp(1)
select count(*) from
(
select *
from dual
--where current_timestamp(9) = current_timestamp(9)
where current_timestamp(1) = current_timestamp(1)
connect by level <= 100000
);
EDIT: The performance difference exists in 10g but not 11g.
The problem is performance. You must trade it with precision. Smaller numbers are readen and written in fewer CPU instruction. A CPU instruction takes less than a nanosecond, but if your server serves millions of transactions you might find some performance decrease, and this suggests you to adopt less precision, or even no precision (round all timestamps to the seconds is quite acceptable in most scenario, even in banking).
But if you, for some reason, ie. real-time system logging, need more precision, you are forced to use a higher precision and thus get a performance decrease. If your server doesn't process a great number of tps you have almost no performance impact, but if you don't need precision you're wasting memory.
Hope to have been of help. If you want to share with us your DB requirements, we might help you choose your best compromise.
The difference is not in the technical use of the Timestamp data type, but the application. FERC and NERC often require a certain precision when used in applications labeled critical infrastructure and as such they will use the highest precision made available.
Of course, making the suits happy with their sequence of events records often requires doing more than laid out by CIP-002 through CIP-009
No disadvantages if you always going to use the data as "date/timestamp" datatype inside Oracle and in the middle tier, however you have to see how your entire application/solution is using that column.
Are you truncating the data before you display it ?
Is it a requirement for compliance and it is mainly read ?
Are you converting that column to a string to compare it to another column ?
is it a requirement for auditing or for order capturing ?
Don't worry too much about the reads and writes performance differences, there are negligible, evaluate your overall requirements as a on whole from storage to UI.

Resources