Column statistics in oracle 19? - oracle

Does an extended metadata table exist in oracle 19 holding column statistics from all tables? I know there is the table ALL_TAB_COL_STATISTICS which stores histogram, min, max, num distinct values etc. but I need additional statistics such as mean, median, or percentiles? If there is such a statistics table how and when gets the table updated? I do not want to calculate the statistics myself but want to extract them as metadata.
Thanks for your help.

I believe you are looking for something that Oracle never intended to. Oracle Statistics are information useful for the CBO ( Cost Based Optimizer ) in order to determine the best execution plan for any SQL statement. It stores information regarding the density and cardinality of the data stored there, but you want analytical data of it, which is not the purpose of these statistics.
You always might rely in Oracle Functions to retrieve those:
select avg(column) as mean_rating , MEDIAN(column) OVER () AS median_value from
yourtable
Regarding percentiles, I would check PERCENTIL_CONT and PERCENTILE_DISC. But like I said, you won't find those in any metadata dictionary table.

Related

Move Range Interval partition data from one table to history table in other database

We have a primary table that is Range partitioned by date with a 1-month interval. It's also a list sub-partitioned with 4 distinct values. So essentially it is one month partition having 4 sub-partitions.
Database: Oracle 19c
I need advice on how to effectively move the partition/sub-partition data from active schema to historical schema in another database.
Also, there are about 30 tables that are referenced partitioned on the primary table for which the data needs to be moved as well. Overall I'm looking to move about 2500 subpartitions
I'm not sure if an exchange partition would be the right approach in this scenario?
TIA
You could use exchange to get the data rapidly out of your active table, but you would still then to send that table over the wire to the remote history database to load it in.
In which case, using "exchange" probably is just adding more steps to the process for little gain. (There are still potential uses here depending on how you want to handle indexing etc).
But simplest is perhaps just transferring the data over, assuming a common structure between the two tables, ie
insert /*+ APPEND */ into history_table#remote_db
select * from active_table partition ( myparname )
I can't remember if partition naming syntax is supported over a db link, but if not, then the appropriate date predicates will do the same trick, and then just follow up with:
alter table active_table truncate partition myparname;

How to create an index so that the following statement has an index

I have table A(id,name,code)
I have sql statement:
Select * from A where upper(code || name) like upper('%<search text>%');
How to create an index so that the following statement has an index?
Question for two option: table partitioned, and table not partitioned
Thanks & BR
Do you have a performance issue or is this just a hypothetical question?
An index is unlikely to help with this example: a full table scan will probably be the quickest solution. Why? Your table has 3 columns. The best index would be one that avoided looking in the table at all e.g.
create index ai on a (code, name, id);
But that needs to contain all the same data as the table plus a ROWID for each table row - so it is going to be bigger than the table and take longer to scan. You could try putting the columns in the index with the least selective first and using compression:
create index ai on a (code, name, id) compress;
Now the index may be smaller than the table - it depends on how selective the code and name columns are. If it is small enough, the optimizer might decide to use it instead of the table. It still contains all the IDs and ROWIDs so the reduction in size probably won't be dramatic. In the test case I set up the compressed index is about half the size of the table, yet Explain Plan shows the query has a higher cost if I use a hint to force it to use the index - maybe due to overheads of compression, I don't know.
You could look into Oracle Text and the CONTAINS expression - but then you would be writing a different query, not using LIKE.

Hive query optimisation

Have to perform incremental load into an internal table from an external table in hive when the source data file is appended with new records, on a daily basis. The new records can be filtered out based on the timestamp(column load_ts in the table) at which they were loaded. Trying to achieve this by selecting the records from source table whose load_ts is greater than the current max(load_ts) in the target table as given below:
INSERT INTO TABLE target_temp PARTITION (DATA_DT)
SELECT ms.* FROM temp_db.source_temp ms
JOIN (select max(load_ts) max_load_ts from target_temp) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;
But the above query does not give the desired output. Takes very long time for execution (should not be the case with Map-Reduce paradigm).
Tried other scenarios also like passing the max(load_ts) as a variable, instead of joining. Still no improvement in the performance. Would be very helpful if anyone can give their insights as to what is possibly incorrect in this approach, with any alternate solutions.
First of all, the map/reduce model does not guarantee that your queries will take less. The main idea is that its performance will scale linearly with the number of nodes, but you have to still think about how you're doing things, more so than in normal SQL.
First thing to check is if the source table is partitioned by time. If not, it should as you'd be reading the whole table every single time.
Second, you're calculating the max as well every time, also, on the whole destination table. You could make it a lot faster if you just calculate the max on the last partition, so change this
JOIN (select max(load_ts) max_load_ts from target_temp) mt
to this (you didn't write the partition column so I am going to assume it's called 'dt'
JOIN (select max(load_ts) max_load_ts from target_temp WHERE dt=PREVIOUS_DATA_DT) mt
since we know the max load_ts is going to be in the last partition.
Otherwise, it's hard to help without knowing the structure of the source table, and, like somebody else commented, the sizes of the two tables.
JOIN is slower than variable in the WHERE clause. But the main problem with performance here is that your query performs full scan of target table and source table. I would recommend:
Query only the latest partition for max(load_ts).
Enable statistics gathering and usage
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.autogather=true;
Compute statistics on both tables for columns.
Statistics will make queries like selecting MAX(partition) or max(ts) executing faster
Try to put source partition files into target partition folder instead of INSERT if applicable (target and source tables partitioning and storage format should enable this). It works fine for example for textfile storage format and if source table partition contain only rows>max(target_partition). You can combine both copy files method(for those source partitions that exactly contain rows to be inserting without filtering) and INSERT(for partitions containing mixed data that need to be filtering).
Hive may be merging your files during INSERT. This merge phase takes additional time and adds additional stage job. Check hive.merge.mapredfiles option and try to switch it off.
And of course use pre-calculated variable instead of join.
Use Cost-Based Optimisation Technique by enabling below properties
set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.stats.fetch.column.stats=true;
set hive.compute.query.using.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.exec.parallel=true;
Also analyze the table
ANALYZE TABLE temp_db.source_temp COMPUTE STATISTICS [comma_separated_column_list];
ANALYZE TABLE target_temp PARTITION(DATA_DT) COMPUTE STATISTICS;

What type of index should be used in Oracle

I have a large table of 7 column in Oracle 11G. Total size of the table is more than 3GB and total row in this table is 1876823. Query we are using
select doc_mstr_id from index_mstr where page_con1 like('%sachin%') it is taking almost a minute. please help me to optimize the query as well as proper indexing for this table. Please let me know if partitioned is required for this table.
Below are the column description
INDEX_MSTR_ID NUMBER
DOC_MSTR_ID NUMBER
PAGE_NO NUMBER
PAGE_PART NUMBER
PAGE_CON1 VARCHAR2(4000)
FILE_MODIFIED_DATE DATE
CREATED_DATE DATE
This query is always going to result in a full table scan. Your only filter cannot use a B-TREE index, due to the leading wildcard:
where page_con1 like('%sachin%')
If you want to do lots of queries of this nature you need to build a Text index on that column. From its datatype page_con1 appears to hold text fragments rather than full documents so you should use a CTXCAT index. This type of index has the advantage of being transactional, rather than requiring background maintenance. Find out more.
Your query would then look like this:
select doc_mstr_id from index_mstr
WHERE CATSEARCH(page_con1, 'sachin') > 0;

Improving DELETE and INSERT times on a large table that has an index structure

Our application manages a table containing a per-user set of rows that is the
result of a computationally-intensive query. Storing this result in a table
seems a good way of speeding up further calculations.
The structure of that table is basically the following:
CREATE TABLE per_user_result_set
( user_login VARCHAR2(N)
, result_set_item_id VARCHAR2(M)
, CONSTRAINT result_set_pk PRIMARY KEY(user_login, result_set_item_id)
)
;
A typical user of our application will have this result set computed 30 times a
day, with a result set consisting of between 1 single items and 500,000 items.
A typical customer will declare about 500 users into the production database.
So, this table will typically consist of 5 million rows.
The typical query that we use to update this table is:
BEGIN
DELETE FROM per_user_result_set WHERE user_login = :x;
INSERT INTO per_user_result_set(...) SELECT :x, ... FROM ...;
END;
/
After having run into performance issues (the DELETE part would take much time)
we decided to have a GLOBAL TEMPORARY TABLE (on commit delete rows) to hold a
“delta” of rows to suppress from the table and rows to insert into it:
BEGIN
INSERT INTO _tmp
SELECT ... FROM ...
MINUS SELECT result_set_item_id
FROM per_user_result_set
WHERE user_login = :x;
DELETE FROM per_user_result_set
WHERE user_login = :x
AND result_set_item_id NOT IN (SELECT result_set_item_id
FROM _tmp
);
INSERT INTO per_user_result_set
SELECT :x, result_set_item_id
FROM _tmp;
COMMIT;
END;
/
This has improved performance a bit, but still this is not satisfactory. So
we're exploring ways to speed up that process and here are the issues that
we experience:
We would have loved to use table partitioning (partitioning by user_login).
But partitioning is not always available (on our test databases we hit
ORA-00439). Our customers cannot all afford Oracle Enterprise Edition with
paid additional features.
We could make the per_user_result_set table GLOBAL TEMPORARY, so that it
is isolated and we can TRUNCATE it for example… but our application
sometimes loses connection to Oracle due to network problems, and will
automatically reconnect. By that time we lose the contents of our
computation.
We could split that table into a certain number of buckets, make a view that
UNIONs ALL all those buckets, and triggers INSTEAD OF UPDATE and DELETE on
that view, and repart rows according to ORA_HASH(user_login) % num_buckets.
But we are afraid this could make SELECT operations much slower.
This would result in a constant number of tables, with smaller indexes
affected in DELETE or INSERT operations. In short, “partioning table for the
poor”.
We've tried to ALTER TABLE per_user_result_set NOLOGGING. This does not
improve things much.
We've tried to CREATE TABLE ... ORGANIZATION INDEX COMPRESS 1. This speeds
things up by a ratio of 1:5.
We've tried to have one table per user_login. That's exactly what we could
have by partitioning using a number of partitions equal to the number of
distinct user_logins and a well-chosen hash function. Performance factor is
1:10. But I would really like to avoid this solution: have to maintain a
huge number of indexes, tables, views, on a per-user basis. This would be
an interesting performance gain for the users, but not for us maintainers of
the systems.
Since the users work at the same time there is no way that we create a new
table and swap it with the old one.
What could you please suggest in complement to these approaches?
Note. Our customers run Oracle Databases from 9i to 11g, and XE editions to
Enterprise edition. That's a wide variety of versions that we need to be
compatible with.
Thanks.
We've tried to have one table per user_login. That's exactly what we
could have by partitioning using a number of partitions equal to the
number of distinct user_logins and a well-chosen hash function.
Performance factor is 1:10. But I would really like to avoid this
solution: have to maintain a huge number of indexes, tables, views, on
a per-user basis. This would be an interesting performance gain for
the users, but not for us maintainers of the systems.
Can you then make a stored procedure to generate these table on a per-user basis? Or, better yet, have this stored procedure do the most appropriate thing depending on the licensure of Oracle being supported?
If Partitioning option
then create or truncate user-specific list partition
Else
drop user-specific result table
Create user-specific result table
as Select from template result table
create indexes
create constraints
perform grants
end if
Perform insert
If all your users were on 11g Enterprise Edition I would recommend you to use Oracle's built-in result-set caching rather than trying to roll your own. But that is not the case, so let's move on.
Another attractive option might be to use PL/SQL collections rather than tables. Being in memory these are faster to retrieve and require less maintenance. They are also supported in all the versions you need. However, they are session variables, so if you have lots of users with big result sets that would put stress on your PGA allocations. Also their data would be lost when the network connection drops. So that's probably not the solution you're looking for.
The core of your problem is this statement:
DELETE FROM per_user_result_set WHERE user_login = :x;
It's not a problem in itself but you have extreme variations in data distribution. Bluntly, the deletion of a single row is going to have a very different performance profile from the deletion of half a million rows. And because your users are constantly refreshing their data there is no way you can handle that, except by giving your users their own tables.
You say you don't want to have a table per user because
"[it] would be an interesting performance gain for the users, but not
for us maintainers of the systems,"
Systems exist for the benefit of our users. Convenience for us is great as long as it helps us to provide better service to them. But their need for a good working experience trumps ours: they pay the bills.
But I question whether having individual tables for each user really increases the work load. I presume each user has their own account, and hence schema.
I suggest you stick with index-organized tables. You only need columns which are in the primary key and maintaining a separate index is unnecessary overhead (for both inserting and deleting). The big advantage of having a table per user is that you can use TRUNCATE TABLE in the refresh process, which is a lot faster than deletion.
So your refresh procedure will look like this:
BEGIN
TRUNCATE TABLE per_user_result_set REUSE STORAGE;
INSERT INTO per_user_result_set(...)
SELECT ... FROM ...;
DBMS_STATS.GATHER_TABLE_STATS(user
, 'PER_USER_RESULT_SET'
, estimate_percent=>10);
COMMIT;
END;
/
Note that you don't need to include the USER column any more, so yur table will just have the single column of result_set_item_id (another indication of the suitability of IOT.
Gathering the table stats isn't mandatory but it is advisable. You have a wide variability in the size of result sets, and you don't want to be using an execution plan devised for 500000 rows when the table has only one row, or vice versa.
The only overhead is the need to create the table in the user's schema. But presumably you already have some set-up for a new user - creating the account, granting privileges, etc - so this shouldn't be a big hardship.

Resources