insert to oracle DB is slow- JDBC [duplicate]

insert to oracle DB is slow- JDBC [duplicate] - oracle

I am working on a file loader program.
The purpose of this program is to take an input file, do some conversions on its data and then upload the data into the database of Oracle.
The problem that I am facing is that I need to optimize the insertion of very large input data on Oracle.
I am uploading data into the table, lets say ABC.
I am using the OCI library provided by Oracle in my C++ Program.
In specific, I am using OCI Connection Pool for multi-threading and loading into ORACLE. (http://docs.oracle.com/cd/B28359_01/appdev.111/b28395/oci09adv.htm )
The following are the DDL statements that have been used to create the table ABC –
CREATE TABLE ABC(
seq_no NUMBER NOT NULL,
ssm_id VARCHAR2(9) NOT NULL,
invocation_id VARCHAR2(100) NOT NULL,
analytic_id VARCHAR2(100) NOT NULL,
analytic_value NUMBER NOT NULL,
override VARCHAR2(1) DEFAULT 'N' NOT NULL,
update_source VARCHAR2(255) NOT NULL,
last_chg_user CHAR(10) DEFAULT USER NOT NULL,
last_chg_date TIMESTAMP(3) DEFAULT SYSTIMESTAMP NOT NULL
);
CREATE UNIQUE INDEX ABC_indx ON ABC(seq_no, ssm_id, invocation_id, analytic_id);
/
CREATE SEQUENCE ABC_seq;
/
CREATE OR REPLACE TRIGGER ABC_insert
BEFORE INSERT ON ABC
FOR EACH ROW
BEGIN
SELECT ABC_seq.nextval INTO :new.seq_no FROM DUAL;
END;
I am currently using the following Query pattern to upload the data into the database. I am sending data in batches of 500 queries via various threads of OCI connection pool.
Sample of SQL insert query used -
insert into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source)
select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual
EXECUTION PLAN by Oracle for the above query -
-----------------------------------------------------------------------------
| Id | Operation | Name|Rows| Cost (%CPU) | Time |
-----------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 4 | 8 (0) | 00:00:01 |
| 1 | LOAD TABLE CONVENTIONAL | ABC | | | |
| 2 | UNION-ALL | | | | |
| 3 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
| 4 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
| 5 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
| 6 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
The Run times of the program loading 1 million lines -
Batch Size = 500
Number of threads - Execution Time -
10 4:19
20 1:58
30 1:17
40 1:34
45 2:06
50 1:21
60 1:24
70 1:41
80 1:43
90 2:17
100 2:06
Average Run Time = 1:57 (Roughly 2 minutes)
I need to optimize and reduce this time further. The problem that I am facing is when I put 10 million rows for uploading.
The average run time for 10 million came out to be = 21 minutes
(My target is to reduce this time to below 10 minutes)
So I tried the following steps as well -
[1]
Did the partitioning of the table ABC on the basis of seq_no.
Used 30 partitions.
Tested with 1 million rows - The performance was very poor. almost 4 times more than the unpartitioned table.
[2]
Another partitioning of the table ABC on the basis of last_chg_date.
Used 30 partitions.
2.a) Tested with 1 million rows - The performance was almost equal to the unpartitioned table. Very little difference was there so it was not considered.
2.b) Again tested the same with 10 million rows. The performance was almost equal to the unpartitioned table. No noticable difference.
The following was the DDL commands were used to achieve partitioning -
CREATE TABLESPACE ts1 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts2 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts3 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts4 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts5 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts6 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts7 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts8 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts9 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts10 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts11 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts12 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts13 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts14 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts15 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts16 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts17 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts18 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts19 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts20 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts21 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts22 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts23 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts24 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts25 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts26 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts27 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts28 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts29 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts30 DATAFILE AUTOEXTEND ON;
CREATE TABLE ABC(
seq_no NUMBER NOT NULL,
ssm_id VARCHAR2(9) NOT NULL,
invocation_id VARCHAR2(100) NOT NULL,
calc_id VARCHAR2(100) NULL,
analytic_id VARCHAR2(100) NOT NULL,
ANALYTIC_VALUE NUMBER NOT NULL,
override VARCHAR2(1) DEFAULT 'N' NOT NULL,
update_source VARCHAR2(255) NOT NULL,
last_chg_user CHAR(10) DEFAULT USER NOT NULL,
last_chg_date TIMESTAMP(3) DEFAULT SYSTIMESTAMP NOT NULL
)
PARTITION BY HASH(last_chg_date)
PARTITIONS 30
STORE IN (ts1, ts2, ts3, ts4, ts5, ts6, ts7, ts8, ts9, ts10, ts11, ts12, ts13,
ts14, ts15, ts16, ts17, ts18, ts19, ts20, ts21, ts22, ts23, ts24, ts25, ts26,
ts27, ts28, ts29, ts30);
CODE that I am using in the thread function (written in C++), using OCI -
void OracleLoader::bulkInsertThread(std::vector<std::string> const & statements)
{
try
{
INFO("ORACLE_LOADER_THREAD","Entered Thread = %1%", m_env);
string useOraUsr = "some_user";
string useOraPwd = "some_password";
int user_name_len = useOraUsr.length();
int passwd_name_len = useOraPwd.length();
text* username((text*)useOraUsr.c_str());
text* password((text*)useOraPwd.c_str());
if(! m_env)
{
CreateOraEnvAndConnect();
}
OCISvcCtx *m_svc = (OCISvcCtx *) 0;
OCIStmt *m_stm = (OCIStmt *)0;
checkerr(m_err,OCILogon2(m_env,
m_err,
&m_svc,
(CONST OraText *)username,
user_name_len,
(CONST OraText *)password,
passwd_name_len,
(CONST OraText *)poolName,
poolNameLen,
OCI_CPOOL));
OCIHandleAlloc(m_env, (dvoid **)&m_stm, OCI_HTYPE_STMT, (size_t)0, (dvoid **)0);
////////// Execution Queries in the format of - /////////////////
// insert into pm_own.sec_analytics (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value, override, update_source)
// select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
// union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
// union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
// union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual
//////////////////////////////////////////////////////////////////
size_t startOffset = 0;
const int batch_size = PCSecAnalyticsContext::instance().getBatchCount();
while (startOffset < statements.size())
{
int remaining = (startOffset + batch_size < statements.size() ) ? batch_size : (statements.size() - startOffset );
// Break the query vector to meet the batch size
std::vector<std::string> items(statements.begin() + startOffset,
statements.begin() + startOffset + remaining);
//! Preparing the Query
std::string insert_query = "insert into ";
insert_query += Context::instance().getUpdateTable();
insert_query += " (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value, override, update_source)\n";
std::vector<std::string>::const_iterator i3 = items.begin();
insert_query += *i3 ;
for( i3 = items.begin() + 1; i3 != items.end(); ++i3)
insert_query += "union " + *i3 ;
// Preparing the Statement and Then Executing it in the next step
text *txtQuery((text *)(insert_query).c_str());
checkerr(m_err, OCIStmtPrepare (m_stm, m_err, txtQuery, strlen((char *)txtQuery), OCI_NTV_SYNTAX, OCI_DEFAULT));
checkerr(m_err, OCIStmtExecute (m_svc, m_stm, m_err, (ub4)1, (ub4)0, (OCISnapshot *)0, (OCISnapshot *)0, OCI_DEFAULT ));
startOffset += batch_size;
}
// Here is the commit statement. I am committing at the end of each thread.
checkerr(m_err, OCITransCommit(m_svc,m_err,(ub4)0));
checkerr(m_err, OCIHandleFree((dvoid *) m_stm, OCI_HTYPE_STMT));
checkerr(m_err, OCILogoff(m_svc, m_err));
INFO("ORACLE_LOADER_THREAD","Thread Complete. Leaving Thread.");
}
catch(AnException &ex)
{
ERROR("ORACLE_LOADER_THREAD", "Oracle query failed with : %1%", std::string(ex.what()));
throw AnException(string("Oracle query failed with : ") + ex.what());
}
}
While the post was being answered, I was suggested several methods to optimize my INSERT QUERY.
I have chosen and used QUERY I in my program for the following reasons that I discovered while testing the various INSERT Queries.
On running the SQL Queries that were suggested to me -
QUERY I -
insert into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source)
select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual
EXECUTION PLAN by Oracle for Query I -
--------------------------------------------------------------------------
| Id | Operation | Name| Rows | Cost (%CPU) | Time |
--------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 4 | 8 (0) | 00:00:01 |
| 1 | LOAD TABLE CONVENTIONAL | ABC | | | |
| 2 | UNION-ALL | | | | |
| 3 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
| 4 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
| 5 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
| 6 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
QUERY II -
insert all
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','b',NULL, 'test', 123 , 'N', 'asdf')
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','e',NULL, 'test', 123 , 'N', 'asdf')
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','r',NULL, 'test', 123 , 'N', 'asdf')
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','t',NULL, 'test', 123 , 'N', 'asdf')
select 1 from dual
EXECUTION PLAN by Oracle for Query II -
-----------------------------------------------------------------------------
| Id | Operation | Name| Rows | Cost (%CPU) | Time |
-----------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 2 (0) | 00:00:01 |
| 1 | MULTI-TABLE INSERT | | | | |
| 2 | FAST DUAL | | 1 | 2 (0) | 00:00:01 |
| 3 | INTO | ABC | | | |
| 4 | INTO | ABC | | | |
| 5 | INTO | ABC | | | |
| 6 | INTO | ABC | | | |
As per the experiments the Query I is faster.
Here I tested on both Oracle SQL Developer , as well as I sent insert queries by my C++ program (FILELOADER) as well.
On Further reading about it, I found out that the cost shown by the Execution Plan is the number of CPU the query will use to process itself.
That tells that Oracle will use more CPU to process the first query and that is why its cost goes on to be = 8.
Even by using the same insert pattern via my application, I found out that its performance it almost 1.5 times better.
I need some insight on how I can improve the performance even further..?
All the things that I have tried, I have summarized them in my question.
If I find or discover anything relevant, I will add to this question.
My target in to bring the upload time of 10 million queries under 10 minutes.

I know others have mentioned this and you don't want to hear it but use SQL*Loader or external tables. My average load time for tables of approximately the same width is 12.57 seconds for just over 10m rows. These utilities have been explicitly designed to load data into the database quickly and are pretty good at it. This may incur some additional time penalties depending on the format of your input file, but there are quite a few options and I've rarely had to change files prior to loading.
If you're unwilling to do this then you don't have to upgrade your hardware yet; you need to remove every possible impediment to loading this quickly. To enumerate them, remove:
The index
The trigger
The sequence
The partition
With all of these you're obliging the database to perform more work and because you're doing this transactionally, you're not using the database to its full potential.
Load the data into a separate table, say ABC_LOAD. After the data has been completely loaded perform a single INSERT statement into ABC.
insert into abc
select abc_seq.nextval, a.*
from abc_load a
When you do this (and even if you don't) ensure that the sequence cache size is correct; to quote:
When an application accesses a sequence in the sequence cache, the
sequence numbers are read quickly. However, if an application accesses
a sequence that is not in the cache, then the sequence must be read
from disk to the cache before the sequence numbers are used.
If your applications use many sequences concurrently, then your
sequence cache might not be large enough to hold all the sequences. In
this case, access to sequence numbers might often require disk reads.
For fast access to all sequences, be sure your cache has enough
entries to hold all the sequences used concurrently by your
applications.
This means that if you have 10 threads concurrently writing 500 records each using this sequence then you need a cache size of 5,000. The ALTER SEQUENCE document states how to change this:
alter sequence abc_seq cache 5000
If you follow my suggestion I'd up the cache size to something around 10.5m.
Look into using the APPEND hint (see also Oracle Base); this instructs Oracle to use a direct-path insert, which appends data directly to the end of the table rather than looking for space to put it. You won't be able to use this if your table has indexes but you could use it in ABC_LOAD
insert /*+ append */ into ABC (SSM_ID, invocation_id , calc_id, ... )
select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual
If you use the APPEND hint; I'd add TRUNCATE ABC_LOAD after you've inserted into ABC otherwise this table will grow indefinitely. This should be safe as you will have finished using the table by then.
You don't mention what version or edition or Oracle you're using. There are a number of extra little tricks you can use:
Oracle 12c
This version supports identity columns; you could get rid of the sequence completely.
CREATE TABLE ABC(
seq_no NUMBER GENERATED AS IDENTITY (increment by 5000)
Oracle 11g r2
If you keep the trigger; you can assign the sequence value directly.
:new.seq_no := ABC_seq.nextval;
Oracle Enterprise Edition
If you're using Oracle Enterprise you can speed up the INSERT from ABC_LOAD by using the PARALLEL hint:
insert /*+ parallel */ into abc
select abc_seq.nextval, a.*
from abc_load a
This can cause it's own problems (too many parallel processes etc), so test. It might help for the smaller batch inserts but it's less likely as you'll lose time computing what thread should process what.
tl;dr
Use the utilities that come with the database.
If you can't use them then get rid of everything that might slow the insert down and do it in bulk, 'cause that's what the database is good at.

If you have a text file you should try SQL LOADER with direct path. It is really fast and it is designed for this kind of massive data loads. Have a look at this options that can improve the performance.
As a secondary advantage for ETL, your file in clear text will be smaller and easier to audit than 10^7 inserts.
If you need to make some transformation you can do it afterwards with oracle.

You should try bulk insert your data. For that purpose, you can use OCI*ML. The discussion of it is here. Noteable article is here.
Or you may try Oracle SQL Bulk Loader SQLLDR itself to increase your upload speed. To do that, serialize the data into csv file and call SQLLDR passing csv as an argument.
Another possible optimization is transaction strategy. Try insert all data in 1 transaction per thread/connection.
Another approach is to use MULTIPLE INSERT:
INSERT ALL
INTO ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source ) VALUES ('c','b',NULL, 'test', 123 , 'N', 'asdf')
INTO ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source ) VALUES ('a','b',NULL, 'test', 123 , 'N', 'asdf')
INTO ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source ) VALUES ('b','b',NULL, 'test', 123 , 'N', 'asdf')
SELECT 1 FROM DUAL;
instead insert .. union all.
Your sample data looks interindependent, that leads to inserting 1 significant row, then extending it into 4 rows with post-insert sql query.
Also, turn off all indexes before insert batch (or delete them and re-create on bulk done). Table Index reduces insert perfomance while you dont actually use it at that time (it calculates some id over every inserted row and performs corresponding operations).
Using prepared statement syntax should speed up upload routine due server would have an already parsed cached statement.
Then, optimize your C++ code:
move ops out of cycle:
//! Preparing the Query
std::string insert_query = "insert into ";
insert_query += Context::instance().getUpdateTable();
insert_query += " (SSM_ID, invocation_id , calc_id,
analytic_id, analytic_value, override, update_source)\n";
while (startOffset < statements.size())
{ ... }

By the way, did you try to increase number of physical clients, not just threads? By running in a cloud on several VMs or on several physical machines. I recently read comments I think from Aerospike developers where they explain that many people are unable to reproduce their results just because they don't understand it's not that easy to make a client actually send that much queries per second (above 1M per second in their case). For instance, for their benchmark they had to run 4 clients in parallel. Maybe this particular oracle driver just is not fast enough to support more than 7-8 thousands of request per second on single machine?

Related

Oracle: How to estimate the size of a view?

In Oracle I can get the size of a table. I would like to estimate the size of a view (non materialized). Is it possible?
I know that views don't have any data per se, but we are moving the data to our data lake and would like to estimate it. Knowing the size we will be able to optimize our resources and speed up the process

You can use EXPLAIN PLAN to estimate the number of bytes and rows that will be returned by reading the entire view. But keep in mind that these numbers are only estimates, they depend on having current statistics, and they will be less accurate for more complicated queries.
For example, on my system, EXPLAIN PLAN estimates that a somewhat complicated metadata view will return 34 MB and 75,590 rows. Whereas the actual values are roughly 14 MB and 85,402 rows.
Commands:
explain plan for select * from dba_objects;
select * from table(dbms_xplan.display);
Results:
Plan hash value: 3423780594
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 75590 | 34M| 134K (1)| 00:00:06 |
| 1 | VIEW | DBA_OBJECTS | 75590 | 34M| 134K (1)| 00:00:06 |
| 2 | UNION-ALL | | | | | |
...
Estimate multiple views in a single query
With a few tricks you can create estimates for multiple views all within a single query. This solution requires Oracle 12.1 or higher. The WITH FUNCTION syntax is a bit odd, and some IDEs struggle with it, so you might have to play around with the semicolon and slash at the end.
--Create sample views from data dictionary views.
create or replace view view1 as select * from all_tables;
create or replace view view2 as select * from all_tab_privs;
create or replace view view3 as select * from all_objects;
--Get the estimated size of each query. The actual values will differ for your database.
with function get_bytes(p_view_name varchar2) return number is
v_bytes number;
--(Needed because "explain plan" is technically DML, which normally shouldn't be executed inside a query.)
pragma autonomous_transaction;
begin
--Create an explain plan for reading everything from the view.
execute immediate replace(q'[explain plan set statement_id = '#VIEW_NAME#' for select * from #VIEW_NAME#]', '#VIEW_NAME#', p_view_name);
--Get the size in bytes.
--Latest plan information. (In case the explain plan was generated multiple times.)
select max(bytes)
into v_bytes
from
(
--Plan information.
select bytes, timestamp, max(timestamp) over () latest_timestamp
from plan_table
where statement_id = p_view_name and id = 0
)
where latest_timestamp = timestamp;
--As part of the AUTONOMOUS_TRANSACTION feature, the function must either commit or rollback.
rollback;
return v_bytes;
end;
select view_name, round(get_bytes(view_name) / 1024 / 1024, 1) mb
from user_views
order by mb desc, view_name;
/
Results:
VIEW_NAME MB
------------ ----------
VIEW3 2.4
VIEW1 .8
VIEW2 .7

Oracle - Column Histograms Showing NONE even after GATHER_TABLE_STATS

I am trying to do performance tuning on a SQL query in Oracle 12c which is using a window partition. There's an index created on HUB_POL_KEY, PIT_EFF_START_DT on the table PIT. While running the explain plan with /*+ gather_plan_statistics */ hint, I observed there's a Window Sort Step in the Explain Plan which is having an Estimated Row Count of 5000K and an Actual Row Count of 1100. I executed DBMS_STATS.GATHER_TABLE_STATS on the table. When I checked in USER_TAB_COLUMNS table, I see there's no histogram generated for HUB_POL_KEY, PIT_EFF_START_DT. However, there's histogram existing for all other columns.
SQL Query
SELECT
PIT.HUB_POL_KEY,
NVL(LEAD(PIT.PIT_EFF_START_DT) OVER (PARTITION BY PIT.HUB_POL_KEY ORDER BY PIT.PIT_EFF_START_DT) ,TO_DATE('31.12.9999', 'DD.MM.YYYY')) EFF_END_DT
FROM PIT
1st Try:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT');
2nd Try:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS SIZE 254 (HUB_POL_KEY,PIT_EFF_START_DT)'));
Checking Histogram:
SELECT HISTOGRAM FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = 'PIT'
AND COLUMN_NAME IN ('HUB_POL_KEY','PIT_EFF_START_DT') --NONE
Table Statistics:
SELECT COUNT(*) FROM PIT --5570253
SELECT COLUMN_NAME,NUM_DISTINCT,NUM_BUCKETS,HISTOGRAM FROM USER_TAB_COL_STATISTICS
WHERE TABLE_NAME = 'PIT'
AND COLUMN_NAME IN ('HUB_POL_KEY','PIT_EFF_START_DT')
+------------------+--------------+-------------+-----------+
| COLUMN_NAME | NUM_DISTINCT | NUM_BUCKETS | HISTOGRAM |
+------------------+--------------+-------------+-----------+
| HUB_POL_KEY | 4703744 | 1 | NONE |
| PIT_EFF_START_DT | 154416 | 1 | NONE |
+------------------+--------------+-------------+-----------+
What am I missing here? Why is the bucket size 1 even when I am running the gather_table_stat procedure with method_opt specifying a size?

The correct syntax as per Oracle documentation should be method_opt=>('FOR COLUMNS (HUB_POL_KEY,PIT_EFF_START_DT) SIZE 254'). Trying it did not create the histogram stats as expected thought (maybe a bug ¯_(ツ)_/¯).
On the other side using method_opt=>('FOR ALL COLUMNS SIZE 254') or method_opt=>('FOR COLUMNS <column_name> SIZE 254') is working fine.
Probably a workaround would be then to gather stats for columns separately:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS HUB_POL_KEY SIZE 254'));
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS PIT_EFF_START_DT SIZE 254'));

Understanding characteristics of a query for which an index makes a dramatic difference

I am trying to come up with an example showing that indexes can have a dramatic (orders of magnitude) effect on query execution time. After hours of trial and error I am still at square one. Namely, the speed-up is not large even when the execution plan shows using the index.
Since I realized that I better have a large table for the index to make a difference, I wrote the following script (using Oracle 11g Express):
CREATE TABLE many_students (
student_id NUMBER(11),
city VARCHAR(20)
);
DECLARE
nStudents NUMBER := 1000000;
nCities NUMBER := 10000;
curCity VARCHAR(20);
BEGIN
FOR i IN 1 .. nStudents LOOP
curCity := ROUND(DBMS_RANDOM.VALUE()*nCities, 0) || ' City';
INSERT INTO many_students
VALUES (i, curCity);
END LOOP;
COMMIT;
END;
I then tried quite a few queries, such as:
select count(*)
from many_students M
where M.city = '5467 City';
and
select count(*)
from many_students M1
join many_students M2 using(city);
and a few other ones.
I have seen this post and think that my queries satisfy the requirements stated in the replies there. However, none of the queries I tried showed dramatic improvement after building an index: create index myindex on many_students(city);
Am I missing some characteristic that distinguishes a query for which an index makes a dramatic difference? What is it?

The test case is a good start but it needs a few more things to get a noticeable performance difference:
Realistic data sizes. One million rows of two small values is a small table. With a table that small the performance difference between a good and a bad execution plan may not matter much.
The below script will double the table size until it gets to 64 million rows. It takes about 20 minutes on my machine. (To make it go quicker, for larger sizes, you could make the table nologging and add an /*+ append */ hint to the insert.
--Increase the table to 64 million rows. This took 20 minutes on my machine.
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
commit;
--The table has about 1.375GB of data. The actual size will vary.
select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'MANY_STUDENTS';
Gather statistics. Always gather statistics after large table changes. The optimizer cannot do its job well unless it has table, column, and index statistics.
begin
dbms_stats.gather_table_stats(user, 'MANY_STUDENTS');
end;
/
Use hints to force a good and bad plan. Optimizer hints should usually be avoided. But to quickly compare different plans they can be helpful to fix a bad plan.
For example, this will force a full table scan:
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
But you'll also want to verify the execution plan:
explain plan for select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
select * from table(dbms_xplan.display);
Flush the cache. Caching is probably the main culprit behind the index and full table scan queries taking the same amount of time. If the table fits entirely in memory then the time to read all the rows may be almost too small to measure. The number could be dwarfed by the time to parse the query or to send a simple result across the network.
This command will force Oracle to remove almost everything from the buffer cache. This will help you test a "cold" system. (You probably do not want to run this statement on a production system.)
alter system flush buffer_cache;
However, that won't flush the operating system or SAN cache. And maybe the table really would fit in memory on production. If you need to test a fast query it may be necessary to put it in a PL/SQL loop.
Multiple, alternating runs. There many things happening in the background, like caching and other processes. It's so easy to get bad results because something unrelated changed on the system.
Maybe the first run takes extra long to put things in a cache. Or maybe some huge job was started between queries. To avoid those issues, alternate running the two queries. Run them five times, throw out the highs and lows, and compare the averages.
For example, copy and paste the statements below five times and run them. (If using SQL*Plus, run set timing on first.) I already did that and posted the times I got in a comment before each line.
--Seconds: 0.02, 0.02, 0.03, 0.234, 0.02
alter system flush buffer_cache;
select count(*) from many_students M where M.city = '5467 City';
--Seconds: 4.07, 4.21, 4.35, 3.629, 3.54
alter system flush buffer_cache;
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
Testing is hard. Putting together decent performance tests is difficult. The above rules are only a start.
This might seem like overkill at first. But it's a complex topic. And I've seen so many people, including myself, waste a lot of time "tuning" something based on a bad test. Better to spend the extra time now and get the right answer.

An index really shines when the database doesn't need to go to every row in a table to get your results. So COUNT(*) isn't the best example. Take this for example:
alter session set statistics_level = 'ALL';
create table mytable as select * from all_objects;
select * from mytable where owner = 'SYS' and object_name = 'DUAL';
---------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 300 |00:00:00.01 | 12 |
| 1 | TABLE ACCESS FULL| MYTABLE | 1 | 19721 | 300 |00:00:00.01 | 12 |
---------------------------------------------------------------------------------------
So, here, the database does a full table scan (TABLE ACCESS FULL), which means it has to visit every row in the database, which means it has to load every block from disk. Lots of I/O. The optimizer guessed that it was going to find 15000 rows, but I know there's only one.
Compare that with this:
create index myindex on mytable( owner, object_name );
select * from mytable where owner = 'SYS' and object_name = 'JOB$';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 | 2 |
| 1 | TABLE ACCESS BY INDEX ROWID| MYTABLE | 1 | 2 | 1 |00:00:00.01 | 3 | 2 |
|* 2 | INDEX RANGE SCAN | MYINDEX | 1 | 1 | 1 |00:00:00.01 | 2 | 2 |
----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OWNER"='SYS' AND "OBJECT_NAME"='JOB$')
Here, because there's an index, it does an INDEX RANGE SCAN to find the rowids for the table that match our criteria. Then, it goes to the table itself (TABLE ACCESS BY INDEX ROWID) and looks up only the rows we need and can do so efficiently because it has a rowid.
And even better, if you happen to be looking for something that is entirely in the index, the scan doesn't even have to go back to the base table. The index is enough:
select count(*) from mytable where owner = 'SYS';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 46 | 46 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 46 | 46 |
|* 2 | INDEX RANGE SCAN| MYINDEX | 1 | 8666 | 9294 |00:00:00.01 | 46 | 46 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OWNER"='SYS')
Because my query involved the owner column and that's contained in the index, it never needs to go back to the base table to look anything up there. So the index scan is enough, then it does an aggregation to count the rows. This scenario is a little less than perfect, because the index is on (owner, object_name) and not just owner, but its definitely better than doing a full table scan on the main table.

pl sql procedure for update

this procedure for update rows of phone number with dashes and adding default area code if phone number does not have it.i do not want to use cursor
CREATE OR REPLACE procedure pro
AS
begin
update judge set phone# = substr(Phone#, 1, 3) || '-' || substr(Phone#, 4,3) || '-' ||
substr(Phone#, 7, 4) where length(trim(phone#))=10;
update judge set phone# = substr(Phone#, 0, 0) || '309-298' ||
substr(Phone#, 1, 5)
where length(trim(phone#))=5;
END;
/
i want to add dashes only if phone number length is 10 and add area code if length is 5.
this code is working but is there any more efficient way of doing it.

is there any more efficient way of doing it.
Yes, there may be a faster method, but it depends on how big the table is, and what percentage of records in the table will be changed.
If the entire table is small - let's say it has less than 100~500 records, then creating indexes will most likely not give you any profit, simple Full Table Scan will be fast enough. In this case use only ONE update command instead of two separate ones - in this way, the table will be read and updated only once instead of twice, and the execution time will be shorter by about half :
update judge set phone# =
CASE length(trim(phone#))
WHEN 10
THEN substr(Phone#,1,3) || '-' || substr(Phone#,4,3) || '-' || substr(Phone#,7,4)
WHEN 5
THEN '309-298' || substr(Phone#,1,5)
ELSE phone#
END
where length(trim(phone#)) in (5,10);
If the entire table is big (thousands or millions records), but a number of records with lengths of 5 and 10 is relatively small (let say that less than 10~15% of all records), then create a function based index:
CREATE INDEX some_name_ix ON judge( length(trim(phone#)) );
and then, after creating the index, refresh the statistics:
exec DBMS_STATS.gather_table_stats( user, 'judge' );
After the above steps check if oracle is willing to use this index by generating explain plans for the below three update commands
EXPLAIN PLAN FOR
UPDATE judge SET phone#= '123'
WHERE length( trim( phone# ) ) in ( 5, 10 )
SELECT * FROM table( dbms_xplan.display );
EXPLAIN PLAN FOR
UPDATE judge SET phone#= '123'
WHERE length( trim( phone# ) ) = 5
SELECT * FROM table( dbms_xplan.display );
EXPLAIN PLAN FOR
UPDATE judge SET phone#= '123'
WHERE length( trim( phone# ) ) = 10
SELECT * FROM table( dbms_xplan.display );
NOTE: SET phone#= '123' clause doesn't matter here, Oracle will not update the table, what's important to us - and what we are checking using explain plan command - is how Oracle will execute the query for different WHERE clauses
For each of the above commands you may see something like below - the keyword TABLE ACCESS FULL means, that Oracle is going to use the Full Table Scan method for this udate, and ignores our index:
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 53421 | 6468K| 36512 (1)| 00:00:02 |
|* 1 | TABLE ACCESS FULL| JUDGE | 53421 | 6468K| 36512 (1)| 00:00:02 |
--------------------------------------------------------------------------------
you may also see sometjing like below: TABLE ACCESS BY INDEX ROWID ... + INDEX RANGE SCAN ... index name - this means, that Oracle is willing to use the index for this update:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 1 | 124 | 5 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| JUDGE | 1 | 124 | 5 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | MY_INDEX_IX | 1 | | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------
There is a Cost (%CPU) column in these tables, which say as what is relative cost of this concrete update (low cost = fast speed, high cost = slow speed).
Finally, depending of results you will get in these explain plans, you'll may decide to:
use only one single update in the procedure
or keep two separate updates
In a case when Oracle use the index in none of these 3 cases, then the index is useless and you can drop it using:
DROP INDEX indexname

For example you can try to combine updates:
update judge
set phone# = decode( length( trim( phone# ) ),
5, '309-298' || substr( Phone#, 1, 5 ),
10, substr( Phone#, 1, 3 ) || '-' || substr( Phone#, 4, 3 ) || '-' || substr( Phone#, 7, 4 ),
phone# )
where length( trim( phone# ) ) in ( 5, 10 );
and using regexp maybe more flexible than seaching by length and second of course insert already prepared and formated data.

Use Oracle unnested VARRAY's instead of IN operator

Let's say users have 1 - n accounts in a system. When they query the database, they may choose to select from m acounts, with m between 1 and n. Typically the SQL generated to fetch their data is something like
SELECT ... FROM ... WHERE account_id IN (?, ?, ..., ?)
So depending on the number of accounts a user has, this will cause a new hard-parse in Oracle, and a new execution plan, etc. Now there are a lot of queries like that and hence, a lot of hard-parses, and maybe the cursor/plan cache will be full quite early, resulting in even more hard-parses.
Instead, I could also write something like this
-- use any of these
CREATE TYPE numbers AS VARRAY(1000) of NUMBER(38);
CREATE TYPE numbers AS TABLE OF NUMBER(38);
SELECT ... FROM ... WHERE account_id IN (
SELECT column_value FROM TABLE(?)
)
-- or
SELECT ... FROM ... JOIN (
SELECT column_value FROM TABLE(?)
) ON column_value = account_id
And use JDBC to bind a java.sql.Array (i.e. an oracle.sql.ARRAY) to the single bind variable. Clearly, this will result in less hard-parses and less cursors in the cache for functionally equivalent queries. But is there anything like general a performance-drawback, or any other issues that I might run into?
E.g: Does bind variable peeking work in a similar fashion for varrays or nested tables? Because the amount of data associated with every account may differ greatly.
I'm using Oracle 11g in this case, but I think the question is interesting for any Oracle version.

I suggest you try a plain old join like in
SELECT Col1, Col2
FROM ACCOUNTS ACCT
TABLE TAB,
WHERE ACCT.User = :ParamUser
AND TAB.account_id = ACCT.account_id;
An alternative could be a table subquery
SELECT Col1, Col2
FROM (
SELECT account_id
FROM ACCOUNTS
WHERE User = :ParamUser
) ACCT,
TABLE TAB
WHERE TAB.account_id = ACCT.account_id;
or a where subquery
SELECT Col1, Col2
FROM TABLE TAB
WHERE TAB.account_id IN
(
SELECT account_id
FROM ACCOUNTS
WHERE User = :ParamUser
);
The first one should be better for perfomance, but you better check them all with explain plan.

Looking at V$SQL_BIND_CAPTURE in a 10g database, I have a few rows where the datatype is VARRAY or NESTED_TABLE; the actual bind values were not captured. In an 11g database, there is just one such row, but it also shows that the bind value is not captured. So I suspect that bind value peeking essentially does not happen for user-defined types.
In my experience, the main problem you run into using nested tables or varrays in this way is that the optimizer does not have a good estimate of the cardinality, which could lead it to generate bad plans. But, there is an (undocumented?) CARDINALITY hint that might be helpful. The problem with that is, if you calculate the actual cardinality of the nested table and include that in the query, you're back to having multiple distinct query texts. Perhaps if you expect that most or all users will have at most 10 accounts, using the hint to indicate that as the cardinality would be helpful. Of course, I'd try it without the hint first, you may not have an issue here at all.
(I also think that perhaps Miguel's answer is the right way to go.)

For medium sized list (several thousand items) I would use this approach:
First:generate a prepared statement with an XMLTABLE in join with your main table.
For instance:
String myQuery = "SELECT ...
+" FROM ACCOUNTS A,"
+ "XMLTABLE('tab/row' passing XMLTYPE(?) COLUMNS id NUMBER path 'id') t
+ "WHERE A.account_id = t.id"
then loop through your data and build a StringBuffer with this content:
StringBuffer idList = "<tab><row><id>101</id></row><row><id>907</id></row> ...</tab>";
eventually, prepare and submit your statement, then fetch the results.
myQuery.setString(1, idList);
ResultSet rs = myQuery.executeQuery();
while (rs.next()) {...}
Using this approach is also possible to pass multi-valued list, as in the select statement
SELECT * FROM TABLE t WHERE (t.COL1, t.COL2) in (SELECT X.COL1, X.COL2 FROM X);
In my experience performances are pretty good, and the approach is flexible enough to be used in very complex query scenarios.
The only limit is the size of the string passed to the DB, but I suppose it is possible to use CLOB in place of String for arbitrary long XML wrapper to the input list;

This binding a variable number of items into an in list problem seems to come up a lot in various form. One option is to concatenate the IDs into a comma separated string and bind that, and then use a bit of a trick to split it into a table you can join against, eg:
with bound_inlist
as
(
select
substr(txt,
instr (txt, ',', 1, level ) + 1,
instr (txt, ',', 1, level+1) - instr (txt, ',', 1, level) -1 )
as token
from (select ','||:txt||',' txt from dual)
connect by level <= length(:txt)-length(replace(:txt,',',''))+1
)
select *
from bound_inlist a, actual_table b
where a.token = b.token
Bind variable peaking is going to be a problem though.
Does the query plan actually change for larger number of accounts, ie would it be more efficient to move from index to full table scan in some cases, or is it borderline? As someone else suggested, you could use the CARDINALITY hint to indicate how many IDs are being bound, the following test case proves this actually works:
create table actual_table (id integer, padding varchar2(100));
create unique index actual_table_idx on actual_table(id);
insert into actual_table
select level, 'this is just some padding for '||level
from dual connect by level <= 1000;
explain plan for
with bound_inlist
as
(
select /*+ CARDINALITY(10) */
substr(txt,
instr (txt, ',', 1, level ) + 1,
instr (txt, ',', 1, level+1) - instr (txt, ',', 1, level) -1 )
as token
from (select ','||:txt||',' txt from dual)
connect by level <= length(:txt)-length(replace(:txt,',',''))+1
)
select *
from bound_inlist a, actual_table b
where a.token = b.id;
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 840 | 2 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | | | | |
| 2 | NESTED LOOPS | | 10 | 840 | 2 (0)| 00:00:01 |
| 3 | VIEW | | 10 | 190 | 2 (0)| 00:00:01 |
|* 4 | CONNECT BY WITHOUT FILTERING| | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 6 | INDEX UNIQUE SCAN | ACTUAL_TABLE_IDX | 1 | | 0 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID | ACTUAL_TABLE | 1 | 65 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Another option is to always use n bind variables in every query. Use null for m+1 to n.
Oracle ignores repeated items in the expression_list. Your queries will perform the same way and there will be fewer hard parses. But there will be extra overhead to bind all the variables and transfer the data. Unfortunately I have no idea what the overall affect on performance would be, you'd have to test it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio