Performance Improvement while joining one huge table with one small table - performance

I am trying to join 2 tables where table A is having 400 million records and table B is having around 50 records. When i am trying to join these tables its taking long time. The reason is I have PPI in table A on non-primary index column.
How can i improve performance in this case?

Related

What to do when the SQL index does not solve the problem of performance?

This query ONE
SELECT * FROM TEST_RANDOM WHERE EMPNO >= '236400' AND EMPNO <= '456000';
in the Oracle Database is running with cost 1927.
And this query TWO :
SELECT * FROM TEST_RANDOM WHERE EMPNO = '236400';
is running with cost 1924.
This table TEST_RANDOM has 1.000.000 rows, I created this table so:
Create table test_normal (empno varchar2(10), ename varchar2(30), sal number(10), faixa varchar2(10));
Begin
For i in 1..1000000
Loop
Insert into test_normal values(
to_char(i), dbms_random.string('U',30),
dbms_random.value(1000,7000), 'ND'
);
If mod(i, **10000)** = 0 then
Commit;
End if;
End loop;
End;
Create table test_random
as
select /*+ append */ * from test_normal order by dbms_random.random;
I created a B-Tree index in the field EMPNO so:
CREATE INDEX IDX_RANDOM_1 ON TEST_RANDOM (EMPNO);
After this, the query TWO improved, and the cost changed to 4.
But the query ONE did not improve, because Oracle Database ignored it, for some reason Oracle Database understood that this query is not worth it to use the plan execution with the index...
My question is: What could we do to improve this query ONE performance? Because the solution of the index did not solve and its cost continues to be expensive...
For this query, Oracle does not use an index because the optimizer correctly estimated the number of rows and correctly decided that a full table scan would be faster or more efficient.
B-Tree indexes are generally only useful when they can be used to return a small percentage of rows, and your first query returns about 25% of the rows. It's hard to say what the ideal percentage of rows is, but 25% is almost always too large. On my system, the execution plan changes from full table scan to index range scan when the query returns 1723 rows - but that number will likely be different for you.
There are several reasons why full table scans are better than indexes for retrieving a large percentage of rows:
Single-block versus multi-block: In Oracle, like in almost all computer systems, it can be significantly faster to retrieve multiple chunks of data at a time (sequential access) instead of retrieving one random chunk of data at a time (random access).
Clustering factor: Oracle stores all rows in blocks, which are usually 8KB large and are analogous to pages. If the index is very inefficient, like if the index is built on randomly sorted data and two sequential reads rarely read from the same block, then reading 25% of all the rows from an index may still require reading 100% of the table blocks.
Algorithmic complexity: A full table scan reads the data as a simple heap, which is O(N). A single index access is much faster, at O(LOG(N)). But as the number of index accesses increases, the benefit wears off, until eventually using the index is O(N * LOG(N)).
Some things you can do to improve performance without indexes:
Partitioning: Partitioning is the idea solution for retrieving a large percentage of data from a table (but the option must be licensed). With partitioning, Oracle splits the logical table into multiple physical tables, and the query can only read from the required partitions. This can create the benefit of multi-block reads, but still limits the amount of data scanned.
Parallelism: Make Oracle work harder instead of smarter. But parallelism probably isn't worth the trouble for such a small table.
Materialized views: Create tables that only store exactly what you need.
Ordering the data: Improve the index clustering factor by sorting the table data by the relevant column instead of doing it randomly. In your case, replace order by dbms_random.random with order by empno. Depending on your version and platform, you may be able to use a materialized zone map to keep the table sorted.
Compression: Shrink the table to make it faster to read the whole thing.
That's quite a lot of information for what is possibly a minor performance problem. Before you go down this rabbit hole, it might be worth asking if you actually have an important performance problem as measured by a clock or by resource consumption, or are you just fishing for performance problems by looking at the somewhat meaningless cost metric?

In Oracle, when inserting a large number of rows, Is dropping and recreating indexes faster?

I have an indexed table to which I have to add about 150000 records daily. This takes a lot of time using insert statements. Is it faster to
first drop the index
then insert the records
then create the index again

How to insert 1 Billion rows in google-bigquery?

I need to insert 1 billion rows in Google bigquery table. But I don't have the data readily available.
I will have to make several millions of asynchronous http requests(1 asynchronous request = 1000 rows of ordered data). Each row has a column called ID and I need the billion rows in bigquery table ordered by ID once all requests are completed because it is timeseries data.
Challenge is, asynchronous calls doesn't have any order.(Also I will use parallelization across multiple CPUs). If I insert rows as they come and wait till all the billion are inserted, I am afraid sorting billion rows might take lot of time at the end.
One naïve way is to create the ID column with billion integers before hand, create empty columns for other fields and insert data by searching the ID, which I think is also inefficient.
Is there an efficient way of achieving this?

Fastest way to take a random sample of 100000 rows from each partition of a hive table

I have a table partitioned daywise with each partition containing almost 80M rows.
I want to take a random sample of 100000 rows from each partition for a particular month.
Currently I'm doing it using rank within each partition, ordering by rand() and then filtering on the rank but it takes almost 45-60 mins.
Is there a faster way to do the same thing without compromising on the quality of the sample?
EDIT
My table is not bucketed

Performance tuning - Query on partitioned vs non-partitioned table

I have two queries, one of which involves a partitioned table in the query while the other query is the same except that it involves the non-partitioned equivalent table. The original (non-partitioned table) query is performing better than the partitioned counter-part. I am not sure how to isolate the problem for this. Looking at the execution plan, I find that the indexes used are the same b/w the two queries and that the new query shows the PARTITION RANGE clause in its execution plan meaning that partition pruning is taking place. The query is of the following form:-
Select rownum, <some columns>
from partTabA
inner join tabB on condition1
inner join tabC on condition2
where partTabA.column1=<value> and <other conditions>
and partTabA.column2 in (select columns from tabD where conditions)
where partTabA is the partitioned table and partTabA.column1 is the partitioning key(range partition). In the original query, this gets replaced by the non-partitioned equivalent of the same table. What parameters should I look at to find out why the new query performs badly. Tool that I have is Oracle SQL Developer.
PARTITION RANGE ITERATOR does not necessarily mean that partition pruning is happening.
You'll also want to look at the Pstart and Pstop in the explain plan, to see which partitions are being used.
There are several potential reasons the partitioned query will be slower, even though it's reading the same data. (Assuming that the partitioned query isn't properly pruning, and is reading from the whole table.)
Reading from multiple local indexes may be much less efficient than reading from a single, larger index.
There may be a lot of wasted space from large initial segment sizes, a large number of partitions, etc. Compare the segment sizes with this: select * from dba_segments where segment_name in ('PARTTABA', 'TABA'); If that's the issue, you may want to look into your tablespace settings, or using deferred segment creation.
I believe that you're dealing with partitioning overhead, if you have partitioned table then oracle has to find which partition to scan first.
Could you paste here both execution plans? How large are the tables? How selective are indexes used here?
Did you try to gather statistics?
You may also try to look into trace file to see what's going on.

Resources