Oracle Partitioned table is taking long time to fetch

Oracle Partitioned table is taking long time to fetch - oracle

I have a partioned table based on date in oracle db, where each partition has crores of records. The front end application is build to search the data based on a date range (meanining it scans through multiple partitions). What is the best logic to get the data in quickest time?

You should create local indexes which work on partitions.
Normally we go for global indexes which work on whole table while local index is specific to partition which will make partition search faster.
Check this link to see how local indexes work: http://docs.oracle.com/cd/E11882_01/server.112/e25523/partition.htm#i461446
If local indexes don't work then query tuning might help. If that doesn't help then you shld look to redesign schema.
EDIT:
Having said all that, just one basic check to ensure that your query is not scanning all partitions. This can be achieved by including partition criteria [date in your case] as part of where clause.

Interval partitioning may help. It makes partition management much
easier, which then makes it reasonable to have thousands of partitions instead of just dozens or hundreds.
For example, if the current table is partitioned by month, a query for a week will need to read a lot of extra data. But if the table is partitioned by day
then almost no extra data will be scanned.
create table partition_test(a number primary key, b date)
partition by range (b) interval (interval '1' day)
(
partition p1 values less than (date '2000-01-01')
);
But even if this reduces the data per partition from crores to lakhs, that's still a lot of data for an application. Local indexes, as #loki suggested, may help.

Related

What to do when the SQL index does not solve the problem of performance?

This query ONE
SELECT * FROM TEST_RANDOM WHERE EMPNO >= '236400' AND EMPNO <= '456000';
in the Oracle Database is running with cost 1927.
And this query TWO :
SELECT * FROM TEST_RANDOM WHERE EMPNO = '236400';
is running with cost 1924.
This table TEST_RANDOM has 1.000.000 rows, I created this table so:
Create table test_normal (empno varchar2(10), ename varchar2(30), sal number(10), faixa varchar2(10));
Begin
For i in 1..1000000
Loop
Insert into test_normal values(
to_char(i), dbms_random.string('U',30),
dbms_random.value(1000,7000), 'ND'
);
If mod(i, **10000)** = 0 then
Commit;
End if;
End loop;
End;
Create table test_random
as
select /*+ append */ * from test_normal order by dbms_random.random;
I created a B-Tree index in the field EMPNO so:
CREATE INDEX IDX_RANDOM_1 ON TEST_RANDOM (EMPNO);
After this, the query TWO improved, and the cost changed to 4.
But the query ONE did not improve, because Oracle Database ignored it, for some reason Oracle Database understood that this query is not worth it to use the plan execution with the index...
My question is: What could we do to improve this query ONE performance? Because the solution of the index did not solve and its cost continues to be expensive...

For this query, Oracle does not use an index because the optimizer correctly estimated the number of rows and correctly decided that a full table scan would be faster or more efficient.
B-Tree indexes are generally only useful when they can be used to return a small percentage of rows, and your first query returns about 25% of the rows. It's hard to say what the ideal percentage of rows is, but 25% is almost always too large. On my system, the execution plan changes from full table scan to index range scan when the query returns 1723 rows - but that number will likely be different for you.
There are several reasons why full table scans are better than indexes for retrieving a large percentage of rows:
Single-block versus multi-block: In Oracle, like in almost all computer systems, it can be significantly faster to retrieve multiple chunks of data at a time (sequential access) instead of retrieving one random chunk of data at a time (random access).
Clustering factor: Oracle stores all rows in blocks, which are usually 8KB large and are analogous to pages. If the index is very inefficient, like if the index is built on randomly sorted data and two sequential reads rarely read from the same block, then reading 25% of all the rows from an index may still require reading 100% of the table blocks.
Algorithmic complexity: A full table scan reads the data as a simple heap, which is O(N). A single index access is much faster, at O(LOG(N)). But as the number of index accesses increases, the benefit wears off, until eventually using the index is O(N * LOG(N)).
Some things you can do to improve performance without indexes:
Partitioning: Partitioning is the idea solution for retrieving a large percentage of data from a table (but the option must be licensed). With partitioning, Oracle splits the logical table into multiple physical tables, and the query can only read from the required partitions. This can create the benefit of multi-block reads, but still limits the amount of data scanned.
Parallelism: Make Oracle work harder instead of smarter. But parallelism probably isn't worth the trouble for such a small table.
Materialized views: Create tables that only store exactly what you need.
Ordering the data: Improve the index clustering factor by sorting the table data by the relevant column instead of doing it randomly. In your case, replace order by dbms_random.random with order by empno. Depending on your version and platform, you may be able to use a materialized zone map to keep the table sorted.
Compression: Shrink the table to make it faster to read the whole thing.
That's quite a lot of information for what is possibly a minor performance problem. Before you go down this rabbit hole, it might be worth asking if you actually have an important performance problem as measured by a clock or by resource consumption, or are you just fishing for performance problems by looking at the somewhat meaningless cost metric?

vertica how restrict database size

Could you please help me with the following issue?
I have installed vertica cluster. I can't understand how I can restrict database size in time or in size. For example, data in a database must be deleted older than 30 days or when a database size of 100 GB is reached (what comes first)

There is no automated way of doing this, and no logical way of "restricting database size". You can't just trim "data" from a "database".
What are you talking about (in terms of limiting data outside of 30 days old) needs to be done on the table level. You would need some kind of date field and delete anything older than 30 days. However, I would advice against deleting rows in this way. It is non-performant and can cause queries against the table to be slow: see DELETE and UPDATE Performance Considerations. The best way of doing this would be to partition the table by day, and create an automated script (bash, python, etc) to each day drop the partition that corresponds with the date 30 days ago: see Dropping Partitions.
As for deleting data—if the size of the "database" goes above 100GB—this requirement is extremely vague and would be impossible to enforce. Let's say you have 50 tables, and the size of several of those tables grows so that the total size of the database is over 100GB, how would you decide which table to prune? This also must be done on a table by table level (or in this case—technically—on a projection level, since that is where the data is actually stored).
To see the compressed size (size on disk) of the database you can use this query:
SELECT SUM(used_bytes) / ( 1024^3 ) AS database_size_gb
FROM projection_storage;
However, since data can only be deleted with a DELETE or DROP PARTITION statement on a table, it would also be helpful to see the size of each table. You can do this by using this query:
SELECT projection_schema, anchor_table_name, SUM(used_bytes) / ( 1024^3 ) AS table_size_gb
FROM projection_storage
GROUP BY 1, 2
ORDER BY 3 DESC;
From the results you can decide which tables you want to prune.
A couple of notes (as a Vertica DBA):
Data is stored in projections. Having too many projections on a single table can not only cause queries to be slow but will also increase the overall data footprint. Avoid using too many projections (especially too many superprojections, don't have more than two per table, and most tables will only need one). Use the database designer or follow the guidelines in the documentation for creating custom projections: Design Fundamentals.
Also, another trick to keep database size down is to use the DESIGNER_DESIGN_PROJECTION_ENCODINGS function. Unless your projections are created with the database designer, they will likely only contain the auto encoding. Using the DESIGNER_DESIGN_PROJECTION_ENCODINGS function will help you to pick the most optimal encoding for each column. I have seen properly encoded projections take up a mere 2% disk size compared to the previously un-optimized projection. That is rare, but in my experience you will still see at least a 20-40% reduction in size. Do not be afraid to use this function liberally. It is one of my favorite tools as a Vertica DBA.
Also

How to append the data to existing hive table without partition

I have created hive table which contains historical stock data of past 10 years. From now i have to append the data on daily bases.
I thought of creating the partition based on date but it leads many partitions approximately 3000 plus a new partition for every new date, i think this is not feasible.
Can any one suggest a best approach to store all the historical data in the table and append the new data as it comes.

As for every partitioned table, the decision on how to partition your table depends primarily on how you are going to query the table.
Another consideration is how much data you're going to have per partition, as partitions should not bee too small. Each one should be at least at as an absolute minimum as big as one HDFS block since it would otherwise take too many directories.
This said, I don't think 3000 partitions would be a problem. At a previous job we had a huge table with one partition per hour, each hour was about 20Gbytes, and we had 6 months of data, so about 4000 partitions, and it worked just fine.
In our case, most people care the most about the last week and the last day.
I suggest as first thing you research how the table is going to be used, that is, are all the 10 years be used, or just mostly the most recent data ?
As second thing, study how big is the data, consider if it may grow in size with the new loads, and see how big each partition is going to be.
Once you've determined these 2 points, you can make a decision, you could just use daily partitions (which could be fine, 3000 partitions is not bad), or you could do weekly, or monthly.

You can use this command
LOAD DATA LOCAL INPATH '<FILE_PATH>' INTO TABLE <TABLE_NAME>;
It will create new files under HDFS directory mapped to table name. Even though there are not too many partitions with it, you will still run into too many files issue.
Periodically, you need to do this:
Create stage table
Move data by running LOAD command from target table to stage table
You can run insert command into target table selecting from stage table
Now it will load data with number of files equal to number of reducers.
You can delete stage table
You can run this process at regular intervals (probably once in a month).

Performance tuning - Query on partitioned vs non-partitioned table

I have two queries, one of which involves a partitioned table in the query while the other query is the same except that it involves the non-partitioned equivalent table. The original (non-partitioned table) query is performing better than the partitioned counter-part. I am not sure how to isolate the problem for this. Looking at the execution plan, I find that the indexes used are the same b/w the two queries and that the new query shows the PARTITION RANGE clause in its execution plan meaning that partition pruning is taking place. The query is of the following form:-
Select rownum, <some columns>
from partTabA
inner join tabB on condition1
inner join tabC on condition2
where partTabA.column1=<value> and <other conditions>
and partTabA.column2 in (select columns from tabD where conditions)
where partTabA is the partitioned table and partTabA.column1 is the partitioning key(range partition). In the original query, this gets replaced by the non-partitioned equivalent of the same table. What parameters should I look at to find out why the new query performs badly. Tool that I have is Oracle SQL Developer.

PARTITION RANGE ITERATOR does not necessarily mean that partition pruning is happening.
You'll also want to look at the Pstart and Pstop in the explain plan, to see which partitions are being used.
There are several potential reasons the partitioned query will be slower, even though it's reading the same data. (Assuming that the partitioned query isn't properly pruning, and is reading from the whole table.)
Reading from multiple local indexes may be much less efficient than reading from a single, larger index.
There may be a lot of wasted space from large initial segment sizes, a large number of partitions, etc. Compare the segment sizes with this: select * from dba_segments where segment_name in ('PARTTABA', 'TABA'); If that's the issue, you may want to look into your tablespace settings, or using deferred segment creation.

I believe that you're dealing with partitioning overhead, if you have partitioned table then oracle has to find which partition to scan first.
Could you paste here both execution plans? How large are the tables? How selective are indexes used here?
Did you try to gather statistics?
You may also try to look into trace file to see what's going on.

TSQL Merge Performance

Scenario:
I have a table with roughly 24 million records. The table has pricing history related to individual customers and is computed daily. There are on average 6 million records for each day. Every morning a the price list is generated and a merge statement is ran to reflect the changes in their pricing.
The merge statement begins with the previous day's previous data being inserted into a variable table, that table is then merged into the actual table. The main problem is that the merge statement takes pretty long.
My real question centers around the performance of using a variable table vs physical table vs temp table. What is the best practice for large merges like this?

Thoughts
I'd consider a temp table: these have statistics which will help. A table variable is always assumed to have one row. Also, the IO can be shunted onto separate drives (assuming you have tempdb separately)
If a single transaction is not required, I'd split the MERGE too into a DELETE, UPDATE, INSERT sequence to reduce the amount of work needed in each action (which reduces the amount of rollback info needed and the amount of locking etc

Temp tables often perform better than table variables for large data sets. Additionally you can put the data into the temp table and then index it.

Check if you indexes on the tables. Indexes would be updated every time you add/delete records on that table.
Try removing the indexes before merging the records and then re-create it again after the merge.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio