How does oracle manage a hash partition - oracle

I understand the concept of range partitioning. If i have a date column and i partition on that column based on month, then if my query has a where clause just filtering for a month, then i can hit a particular partition and get my data, without hitting the full table.
In Oracle docs i read that if a logical partitioning like 'month' is not available,(e.g, you partition on a column called customer id) ,then use a hash partitioning. So how will this work? Oracle will randomly divide the data and assign it to different partitions and assign a hash code to each partition?
But in this situation, when new data comes in, how does oracle know in which partition to put the new data? And when i query data, it seems there is no way to avoid hitting multiple partitions?

"how does oracle know in which partition to put the new data?"
From the documentation
Oracle Database uses a linear hashing algorithm and to prevent data
from clustering within specific partitions, you should define the
number of partitions by a power of two (for example, 2, 4, 8).
As for your other question ...
"when i query data, it seems there is no way to avoid hitting multiple
partitions?"
If you're searching for a single Customer ID then no. Oracle's hashing algorithm is consistent, so records with the same partition key end up in the same partition (obviously). But if you are searching for, say, all the new customers from the last month then yes. Oracle's hashing algorithm will strive to distribute records evenly so the latest records will be spread across the whole table.
So the real question is, why do we choose to partition a table? Performance is often the least compelling reason to partition. Better reasons include
availability each partition can reside on a different tablespace. Hence a problem with a tablespace will take out a slice of the table's data instead of the whole thing.
management partitioning provides a mechanism for splitting whole table jobs into clear batches. Partition exchange can make it easier to bulk load data.
As for performance, physical co-location of records can speed up some queries- those which are searching records by a defined range of keys. However, any queries which don't match the grain of the query won't perform faster (and may even perform slower) than a non-partitioned table.
Hash partitioning is unlikely to provide performance benefits, precisely because it shuffles the keys across the whole table. It will provide the availability and manageability benefits of partitioning (but is obviously not particularly amenable to partition exchange).

A hash is not random, it divides the data in a repeatable (but perhaps difficult-to-predict) fashion so that the same ID will always map to the same partition.
Oracle uses a hash algorithm that should usually spread the data evenly between partitions.

Related

Oracle partitioning recommendations

Due to being locked down by Corona, I don't have easy access to my more knowledgeable colleagues, so I'm hoping for a few possible recommendations here.
We do quarterly and yearly "freezes" of a number of statistical entities with a large number (1-200) of columns. Everyone then uses these "frozen" versions as a common basis for all statistical releases in Denmark. Currently, we simply create a new table for each version.
There's a demand to test if we can consolidate these several hundred tables to 26 entity-based tables to make programming against them easier, while not harming performance too much.
A "freeze" is approximately 1 million rows and consists of: Year + Period + Type + Version.
For example:
2018_21_P_V1 = Preliminary Data for 2018 first quarter version 1
2019_41_F_V2 = Final Data for 2019 yearly version 2
I am simply not very experienced in the world of partitions. My initial thought was to partition on Year + Period and Subpartiton on Type + Version, but I am no longer sure this is the right approach, nor do I have a clear picture of which partitioning type would solve the problem best.
I am hoping someone can recommend an approach as it would help me tremendously and save me a lot of time "brute force" testing a lot of different combinations.
Based on your current situation which you explained I highly recommend that "USE THE PARTITIONING". No doubt.
It's highly effective and easy to use. You can read Oracle documentation about partitioning or search on the web for that to understand how to start.
In general, when you partition a table, Oracle looks at each partition as a separate table so don't worry about the speed of fetching data.
The most important step is to choose the best field(s) to establish your partitions based on. I used the date format (20190506) in number or int data type for my daily basis. Or (201907) for a monthly basis. You should design and test it.
The next is to decide about the sub-partitions. In some cases, you don't really need one. It depends on your data structure and your expectations from the data. What do you want to do with the data? Which fields are more important? (used in where clause, ...)
Then make some index(es) for each partition. Very important.
Another important point is that using partitions may have some changes in the way you code in pl/sql. For example, you can not use 2 or more partitions in a single query at the same time. You should select and fetch data from different partitions one by one.
And don't worry about 1 million records. I used partitioning for tables way larger than this and it works fine.
Goodluck

Oracle table and index partitioning - risk and disadvantages

I have large unpartitioned tables in the database (100GB+), and to be able to improve performance I think about partitioning them, or maybe just indexes. Data comes in on regularly basis, and is selected by dates, so I think range partitioning by month of creation date would be good opion.
I am reading about oracle table and index partitioning, and it look quite promising.
But I have two questions, for which I can not find answers (I think my google skills are going down).
First one is:
What are risk and disadvantages of creating partitioned tables and indexes in oracle, in particular on such large and alive tables? Is there something that I should know about?
Second:
How to create partition on existing and unpartitioned table or index?
Besides the outage (see below) needed to partition your data, the main risk I see is that if you decide to partition your table and indexes, with local indexes, your performance will not be great for queries not relying on the partition key (date). But you can use global indexes in that case, and go back to similar performances.
The simplest way to create a partitioned table from an unpartitioned one, by far, is to use create table as select with a new name and all the partition storage detail, delete the unpartitioned table and renamed the new table as the old one. Obviously, this requires careful preparation, and an outage that can last a few minutes :)

How does GetItem/BatchGetItem compare to Querying and Scanning a DynamoDB table in terms of efficiency?

Specifically, when is it better to use one or the other? I am using BatchGetItem now and it seems pretty damn slow.
In terms of efficiency for retrieving a single item, for which you know the partition key (and sort key if one is used in the table), GetItem is more efficient than querying or scanning. BatchGetItem is a convenient way of retrieving a bunch of items for which you know the partition/sort key and it's only more efficient in terms of network traffic savings.
However, if you only have partial information about an item then you can't use GetItem/BatchGetItem and you have to either Scan or Query for the item(s) that you care about. In such cases Query will be more efficient than Scanning since with a query you're already narrowing down the table space to a single partition key value. Filter Expressions don't really contribute all that much to the efficiency but they can save you some network traffic.
There is also the case when you need to retrieve a large number of items. If you need lots of items with the same partition key, then a query becomes more efficient than multiple GetItem (or BatchGetItem calls). Also, if you need to retrieve items making up a significant portion of your table, a Scan is the way to go.

Asking for opinions : One sequence for all tables

Here's another one I've been thinking about lately.
We have concluded in earlier discussions : 'natural primary keys are bad, artificial primary keys are good.'
Working with Hibernate earlier I have seen that Hibernate default creates one sequence for all tables. At first I was puzzled by this, why would you do this. But later I saw the advantage that it makes linking parents and children fool proof. Because no tables have the same primary key value, accidentally linking a parent with a table that is not a child gives no results.
Does anyone see any downsides to this approach. I only see one : you cannot have more than 999999999999999999999999999 records in your database.
There could be performance issues with all code getting values from a single sequence - see this Ask Tom thread.
Depending on how sequences are implemented in the database, always hitting the same sequence can be better or worse. When only a few or only one thread request new values, there will be no locking issues. But a bad implementation could cause congestion.
Another problem is rolling back transactions: Sequences don't get rolled back (because someone else might have requested a higher value already), so you can have large gaps which will eat your number space much more quickly than you might expect. OTOH, it will take some time to eat 2 or 4 billion IDs (if you "only" use 32 bit (signed) ints), so it's rarely an issue in practice.
Lastly, you can't easily reset the sequence if you have to. But if you need to have a restarting sequence (say, number of records since midnight), you can tell Hibernate to create/use a second sequence.
A major advantage is that you can uniquely identify objects anywhere in the DB just by the ID. That means you can severely cut down the log information you write in the production system and still find something if you only have the ID.
I prefer having one sequence per table. This comes from one general observation: Some tables ("master tables") have a relatively small row count and have to be kept "forever". For example, the customer table in an ERP.
In other tables ("transaction tables"), many rows are generated perpetually, but after some time, those rows can be archived (or simply deleted). The most extreme example is a tracing table used for debugging purposes; it might grow by hundreds of rows per second, but each row is obsolete after a few days.
Small IDs in the master tables make it easier when working directly on the database, e.g. for debugging purposes.
select * from orders where customerid=415
vs
select * from orders where customerid=89461836571
But this is only a minor issue. The bigger issue is cycling. If you use one sequence for all tables, you simply cannot let it restart. With one sequence per table, you can restart the sequences for the transaction tables when you have archived or deleted the old data. Master tables hardly ever have that problem, since they grow much slower.
I see little value in having only one sequence for all tables. The arguments told so far do not convince me.
There are a couple of disadvantages of using a single sequence:-
reduced concurrency. Handing out the next sequence value involves synchronisation. In practice, I do not think this is likely to be a big problem
Oracle has special code when maintaining btree indexes to detect monotonically increasing values and balance the tree approriately
The CBO might have a better time estimating range queries on the index (if you ever did this) if most values were filled in
An advantage might be that you can determine the order of inserts amongst different tables.
Certainly there are pros and cons to the one-sequence versus one-sequence-per-table approach. Personally I find the ability to assign a truly unique identifier to a row, making each id column a uuid, to be enough of a benefit to outweigh any disadvantages. As Aaron D. succinctly writes:
you can uniquely identify objects anywhere in the DB just by the ID
And, for most applications, due to the way Hibernate3 batches IMPORT statements, this will not be a performance bottleneck unless massive amounts of records are vying for the same db resource (SELECT hibernate_sequence.nextval FROM dual).
Also, this sequence mapping is not supported in the latest release (1.2) of Grails. Though it was supported in Grails 1.1 (!). It now requires subclassing one of the Hibernate dialect classes as a workaround.
For those using Grails/GORM, have a look at this JIRA entry:
Oracle Sequence mappings ignored

Oracle Hierarchical Query Performance

We're looking at using Oracle Hierarchical queries to model potentially very large tree structures (potentially infinitely wide, and depth of 30+). My understanding is that hierarchal queries provide a method to write recursively joining SQL but they it does not provide any real performance enhancements over if you were to manually write an equivalent query... is this the case? What sort of experiences have people had, performance wise, with using oracle hierarchical queries?
Well the short answer is that without the hierarchical extension (connect by) you couldn't write a recursive query. You could programmitically issue many queries which were recurisively linked.
The rule of thumb with everything database is, especially oracle, is that if you can issue your result in a single query it will almost always be faster than doing it programatically.
My experiences have been with much smaller sets, so I can't speak for how well heirarchical queries will perform for large sets.
When doing these tree retrievals, you typically have these options
Query everything and assemble the tree on the client side.
Perform one query for each level of the tree, building on what you know that you need from the previous query results
Use the built in stuff Oracle provides (START WITH,CONNECT BY PRIOR).
Doing it all in the database will reduce unnecessary round trips or wasteful queries that pull too much data.
Try partitioning the data within you hierarchical table and then limiting the partition included in the query.
CREATE TABLE
loopy
(key NUMBER, key_hier number, info VARCHAR2, part NUMBER)
PARTITION BY
RANGE (part)
(
PARTITION low VALUES LESS THAN (1000),
PARTITION mid VALUES LESS THAN (10000),
PARTITION high VALUES LESS THAN (MAXVALUE)
);
SELECT
info
FROM
loopy PARTITION(mid)
CONNECT BY
key = key_hier
START WITH
key = <some value>;
The interesting problem now becomes your partitioning strategy. Oracle provides several options.
I've seen that using connect by can be slow but compared to what? There isn't really another option except building a result set using recursive PL/SQL calls (slower) or doing it on your client side.
You could try separating your data into a mapping (hierarchy definition) and lookup tables (the display data) and then joining them back together. I guess I wouldn't expect much of a gain assuming you are getting the hierarchy data from indexed fields but its worth a try.
Have you tried it using the connect by yet? I'm a big fan of trying different variations.

Resources