Best way to save large column data in datawarehouse

Best way to save large column data in datawarehouse - oracle

I have a table that stores the changes to a transaction. All the changes are captured into a table. One of the column that comes as part of the transaction can have many comma separated values. The number of occurrences cannot be predicted. Also this field is not mandatory and can have null values as well.
The total number of transactions that i have in the table is around 100M. Out of those the number of records for which the value is populated is 1M. Out of the 1M transactions the number of records for which the length of the record exceeds 4000 is ~37K.
I mentioned the length as 4000 since in my oracle table the column which would save this has been defined as varchar2(4000).
I check at places and found that if I have to save something of unknown length then i should define the table column datatype as clob. But clob is expensive for me since only a very small amount of data has length > 4000. If I snowflake my star schema and create another table to store the values then even though, I have transactions for which the length is much smaller than 4000 would be saved as part of the clob column. This would be expensive both in terms of storage and performance.
Can someone suggest me an approach to solve this problem.
Thanks
S

You could create a master - detail table to store the comma separated values, then you could have rows rather than save all comma separated values in a single column. This could be managed with a foregn key using a pseudo key between master and detail table.

Here's one option.
Create two columns, e.g.
create table storage
(id number primary key,
long_text_1 varchar2(4000),
long_text_2 varchar2(4000)
);
Store values like
insert into storage (id, long_text_1, long_text_2)
values (seq.nextval,
substr(input_value, 1, 4000),
substr(input_value, 4001, 4000)
);
When retrieving them from the table, concatenate them:
select id,
long_text_1 || long_text_2 as long_text
from storage
where ...

You might benefit from using inline SecurFile CLOBs. With inline CLOBs, up to about 4000 bytes of data can be stored in rows like a regular VARCHAR2 and only the larger values will be stored in a separate CLOB segment. With SecureFiles, Oracle can significantly improve CLOB performance. (For example, import and export of SecureFiles is much faster than the old-fashioned BasicFile LOB format.)
Depending on your version, parameters, and table DDL, your database may already store CLOBs as inline SecureFiles. Ensure that your COMPATIBLE setting is 11.2 or higher, and that DB_SECUREFILE is one of "permitted", "always", or "preferred":
select name, value from v$parameter where name in ('compatible', 'db_securefile') order by 1;
Use a query like this to ensure that your tables were setup correctly, and nobody overrode the system settings:
select dbms_metadata.get_ddl('TABLE', 'YOUR_TABLE_NAME') from dual;
You should see something like this in the results:
... LOB ("CLOB_NAME") STORE AS SECUREFILE (... ENABLE STORAGE IN ROW ...) ...
One of the main problems with CLOBs is that they are stored in a separate segment, and a LOB index must be traversed to map each row in the table to a value in another segment. The below demo creates two tables to show that LOB segments do not need to be used when the the data is small and stored inline.
--drop table clob_test_inline;
--drop table clob_test_not_in;
create table clob_test_inline(a number, b clob) lob(b) store as securefile (enable storage in row);
create table clob_test_not_in(a number, b clob) lob(b) store as (disable storage in row);
insert into clob_test_inline select level, lpad('A', 900, 'A') from dual connect by level <= 10000;
insert into clob_test_not_in select level, lpad('A', 900, 'A') from dual connect by level <= 10000;
commit;
The inline table segment is large, because it holds all the data. The out of line table segment is small, because all of its data is held elsewhere.
select segment_name, bytes/1024/1024 mb_inline
from dba_segments
where segment_name like 'CLOB_TEST%'
order by 1;
SEGMENT_NAME MB_INLINE
---------------- ---------
CLOB_TEST_INLINE 27
CLOB_TEST_NOT_IN 0.625
Looking at the LOB segments, the sizes are reversed. The inline table doesn't store anything in the LOB segment.
select table_name, bytes/1024/1024 mb_out_of_line
from dba_segments
join dba_lobs
on dba_segments.owner = dba_lobs.owner
and dba_segments.segment_name = dba_lobs.segment_name
where dba_lobs.table_name like 'CLOB_TEST%'
order by 1;
TABLE_NAME MB_OUT_OF_LINE
------------ --------------
CLOB_TEST_INLINE 0.125
CLOB_TEST_NOT_IN 88.1875
Despite the above, I can't promise that CLOBs will still work for you. All I can say is that it's worth testing the data using CLOBs. You'll still need to look out for a few things. CLOBs store text slightly differently (UCS2 instead of UTF8), which may take up more space depending on your character sets. So check the segment sizes. But also beware that segment sizes can lie when they are small - there's a lot of auto-allocated overhead for sample data, so you'll want to use realistic sizes when testing.
Finally, as Raul pointed out, storing non-atomic values in a field is usually a terrible mistake. That said, there are rare times when data warehouses need to break the rules for performance, and data needs to be stored as compactly as possible. Before you store the data this way, ensure that you will never need to join based on those values, or query for individual values. If you think dealing with 100M rows is tough, just wait until you try to split 100M values and then join them to another table.

Related

What is the data length of CLOB in oracle?

Could you please let me know what is the Data length for the 2nd column_id of "CLOB" data type in the Employee table? I see some blogs where it says maximum data length is : (4GB -1)* (database block size).
I'm new to this data designing.
Table : Employee
**Column_Name ----- Data_Type ------- Nullable ---- Column_Id**
Emp_ID NUMBER No 1
Emp_details CLOB NO 2
Please help me.

To get CLOB size for a given column in a given row, use DBMS_LOB.GETLENGTH function:
select dbms_lob.getlength(emp_details) from employee from emp_id=1;
To get CLOB size for a given column in a given table that is allocated in the tablespace, you need to identify both segments implementing the LOB.
You can compare both size with following query:
select v1.col_size, v2.seg_size from
(select sum(dbms_lob.getlength(emp_details)) as col_size from employee) v1,
(select sum(bytes) as seg_size from user_segments where segment_name in
(
(select segment_name from user_lobs where table_name='EMPLOYEE' and column_name='EMP_DETAILS')
union
(select index_name from user_lobs where table_name='EMPLOYEE' and column_name='EMP_DETAILS')
)
) v2
;

LOBs are not stored in the table, but outside of it in a dedicated structure called LOB segment, using an LOB index. As #pifor explains, you can inspect those structures in the dictionary view user_lobs.
The LOB segment uses blocks of usually 8192 bytes (check the tablespace in user_lobs), so the minimum size allocated for a single LOB is 8K. For 10.000 bytes, you need two 8K blocks and so on.
Please note that if your database is set to Unicode (as most modern Oracle databases are), the size of a CLOB is roughly 2x as expectet, because they are stored in a 16 bit encoding.
This gets a bit better if you compress the LOBs, but your Oracle license needs to cover "Advanced Compression".
For very small LOBs (less than ca 4000 bytes), you can avoid the 8K overhead and store them in the table where all the other columns are (enable storage in row).

Trying to figure out max length of Rowid in Oracle

As per my design I want to fetch rowid as in
select rowid r from table_name;
into a C variable. I was wondering what is the max size / length in characters of the rowid.
Currently in one of the biggest tables in my DB we have the max length as 18 and its 18 throughout the table for rowid.
Thanks in advance.
Edit:
Currently the below block of code is iterated and used for multiple tables hence in-order to make the code flexible without introducing the need of defining every table's PK in the query we use ROWID.
select rowid from table_name ... where ....;
delete from table_name where rowid = selectedrowid;
I think as the rowid is picked and used then and there without storing it for future, it is safe to use in this particular scenario.
Please refer to below answer:
Is it safe to use ROWID to locate a Row/Record in Oracle?
I'd say no. This could be safe if for instance the application stores ROWID temporarily(say generating a list of select-able items, each identified with ROWID, but the list is routinely regenerated and not stored). But if ROWID is used in any persistent way it's not safe.

A physical ROWID has a fixed size in a given Oracle version, it does not depend on the number of rows in a table. It consists of the number of the datafile, the number of the block within this file, and the number of the row within this block. Therefore it is unique in the whole database and allows direct access to the block and row without any further lookup.
As things in the IT world continue to grow, it is safe to assume that the format will change in future.
Besides volume there are also structural changes, like the advent of transportable tablespaces, which made it necessary to store the object number (= internal number of the table/partition/subpartion) inside the ROWID.
Or the advent of Index organized tables (mentioned by #ibre5041), which look like a table, but are in reality just an index without such a physical address (because things are moving constantly in an index). This made it necessary to introduce UROWIDs which can store physical and index-based ROWIDs.
Please be aware that a ROWID can change, for instance if the row moves from one table partition to another one, or if the table is defragmented to fill the holes left by many DELETEs.

According documentation ROWID has a length of 10 Byte:
Rowids of Row Pieces
A rowid is effectively a 10-byte physical address of a row.
Every row in a heap-organized table has a rowid unique to this table
that corresponds to the physical address of a row piece. For table
clusters, rows in different tables that are in the same data block can
have the same rowid.
Oracle also documents the (current) format see, Rowid Format
In general you could use the ROWID in your application, provided the affected rows are locked!
Thus your statement may look like this:
CURSOR ... IS
select rowid from table_name ... where .... FOR UPDATE;
delete from table_name where rowid = selectedrowid;
see SELECT FOR UPDATE and FOR UPDATE Cursors
Oracle even provides a shortcut. Instead of where rowid = selectedrowid you can use WHERE CURRENT OF ...

Oracle - how to see how many blocks have been used in a table

I have very limited experience when using Oracle, and am after a rather simple query I imagine. I have a table which contains 1 million rows, Im trying to proof that compressing the data uses less space, however im not sure how to do this, based on this table creation below could someone please show me what i need to write to see the blocks used before/after?
CREATE TABLE OrderTableCompressed(OrderID, StaffID, CustomerID, TotalOrderValue)
as (select level, ceil(dbms_random.value(0, 1000)),
ceil(dbms_random.value(0,10000)),
round(dbms_random.value(0,10000),2)
from dual
connect by level <= 1000000);
ALTER TABLE OrderTableCompressed ADD CONSTRAINT OrderID_PKC PRIMARY KEY (OrderID);
--QUERY HERE THAT SHOWS BLOCKS USED/TIME TAKEN
SELECT COUNT(ORDERID) FROM OrderTableCompressed;
ALTER TABLE OrderTableCompressed COMPRESS;
--QUERY HERE THAT SHOWS BLOCKS USED/TIME TAKEN WHEN COMPRESSED
SELECT COUNT(ORDERID) FROM OrderTableCompressed;
I know how the compression works etc... its just applying the code to proove my theory. thanks for any help

--QUERY HERE THAT SHOWS BLOCKS USED
SELECT blocks, bytes/1024/1024 as MB
FROM user_segments
where segment_name = 'ORDERTABLECOMPRESSED';
Now compress the table: (Note the move. Without it you just change the attribute of the table and subsequent direct path inserts will create compressed blocks)
ALTER TABLE OrderTableCompressed MOVE COMPRESS;
Verify blocks:
--QUERY HERE THAT SHOWS BLOCKS USED TAKEN WHEN COMPRESSED
SELECT blocks, bytes/1024/1024 as MB
FROM user_segments
where segment_name = 'ORDERTABLECOMPRESSED';

Oracle 11G - Performance effect of indexing at insert

Objective
Verify if it is true that insert records without PK/index plus create thme later is faster than insert with PK/Index.
Note
The point here is not about indexing takes more time (it is obvious), but the total cost (Insert without index + create index) is higher than (Insert with index). Because I was taught to insert without index and create index later as it should be faster.
Environment
Windows 7 64 bit on DELL Latitude core i7 2.8GHz 8G memory & SSD HDD
Oracle 11G R2 64 bit
Background
I was taught that insert records without PK/Index and create them after insert would be faster than insert with PK/Index.
However 1 million record inserts with PK/Index was actually faster than creating PK/Index later, approx 4.5 seconds vs 6 seconds, with the experiments below. By increasing the records to 3 million (999000 -> 2999000), the result was the same.
Conditions
The table DDL is below. One bigfile table space for both data and
index.
(Tested a separate index tablespace with the same result & inferior overall perforemace)
Flush the buffer/spool before each run.
Run the experiment 3 times each and made sure the results
were similar.
SQL to flush:
ALTER SYSTEM CHECKPOINT;
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
Question
Would it be actually true that "insert witout PK/Index + PK/Index creation later" is faster than "insert with PK/Index"?
Did I make mistakes or missed some conditions in the experiment?
Insert records with PK/Index
TRUNCATE TABLE TBL2;
ALTER TABLE TBL2 DROP CONSTRAINT PK_TBL2_COL1 CASCADE;
ALTER TABLE TBL2 ADD CONSTRAINT PK_TBL2_COL1 PRIMARY KEY(COL1) ;
SET timing ON
INSERT INTO TBL2
SELECT i+j, rpad(TO_CHAR(i+j),100,'A')
FROM (
WITH DATA2(j) AS (
SELECT 0 j FROM DUAL
UNION ALL
SELECT j+1000 FROM DATA2 WHERE j < 999000
)
SELECT j FROM DATA2
),
(
WITH DATA1(i) AS (
SELECT 1 i FROM DUAL
UNION ALL
SELECT i+1 FROM DATA1 WHERE i < 1000
)
SELECT i FROM DATA1
);
commit;
1,000,000 rows inserted.
Elapsed: 00:00:04.328 <----- Insert records with PK/Index
Insert records without PK/Index and create them after
TRUNCATE TABLE TBL2;
ALTER TABLE &TBL_NAME DROP CONSTRAINT PK_TBL2_COL1 CASCADE;
SET TIMING ON
INSERT INTO TBL2
SELECT i+j, rpad(TO_CHAR(i+j),100,'A')
FROM (
WITH DATA2(j) AS (
SELECT 0 j FROM DUAL
UNION ALL
SELECT j+1000 FROM DATA2 WHERE j < 999000
)
SELECT j FROM DATA2
),
(
WITH DATA1(i) AS (
SELECT 1 i FROM DUAL
UNION ALL
SELECT i+1 FROM DATA1 WHERE i < 1000
)
SELECT i FROM DATA1
);
commit;
ALTER TABLE TBL2 ADD CONSTRAINT PK_TBL2_COL1 PRIMARY KEY(COL1) ;
1,000,000 rows inserted.
Elapsed: 00:00:03.454 <---- Insert without PK/Index
table TBL2 altered.
Elapsed: 00:00:02.544 <---- Create PK/Index
Table DDL
CREATE TABLE TBL2 (
"COL1" NUMBER,
"COL2" VARCHAR2(100 BYTE),
CONSTRAINT "PK_TBL2_COL1" PRIMARY KEY ("COL1")
) TABLESPACE "TBS_BIG" ;

The current test case is probably good enough for you to overrule the "best practices". There are too many variables involved to make a blanket statement that "it's always best to leave the indexes enabled". But you're probably close enough to say it's true for your environment.
Below are some considerations for the test case. I've made this a community wiki in the hopes that others will add to the list.
Direct-path inserts. Direct-path writes use different mechanisms and may work completely differently. Direct-path inserts can often be significantly faster than regular inserts, although they have some complicated restrictions (for example, triggers must be disabled) and disadvantages (the data is not immediately backed-up). One particular way it affects this scenario is that NOLOGGING for indexes only applies during index creation. So even if a direct-path insert is used, an enabled index will always generate REDO and UNDO.
Parallelism. Large insert statements often benefit from parallel DML. Usually it's not worth worrying about the performance of bulk loads until it takes more than several seconds, which is when parallelism starts to be useful.
Bitmap indexes are not meant for large DML. Inserts or updates to a table with a bitmap index can lock the whole table and lead to disastrous performance. It might be helpful to limit the test case to b-tree indexes.
Add alter system switch logfile;? Log file switches can sometimes cause performance issues. The tests would be somewhat more consistent if they all started with empty logfiles.
Move data generation logic into a separate step. Hierarchical queries are useful for generating data but they can have their own performance issues. It might be better to create in intermediate table to hold the results, and then only test inserting the intermediate table into the final table.

It's true that it is faster to modify a table if you do not also have to modify one or more indexes and possibly perform constraint checking as well, but it is also largely irrelevant if you then have to add those indexes. You have to consider the complete change to the system that you wish to effect, not just a single part of it.
Obviously if you are adding a single row into a table that already contains millions of rows then it would be foolish to drop and rebuild indexes.
However, even if you have a completely empty table into which you are going to add several million rows it can still be slower to defer the indexing until afterwards.
The reason for this is that such an insert is best performed with the direct path mechanism, and when you use direct path inserts into a table with indexes on it, temporary segments are built that contain the data required to build the indexes (data plus rowids). If those temporary segments are much smaller than the table you have just loaded then they will also be faster to scan and to build the indexes from.
the alternative, if you have five index on the table, is to incur five full table scans after you have loaded it in order to build the indexes.
Obviously there are huge grey areas involved here, but well done for:
Questioning authority and general rules of thumb, and
Running actual tests to determine the facts in your own case.
Edit:
Further considerations -- you run a backup while the indexes are dropped. Now, following an emergency restore, you have to have a script that verifies that all indexes are in place, when you have the business breathing down your neck to get the system back up.
Also, if you absolutely were determined to not maintain indexes during a bulk load, do not drop the indexes -- disable them instead. This preserves the metadata for the indexes existence and definition, and allows a more simple rebuild process. Just be careful that you do not accidentally re-enable indexes by truncating the table, as this will render disabled indexes enabled again.

Oracle has to do more work while inserting data into table having an index. In general, inserting without index is faster than inserting with index.
Think in this way,
Inserting rows in a regular heap-organized table with no particular row order is simple. Find a table block with enough free space, put the rows randomly.
But, when there are indexes on the table, there is much more work to do. Adding new entry for the index is not that simple. It has to traverse the index blocks to find the specific leaf node as the new entry cannot be made into any block. Once the correct leaf node is found, it checks for enough free space and then makes the new entry. If there is not enough space, then it has to split the node and distribute the new entry into old and new node. So, all this work is an overhead and consumes more time overall.
Let's see a small example,
Database version :
SQL> SELECT banner FROM v$version where ROWNUM =1;
BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
OS : Windows 7, 8GB RAM
With Index
SQL> CREATE TABLE t(A NUMBER, CONSTRAINT PK_a PRIMARY KEY (A));
Table created.
SQL> SET timing ON
SQL> INSERT INTO t SELECT LEVEL FROM dual CONNECT BY LEVEL <=1000000;
1000000 rows created.
Elapsed: 00:00:02.26
So, it took 00:00:02.26. Index details:
SQL> column index_name format a10
SQL> column table_name format a10
SQL> column uniqueness format a10
SQL> SELECT index_name, table_name, uniqueness FROM user_indexes WHERE table_name = 'T';
INDEX_NAME TABLE_NAME UNIQUENESS
---------- ---------- ----------
PK_A T UNIQUE
Without Index
SQL> DROP TABLE t PURGE;
Table dropped.
SQL> CREATE TABLE t(A NUMBER);
Table created.
SQL> SET timing ON
SQL> INSERT INTO t SELECT LEVEL FROM dual CONNECT BY LEVEL <=1000000;
1000000 rows created.
Elapsed: 00:00:00.60
So, it took only 00:00:00.60 which is faster compared to 00:00:02.26.

How do I UPDATE a large table in oracle pl/sql in batches to avoid running out of undospace?

I have a very large table (5mm records). I'm trying to obfuscate the table's VARCHAR2 columns with random alphanumerics for every record on the table. My procedure executes successfully on smaller datasets, but it will eventually be used on a remote db whose settings I can't control, so I'd like to EXECUTE the UPDATE statement in batches to avoid running out of undospace.
Is there some kind of option I can enable, or a standard way to do the update in chunks?
I'll add that there won't be any distinguishing features of the records that haven't been obfuscated so my one thought of using rownum in a loop won't work (I think).

If you are going to update every row in a table, you are better off doing a Create Table As Select, then drop/truncate the original table and re-append with the new data. If you've got the partitioning option, you can create your new table as a table with a single partition and simply swap it with EXCHANGE PARTITION.
Inserts require a LOT less undo and a direct path insert with nologging (/+APPEND/ hint) won't generate much redo either.
With either mechanism, there would probably sill be 'forensic' evidence of the old values (eg preserved in undo or in "available" space allocated to the table due to row movement).

The following is untested, but should work:
declare
l_fetchsize number := 10000;
cursor cur_getrows is
select rowid, random_function(my_column)
from my_table;
type rowid_tbl_type is table of urowid;
type my_column_tbl_type is table of my_table.my_column%type;
rowid_tbl rowid_tbl_type;
my_column_tbl my_column_tbl_type;
begin
open cur_getrows;
loop
fetch cur_getrows bulk collect
into rowid_tbl, my_column_tbl
limit l_fetchsize;
exit when rowid_tbl.count = 0;
forall i in rowid_tbl.first..rowid_tbl.last
update my_table
set my_column = my_column_tbl(i)
where rowid = rowid_tbl(i);
commit;
end loop;
close cur_getrows;
end;
/
This isn't optimally efficient -- a single update would be -- but it'll do smaller, user-tunable batches, using ROWID.

I do this by mapping the primary key to an integer (mod n), and then perform the update for each x, where 0 <= x < n.
For example, maybe you are unlucky and the primary key is a string. You can hash it with your favorite hash function, and break it into three partitions:
UPDATE myTable SET a=doMyUpdate(a) WHERE MOD(ORA_HASH(ID), 3)=0
UPDATE myTable SET a=doMyUpdate(a) WHERE MOD(ORA_HASH(ID), 3)=1
UPDATE myTable SET a=doMyUpdate(a) WHERE MOD(ORA_HASH(ID), 3)=2
You may have more partitions, and may want to put this into a loop (with some commits).

If I had to update millions of records I would probably opt to NOT update.
I would more likely create a temp table and then insert data from old table since insert doesnt take up a lot of redo space and takes less undo.
CREATE TABLE new_table as select <do the update "here"> from old_table;
index new_table
grant on new table
add constraints on new_table
etc on new_table
drop table old_table
rename new_table to old_table;
you can do that using parallel query, with nologging on most operations generating very
little redo and no undo at all -- in a fraction of the time it would take to update the
data.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio