Calculate how much space take 100 rows in a Table - oracle

I have a Table with 5000 rows. Is there any way to find out how much space is used by the first 100 rows?
EDIT I found this script on the internet:
WITH table_size AS
(SELECT owner, segment_name, SUM (BYTES) total_size
FROM dba_extents
WHERE segment_type = 'TABLE'
GROUP BY owner, segment_name)
SELECT table_name, avg_row_len, num_rows * avg_row_len actual_size_of_data,
b.total_size
FROM dba_tables a, table_size b
WHERE a.owner = UPPER ('&&ENTER_OWNER_NAME')
AND a.table_name = UPPER ('&&ENTER_TABLE_NAME')
AND a.owner = b.owner
AND a.table_name = b.segment_name;
I don't know if it gives the desired result. It calculates the Average Row Length and I multiply that with 100.

Oracle separates the logical storage from the physical by packing rows into blocks. So, while it's not straight forward and it is also not exact, it is possible.
You have to determine the sum of the bytes used by a row (this will be an educated guess if you have one or more varchar2 columns) then, based on block size, determine how many rows will fit in a block. Oracle always allocates a full block even if it only has two store on byte in it so the total storage will be a factor of block size.

Related

Oracle maximum number of expression issue

I have this query
#Query("SELECT b from BillInfoDetails b where b.masterAcctCode in :masterAccountList and b.msisdn in :msisdnList") List<BillInfoDetails>findAllByMsisdnAndMasterAcctList(#Param("masterAccountList")List<String> masterAccountList, #Param("msisdnList") List<String> msisdnList);
and then find this error ORA-01795: maximum number of expressions in a list is 1000.
This error has some manual solution that I have split the list manually 1 to 999 and then 1000 to 1999 and so on. But this will not the good one for me cause in this msisdnList there could be 1500 or 18000 or some other more values. Moreover I want a dynamic solution actually where any dynamic value whatever it is, it will work properly
One option is to store all those values into a table; then you'd be able to use it as a join (or a subquery). For example:
select b.*
from billinfodetails b join new_table n on n.masteracctcode = b.masteracctcode
or
select b.*
from billinfodetails b
where b.masteracctcode in (select n.masteracctcode
from new_table n)
or
select b.*
from billinfodetails b
where exists (select null
from new_table n
where n.masteracctcode = b.masteracctcode)

How to Improve Cross Join Performance in Hive TEZ?

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.
For achieving this I am doing a cross join like
select *
from table1 join table 2
ON 1 = 1;
This is taking 5 hours to run with the highest possible memory parameters.
Is there any other short or easier way to achieve this in less time ?
Turn on map-join:
set hive.auto.convert.join=true;
select *
from table1 cross join table2;
The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.
Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.
with t1 as (
selct col1, col2,..., 0 as k from table1
)
,t2 as (
selct col3, col4,..., 0 as k from table2
)
selct
*
from t1 join t2
on t1.k = t2.k
Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

How to define if table is a good candidate for a clustered columnstore index?

I have read (here,here and here) about clustered columnstore indexes introduced in SQL Server 2014. Basically, now:
Column store indexes can be updatable
Table schema can be modified (without drop column store indexes)
Structure of the base table can be columnar
Space saved by compression effects (with a column store index, you
can save between 40 to 50 percent of initial space used for the
table)
In addition, they support:
Row mode and Batch mode processing
BULK INSERT statement
More data types
AS I have understood there are some restrictions, like:
Unsupported data types
Other indexes cannot be created
But as it is said:
With a clustered column store index, all filter possibilities are
already covered; Query Processor, using Segment Elimination, will be
able to consider only the segments required by the query clauses. On
the columns where it cannot apply the Segment Elimination, all scans
will be faster than B-Tree index scans because data are compressed so
less I/O operations will be required.
I am interested in the following:
Does the statement above say that a clustered column store index is always better for extracting data than a B-Tree index when a lot of duplicated values exist?
What about the performance between clustered column store index and non-clustered B-Tree covering index, when the table has many columns for example?
Can I have a combination of clustered and non-clustered columnstores indexes on one table?
And most importantly, can anyone tell how to determine whether a table is a good candidate for a columned stored index?
It is said that the best candidates are tables for which update/delete/insert operations are not performed often. For example, I have a table with storage size above 17 GB (about 70 millions rows) and new records are inserted and deleted constantly. On the other hand, a lot of queries using its columns are performed. Or I have a table with storage size about 40 GB (about 60 millions rows) with many inserts performed each day - it is not queried often but I want to reduce its size.
I know the answer is mostly in running production tests but before that I need to pick the better candidates.
One of the most important restrictions for Clustered Columnstore is their locking, you can find some details over here: http://www.nikoport.com/2013/07/07/clustered-columnstore-indexes-part-8-locking/
Regarding your questions:
1) Does the statement above say that a clustered column store index is always better for extracting data then a B-Tree index when a lot of duplicated values exist
Not only duplicates are faster scanned by Batch Mode, but for data reading the mechanisms for Columnstore Indexes are more effective, when reading all data out of a Segment.
2) What about the performance between clustered column store index and non-clustered B-Tree covering index, when the table has many columns for example
Columnstore Index has a significantly better compression than Page or Row, available for the Row Store, Batch Mode shall make the biggest difference on the processing side and as already mentioned even reading of the equally-sized pages & extents should be faster for Columnstore Indexes
3) Can I have a combination of clustered and non clustered columnstores indexes on one table
No, at the moment this is impossible.
4) ... can anyone tell how to define if a table is a good candidate for a columned stored index?
Any table which you are scanning & processing in big amounts (over 1 million rows), or maybe even whole table with over 100K scanned entirely might be a candidate to consider.
There are some restrictions on the used technologies related to the table where you want to build Clustered Columnstore indexes, here is a query that I am using:
select object_schema_name( t.object_id ) as 'Schema'
, object_name (t.object_id) as 'Table'
, sum(p.rows) as 'Row Count'
, cast( sum(a.total_pages) * 8.0 / 1024. / 1024
as decimal(16,3)) as 'size in GB'
, (select count(*) from sys.columns as col
where t.object_id = col.object_id ) as 'Cols Count'
, (select count(*)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id and
UPPER(tp.name) in ('VARCHAR','NVARCHAR')
) as 'String Columns'
, (select sum(col.max_length)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id
) as 'Cols Max Length'
, (select count(*)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id and
(UPPER(tp.name) in ('TEXT','NTEXT','TIMESTAMP','HIERARCHYID','SQL_VARIANT','XML','GEOGRAPHY','GEOMETRY') OR
(UPPER(tp.name) in ('VARCHAR','NVARCHAR') and (col.max_length = 8000 or col.max_length = -1))
)
) as 'Unsupported Columns'
, (select count(*)
from sys.objects
where type = 'PK' AND parent_object_id = t.object_id ) as 'Primary Key'
, (select count(*)
from sys.objects
where type = 'F' AND parent_object_id = t.object_id ) as 'Foreign Keys'
, (select count(*)
from sys.objects
where type in ('UQ','D','C') AND parent_object_id = t.object_id ) as 'Constraints'
, (select count(*)
from sys.objects
where type in ('TA','TR') AND parent_object_id = t.object_id ) as 'Triggers'
, t.is_tracked_by_cdc as 'CDC'
, t.is_memory_optimized as 'Hekaton'
, t.is_replicated as 'Replication'
, coalesce(t.filestream_data_space_id,0,1) as 'FileStream'
, t.is_filetable as 'FileTable'
from sys.tables t
inner join sys.partitions as p
ON t.object_id = p.object_id
INNER JOIN sys.allocation_units as a
ON p.partition_id = a.container_id
where p.data_compression in (0,1,2) -- None, Row, Page
group by t.object_id, t.is_tracked_by_cdc, t.is_memory_optimized, t.is_filetable, t.is_replicated, t.filestream_data_space_id
having sum(p.rows) > 1000000
order by sum(p.rows) desc

How to get records randomly from the oracle database?

I need to select rows randomly from an Oracle DB.
Ex: Assume a table with 100 rows, how I can randomly return 20 of those records from the entire 100 rows.
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum < 21;
SAMPLE() is not guaranteed to give you exactly 20 rows, but might be suitable (and may perform significantly better than a full query + sort-by-random for large tables):
SELECT *
FROM table SAMPLE(20);
Note: the 20 here is an approximate percentage, not the number of rows desired. In this case, since you have 100 rows, to get approximately 20 rows you ask for a 20% sample.
SELECT * FROM table SAMPLE(10) WHERE ROWNUM <= 20;
This is more efficient as it doesn't need to sort the Table.
SELECT column FROM
( SELECT column, dbms_random.value FROM table ORDER BY 2 )
where rownum <= 20;
In summary, two ways were introduced
1) using order by DBMS_RANDOM.VALUE clause
2) using sample([%]) function
The first way has advantage in 'CORRECTNESS' which means you will never fail get result if it actually exists, while in the second way you may get no result even though it has cases satisfying the query condition since information is reduced during sampling.
The second way has advantage in 'EFFICIENT' which mean you will get result faster and give light load to your database.
I was given an warning from DBA that my query using the first way gives loads to the database
You can choose one of two ways according to your interest!
In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 3 additional methods:
1: Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
2: if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
3: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);
To randomly select 20 rows I think you'd be better off selecting the lot of them randomly ordered and selecting the first 20 of that set.
Something like:
Select *
from (select *
from table
order by dbms_random.value) -- you can also use DBMS_RANDOM.RANDOM
where rownum < 21;
Best used for small tables to avoid selecting large chunks of data only to discard most of it.
Here's how to pick a random sample out of each group:
SELECT GROUPING_COLUMN,
MIN (COLUMN_NAME) KEEP (DENSE_RANK FIRST ORDER BY DBMS_RANDOM.VALUE)
AS RANDOM_SAMPLE
FROM TABLE_NAME
GROUP BY GROUPING_COLUMN
ORDER BY GROUPING_COLUMN;
I'm not sure how efficient it is, but if you have a lot of categories and sub-categories, this seems to do the job nicely.
-- Q. How to find Random 50% records from table ?
when we want percent wise randomly data
SELECT *
FROM (
SELECT *
FROM table_name
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum <= (select count(*) from table_name) * 50/100;

Oracle disk space used by a table

I have a table in an Oracle db that gets a couple of million new rows every month. Each row has a column which states the date when it was created.
I'd like to run a query that gets the disk space growth over the last 6 months. In other words, the result would be a table with two columns where each row would have the month's name and disk space used during that month.
Thanks,
This article reports a method of getting the table growth: http://www.dba-oracle.com/t_table_growth_reports.htm
column "Percent of Total Disk Usage" justify right format 999.99
column "Space Used (MB)" justify right format 9,999,999.99
column "Total Object Size (MB)" justify right format 9,999,999.99
set linesize 150
set pages 80
set feedback off
select * from (select to_char(end_interval_time, 'MM/DD/YY') mydate, sum(space_used_delta) / 1024 / 1024 "Space used (MB)", avg(c.bytes) / 1024 / 1024 "Total Object Size (MB)",
round(sum(space_used_delta) / sum(c.bytes) * 100, 2) "Percent of Total Disk Usage"
from
dba_hist_snapshot sn,
dba_hist_seg_stat a,
dba_objects b,
dba_segments c
where begin_interval_time > trunc(sysdate) - &days_back
and sn.snap_id = a.snap_id
and b.object_id = a.obj#
and b.owner = c.owner
and b.object_name = c.segment_name
and c.segment_name = '&segment_name'
group by to_char(end_interval_time, 'MM/YY'))
order by to_date(mydate, 'MM/YY');
DBA_TABLES (or the equivalent) gives an AVG_ROW_LEN, so you could simply multiply that by the number of rows created per month.
The caveats to that are, it assumes that the row length of new rows is similar to that of existing rows. If you've got a bunch of historical data that were 'small' (eg 50 bytes) but new rows are larger (150 bytes), then the estimates will be too low.
Also, how do updates figure into things ? If a row starts at 50 bytes and grows to 150 two months later, how do you account for those 100 bytes ?
Finally, tables don't grow for each row insert. Every so often the allocated space will fill up and it will go and allocate another chunk. Depending on the table settings, that next chunk may be, for example, 50% of the existing table size. So you might not physically grow for three months and then have a massive jump, then not grow for another six months.

Resources