MD5 of entire row in oracle - oracle

Below is the query to get MD5 of entire row in snowflake
SELECT MD5(TO_VARCHAR(ARRAY_CONSTRUCT(*)))FROM T
taken from here
what is the alternative query in oracle to achieve such requirement without having to put all column names manually.

You may use the packaged function dbms_sqlhash.gethash as described below, but remember:
the package was removed from the documentation (I guess in 11g), so in the recent releases this is an undocumented feature
if you calculate the hash code from more than one row you must define order (order by on a unique key(s)) . Otherwise the calculated hash is not deterministic. (This was probaly the reason of the removal)
the columns with other data types than varchar2 are converted to strings before the hash calculation, so the result is dependent on the NLS setting. You must stabilize the NLS setting to get reproducible results, e.g. with alter session set nls_date_format='dd-mm-yyyy hh24:mi:ss';
The column must be concatenated with some special delimiter (that does not occure in the data) to aviod collision: 'A' || null is the same as null || 'A'. This are unknown internals, so it is rather hard to compare the result MD5 hash with hash calculated on other (non Oracle) data.
You need extra grant to execute the package
Some additional info
Example
select * from tab where x=1;
X Y Z
---------- - -------------------
1 a 13.10.2021 00:00:00
select
dbms_sqlhash.gethash(
sqltext => 'select * from tab where x=1',
digest_type => 2 /* dbms_crypto.hash_md5 */
) MD5
from dual;
MD5
--------------------------------
215A9C4642A3691F951DD8060877D191
Order Independent Hash Code of a Table
Contrary to a file (where the order matter) in a database table the order is not relevant. It would be therefore meaningfull to have a possibility to calculate an order independent hash code of a table.
Unfortunately this feature is currently not available in Oracle, but was implemented as a prototype as described here

Related

Insertion of characters into number column

I have a table with several number columns that are inserted through a Asp.Net application using bind variables.
Due to upgrade of Oracle client to 19c and server change, the code instead of giving an error on insert of invalid data, inserts trash and the application crashes aftewards.
Any help is appreciated in finding the root cause.
SELECT trial1,
DUMP (trial1, 17),
DUMP (trial1, 1016),
trial3,
DUMP (trial3,17),
DUMP (trial3, 1016)
Result in SQL Navigator
results of query
Oracle 12c
Oracle client 19
My DBA found this on Oracle Support and that lead to us find the error in the application side:
NaN is a specific IEEE754 value. However Oracle NUMBER is not IEEE754
compliant. Therefore if you force the data representing NaN into a
NUMBER column results are unpredicatable. SOLUTION If you can put a
value in a C float, double, int etc you can load this into the
database as no checks are undertaken - just as with the Oracle NUMBER
datatype it's up to the application to ensure the data is valid. If
you use the proper IEEE754 compliant type, eg BINARY_FLOAT, then NaN
is recognised and handled correctly.
You have bad data as you have tried to store an double precision NAN value in a NUMBER column rather than a BINARY_DOUBLE column.
We can duplicate the bad data with the function (never use this in a production environment):
CREATE FUNCTION createNumber(
hex VARCHAR2
) RETURN NUMBER DETERMINISTIC
IS
n NUMBER;
BEGIN
DBMS_STATS.CONVERT_RAW_VALUE( HEXTORAW( hex ), n );
RETURN n;
END;
/
Then, we can duplicate your bad values using the hexadecimal values from your DUMP output:
CREATE TABLE table_name (trial1 NUMBER, trial3 NUMBER);
INSERT INTO table_name (trial1, trial3) VALUES (
createNumber('FF65'),
createNumber('FFF8000000000000')
);
Then:
SELECT trial1,
DUMP(trial1, 16) AS t1_hexdump,
trial3,
DUMP(trial3, 16) AS t3_hexdump
FROM table_name;
Replicates your output:
TRIAL1
T1_HEXDUMP
TRIAL3
T3_HEXDUMP
~
Typ=2 Len=2: ff,65
null
Typ=2 Len=8: ff,f8,0,0,0,0,0,0
Any help is appreciated in finding the root cause.
You need to go back through your application and work out where the bad data came from and see if you can determine what the original data was and debug the steps it went through in the application to work out if it was:
Always bad data, and then you need to put in some validation into your application to make sure the bad data does not get propagated; or
Was good data but there is a bug in your code that changed it and then you need to fix the bug.
As for the existing bad data, you either need to correct it (if you know what it should be) or delete it.
We cannot help with any of that as we do not have visibility of your application nor do we know what the correct data should have been.
If you want to store that data as a floating point then you need to change from using a NUMBER to using a BINARY_DOUBLE data type:
CREATE TABLE table_name (value BINARY_DOUBLE);
INSERT INTO table_name(value) VALUES (BINARY_DOUBLE_INFINITY);
INSERT INTO table_name(value) VALUES (BINARY_DOUBLE_NAN);
Then:
SELECT value,
DUMP(value, 16)
FROM table_name;
Outputs:
VALUE
DUMP(VALUE,16)
Inf
Typ=101 Len=8: ff,f0,0,0,0,0,0,0
Nan
Typ=101 Len=8: ff,f8,0,0,0,0,0,0
Then BINARY_DOUBLE_NAN exactly matches the binary value in your column and you have tried to insert a Not-A-Number value into a NUMBER column (that does not support it) in the format expected for a BINARY_DOUBLE column (that would support it).
The issue was a division by zero on the application side that was inserted as infinity into the database, but Oracle has an unpredictable behavior with this values.
Please see original post above for all the details.

How to use Scalar Subquery on a single table for lob column?

I have a below query in Oracle having duplicate rows, where file_data is a BLOB column.
SELECT attachsysfilename, file_seq, version, file_size, lastupddttm, lastupdoprid, file_data
from PS_LP_EX_FILEATTCH
I want to apply distinct clause on top of it to get unique records. But unable to do so because of BLOB column.
Can someone please help in this regards?
How can I use the Scalar subquery on file_data column to get the DISTINCT records from the table?
assuming you have a primaru key for the PS_LP_EX_FILEATTCH table's row you could rey using subquery for an aggreagted result of the related primary key
select t.*, ps.file_data
from (
SELECT min(pk) my_id
attachsysfilename
,file_seq
,version
,file_size
,lastupddttm
,lastupdoprid
from PS_LP_EX_FILEATTCH
group by attachsysfilename
,file_seq
,version
,file_size
,lastupddttm
,lastupdoprid
) t
inner join PS_LP_EX_FILEATTCH ps ON t.my_id = ps.pk
You could use a hash of the BLOB values and group by the hash along with all the other (the non-BLOB) columns, select one pk (or rowid, see discussion below) from each group, for example min(pk) or min(rowid), and then select the corresponding rows from the table.
For hashing you could use ora_hash, but that is only for school work. If this is a serious project, you probably need to use dbms_crypto.hash.
Whether this is a correct solution depends on the possibility of collisions when hashing the BLOB values. In Oracle 11.1 - 11.2 you can use SHA-1 hashes (160 bits); perhaps this is enough to distinguish between your BLOB values. In higher Oracle versions, longer hashes (up to 512 bits in my version, 12.2) are available. Obviously, the longer the hashes, the slower the query - but also the higher the likelihood that you won't incorrectly identify different BLOB values as "duplicates" due to collisions.
Other responders asked about or mentioned a primary key (pk) column or columns in your table. If you have one, you can use it instead of the rowid in my query below - but rowid should work OK for this. (Still, pk is preferred if your table has one.)
dbms_crypto.hash takes an integer argument (1, 2, 3, etc.) for the hashing algorithm to be used. These are defined as named constants in the package. Alas, in SQL you can't reference package constants; you need to find the values beforehand. (Or, in Oracle 12.1 or higher, you can do it on the fly, by including a function in a with clause - but let's keep it simple.)
So, to cover Oracle 11.1 and higher, I'll assume we want to use the SHA-1 algorithm. To find its integer value from the package, I can do this:
begin
dbms_output.put_line(dbms_crypto.hash_sh1);
end;
/
3
PL/SQL procedure successfully completed.
If your Oracle version is higher, you can check for the value of hash_sh256, for example; on my system, it's 4. Remember this number, since we will use it below.
The query is:
select {whatever columns you need, including the BLOB}
from {your table}
where rowid in (
select min(rowid)
from {your table}
group by {the non-BLOB columns},
dbms_crypto.hash({BLOB column}, 3)
)
;
Notice the number 3 used in the hash function - that's the value of dbms_crypto.hash_sh1, which we found earlier.
I used below query to get the distinct rows including BLOB column.
select
attachsysfilename,
file_seq,
version,
lastupddttm,
lastupdoprid,
file_data,
ROW_NUMBER() OVER (PARTITION BY attachsysfilename,file_seq,version,lastupddttm,lastupdoprid ORDER BY attachsysfilename ,file_seq ,version,lastupddttm DESC,lastupdoprid) RNK
from ps_lp_ex_fileattch a
) WHERE RNK=1

Process SQL result set entirely

I need to work with a SQL result set in order to do some processing for each column (medians, standard deviations, several control statements included)
The SQL is dynamic so I don't know the number of columns, rows.
First I tried to use temporary tables, views, etc to store the results, however I did not manage to overcome the 30 character limit of Oracle columns when using the below sql:
create table (or view or global temporary table) as select * from (
SELECT
DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE,
SUM(DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI_CHZ +DMTTBF_MAT_MATURATO_BILL_POS. MAT_N_NUM_EVENTI) <-- exceeds the 30 character limit
FROM DMTTBF_MAT_MATURATO_BILL_POS
WHERE DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE >= '201301'
GROUP BY DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE
)
Second choice was to use some PL/SQL types to store the entire table information, so I could call it like in other programming languages (e.g. a matrix result[i][j]) but I could not find anything similar.
Third variant, using files for reading and writing: i did not try it yet; i'm still expecting a more elegant pl/sql solution
It's possible that I have the wrong approach here so any advice is more than welcome.
UPDATE: Modifying the input SQL is not an option. The program has to accept any select statement.
Note that you can alias both tables and fields. Using a table alias keeps references to it from producing walls of text in the query. Using one for a field gives it a new name in the output.
SELECT A.LONG_FIELD_NAME_HERE AS SHORTNAME
FROM REALLY_LONG_TABLE_NAME_HERE A
The auto naming adds _1 and _2 etc to differentiate the same column name coming from different table references. This often puts a field already borderline over the limit. Giving the fields names yourself bypasses this.
You can put the alias also in dynamic SQL:
sqlstr := 'create table (or view or global temporary table) as select * from (
SELECT
DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE,
SUM(DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI_CHZ + DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI) AS '||SUBSTR('SUM(DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI_CHZ +DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI)', 1, 30)
||' FROM DMTTBF_MAT_MATURATO_BILL_POS
WHERE DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE >= ''201301''
GROUP BY DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE
)'

Datatype difference in procedure parameter and sql query inside it

In my backend procedure i have a varchar2 parameter and i am using it in the SQL query to search with number column. Will this cause any kind of performance issues ?
for ex:
Proc (a varchar)
is
select * from table where deptno = a;
end
Here deptno is number column in table and a is varchar .
It might do. The database will resolve the differences in datatype by casting DEPTNO to a VARCHAR2. This will prevent the optimizer from using any (normal) index you have on that column. Depending on the data volumes and distribution, an indexed read may not always be the most efficient access path, in which case the data conversion doesn't matter.
So it does depend. But what are your options if it does matter (you have a highly selective index on that column)?
One solution would be to apply an explicit data conversion in your query:
select * from table
where deptno = to_number(a);
This will cause the query to fail if A contains a value which won't convert to a number.
A better solution would be to change the datatype of A so that the calling program can only pass a numeric value. This throws the responsibility for duff data where it properly belongs.
The least attractive solution is to keep the procedure's signature and the query as is, and build a function-based index on the column:
create index emp_deptchar_fbi on emp(to_char(deptno));
Read the documentation to find out more about function-based indexes.

Why does the Execution Plan include a user-defined function call for a computed column that is persisted?

I have a table with 2 computed columns, both of which has "Is Persisted" set to true. However, when using them in a query the Execution Plan shows the UDF used to compute the columns as part of the plan. Since the column data is calculated by the UDF when the row is added/updated why would the plan include it?
The query is incredibly slow (>30s) when these columns are included in the query, and lightning fast (<1s) when they are excluded. This leads me to conclude that the query is actually calculating the column values at run time, which shouldn't be the case since they are set to persisted.
Am I missing something here?
UPDATE: Here's a little more info regarding our reasoning for using the computed column.
We are a sports company and have a customer who stores full player names in a single column. They require us to allow them to search player data by first and/or last name separately. Thankfully they use a consistent format for player names - LastName, FirstName (NickName) - so parsing them is relatively easy. I created a UDF that calls into a CLR function to parse the name parts using a regular expression. So obviously calling the UDF, which in turn calls a CLR function, is very expensive. But since it is only used on a persisted column I figured it would only be used during the few times a day that we import data into the database.
The reason is that the query optimizer does not do a very good job at costing user-defined functions. It decides, in some cases, that it would be cheaper to completely re-evaluate the function for each row, rather than incur the disk reads that might be necessary otherwise.
SQL Server's costing model does not inspect the structure of the function to see how expensive it really is, so the optimizer does not have accurate information in this regard. Your function could be arbitrarily complex, so it is perhaps understandable that costing is limited this way. The effect is worst for scalar and multi-statement table-valued functions, since these are extremely expensive to call per-row.
You can tell whether the optimizer has decided to re-evaluate the function (rather than using the persisted value) by inspecting the query plan. If there is a Compute Scalar iterator with an explicit reference to the function name in its Defined Values list, the function will be called once per row. If the Defined Values list references the column name instead, the function will not be called.
My advice is generally not to use functions in computed column definitions at all.
The reproduction script below demonstrates the issue. Notice that the PRIMARY KEY defined for the table is nonclustered, so fetching the persisted value would require a bookmark lookup from the index, or a table scan. The optimizer decides it is cheaper to read the source column for the function from the index and re-compute the function per row, rather than incur the cost of a bookmark lookup or table scan.
Indexing the persisted column speeds the query up in this case. In general, the optimizer tends to favour an access path that avoids re-computing the function, but the decision is cost-based so it is still possible to see a function re-computed for each row even when indexed. Nevertheless, providing an 'obvious' and efficient access path to the optimizer does help to avoid this.
Notice that the column does not have to be persisted in order to be indexed. This is a very common misconception; persisting the column is only required where it is imprecise (it uses floating-point arithmetic or values). Persisting the column in the present case adds no value and expands the base table's storage requirement.
Paul White
-- An expensive scalar function
CREATE FUNCTION dbo.fn_Expensive(#n INTEGER)
RETURNS BIGINT
WITH SCHEMABINDING
AS
BEGIN
DECLARE #sum_n BIGINT;
SET #sum_n = 0;
WHILE #n > 0
BEGIN
SET #sum_n = #sum_n + #n;
SET #n = #n - 1
END;
RETURN #sum_n;
END;
GO
-- A table that references the expensive
-- function in a PERSISTED computed column
CREATE TABLE dbo.Demo
(
n INTEGER PRIMARY KEY NONCLUSTERED,
sum_n AS dbo.fn_Expensive(n) PERSISTED
);
GO
-- Add 8000 rows to the table
-- with n from 1 to 8000 inclusive
WITH Numbers AS
(
SELECT TOP (8000)
n = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM master.sys.columns AS C1
CROSS JOIN master.sys.columns AS C2
CROSS JOIN master.sys.columns AS C3
)
INSERT dbo.Demo (N.n)
SELECT
N.n
FROM Numbers AS N
WHERE
N.n >= 1
AND N.n <= 5000
GO
-- This is slow
-- Plan includes a Compute Scalar with:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
-- QO estimates calling the function is cheaper than the bookmark lookup
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the computed column
-- Notice the actual plan also calls the function for every row, and includes:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Query now uses the index, and is fast
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Drop the index
DROP INDEX uq1 ON dbo.Demo;
GO
-- Don't persist the column
ALTER TABLE dbo.Demo
ALTER COLUMN sum_n DROP PERSISTED;
GO
-- Show again, as you would expect
-- QO has no option but to call the function for each row
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the non-persisted column
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Fast again
-- Persisting the column bought us nothing
-- and used extra space in the table
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Clean up
DROP TABLE dbo.Demo;
DROP FUNCTION dbo.fn_Expensive;
GO

Resources