Implement zstd compression with oracle advanced compression - oracle

I am new to oracle database and using 19c version. I need to know if zstd algorithm can be implemented along with oracle advanced compression. I am able to implement zstd algorithm at RMAN level. Is there any way to choose the compression algorithm when using advanced compression in oracle? Thanks in advance!
RMAN COMMAND OUTPUT FOR ENABLING ZSTD:
RMAN> CONFIGURE COMPRESSION ALGORITHM 'ZSTD'
2> ;
new RMAN configuration parameters:
CONFIGURE COMPRESSION ALGORITHM 'ZSTD' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE;
new RMAN configuration parameters are successfully stored
RMAN> show COMPRESSION ALGORITHM
2> ;
RMAN configuration parameters for database with db_unique_name DB9ZX are:
CONFIGURE COMPRESSION ALGORITHM 'ZSTD' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE;
EXPECTING FOR SYNTAX:
SQL<>alter table xtbl row store compress advanced;
Table altered.
SQL<>alter table xtbl row store compress advanced zstd;
alter table xtbl row store compress advanced zstd
*
ERROR at line 1:
ORA-01735: invalid ALTER TABLE option
SQL<>alter table xtbl row store compress zstd advanced;
alter table xtbl row store compress zstd advanced
*
ERROR at line 1:
ORA-01735: invalid ALTER TABLE option

Compression for RMAN files is significantly different from compression for live tables and rows. You can't choose a specific algorithm like ZSTD for table compression.
The Oracle Database Administrator's Guide has a section on table compression which covers 4 types of compression and when you might prefer each one.
When you use basic table compression, warehouse compression, or archive compression, compression only occurs when data is bulk loaded or array inserted into a table.
Advanced row compression is intended for OLTP applications and compresses data manipulated by any SQL operation.
Warehouse and Archive compression use Hybrid Columnar Compression (which requires additional licensing) - it dynamically chooses different (unspecified) compression algorithms based on the data type, etc. and is optimized for storage, not performance.
Oracle Database Concepts also has a section on table compression that goes into a little bit of detail about advanced row compression, which is optimized for OLTP performance. The relevant part for your question is that Oracle implemented their own simple compression algorithm (just replacing duplicate values with symbol table references). You can't configure your own compression algorithm.

Related

Compression of table in the Impala

I want to compress a table in parquet compression in Impala. Is there any method to compress that table as there are 1000's of files in the HDFS to that particular table.
Parquet is an encoding, not a compression format. Snappy is a compression format commonly used with Parquet
It's not clear what your original file types are, but typically simple running an INSERT OVERWRITE INTO query will cause the files to get re-collected and "compacted" into a lesser quantity.

Oracle to PostgreSQL database reduces

Why does the database size reduce in PostgreSQL post migration from Oracle schema having lob, clob and blob datatypes
The main reason is that Postgres by default compresses values that are bigger than (approximately) 2000 bytes of data variable length data types - these are mainly text, varchar and bytea types.
Oracle will only compress the content of LOB columns if you are using the Enterprise Edition and enable the compression when defining the LOB column (the most important part is to use SecureFile instead of BasicFile).
Most probably your LOB columns where defined without using compression in Oracle and contain many values bigger than 2000 bytes, that's why you see a reduction in size due to Postgres' automatic compression.

Hive parquet snappy compression not working

I am creating one table skeleton using the table properties as
TBLPROPERTIES('PARQUET.COMPRESSION'='SNAPPY')
(as the files are in parquet format) and setting few of the parameters before creating the table as :
set hive.exec.dynamic.partition.mode=nonstrict;
set parquet.enable.dictionary=false;
set hive.plan.serialization.format=javaXML;
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
set avro.output.codec=snappy;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
add jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1168.923/lib/sentry/lib/hive-metastore.jar;
Still the table is not getting compressed. Could you please let me know the reason for table not getting compressed.
Thanks in advance for you inputs.
The solution is using “TBLPROPERTIES ('parquet.compression'='SNAPPY')”(and the case matters) in the DDL instead of “TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')”.
You can also achieve the compression using the following property in the hive.
set parquet.compression=SNAPPY
Your Parquet table is probably compressed but you're not directly seeing that. In Parquet files, the compression is baked in into the format. Instead of the whole file being compressed, individual segments are compressed using the specified algorithm. Thus a compressed Parquet will look from the outside the same as a compressed one (normally they don't include any suffix like normal compressed files have (e.g. .gz) as you cannot decompress them using the usual tools).
Having the compression baked in into the format is one of the many advantages of the Parquet format. This makes the files (hadoop-)splittable independent of the compression algorithm as well as it enables fast access to specific segments of the file without the need to decompress the whole file. In the case of a query engine processing a query on top of Parquet files, this means that often it only needs to read the small but uncompressed header, sees which segments are relevant for the query and then only needs to decompress these relevant sections.
Set the below parameters and after that perform below steps-
SET parquet.compression=SNAPPY;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
CREATE TABLE Test_Parquet (
customerID int, name string, ..etc
) STORED AS PARQUET Location ''
INSERT INTO Test_Parquet SELECT * FROM Test;
If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.
describe formatted tableName
Note - but you will always see the compression as NO because the compression data format is not stored in
metadata of the table , the best way is to do dfs -ls -r to the table location and see the file format for compression.
Note- Currently the default compression is - Snappy with Impala tables.
If your issue didn't resolved after these steps also please post the all steps which are you performing..?
I see this mistake being done several times, here is what needs to be done (This will only work for hive. NOT with SPARK) :
OLD PROPERTY :
TBLPROPERTIES('PARQUET.COMPRESSION'='SNAPPY')
CORRECT PROPERTY :
TBLPROPERTIES('PARQUET.COMPRESS'='SNAPPY')
I have recently created some tables stored as Parquet file with Snappy compression and have used the following commands:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;

Set ORC file name

I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando

Compress Oracle table

I need to compress a table. I used alter table tablename compress to compress the table. After doing this the table size remained the same.
How should I be compressing the table?
To compress the old blocks of the table use:
alter table table_name move compress;
This will reinsert the records in another blocks, compressed, and discard old blocks, so you'll gain space. And invalidates the indexex, so you will need to rebuild them.
Compress does not affect already stored rows. Please, check the official documentation:
" You specify table compression with the COMPRESS clause of
the CREATE TABLE statement. You can enable compression for an existing
table by using this clause in an ALTER TABLEstatement. In this case,
the only data that is compressed is the data inserted or updated after
compression is enabled..."
ALTER TABLE t MOVE COMPRESS is a valid answer. But if you use different non default options, especially with big data volume, do regression tests before using ALTER TABLE ... MOVE.
There were historically more problems (performance degradations and bugs) with it. If you have access, look Oracle bug database to see if there are known problems for features and version you use.)
You are on safer side if you: create new table insert data from original (old) table drop old table rename new table to old table name

Resources