Is there a way to recover deleted data in Azure Databricks?

Is there a way to recover deleted data in Azure Databricks? - azure-databricks

With out realizing shift+enter runs a cell.
I was writing a delete from table and pressed shift enter which deleted all of the data in table.

In a Delta Lake table, the DELETE is another transaction, the data is only 'marked for deletion' not immediately deleted. Using the Time Travel feature, you can view your transaction history and then select from the version prior to the SQL DELETE and insert into the same table to restore your data.
To restore the data:
DESCRIBE HISTORY <table>
Note down version number prior to the delete
INSERT INTO <table> SELECT * from <table> VERSION AS OF <version from history>
Let me add, that as of DBR 7.4, the RESTORE command is available:
RESTORE [TABLE] table_identifier[TO] <time_travel_version>
Per Azure Databricks Delta Lake Docs

It’s down to where the data was stored. If in the default DBFS location then it’s gone I’m afraid. That uses a blob account with no backup features.
If you mounted your own blob/lake storage and enabled soft delete or snapshots you can get it back by going to that resource in the azure portal.
If it’s a relational database source then you may have backups.
But chances are it is gone I’m afraid.

Related

Deleting data in table without Rollback in oracle

I have deleted data in a table and didnot rollback. Will the data in that table gets deleted permanently in oracle sql developer?

Will those rows be deleted permanently?
It depends.
if you didn't rollback, but you didn't commit either, then rows will be deleted within your own session, but any other user will still see all those rows
check SQL Developer's preferences (Database > Advanced) and see whether the Autocommit checkbox is turned on - if so, then although you didn't explicitly commit, tool did that for you. In my opinion, autocommit is generally a bad idea
even if you did commit, but there's the flashback option enabled in your database, then you can restore those rows
certainly, you should perform regular backup and be able to restore data (that won't save you from losing rows that were inserted/updated between two backups, though)
Basically, there are various options that could & should help you restore deleted rows.

You can recover deleted table records in SQL server by
Step 1: Create a Database. ...
Step 2: Insert Data into Table. ...
Step 3: Delete Rows from Table. ...
Step 4: Get Information about Deleted Rows. ...
Step 5: Get Log Sequence Number of the LOP_BEGIN_XACT Log Record. ...
Step 6: Recover Deleted Records in SQL Server.

After TRUNCATE TABLE, how can I recover the data?

I need to recover data truncated in Oracle table.
There are any folder in Linux where stores the truncated data?
Is there any table which store the information of table after truncating?
I am not DBA.

If you have not backed up the table (for example, by using RMAN, EXPDP or EXP) or created a RESTORE POINT then your data is lost.
From the Oracle documentation:
Caution:
You cannot roll back a TRUNCATE TABLE statement, nor can you use a FLASHBACK TABLE statement to retrieve the contents of a table that has been truncated.
You can check if you have an RMAN backup by logging into RMAN (rather than into the database) and using the LIST command.
You can check if you have a restore point (from a database user with the appropriate permissions) using:
SELECT name,
guarantee_flashback_database,
pdb_restore_point,
clean_pdb_restore_point,
pdb_incarnation#,
storage_size
FROM v$restore_point;
You are looking for a restore point where guarantee_flashback_database is YES.
(Assuming that the RESTORE POINT was created after the table was created and populated.)
Note:
If you restore from a backup or to a restore point then all changes made since that backup or creating the restore point will be lost.
To answer your additional questions:
[Are] there are any folder in Linux where stores the truncated data?
No
Is there any table which store the information of table after truncating?
No

Delete from temporary tables takes 100% CPU for a long time

I have a pretty complex query where we make use of a temporary table (this is in Oracle running on AWS RDS service).
INSERT INTO TMPTABLE (inserts about 25.000 rows in no time)
SELECT FROM X JOIN TMPTABLE (joins with temp table also in no time)
DELETE FROM TMPTABLE (takes no time in a copy of the production database, up to 10 minutes in the production database)
If I change the delete to a truncate it is as fast as in development.
So this change I will of course deploy. But I would like to understand why this occurs. AWS team has been quite helpful but they are a bit biased on AWS and like to tell me that my 3000 USD a month database server is not fast enough (I don't think so). I am not that fluent in Oracle administration but I have understood that if the redo logs are constantly filled, this can cause issues. I have increased the size quite substantially, but then again, this doesn't really add up.

This is a fairly standard issue when deleting large amounts of data. The delete operation has to modify each and every row individually. Each row gets deleted, added to a transaction log, and is given an LSN.
truncate, on the other hand, skips all that and simply deallocates the data in the table.
You'll find this behavior is consistent across various RDMS solutions. Oracle, MSSQL, PostgreSQL, and MySQL will all have the same issue.

I suggest you use an Oracle Global Temporary table. They are fast, and don't need to be explicitly deleted after the session ends.
For example:
CREATE GLOBAL TEMPORARY TABLE TMP_T
(
ID NUMBER(32)
)
ON COMMIT DELETE ROWS;
See https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#ADMIN11633

Removing bloat from greenplum table

I have created some tables in Greenplum, performing insert update and delete operation. Regularly I am also performing vacuum operation. I Found bloat in it. Found solution to remove bloat https://discuss.pivotal.io/hc/en-us/articles/206578327-What-are-the-different-option-to-remove-bloat-from-a-table
However, if I truncate the table and reinsert the data, it removes bloat. Is it good practice to truncate the data from the table?

If you are performing UPDATE and DELETE statements on a heap table (default storage) and running VACUUM regularly, you will get some bloat by design. Heap storage, which is similar to the default PostgreSQL storage mechanism, provides read consistency using Multi-Version Concurrency Control (MVCC).
When you UPDATE or DELETE a record, the old value is still in the table and is able to be read by transactions that are still inflight and started before you issued the UPDATE or DELETE command. This provides the read consistency to the table.
When you execute a VACUUM statement, the database will mark the stale rows as available to be overwritten. It doesn't shrink the files. It just marks rows so they can be overwritten. The next time you execute an INSERT or UPDATE, the stale rows are now able to be used for the new data.
So if you UPDATE or DELETE 10% of a table between running VACUUM, you will probably have about 10% bloat.
Greenplum also has Append-Optimized (AO) storage which doesn't use MVCC and uses a visibility map instead. The files are bit smaller too so you should get better performance. The stale rows are hidden with the visibility map and VACUUM won't do anything until you hit the gp_appendonly_compaction_threshold percentage. The default is 10%. When you have 10% bloat in an AO table and execute VACUUM, the table will automatically get rebuilt for you.
Append-Optimized is called "appendonly" for backwards compatibility reasons but it does allow UPDATE and DELETE. Here is an example of an AO table:
CREATE TABLE sales
(txn_id int, qty int, date date)
WITH (appendonly=true)
DISTRIBUTED BY (txn_id);

Instead of truncate it is better to use drop the table, create the table and then insert the data.

Can you using joins with direct path inserts?

I have tried to find examples but they are all simple with a single where clause. Here is the situation. I have a bunch of legacy data transferred from another database. I also have the "good" tables in that same database. I need to transfer (data-conversion) data from the legacy tables to thew tables. Because this is a different set of tables the data-conversion requires complex joins to put the old data into the new tables correctly.
So, old tables old data.
New tables must have the old data but it requires lots of joins to get that old data into the new tables correctly.
Can I use direct path with lots of joins like this? INSERT SELECT (lots of joins)
Does direct path apply to tables that are already on the same database (transfer between tables)? Is it only for loading tables from say a text file?
Thank you.

The query in your SELECT can be as complex as you'd like with a direct-path insert. The direct-path refers only to the destination table. It has nothing to do with the way that data is read or processed.
If you're doing a direct-path insert, you're asking Oracle to insert the new data above the high water mark of the table so you bypass the normal code that reuses space in existing blocks for new rows to be inserted. It also has to block other inserts since you can't have the high water mark of the table change during a direct-path insert. This probably isn't a big deal if you've got a downtime window in which to do the load but it would be quite problematic if you wanted the existing tables to be available for other applications during the load.

No, on the contrary, it means you need to do a backup after a NOLOGGING load, not that you can't backup the database.
Allow me to elaborate a bit. Normally, when you do DML in Oracle, the before images of the changes you are are making get logged in UNDO, and all the changes (including the UNDO changes) are first written to REDO. This is how Oracle manages transactions, instance recovery, and database recovery. If a transaction is aborted or rolled back, Oracle uses the information in UNDO to undo the changes your transaction made. If the instance crashes, then on instance restart, Oracle will use the information in REDO and UNDO to recover up to the last committed transaction. First, Oracle will read the REDO and roll forward, then, use UNDO to roll back all the transactions that were not committed at the time of the crash. In this way, Oracle is able to recover up to the last committed transaction.
Now, when you specify an APPEND hint on an insert statement, Oracle will execute the INSERT with direct load. This means that data is loaded into brand new, never before used blocks, from above the highwater mark. Because the blocks being loaded are brand new, there is no "before image", so, Oracle can avoid writing UNDO, which improves performance. If the database is in NOARCHIVELOG mode, then Oracle will also not write REDO. On a database in ARCHIVELOG mode, Oracle will still write REDO, unless, before you do the insert /*+ append */, you set the table to NOLOGGING, (i.e. alter table tab_name nologging;). In that case, REDO logging is disabled for the table. However, this is where you could run into backup/recovery implications. If you do a NOLOGGING direct load, and then you suffer a media failure, and the datafile containing the segment with the nologging operation is restored from a backup taken before the nologging load, then the redo log will not contain the changes required to recover that segment. So, what happens? Well, when you do a NOLOGGING load, Oracle writes extent invaldation records to the redo log, instead of the actual changes. Then, if you use that redo in recovery, those data blocks will be marked logically corrupt. Any subsequent queries against that segment will get an ORA-26040 error.
So, how to avoid this? Well, you should always take a backup imediately following any NOLOGGING direct load. If you restore/recover from a backup taken after the nologging load, there is no problem, because the data will be in the datablocks in the file that was restored.
Hope that's clear,
-Mark

Yes, there should not be any arbitrary limits on query complexity.
If you do
insert /*+ APPEND */ into target_table select .... from source1, source2..., sourceN where
It should work fine. Consider though, that the performance of the load will be limited by the performance of that query, so, be sure it's well-tuned, if you're expecting good performance.
Finally, consider whether setting NOLOGGING on the target table would improve performance significantly. But, also consider the backup recovery implications, if you decide to implement NOLOGGING.
Hope that helps,
-Mark

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio