How to rollback truncate data in hive table

How to rollback truncate data in hive table - hadoop

Unfortunately, I truncated a table in hive and trash got cleaned up. Is there any way to get the data back.thanks

There is a way to recover deleted file(s) but it's not recommended and if not done properly can affect a cluster as well. Use these procedure with caution on production system
Here's step by step description to recover accidentally deleted files.
https://community.hortonworks.com/articles/26181/how-to-recover-accidentally-deleted-file-in-hdfs.html

Related

Oozie workflow: how to keep most recent 30 days in the table

I am trying to build a Hive table and automate it through oozie. Data in the table need not be older than last 30 days.
Action in the work flow would be run every day. It will first purge data that are 30 days older, and insert data for today. Sliding window with 30 days interval.
Can someone show an example how to achieve this?

Hive stores data in HDFS files, and these files are immutable.
Well, actually, with recent Hadoop releases HDFS files can be appended to, or even truncated, but with a low-level API, and Hive has no generic way to modify data files for text/AVRO/Parquet/ORC/whatever format, so for practical purposes HDFS files are immutable for Hive.
One workaround is to use transactional ORC tables that create/rewrite an entire data file on each "transaction" -- requiring a background process for periodic compaction of the resulting mess (e.g. another step of rewriting small files into bigger files).
Another workaround would be an ad hoc batch rewrite of your table whenever you want to get rid of older data -- e.g. every 2 weeks, run a batch that removes data older than 30 days.
> simple design
make sure that you will have no INSERT or SELECT running until the purge is over
create a new partioned table with the same structure plus a dummy
partitioning column
copy all the data-to-keep to that dummy partition
run an EXCHANGE PARTITION command
drop the partitioned table
now the older data is gone, and you can resume INSERTs
> alternative design, allows INSERTs while purge is running
rebuild your table with a dummy partitioning key, and make sure that
all INSERTs always go into "current" partition
at purge time, rename "current" partition as "to_be_purged" (and
make sure that you will run no SELECT until purge is over, otherwise
you may get duplicates)
copy all the data-to-keep from "to_be_purged" to "current"
drop partition "to_be_purged"
now the older data is gone
But it would be soooooooooooooooooo much simpler if your table was partitioned by month, in ISO format (i.e. YYYY-MM). In that case you could just get the list of partitions and drop all that have a key "older" than (current month -1), with a plain bash script. Believe me, it's simple and rock-solid.

As already answered in How to delete and update a record in Hive (by user ashtonium), hive version 0.14 is available with ACID support. So, create .hql script and use simple DELETE + where condition (for current date use unix_timestamp()) and INSERT. The INSERT should be used in a bulk fashion not in a OLTP one.

Impala Concurrent READ & Overwrite

I noticed in one application that concurrent READ (with invalidating metadata ) and OVERWRITING table , causes underlying files to corrupt.
Is it a known scenario? I expected that while table is been overwritten, concurrent read would be just failed, It can't corrupt underlying files of the table.
Help will be appreciated!

If the files become corrupt, it shouldn't be caused by concurrent reads and writes. HDFS is a read/append-only filesystem and Impala will always write new files. When you insert, files are written to a staging directory which Impala will not read from until files are complete, at which point they are moved into the table/partition directory.
A few things to consider: If you run the insert independently of the select, the files are OK? What do you mean by corrupt? Does it work in Hive? What version of Impala are you running?

Oracle 11g Deleting large amount of data without generating archive logs

I need to delete a large amount of data from my database on a regular basis. The process generates huge volume of archive logs. We had a database crash at one point because there was no storage space available on archive destination. How can I avoid generation of logs while I delete data?
The data to be deleted is already marked as inactive in the database. Application code ignores inactive data. I do not need the ability to rollback the operation.
I cannot partition the data in such a way that inactive data falls in one partition that can be dropped. I have to delete the data with delete statements.
I can ask DBAs to set certain configuration at table level/schema level/tablespace level/server level if needed.
I am using Oracle 11g.

What proportion of the data on the table would be deleted, what volume? Are there any referential integrity constraints to manage or is this table childless?
Depending on the answers , you might consider:
"CREATE TABLE keep_data UNRECOVERABLE AS SELECT * FROM ... WHERE
[keep condition]"
Then drop the original table
Then rename keep_table to original table
Rebuild the indexes (again with unrecoverable to prevent redo),constraints etc.
The problem with this approach is it's a multi-step DDL, process, which you will have a job to make fault tolerant and reversible.
A safer option might be to use data-pump to:
Data-pump expdp to extract the "Keep" data
TRUNCATE the table
Data-pump impdp import of data from step 1, with direct-path
At this point I suggest you read the Oracle manual on Data Pump, particularly the section on Direct Path Loads to be sure this will work for you.
MY preferred option would be partitioning.

Of course, the best way would be TenG solution (CTAS, drop and rename table) but it seems it's impossible for you.
Your only problem is the amount of archive logs and database crash problem. In this case, maybe you could partition your delete statement (for example per 10.000 rows).
Something like:
declare
e number;
i number
begin
select count(*) from myTable where [delete condition];
f :=trunc(e/10000)+1;
for i in 1.. f
loop
delete from myTable where [delete condition] and rownum<=10000;
commit;
dbms_lock.sleep(600); -- purge old archive if it's possible
end loop;
end;
After this operation, you should reorganize your table which is surely fragmented.

Alter the table to set NOLOGGING, delete the rows, then turn logging back on.

How is the TRUNCATE command in Oracle able to retrieve the structure of a table after dropping it?

The SQL command TRUNCATE in Oracle is faster than than DELETE FROM table; in that the TRUNATE comand first drops the specified table in it's entirely and then creates a new table with same structure (clarification may require in case I may be wrong). Since TRUNCATE is a part of DDL it implicitly issues COMMIT before being executed and after the completion of execution. If such is a case then, the table that is dropped by the TRUNCATE command is lost permanently with it's entire structure in the data dictionary. In such a scenario, how is the TRUNCATE command able to drop first the table and recreate the same with the same structure?

(Note that I work for Sybase in SQL Anywhere engineering and my answer comes from my knowledge of how truncate is implemented there, but I imagine it's similar in Oracle as well.)
I don't believe the table is actually dropped and re-created; the contents are simply thrown away. This is much faster than delete from <table> because no triggers need to be executed, and rather than deleting a row at a time (both from the table and the indexes), the server can simply throw away all pages that contain rows for that table and any indexes.

I thought a truncate (amoungst other things) simply reset the High Water Mark.
see: http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/statements_10007.htm#SQLRF01707
however in
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:2816964500346433991
It is clear that the data segment changes after a truncate.

Can you using joins with direct path inserts?

I have tried to find examples but they are all simple with a single where clause. Here is the situation. I have a bunch of legacy data transferred from another database. I also have the "good" tables in that same database. I need to transfer (data-conversion) data from the legacy tables to thew tables. Because this is a different set of tables the data-conversion requires complex joins to put the old data into the new tables correctly.
So, old tables old data.
New tables must have the old data but it requires lots of joins to get that old data into the new tables correctly.
Can I use direct path with lots of joins like this? INSERT SELECT (lots of joins)
Does direct path apply to tables that are already on the same database (transfer between tables)? Is it only for loading tables from say a text file?
Thank you.

The query in your SELECT can be as complex as you'd like with a direct-path insert. The direct-path refers only to the destination table. It has nothing to do with the way that data is read or processed.
If you're doing a direct-path insert, you're asking Oracle to insert the new data above the high water mark of the table so you bypass the normal code that reuses space in existing blocks for new rows to be inserted. It also has to block other inserts since you can't have the high water mark of the table change during a direct-path insert. This probably isn't a big deal if you've got a downtime window in which to do the load but it would be quite problematic if you wanted the existing tables to be available for other applications during the load.

No, on the contrary, it means you need to do a backup after a NOLOGGING load, not that you can't backup the database.
Allow me to elaborate a bit. Normally, when you do DML in Oracle, the before images of the changes you are are making get logged in UNDO, and all the changes (including the UNDO changes) are first written to REDO. This is how Oracle manages transactions, instance recovery, and database recovery. If a transaction is aborted or rolled back, Oracle uses the information in UNDO to undo the changes your transaction made. If the instance crashes, then on instance restart, Oracle will use the information in REDO and UNDO to recover up to the last committed transaction. First, Oracle will read the REDO and roll forward, then, use UNDO to roll back all the transactions that were not committed at the time of the crash. In this way, Oracle is able to recover up to the last committed transaction.
Now, when you specify an APPEND hint on an insert statement, Oracle will execute the INSERT with direct load. This means that data is loaded into brand new, never before used blocks, from above the highwater mark. Because the blocks being loaded are brand new, there is no "before image", so, Oracle can avoid writing UNDO, which improves performance. If the database is in NOARCHIVELOG mode, then Oracle will also not write REDO. On a database in ARCHIVELOG mode, Oracle will still write REDO, unless, before you do the insert /*+ append */, you set the table to NOLOGGING, (i.e. alter table tab_name nologging;). In that case, REDO logging is disabled for the table. However, this is where you could run into backup/recovery implications. If you do a NOLOGGING direct load, and then you suffer a media failure, and the datafile containing the segment with the nologging operation is restored from a backup taken before the nologging load, then the redo log will not contain the changes required to recover that segment. So, what happens? Well, when you do a NOLOGGING load, Oracle writes extent invaldation records to the redo log, instead of the actual changes. Then, if you use that redo in recovery, those data blocks will be marked logically corrupt. Any subsequent queries against that segment will get an ORA-26040 error.
So, how to avoid this? Well, you should always take a backup imediately following any NOLOGGING direct load. If you restore/recover from a backup taken after the nologging load, there is no problem, because the data will be in the datablocks in the file that was restored.
Hope that's clear,
-Mark

Yes, there should not be any arbitrary limits on query complexity.
If you do
insert /*+ APPEND */ into target_table select .... from source1, source2..., sourceN where
It should work fine. Consider though, that the performance of the load will be limited by the performance of that query, so, be sure it's well-tuned, if you're expecting good performance.
Finally, consider whether setting NOLOGGING on the target table would improve performance significantly. But, also consider the backup recovery implications, if you decide to implement NOLOGGING.
Hope that helps,
-Mark

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio