My case is a source table with Null engine and 2 its' mv with replicatedMergeTree engine. Is it possible?
I've read this man from Den Crane https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
And he described only case when source table also has the same engine with replication, so I get the question is below.
You can use Null + Replicated MV.
MV tables which store data for MV are not related to MV or to source tables.
https://youtu.be/1LVJ_WcLgF8?list=PLO3lfQbpDVI-hyw4MyqxEk3rDHw95SzxJ&t=7597
Related
We have a materialized view (MV) on Oracle 19c. Its data is updated/refreshed by scheduler job every day.
Because of some maintenance work that may last for more than 6 months, I would like to update some data inside it manually. However, manual update is forbidden on MV.
It may sound stupid but I am planning to:
Disable to the scheduler job
Turn the MV into table (DROP MATERIALIZED VIEW .. PRESERVE TABLE; )
Then we can update the table data manually for our maintenance work.
After the maintenance work, I would:
Turn the table back to MV
Re-enable the scheduler job to refresh the data
So the question is... how do I turn the table back to MV SAFELY in my case? It is easy to turn MV into table but I have never heard anyone doing it the other way round.
By safely, I mean that the reverted MV is back to before without lost of behaviours/properties.
If I turn the MV to table and then back to MV, would the index still work for both the table and the MV without affected?
Similarly, if we already have a synonym for the MV, would this synonym still work after converting to table and back to MV again?
Do I need re-grant any user privileges to the table and later for the MV again?
Note: I am aware that after turning the table back to MV, the data get refreshed and our manual data would be lost. That is acceptable for us because we just want the manual data to stay during the maintenance period.
If there are other suggestions/alternatives, I am happy to hear.
Combining synonym, patch table and views might be a good solution for temporary situation as suggested.
And then..
You can safely recreate materialized view with
CREATE MATERIALIZED VIEW testmv ON PREBUILT TABLE ...
Indexes can be used till refresh. You don't need to re-grant user privileges
Synonyms that crated previously for MV still works after creating MV from prebuilt table.
What is the reason to update some data inside the mview? Maybe you need a table, not a mview. Just refresh the materialized view to update the data, or change the related SQL to reflect the dataset you want. Other way it would be as inconsistent state.
Turn the table back to MV
If you understand that after that step, the MV is refreshed with the life data and your changes to the underlaying table are lost, it's OK.
Another way to do the job is to access the MV through a VIEW, and changes the definition of the VIEW for the maintenance work (conceptually "select from MV where not exists (select FROM PATCH) UNION ALL select from PATCH"), and putting it back to "select FROM MV" when done).
(And note that truncate PATCH will have the same effect, and you don't have to change the VIEW...)
I want to apply archive and purge mechanism on hive tables, which includes internal and external tables and both partitioned and non-partitioned.
I have a site_visitors table and its partitioned with visit_date.
And I wanted to archive the site_visitors table data, where in users not visited my site in last one year. At the same time, I don't want to keep this archived data in same table directory. I can have archived data some specific location.
You can handle that on the partitions in the HDFS directory, below is one of the ways you can achieve that.
Your internal table/Main table will be sitting on top of hdfs and the directory will look something like below hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-01
hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-02
hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-03
You can create an archive table on top of HDFS or if you are just looking to archive the data you can dump the partitions to other location in HDFS. Either way, your HDFS location will look something like below.
hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-01
hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-02
hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-03
You can run a UNIX script or javascript or in any other language that is used in your environment to move the files from one HDFS location to the other archive hdfs location based on the partition dates.
You can also do with the below approach, where you can load the data into archive table and drop the data in the original table.
#!bin/bash
ARCHIVE=$1
now=$(date +%Y-%m-%d)
StartDate=$now
#archive_dt will give a date based on the ARCHIVE date and that be will used for alterations and loading
archive_dt=$(date --date="${now} - ${ARCHIVE} day" +%Y-%m-%d)
EndDate=$archive_dt
#You can use hive or beeline or impala to insert the data into archive table, i'm using beeline for my example
beeline -u ${CONN_URL} -e "insert into table ${SCHEMA}.archive_table partition (visit_date) select * from ${SCHEMA}.${TABLE_NAME} where visit_date < ${archive_dt}"
#After the data been loaded to the archive table i can drop the partitions in original table
beeline -u ${CONN_URL} -e "ALTER TABLE ${SCHEMA}.main_table DROP PARTITION(visit_date < ${archive_dt})"
#Repair the tables to sync the metadata after alterations
beeline -u ${CONN_URL} -e "MSCK REPAIR TABLE ${SCHEMA}.main_table; MSCK REPAIR TABLE archiveSchema.archive_table"
In hive, how can I delete duplicate records ? Below is my case,
First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table
FROM products INSERT OVERWRITE TABLE products_rcfileformat
SELECT *;
Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause
FROM products INSERT INTO TABLE products_rcfileformat
SELECT *;
When I query the data it give me total rows = 50 which are right
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0
Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?
Partition your data such that the rows (use window function row_number) you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.
detail about Partition .
www.tutorialspoint.com/hive/hive_partitioning.htm
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1
instead of append to 000000_0
The reason is hdfs is read only, not editable, as hive warehouse files (or whatever may be the location) that is still in hdfs, so it has to create a second file.
Now I want to remove those records that read from xxx_copy_1. How can
I achieve this in hive command ?
Please check this post - Removing DUPLICATE rows in hive based on columns.
Let me know if you are satisfied with the answer there. I have another method, which removes duplicate entries but may not be in the way you want.
I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando
I need to delete a large amount of data from my database on a regular basis. The process generates huge volume of archive logs. We had a database crash at one point because there was no storage space available on archive destination. How can I avoid generation of logs while I delete data?
The data to be deleted is already marked as inactive in the database. Application code ignores inactive data. I do not need the ability to rollback the operation.
I cannot partition the data in such a way that inactive data falls in one partition that can be dropped. I have to delete the data with delete statements.
I can ask DBAs to set certain configuration at table level/schema level/tablespace level/server level if needed.
I am using Oracle 11g.
What proportion of the data on the table would be deleted, what volume? Are there any referential integrity constraints to manage or is this table childless?
Depending on the answers , you might consider:
"CREATE TABLE keep_data UNRECOVERABLE AS SELECT * FROM ... WHERE
[keep condition]"
Then drop the original table
Then rename keep_table to original table
Rebuild the indexes (again with unrecoverable to prevent redo),constraints etc.
The problem with this approach is it's a multi-step DDL, process, which you will have a job to make fault tolerant and reversible.
A safer option might be to use data-pump to:
Data-pump expdp to extract the "Keep" data
TRUNCATE the table
Data-pump impdp import of data from step 1, with direct-path
At this point I suggest you read the Oracle manual on Data Pump, particularly the section on Direct Path Loads to be sure this will work for you.
MY preferred option would be partitioning.
Of course, the best way would be TenG solution (CTAS, drop and rename table) but it seems it's impossible for you.
Your only problem is the amount of archive logs and database crash problem. In this case, maybe you could partition your delete statement (for example per 10.000 rows).
Something like:
declare
e number;
i number
begin
select count(*) from myTable where [delete condition];
f :=trunc(e/10000)+1;
for i in 1.. f
loop
delete from myTable where [delete condition] and rownum<=10000;
commit;
dbms_lock.sleep(600); -- purge old archive if it's possible
end loop;
end;
After this operation, you should reorganize your table which is surely fragmented.
Alter the table to set NOLOGGING, delete the rows, then turn logging back on.