In our application, we are planning to go for Incremental Backup due to the excess time it takes. Now we have two dump files:one is full backup and the other is incremental backup since the previous full or incremental backup.My problem is i need to merge these two dump files to get the latest data which i can then import. But i am not able to get how to merge these two backups(full backup and incremental backup).I have read about RMAN but did not get clear idea on the syntax of Restore command in RMAN.Please help me on this soon.
Exactly what do you mean when you say you have an "incremental backup"?
You talk about having "two dump files" which implies that you have the output of two different calls to the export utility. Potentially, the second export call could have used the INCTYPE parameter. That is not what most people would mean when they talk about a backup or an incremental backup. An incremental export will do a complete export of every table where any data changed between the last export and the "incremental" export. That is almost never what people want (or think they're getting) from an incremental export. If you have the output of two calls to the export utility, there is no way to merge them. You'd need to import the full export and then the incremental export (which would completely re-load all the data in most if not all of the tables). And dump files cannot be used with the RMAN utility.
When you talk about your "two dump files", it's also possible, I suppose, that you are referring to an actual RMAN full backup and a RMAN incremental backup. That would almost certainly involve more than two files and wouldn't normally be called a "dump file" but you would at least be able to restore the backups using RMAN. Can you post the RMAN backup command you used to create the backups (if you did, indeed, create physical backups using RMAN)?
Related
my requirement is to read that and generate another set of parquet data into another ADLS folder.
do i need this into spark dataframes and perform upserts ?
Parquet is like any other file format. You have to overwrite the files to perform insert, updates and deletes. It does not have ACID properties like a database.
1 - We can use SET properties with the spark dataframe to accomplish what you want. However, it compares at both the row and column level. Not as nice as an ANSI SQL.
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html
2 - We can save the data in the target directory as a DELTA file. Most people are using DELTA since it has ACID properties like a database. Please see the merge statement. It allows for updates and inserts.
https://docs.delta.io/latest/delta-update.html
Additionally we can soft delete when reversing the match.
The nice thing about a delta file (table) is we can partition by date for a daily file load. Thus we can use time travel to see what happen yesterday versus today.
https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
3 - If you do not care about history and soft deletes, the easiest way to accomplish this task is to archive the old files in the target directory, then copy over the new files from the source directory to the target directory.
We have small array of greenplum DCA V1 and V3.
Trying to conduct backup/restore process steps between them.
As novice to DCA Appliances.banging my head against the wall to understand the parallel backup process in logical way.
We tried
Trying to conduct parallel backup.
using gpcrondump/gpdbrestore. But did not understand working process how it execute
on Master host
on segment host
Question is :
How parallel backup works in master-segment DCA env from version to version.
gpcrondump executes a backup in parallel. It basically coordinates the backups across all segments. By default, each segment will create a db_dumps directory in each segment's $PGDATA directory and a sub-directory under that with a date format.
For example, let's say you have 4 segments per host and hosts sdw1-4. The dumps will be created in:
/data1/gpseg0/db_dumps/20161111/
/data1/gpseg1/db_dumps/20161111/
/data2/gpseg2/db_dumps/20161111/
/data2/gpseg3/db_dumps/20161111/
This repeats across all segments.
The segment will dump only its data to this dump location. grcrondump will name the files, make sure it completes successfully, etc as each segment dumps data independently of the other segments. Thus, it is done in parallel.
The master will also have a backup directory created but there isn't much data in this location. It is mainly metadata about the backup that was executed.
The metadata for each backup is pretty important. It contains the segment id and the content id for the backup.
gpdbrestore restores a backup created by gpcrondump. It reads the files and loads it into the database. It reads those backup files and makes sure the segment id and content id match the target. So, the number of segments from a backup must match the number of segments to restore to. It also has to have the same mapping of segment id to content id.
Migration from one cluster can be done multiple ways. One way is to do a backup and then restore. This requires the same configuration in both clusters. You have to copy all of the backup files from one cluster to the other as well. Alternatively, you could backup and restore from a backup device like DataDomain.
You can also use a built-in tool call gptransfer. This doesn't use a backup but instead, uses external tables to transfer from one cluster to another. The configuration of the two clusters doesn't have to be the same when using this tool but if you are going from a larger cluster to a smaller cluster, it will not be done in parallel.
I highly recommend you reach out to your Pivotal Account Rep to get some assistance. More than likely, you have already paid for services when buying the new DCA that will cover part or all of the migration work. You will have to configure networking between the two clusters which requires some help from EMC too.
Good luck!!
I am using Hadoop and facing the dreaded problem of large numbers of small files. I need to be able to create har archives out of existing hive partitions and query them at the same time. However, Hive apparently supports archiving partitions only in managed tables and not external tables - which is pretty sad. I am trying to find a workaround for this, by manually archiving the files inside a partition's directory, using hadoop's archive tool. I now need to configure hive to be able to query the data stored in these archives, along with the unarchived data stored in other partition directories. Please note that we only have external tables in use.
The namespace for accessing the files in the created partition-har corresponds to the hdfs path of the partition dir.
For example, For example, a file in hdfs:
hdfs:///user/user1/data/db1/tab1/ds=2016_01_01/f1.txt
can after archiving be accessed as:
har:///user/user1/data/db1/tab1/ds=2016_01_01.har/f1.txt
Would it be possible for hive to query the har archives from the external table? Please suggest a way if yes.
Best Regards
In practice, the line between "managed" and "external" tables is very thin.
My suggestion:
create a "managed" table
add explicitly partitions for some days in the future, but with ad hoc locations -- i.e. the directories your external process expects to use
let the external process dump its file directly at HDFS level -- they are automagically exposed in Hive queries, "managed" or not(the Metastore does not track individual files and blocks, they are detected on each query; as a side note, you can run backup & restore operations at HDFS level if you wish, as long as you don't mess with the directory structure)
when a partition is "cold" and you are pretty sure there will never be another file dumped there, you can run a Hive command to archive the partition i.e. move small files in a single HAR + flag the partition as "archived" in the Metastore
Bonus: it's easy to unarchive your partition within Hive (whereas there is no hadoop unarchive command AFAIK).
Caveat: it's a "managed" table so remember not to DROP anything unless you have safely moved your data out of the Hive-managed directories.
I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:
Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.
My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?
My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.
Sqoop issues usual select queries on the Oracle batabase, so it does
the same locks as the select query would. No extra additional locking
is performed by Sqoop.
Data will be transferred in several concurrent tasks(mappers). Any
expensive function call will put a significant performance burden on
your database server. Advanced functions could lock certain tables,
preventing Sqoop from transferring data in parallel. This will
adversely affect transfer performance.
For efficient advanced filtering, run the filtering query on your
database prior to import, save its output to a temporary table and
run Sqoop to import the temporary table into Hadoop without the —where parameter.
Sqoop import has nothing to do with copy of data accross the network.
Sqoop stores at one location and based on the Replication Factor of
the cluster HDFS replicates the data
Oracle has two options of backuping database, and documentation on them is very brief.
To back up to disk as image copies, use BACKUP AS COPY as shown in
BACKUP AS COPY
DEVICE TYPE DISK
DATABASE;
To back up your data
into backup sets, use the AS BACKUPSET clause. You can allow backup
sets to be created on the configured default device, or direct them
specifically to disk or tape
BACKUP AS BACKUPSET
DATABASE;
BACKUP AS BACKUPSET
DEVICE TYPE DISK
DATABASE;
What is the difference between the two, why there are these multiple options?
To put it simply, back up as copy makes a simple copy of database files(the same way Linux cp command does), whereas backup sets is a logical entity to backup pieces as a tablespace to data files. Backup pieces are in an RMAN specific binary format.
why there are these multiple options?
To give the opportunity to perform backup and recovery more effectively and efficiently. For example, you can simply switch to an image copy of a data file avoiding, possibly time consuming, restoration process. But you cannot perform incremental backups with image copies as you be able to do so with backup sets, etc.
The choice of options, of course depends on your B&R strategy.
Find out more