Greenplum DCA-How to backup & restore Version V2 to V3 - greenplum

We have small array of greenplum DCA V1 and V3.
Trying to conduct backup/restore process steps between them.
As novice to DCA Appliances.banging my head against the wall to understand the parallel backup process in logical way.
We tried
Trying to conduct parallel backup.
using gpcrondump/gpdbrestore. But did not understand working process how it execute
on Master host
on segment host
Question is :
How parallel backup works in master-segment DCA env from version to version.

gpcrondump executes a backup in parallel. It basically coordinates the backups across all segments. By default, each segment will create a db_dumps directory in each segment's $PGDATA directory and a sub-directory under that with a date format.
For example, let's say you have 4 segments per host and hosts sdw1-4. The dumps will be created in:
/data1/gpseg0/db_dumps/20161111/
/data1/gpseg1/db_dumps/20161111/
/data2/gpseg2/db_dumps/20161111/
/data2/gpseg3/db_dumps/20161111/
This repeats across all segments.
The segment will dump only its data to this dump location. grcrondump will name the files, make sure it completes successfully, etc as each segment dumps data independently of the other segments. Thus, it is done in parallel.
The master will also have a backup directory created but there isn't much data in this location. It is mainly metadata about the backup that was executed.
The metadata for each backup is pretty important. It contains the segment id and the content id for the backup.
gpdbrestore restores a backup created by gpcrondump. It reads the files and loads it into the database. It reads those backup files and makes sure the segment id and content id match the target. So, the number of segments from a backup must match the number of segments to restore to. It also has to have the same mapping of segment id to content id.
Migration from one cluster can be done multiple ways. One way is to do a backup and then restore. This requires the same configuration in both clusters. You have to copy all of the backup files from one cluster to the other as well. Alternatively, you could backup and restore from a backup device like DataDomain.
You can also use a built-in tool call gptransfer. This doesn't use a backup but instead, uses external tables to transfer from one cluster to another. The configuration of the two clusters doesn't have to be the same when using this tool but if you are going from a larger cluster to a smaller cluster, it will not be done in parallel.
I highly recommend you reach out to your Pivotal Account Rep to get some assistance. More than likely, you have already paid for services when buying the new DCA that will cover part or all of the migration work. You will have to configure networking between the two clusters which requires some help from EMC too.
Good luck!!

Related

Spark EMR S3 Processing Large No of Files

I have around 15000 files (ORC) present in S3 where each file contain few minutes worth of data and size of each file varies between 300-700MB.
Since recursively looping through a directory present in YYYY/MM/DD/HH24/MIN format is expensive, I am creating a file which contain list of all S3 files for a given day (objects_list.txt) and passing this file as input to spark read API
val file_list = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/objects_list.txt"))
val paths: mutable.Set[String] = mutable.Set[String]()
for (line <- file_list.getLines()) {
if(line.length > 0 && line.contains("part"))
paths.add(line.trim)
}
val eventsDF = spark.read.format("orc").option("spark.sql.orc.filterPushdown","true").load(paths.toSeq: _*)
eventsDF.createOrReplaceTempView("events")
The Size of the cluster is 10 r3.4xlarge machines (workers)(Where Each Node: 120GB RAM and 16 cores) and master is of m3.2xlarge config (
The problem which am facing is, spark read was running endlessly and I see only driver working and rest all Nodes aren't doing anything and am not sure why driver is opening each S3 file for reading, because AFAIK spark works lazily so till an action is called reading shouldn't happen, I think it's listing each file and collecting some metadata associated with it.
But why only Driver is working and rest all Nodes aren't doing anything and how can I make this operation to run in parallel on all worker nodes ?
I have come across these articles https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 and https://gist.github.com/snowindy/d438cb5256f9331f5eec, but here the entire file contents are being read as an RDD, but my use case is depending on the columns being referred only those blocks/columns of data should be fetched from S3 (columnar access given ORC is my storage) . Files in S3 have around 130 columns but only 20 fields are being referred and processed using dataframe API's
Sample Log Messages:
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=09/min=00/part-r-00199-e4ba7eee-fb98-4d4f-aecc-3f5685ff64a8.zlib.orc' for reading
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=19/min=00/part-r-00023-5e53e661-82ec-4ff1-8f4c-8e9419b2aadc.zlib.orc' for reading
You can see below that only One Executor is running that to driver program on one of the task Nodes(Cluster Mode) and CPU is 0% on rest of the other Nodes(i.e Workers) and even after 3-4 hours of processing, the situation is same given huge number of files have to be processed
Any Pointers on how can I avoid this issue, i.e speed up the load and process ?
There is a solution that can help you based in AWS Glue.
You have a lot of files partitioned in your S3. But you have partitions based in timestamp. So using glue you can use your objects in S3 like "hive tables" in your EMR.
First you need to create a EMR with version 5.8+ and you will be able to see this:
You can set up this checking both options. This will allow to access the AWS Glue Data Catalog.
After this you need to add the your root folder to the AWS Glue Catalog. The fast way to do that is using the Glue Crawler. This tool will crawl your data and will create the catalog as you need.
I will suggest you to take a look here.
After the crawler runs, this will have the metadata of your table in the catalog that you can see at AWS Athena.
In Athena you can check if your data was properly identified by the crawler.
This solution will make your spark works close to a real HDFS. Due to the metadata will be properly in the Data Catalog. And the time you app is taking to find the "indexing" will allow to run the jobs faster.
Working with this here I was able to improve the queries, and working with partitions was much better with glue. So, have a try this probably can help in the performance.

Way to export all the data from Cassandra cluster to file(s)

I need to export Cassandra schema and data to a file in order to quickly setup identical cluster when needed.
Identical likely means the same topology, same number of nodes and replication factor.
In case of NetworkTopologyStrategy simple file backup/sstable snapshot is not helpful cause peer IPs are recorded with other data. After restore on another node it tries to reach source cluster seeds.
I was surprised there is almost no ready solution for such task.
Suppose i have to use DESC SCHEMA; then parse output for all the tables, backup them with COPY keyspace.table TO /backup/keyspace.table.csv; and later use sstableloader to restore on other node.
Any better solutions?
You can use the solution you've specified.
Or you can use snapshots option (looks easier for me). Here's a docs describing how to copy snapshots between clusters:
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html

Dataset for Hadoop Dev environment?

I am learning hadoop. I want to understand how dataset/database is setup for environments like Dev, Test and Pre-prod.
Of course in PROD environment we will be dealing with Terabytes of data, but having the same replica of tera bytes of data to other environments, i dont think it is possible.
For other environments how the datasets are replicated? only certain portions of data will be loaded and used in these non prod environments? if so how it is done?
How it is replicated, basically the concept of hdfs relevant to namenodes and datanodrs should give you some research. When you create a new file it goes to name node which updated the metadata and give you a blank block id once you write it finds the nearest datanodes base on the rack location. It replicates to the first datanodes, once its done replicating. Datanode first will replicate it to the next second then thirds and so fourth. It basically just re0licate on the very first node and the hdfs framework will handle the next preceedi g replication

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Backup COPY vs BACKUPSET

Oracle has two options of backuping database, and documentation on them is very brief.
To back up to disk as image copies, use BACKUP AS COPY as shown in
BACKUP AS COPY
DEVICE TYPE DISK
DATABASE;
To back up your data
into backup sets, use the AS BACKUPSET clause. You can allow backup
sets to be created on the configured default device, or direct them
specifically to disk or tape
BACKUP AS BACKUPSET
DATABASE;
BACKUP AS BACKUPSET
DEVICE TYPE DISK
DATABASE;
What is the difference between the two, why there are these multiple options?
To put it simply, back up as copy makes a simple copy of database files(the same way Linux cp command does), whereas backup sets is a logical entity to backup pieces as a tablespace to data files. Backup pieces are in an RMAN specific binary format.
why there are these multiple options?
To give the opportunity to perform backup and recovery more effectively and efficiently. For example, you can simply switch to an image copy of a data file avoiding, possibly time consuming, restoration process. But you cannot perform incremental backups with image copies as you be able to do so with backup sets, etc.
The choice of options, of course depends on your B&R strategy.
Find out more

Resources