How to build Vertica slave? - vertica

I have a Vertica instance running on our prod. Currently, we are taking regular backups of the database. I want to build a Master/Slave configuration for Vertica so that I always have the latest backup in case something goes bad. I tried to google but did not find much on this topic. Your help will be much appreciated.

There is no concept of a Master/Slave in Vertica. It seems that you are after a DR solution which would give you a standby instance if your primary goes down.
The standard practice with Vertica is to use a dual load solution which streams data into your primary and DR instances. The option you're currently using would require an identical standby system and take time to restore from your backup. Your other option is to do storage replication which is more expensive.
Take a look at the best practices for disaster recovery in the documentation.

Related

How to perform GreenPlum 6.x Backup & Recovery

I am using GreenPlum 6.x and facing issues while performing backup and recovery. Does we have any tool to take the physical backup of whole cluster like pgbackrest for Postgres, further how can we purge the WAL of master and each segment as we can't take the pg_basebackup of whole cluster.
Are you using open source Greenplum 6 or a paid version? If paid, you can download the gpbackup/gprestore parallel backup utility (separate from the database software itself) which will back up the whole cluster with a wide variety of options. If using open source, your options are pretty much limited to pgdump/pgdumpall.
There is no way to purge the WAL logs that I am aware of. In Greenplum 6, the WAL logs are used to keep all the individual postgres engines in sync throughout the cluster. You would not want to purge these individually.
Jim McCann
VMware Tanzu Data Engineer
I would like to better understand the the issues you are facing when you are performing your backup and recovery.
For Open Source user of the Greenplum Database, the gpbackup/gprestore utilities can be downloaded from the Releases page on the Github repo:
https://github.com/greenplum-db/gpbackup/releases
v1.19.0 is the latest.
There currently isn't a pg_basebackup / WAL based backup/restore solution for Greenplum Database 6.x
WAL logs are periodically purged (as they get replicated to mirror and flushed) from master and segments individually. So, no manual purging is required. Have you looked into why the WAL logs are not getting purged? One of the reasons could be mirrors in cluster is down. If that happens WAL will continue mounting on primary and won't get purged. Perform select * from pg_replication_slots; for master or segment for which WAL is building to know more.
If the cause for WAL build is due replication slot as for some reason is mirror down, can use guc max_slot_wal_keep_size to configure max size WAL's should consume, after that replication slot will be disabled and not consume more disk space for WAL.

How to copy hbase table from hbase-0.94 cluster to hbase-0.98 cluster

We have a hbase-0.94 cluster with hadoop-1.0.1. We don't want to have downtime for this cluster while upgrading to hbase-0.98 with hadoop-2.5.1
I have provisioned another hbase-0.98 cluster with hadoop-2.5.1 and want to copy hbase-0.94 tables to hbase-0.98. Hbase CopyTable does not seem to work for this purpose.
Please suggest a way to perform theabove task.
These are available options, out of which you can choose.
You can use org.apache.hadoop.hbase.mapreduce.Export tool to
export tables to HDFS and then you can use hadoop distcp to move data to
another cluster. When data is place on second cluster you can use
org.apache.hadoop.hbase.mapreduce.Import tool to import tables.
Please look at http://hbase.apache.org/book.html#export.
Second option is to us CopyTable tool, please look at:
http://hbase.apache.org/book.html#copytable
Have a look at pivotal
Third option is to enable hbase Snapshots, create table
snapshots, and then use ExportSnapshot tool to move them to second cluster. When snapshots are on second cluster you can clone tables from snapshots. Please look: http://hbase.apache.org/book.html#ops.snapshots
HBase Snapshots allow you to take a snapshot of a table without too
much impact on Region Servers. Snapshot, Clone and restore operations
don't involve data copying. Also, Exporting the snapshot to another
cluster doesn't have impact on the Region Servers
I was using 1 and 3 for moving data between clusters and I in my case 3
was better solution.
Also, have a look at my answer posted
Run below command on source cluster, make sure you have cross cluster authentication enabled.
/usr/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Ddfs.nameservices=nameservice1,devnameservice -Ddfs.ha.namenodes.devnameservice=devnn1,devnn2 -Ddfs.namenode.rpc-address.devnameservice.devnn1=<destination_namenode01_host>:<destination_namenode01_port> -Ddfs.namenode.rpc-address.devnameservice.devnn2=<destination_namenode02_host>:<destination_namenode02_port> -Ddfs.client.failover.proxy.provider.devnameservice=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider -Dmapred.map.tasks.speculative.execution=false --peer.adr=<destination_zookeeper host>:<port>:/hbase --versions=<n> <table_name>

GlusterFS as the backend for Hadoop

I've seen redhat has come up one possible solution with GlusterFS working as the backend for hadoop. In this case, you can get ride of the namenode/datanode architecture and replace it with glusterfs, meanwhile you still have Hadoop Mapreduce api-compatibility.
Just wondering how does the performance compare against native-HDFS? Is it really production ready? Does it support all the hadoop ecosystem as well? e.g. Solr Cloud, Spark, Impala etc etc.
disclaimer: I work for Storage vendor.
Well. I don't know much about GlusterFS in particular but i can speak about Lustre as it's POSIX at the end of the day. It's parallel filesystem, but the benchmarks i looked into recently showed it does outperform HDFS. but it's definitely a production ready alternative that offers a single name space for your data (no more HDFS ingestion)
What does work from Hadoop ecosystem today?
what I've seen in the production today is Spark,Hive,Hbase. Imapala looks to me it require certain parts of HDFS, this is why it doesn't work with POSIX FS and it's not HCFS. I did a quick test and i was able to create the database and everything, but i wasn't able to fetch any rows.
Let me if you need further help.

What to use.. Impala on HDFS, or Impala on Hbase or just the Hbase?

I am working on Proof of Concept task.
The task is to implement a feature of our product using Hadoop technology.
Feature is quite simple, we have a UI which will let you insert details about "Network Issue".
All details about such a issue are captured and inserted into a table in Oracle DB.
We then process data in this table and calculate a Health Score.
I have to use Hadoop instead of a traditional Db So my question is what to go for?
Impala on HDFS? or
Impala on Hbase ? or
Hbase?
I am using a cloudera VM for the POC implementation.
As per my understanding, Hbase is NoSQL distributed database, which is actually a layer on HDFS , which provides java APIs to access data.
Impala is a tool which also provides JDBC access to access data over Hbase or directly over HDFS.
I am very new to hadoop, can some one please help?
Well, it depends on several things, like the kind of processing you are going to perform, desired response time etc. But by looking at whatever you have written here, HBase seems to be fine. I don't find any need of Impala as of now. HBase API is good and will serve your most of the needs.
IMHO, it's better to keep things simple initially and add a tool only if it is really required. Same holds good here. If you reach a point where you find that HBase API is not able to serve the purpose you could definitely add Impala to your stack.
That being said, there is one thing which you should keep in mind. HBase is a NoSQL DB and doesn't follow RDBMS conventions and terminologies. So, you might find it a bit strange initially. It's better to keep this in mind and then proceed as you have to design the schema in a way which is totally different from the RDBMS style of schema design.

Hadoop Disaster Recovery

Anyone could help me to know about Hadoop Disaster Recovery ?
should i replicate data from cluster to another cluster as backup use distcp?
Or i can use copyToLocal to copy my data to my localmachine ?
Anyone idea about it ?
DRP plan goes beyond just the technology and the requirements can greatly affect the solution.
for instance if you can't afford to lose any data you'd want an active/active setup and send data to two hadoop clusters simultaneously. on the other side of the spectrum hadoop's replication (default is 3 copies but you can change that) and rack awareness can give you a copy on a secondary rack. In between you can use things like distcp that you mention to copy data from cluster to cluster.
Additionally you might want to follow project falcon which is a new initiative for hadoop data life-cycle management

Resources