openGauss distributed deployment? - open-gauss

Does openGauss support distributed deployment? I read that Primary/Standby can be on different machines, but can I install more than primary?

That's a good question.
openGauss current only support primary/standby deployment, but fortunately you can have 2 options
turn to GaussDb for openGauss, which is a super set of openGauss which support fully distributed functionality.
turn to proxy way ,such as ShardingSphere which also support openGauss to be deployed in a distributed sharding style.

Yes, openGauss supports distributed deployment. You can install multiple primary nodes on different machines, and each primary node can have its own standby node.

Related

Multi-node Hadoop cluster with Docker

I am in planning phase of a multi-node Hadoop cluster in a Docker based environment. So it should be based on a lightweight easy to use virtualized system.
Current architecture (regarding to documentation) contains 1 master and 3 slave nodes. This host machine uses HDFS filesystem and KVM for virtualization.
The whole cloud is managed by Cloudera Manager. There are several Hadoop modules installed on this cluster. There is also a NodeJS data upload service.
This time I should make architecture Docker based.
I have read several tutorials and have some opinions, but also open questions.
A. What do you think, is https://github.com/Lewuathe/docker-hadoop-cluster a good base for my project? I have found also an official image, but it is single-node.
B. How will system requirements change if I would like to make this in a single container? It would be great, because this architecture should work in different locations, so changes can be easily transferred between these locations. Synchronization between these so called clones would be important.
C. Do you have some other ideas, maybe best practices?
As of September 2016 there is no quick answer.
https://github.com/Lewuathe/docker-hadoop-cluster does not seem like a good start, as it should be universal for your B. option
Keep an eye on https://github.com/sequenceiq/hadoop-docker and https://github.com/kiwenlau/hadoop-cluster-docker
To address your question C., you may want to check out BlueData's software platform: http://www.bluedata.com/blog/2015/06/docker-containers-big-data-clusters
It's designed to run multi-node Hadoop clusters in a Docker-based environment and there is a free version available for download (you can also run it in an AWS EC2 instance).
This work has already been done for you, actually:
https://hub.docker.com/r/cloudera/clusterdock/
It includes a pre-packaged multi-node CDH cluster, with Cloudera Manager as an optional component for cluster management et al.

zookeeper and HBase OR HBase including ZooKeeper

I configured a multi node hadoop env on AWS (1 master/3 slaves running on Ubuntu 14.04). now I am planning to install and configure other Apache bricks (not sure which one exactly yet). I decided to start with HBase.
here is my dilemma: should I install zookeeper as a standalone and then HBase (taking into consideration future bricks like pig, hive ...) or should I use zookeeper/Hbase bundled?
How this choices may affect subsequent architecture design ?
thanks for sharing your views/personal experiences !
It doesn't really matter all that much from a capabilities point of view. If you install the bundled HBase+ZK, you'll still be able to use ZK later on to support other bricks. Since installing the bundle is likely to be the quickest path to a working HBase, it is probably the best option for you.
ZK ensemble is recommended to be run on separate machines (Odd number), in any production environment.
For your learning & experimenting, it can co-exist in one machine.
More information # https://zookeeper.apache.org/doc/r3.3.2/zookeeperAdmin.html

Why is DCOS required if we can deploy Mesos Cluster directly

I have an query, If we can use Mesos Cluster by directly installing master and slave nodes. Then why do we need DCOS , is it that DCOS provides additional support along with mesos cluster. Please elaborate on this part.
Depends on your needs :-):
Here what in my opinion the Community Edition (the Enterprise Edition includes more proprietary features such as security) of DCOS adds to self setup of Mesos:
Easy setup, including Marathon and MesosDNS.
Command Line Interface with one Click install from the Universe. I personally especially like the simple installs of these services as it is really simple to install for example HDFS or cassandra in your cluster. Note: As with the above you can probably with some effort configure such setup yourself as both projects are on Github.
Very nice UI
So overall I would summarize DCOS provides a very easy and tested best-practice setup of Mesos and its ecosystem.
Hope this helps!

Is HortonWorks Sandbox VM preferred in production environment?

The HortonWorks HDP, could be implemented in two ways:
Sandbox (VM)
Manual Installation.
I would like to understand, whether HDP SandBox, or the manual installation is preferred in the production environment. The choice could be made on obvious reasons like performance, but I would like to understand whether there are any other considerations?
The Hortonworks Sandbox allows to try out the features and functionality in Hadoop and its' ecosystem of projects. That's all.
If you want to go to production, you have three installation type:
Automated with Ambari
Manual
Cloud with Cloudbreak
Regards,
Alain
performance. hadoop is about parallel processing. Can't do that with a single node.
storage. hadoop uses a distributed file system. With a single node your storage space is very limited.
redundancy. if this node dies, everything is gone. Normal hadoop configuration include a redundancy factor (of 3 by default) so that when some nodes or disks go down, all of the data is still reachable. Similarly with a standby namenode.
There are a few other points, but these are the main ones IMO.
Single node hadoop only makes sense for proof of concept, and experimentation. Not for providing production level value.

Cloudera installation Doubts?

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

Resources