Synchronizing Ambari cluster configurations - hadoop

We have been exploring Apache Ambari with HDP 2.2 to setup a cluster. Our backend features three environments: testing, staging and production which is a standard practice in our industry.
When we would deploy a cluster in the testing environment with Ambari, what is the easiest way to have the same cluster configuration on the staging, and later, production environment ?
The initial step seems easy: you create a cluster in the testing environment using the UI and then you export the configuration as a blueprint. Subsequently, you use the exported blueprint to create a new cluster in the other environments. So far, so good.
Inevitably, we will need to change our Ambari configuration (e.g. deploy a new service, increase heap size for the JVM's,...). I was hoping we could just update the blueprint (using the UI or by hand) and then use the updated blueprint to also update the different clusters. However, this seems not possible unless you destroy and recreate the cluster which seems a bit harsh.. (we don't want to lose our data) ?
Alternatively we could use the REST API of Ambari to do specific updates to the configuration but as configuration changes with respect to the initial blueprint will undoubtedly accumulate, this will prove unwieldy and unmaintainable over time, I am afraid.
Can you suggest us a better solution for this use case?

I believe the easiest way would be to dump each services configuration to a file. Then import each of those configurations into the other clusters. This could be done simply by using the Ambari API or by using the script provided by Ambari to update configurations (/var/lib/ambari-server/resources/scripts/configs.sh).

Related

Elasticsearch on Kubernetes - 'Elastic Cloud (ECK)' vs 'Helm charts'

For the purpose of log file aggregation, I'm looking to setup a production Elasticsearch instance on an on-premise (vanilla) Kubernetes cluster.
There seems to be two main options for deployment:
Elastic Cloud (ECK) - https://github.com/elastic/cloud-on-k8s
Helm Charts - https://github.com/elastic/helm-charts
I've used the old (soon to be deprecated) helm charts successfully but just discovered ECK.
What are the benefits and disadvantages of both of these options? Any constraints or limitations that could impact long-term use?
The main difference is that the Helm Charts are pretty unopinionated while the Operator is opinionated — it has a lot of best practices built in like a hard requirement on using security. Also the Operator Framework is built on the reconcilliation loop and will continuously check if your cluster is in the desired state or not. Helm Charts are more like a package manager where you run specific commands (install a cluster in version X with Y nodes, now add 2 more nodes, now upgrade to version Z,...).
If ECK is Cloud-on-Kubernetes, you can think of the Helm charts as Stack-on-Kubernetes. They're a way of defining exact specifications running our Docker images in a Kubernetes environment.
Another difference is that the Helm Charts are open source while the Operator is free, but uses the Elastic License (you can't use it to run a paid Elasticsearch service is the main limitation).
1. Elastic Cloud (ECK):
ADVANTAGES
document oriented (JSON)
multilingual - the ICU plugin is used to index and tokenize
multilingual content which is an elasticsearch plugin based on the
lucene implementation of the unicode text segmentation standard
managing and monitoring multiple clusters
upgrading to new stack versions with ease
scaling cluster capacity up and down
changing cluster configuration
dynamically scaling local storage (includes Elastic Local Volume, a
local storage driver)
scheduling backups
secure by default - have encryption enabled and are protected with a
strong default password right at creation time
free features - Canvas, Maps, Uptime
hot-warm-cold and custom topologies
official GKE support
free tier
DISADVANTAGES
it is not as good at being a data store as some other options like
MongoDB, Hadoop, etc. For smaller use cases, it will perform fine. If
you are streaming TB’s of data every day, you will find that it
either chokes or loses data
it’s learning curve is much
steeper
when you can’t or won’t create a production-worthy setup because of
economics. For test and dev, a single node will work fine. When you
move to production, you should have no less than a 3-node/2-replica
More information you can find here: ECK.
2. Elastic Stack Kubernetes Helm Charts:
ADVANTAGES
huge community
easy to deploy and use in Kubernetes
each component in the stack takes care of a different step in the
logging pipeline, and together, they all provide a comprehensive and
powerful logging solution for Kubernetes
rich analysis capabilities
DISADVANTAGES
difficult to maintain at scale
More information you can find here: open-source-monitoring-tools-for-kubernetes.

Automate NiFi Deployment

I am looking for best approaches for deploying NiFi flows from my DEV environment to TEST/PROD environments.
Below links gives an overview of how we can achieve the same; basically it explains we have to make use of NiFi Cli to automate the deployment.
https://pierrevillard.com/2018/04/09/automate-workflow-deployment-in-apache-nifi-with-the-nifi-registry/
https://bryanbende.com/development/2018/01/19/apache-nifi-how-do-i-deploy-my-flow
But I was wondering is there an option to create a general script which can be used for deploying for different types of flows. Since the variables that we need to set for one processor is different from another one, not sure how we can do the same.
Any help is appreciated
I am the primary maintainer of NiPyAPI, a Python client for working with Apache NiFi. I have an example script covering the steps you are requesting though it is not part of the official Apache project.
https://github.com/Chaffelson/nipyapi/blob/master/nipyapi/demo/fdlc.py

Any open-souce software for me to manage big-data cluster including hadoop/hive/spark/?

I am looking for an open-source system for me to manage my big-data cluster which is composed of 50+ machines including components like hadoop, hdfs, hive, spark, oozie, hbase, zookeeper, kylin.
I want to manage them in a web system .The meaning of "manage" is :
I can restart the component one-by-one with only one click ,such
as when I click the "restart" button ,the component zookeeper will
be restarted one machine by another
I can deploy a component with only one click, such as when I
deploy a new zookeeper , I can make a compiled zookeeper prepared in
one machine ,then I click "deploy", it will deployed to all machines
automatically.
I can upgrade a component with only one click ,such as when I
want to update a zookeeper cluster, I can put the updated zookeeper
in a machine ,then I click "update" ,then the updated zookeeper will
override all the old version of zookeeper in other machines.
all in all , what I want is a management system for my big-data cluster like restart,deploy,upgrade,view the log ,modify the configuration and so on , or at least some of them .
I have considered Ambari, but it can only be used to deploy my whole system from absolute scratch, but my big-data cluster is already running for 1 years.
Any suggestions?
Ambari is what you want. It's the only open source solution for managing hadoop stacks that meets your listed requirements. You are correct that it doesn't work with already provisioned clusters, this is because to achieve such a tight integration with all those services it must know how they were provisioned and where everything is and know what configurations exist for each. The only way Ambari will know that is if it was used to provision those services.
Investing the time to recreate your cluster with Ambari may feel like its painful but in the long run it will payoff due to the added ability to upgrade and manage services so easily going forward.

How do I install components such as Apache Drill and Apache Hue in IBM Bluemix BigInsights Apache Hadoop

I am new to IBM Bluemix platform and exploring its BigInsights service. I can see pre configured components such as Pig Hive Hbase and others. But I want to know How can I install services like Drill or say Hue which is not configured by default. Also ssh to cluster nodes allows restricted access with no sudo rights in case one need to run yum commands.Does bluemix allows root access as I cannot see one. Thanks In advance.
As far as I know, it is not possible.
But you can use http://www.softlayer.com/ to build your own IOP (IBM Open Platform) Cluster in the cloud.
If you are interested in IBM's value-adds and you just want to try out:
https://www.youtube.com/watch?v=4p7LDeu_qQQ it is a nice tutorial to set up your own cluster via Docker.
This tutorial should be still valid for Hue:
https://developer.ibm.com/hadoop/2015/06/02/deploying-hue-on-ibm-biginsights/
Installing Drill doesn't look complicated:
https://drill.apache.org/docs/installing-drill-in-distributed-mode/
In conclusion: You need to move away from Bluemix, if you want to have a more customised BigInsights. But there are options: Softlayer, AWS, .. or just on your local computer (if you got sufficient resources, since some components like Hbase need a minimum amount of nodes)

Configuring AWS cluster using automation script

We are looking for the possibility of an automation script which we can give how many master and data nodes we need and it would configure a cluster. Probably giving the credentials in a properties file.
Currently our approach is to login to the console and configure the Hadoop cluster. It would be great if there could be an automated way around it.
I've seen this done very nicely using Foreman, Chef, and Ambari Blueprints. Foreman was used to provision the VMs, Chef scripts were used to install Ambari, configure the Ambari blueprint, and to create the cluster using the Blueprint.

Resources