hadoop and its technologies setup - hadoop

For study project requirement, I am selecting following technology because source of data is SQL SERVER
Initial data size is 100Gb and 10 growth#quarter
Information
Hadoop – Multi node cluster (1Namenode + 3 DataNode)
Hadoop 3.1.2,
Apache Maven 3.6.0
Ubuntu 18.04
Ambari
Above setup is ready now following item remaining
Sqoop: 1.4.7
Hive: 2.3.5
Oozie 5.0.0
Should they all be installed on separate machines?
What is the deployment strategy once development completed?

If you have the hardware available, then yes, every master service should be on separate machines for fault tolerance purposes.
Meaning, Oozie server, Hive server, Hive metastore are all separate.
Sqoop and Hive client are only clients and can be on any NodeManager

Related

How do I integrate presto cluster to hadoop cluster?

We have Hadoop cluster based on ambari
Since thrift server have poor performance , we decided to replace it with presto
Our current Hadoop cluster have the following machines
960 data node machines ( based on redhat 7 OS )
Few words about the presto-
Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS),
We installed the new presto server as the following
First we installed the OS ( redhat 7 ) , total 13 machines
1 machine for the presto coordinator
And 12 machines for presto workers
After installing the OS
We installed successfully the presto ( presto coordinator + presto workers )
Now we are stuck about how to do the integration between presto cluster to the Hadoop cluster
I will give short example about hive connector ( hive.properties )
we have the following variable
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
since this file are located the data node machines and of course not on the presto worker machines , I assume that we need to copy these files from one of the data node machine to the presto workers machines
am I right here ?
You normally do not need to configure hive.config.resources to allow Presto to talk to your HDFS cluster. Try using Presto without that configuration. Only configure it if you have special requirements such as Hadoop KMS.
To configure it, copy the appropriate Hadoop config file(s) to your Presto machines (coordinator and workers), then set hive.config.resources to point to those file(s).
See the Hive connector documentation for more details.

In a hadoop cluster, should hive be installed on all nodes? Install Pig

I am new to Hadoop / Pig and I have just started reading the docs.
There are lots of blogs on installing Hadoop in cluster mode.
I know that Pig runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes.
Should I also install Pig on all the cluster nodes or only on the master node?
You would want to install Hive Metastore and Hive Server on 2 different nodes. By default, hive uses derby database, but most of the people choose to go with MySQL so there will be a MYSQL server daemon also.
So not to confuse you anymore :
Install HiveServer and WebHcat Server on one node
Install Hive Metastore and MySQL server on another node.
This is the best practice. If you have any other doubt you can ask!
I cannot tell if the question is about Hive or Pig, but there's a difference between clients and servers.
For Hive, the master services are the Metastore and HiveServer2. You can install these daemons on the same server to improve network traffic between the metastore and the Hive query compiler. You only need one client to communicate with those masters.
For Pig, it communicates directly to YARN and HDFS (optionally Hive, if you use Hcatalog). Again, it's only a client, so only one hosts needs it.
It is generally preferred to have a dedicated set of machines for Hive and the backing RDBMS for the metastore (Mysql or Postgres being the more popular options)
You also don't need to "install Pig in the cluster". For example, I could grab the Hadoop XML configs and run some Pig code against the YARN cluster from any outside computer after downloading Pig locally (same applies to Spark)

How to set up Spark on multi-node Hadoop cluster?

I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?
I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!

Existing Cluster monitoring by Hortonworks Ambari

I have a 10 node existing cluster in RHEL 6.6 which was prepared by plain apache Hadoop configuration XMLs. Now I wanted to check the cluster status by Ambari. Would it be possible to install Hortonworks Ambari just to monitor only not to install Hadoop.
No, Ambari must provision the cluster it's monitoring.
Ambari is designed around a Stack concept where each stack consists of several services. A stack definition is what allows Ambari to install, manage and monitor the services in the cluster.
In order for you to use Ambari with the hadoop core that you built you would have to provide your own Ambari stack definition.
Specifically in your case your existing Hadoop installation would not have the necessary alert.json descriptors used by Ambari to provide alerts for any given service.

How to install Apache Spark on HortonWorks HDP 2.2 (built using Ambari)

I successfully built a 5 node cluster of HortonWorks HDP 2.2 using Ambari.
However I don't see Apache Spark in the installed services list.
I did some research and found that Ambari does not install certain components like hue etc. ( Spark was not in that list, but I guess its not installed).
How do I do a manual install of Apache spark on my 5 node HDP 2.2?
Or should I delete my cluster and perform a fresh install without using Ambari?
Hortonworks support for Spark is arriving but not fully complete (details and blog).
Instructions for how to integrate Spark with HDP can be found here.
You could build your own Ambari Stack for Spark. I recently did just that, but I cannot share that code :(
What I can do is share a tutorial I did on how to do any stack for Ambari, including Spark. There are many interesting issues with Spark that need to be addressed and are not covered through the tutorial. Anyways hope it helps. http://bit.ly/1HDBgS6
There is also a guide from the Ambari people here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=38571133.
1) Ambari 1.7x does not install Accumulo, Hue, Ranger, or Solr services for the HDP 2.2 Stack.
For Installing Accumulo, Hue, Knox, Ranger, and Solr services, install
HDP Manually.
2) Apache Spark 1.2.0 on YARN with HDP 2.2 : here .
3)
Spark and Hadoop: Working Together :
Standalone deployment: With the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the deployment of choice for many Hadoop 1.x users.
Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of the full power of Spark, as well as of other components running on top of Spark.
Spark In MapReduce : For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it! This tremendously lowers the barrier of deployment, and lets virtually everyone play with Spark.

Resources