How do I integrate presto cluster to hadoop cluster? - hadoop

We have Hadoop cluster based on ambari
Since thrift server have poor performance , we decided to replace it with presto
Our current Hadoop cluster have the following machines
960 data node machines ( based on redhat 7 OS )
Few words about the presto-
Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS),
We installed the new presto server as the following
First we installed the OS ( redhat 7 ) , total 13 machines
1 machine for the presto coordinator
And 12 machines for presto workers
After installing the OS
We installed successfully the presto ( presto coordinator + presto workers )
Now we are stuck about how to do the integration between presto cluster to the Hadoop cluster
I will give short example about hive connector ( hive.properties )
we have the following variable
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
since this file are located the data node machines and of course not on the presto worker machines , I assume that we need to copy these files from one of the data node machine to the presto workers machines
am I right here ?

You normally do not need to configure hive.config.resources to allow Presto to talk to your HDFS cluster. Try using Presto without that configuration. Only configure it if you have special requirements such as Hadoop KMS.
To configure it, copy the appropriate Hadoop config file(s) to your Presto machines (coordinator and workers), then set hive.config.resources to point to those file(s).
See the Hive connector documentation for more details.

Related

hadoop and its technologies setup

For study project requirement, I am selecting following technology because source of data is SQL SERVER
Initial data size is 100Gb and 10 growth#quarter
Information
Hadoop – Multi node cluster (1Namenode + 3 DataNode)
Hadoop 3.1.2,
Apache Maven 3.6.0
Ubuntu 18.04
Ambari
Above setup is ready now following item remaining
Sqoop: 1.4.7
Hive: 2.3.5
Oozie 5.0.0
Should they all be installed on separate machines?
What is the deployment strategy once development completed?
If you have the hardware available, then yes, every master service should be on separate machines for fault tolerance purposes.
Meaning, Oozie server, Hive server, Hive metastore are all separate.
Sqoop and Hive client are only clients and can be on any NodeManager

How to import data from HDFS (Hadoop) into ElasticSearch?

We have a big Hadoop cluster and recently installed Elastic Search for evaluation.
Now we want to bring data from HDFS to ElasticSearch.
ElasticSearch is installed in a different cluster and so far - we could run a Beeling or HDFS script to extract data from Hadoop into some file and then from a local file bulk load it to ElasticSearch.
Wondering if there is a direct connection from HDFS to ElasticSearch.
I start reading about it here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
But since our team is not DevOps (does not configure nor manage Hadoop cluster) and can only access Hadoop via Kerberos/user/pass - wondering if this is possible to configure (and how) without involving whole DevOps team that manages Hadoop cluster to install/setup all these libraries before direct connect?
How to do it from a Client side?
Thanks.

In a hadoop cluster, should hive be installed on all nodes? Install Pig

I am new to Hadoop / Pig and I have just started reading the docs.
There are lots of blogs on installing Hadoop in cluster mode.
I know that Pig runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes.
Should I also install Pig on all the cluster nodes or only on the master node?
You would want to install Hive Metastore and Hive Server on 2 different nodes. By default, hive uses derby database, but most of the people choose to go with MySQL so there will be a MYSQL server daemon also.
So not to confuse you anymore :
Install HiveServer and WebHcat Server on one node
Install Hive Metastore and MySQL server on another node.
This is the best practice. If you have any other doubt you can ask!
I cannot tell if the question is about Hive or Pig, but there's a difference between clients and servers.
For Hive, the master services are the Metastore and HiveServer2. You can install these daemons on the same server to improve network traffic between the metastore and the Hive query compiler. You only need one client to communicate with those masters.
For Pig, it communicates directly to YARN and HDFS (optionally Hive, if you use Hcatalog). Again, it's only a client, so only one hosts needs it.
It is generally preferred to have a dedicated set of machines for Hive and the backing RDBMS for the metastore (Mysql or Postgres being the more popular options)
You also don't need to "install Pig in the cluster". For example, I could grab the Hadoop XML configs and run some Pig code against the YARN cluster from any outside computer after downloading Pig locally (same applies to Spark)

Pivotal: HDFS-HAWQ - Migration to New Hardware

We have version PHD3.0 hadoop cluster for 5 node using ambari installed on rackspace . We have to migrate into Google cloud (GCP).
But Not getting steps How to conduct following
Q1: How do i migrate data,metadata configuration from Old machine to New machine. ( Old Machine version is cantos 6.5 to target machine version centos 6.5)
Q2: What components and folders to backup? What would be the commands?
Q3: How to backup nameode and datanodes?
Q4: Do we need to take backup of ambari database also?
Any help on it would be much appreciated?
I would personally prefer to provision hadoop cluster in GCP and move data using distcp to new cluster.
For hawq managed tables move data to hdfs and then do distcp.
Some occasions on AWS i moved data to s3 and import back into hadoop.

In a hadoop cluster, should hive be installed on all nodes?

I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?
No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.
From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster

Resources