Installing mahout on a cluster Hadoop - hadoop

I have configured a cluster hadoop using cloudera CDH 5.3.
I created a 4 nodes cluster and now want to install mahout on it.
I did add the .jar file of mahout in the cluster but i don't know what to do next.
Should I make the installation in the 4 nodes?

Related

presto + build presto cluster that will be join to exsiting hadoop cluster

we have hadoop cluster that contain all the relevant components/services as
HDFS
YARN
mapreduce
HIVE
Tez
pig
Zookeeper
hadoop clutser contain 3 masters machines and 12 data node machines and 3 kafka
now we want to use presto to run query against data sources ( hadoop cluster / hive )
so we build a new presto cluster as the follwing
1 presto coordinator
8 presto workers
all presto cluster machines are redhat 7.2
now we want to install the presto on all OS
but we are not sure if presto can be installed immodestly after Linux scratch OS
or maybe we need to install something in the middle after the OS and before the presto ?
The only requirement for Presto is a Java Virtual Machine (JVM). We recommend installing the latest OpenJDK 11 version, currently 11.0.2. After that, follow the Presto deployment instructions.
Python is required for the launcher (the script that starts the JVM), but this is normally available on a typical Linux distribution.

How to check the hadoop distribution used in my cluster?

How can I know whether my cluster has been setup using Hortonworks,Cloudera or normal installation of hadoop components?
Also how can I know the port number of various services?
It is difficult to identify hadoop distribution from port number, since Apache, Hortonworks, Cloudera distros uses different port numbers
Other options are to check for cluster management service agents (Cloudera Manager - agent start up script - /etc/init.d/cloudera-scm-agent , Hortonworks - Ambari agent start up script - /etc/init.d/ambari-agent, Vanilla Apache hadoop will not have any agents in the server
Another option is to check hadoop classpath, below command can be used to get the classpath.
`hadoop classpath`
Most of hadoop distributions include distro name in the classpath, If classpath doesn't contains any of below keywords, distribution/setup will be Apache/Normal installation.
hdp - (Hortonworks)
cdh - (Cloudera)
The simplest way is to run hadoop version command and in output you will see, what version of Hadoop you are having and also which distribution and its version you are running with. If you will find words like cdh or hdp then cdh stands for cloudera and hdp for hortonworks.
For example, here I am having cloudera and with hadoop version command below is output.
Here in first line Hadoop version followed by hadoop distribution and its version.
Hope this will help.
Command hdfs version will give you version of the hadoop and its distribution

Apache HAWQ installation built on top of HDFS

I would like to install Apache HAWQ based on the Hadoop.
Before installing HAWQ, I should install Hadoop and configure the all my nodes.
I have four nodes as below and my question is as blow.
Should I install a hadoop distribution for hawq-master?
1. hadoop-master //namenode, Secondary Namenode, ResourceManager, HAWQ Standby,
2. hawq-master //HAWQ Master
3. datanode01 //Datanode, HAWQ Segment
4. datanode02 //Datanode, HAWQ Segment
I wrote the role of each node next to the nodes as above.
In my opinion, I should install hadoop for hadoop-master, datanode01 and datanode02 and I should set hadoop-master as namenode (master) and the others as datanode (slave). And then, I will install apache HAWQ on all the nodes. I will set hawq-master as a master node and hadoop-master as HAWQ Stand by and finally the other two nodes as HAWQ segment.
What I want is installing HAWQ based on the Hadoop. So, I think the hawq-master should be built on top of hadoop, but there are no connection with hadoop-master.
If I proceed above procedure, then I think that I don't have to install hadoop distribution on hawq-master. Is my thought right to successfully install the HAWQ installation based on the hadoop?
If hadoop should be installed on hawq-master then which one is correct?
1. `hawq-master` should be set as `namenode` .
2. `hawq-master` should be set as 'datanode`.
Any help will be appreciated.
Honestly, there is no strictly constraints on how the hadoop installed and hawq installed if they are configured correctly.
For your concern, "I think the hawq-master should be built on top of hadoop, but there are no connection with hadoop-master". IMO, it should be "hawq should be built on top of hadoop". And we configured the hawq-master conf files(hawq-site.xml) to make hawq have connections with hadoop.
Usually, for the hawq master and hadoop master, we could install each component on one node, but we could install some of them on one node to save nodes. But for HDFS datanode and HAWQ segment, we often install them together. Taking the workload of each machine, we could install them as below:
hadoop hawq
hadoop-master namenode hawq standby
hawq-master secondarynamenode hawq master
other node datanode segment
If you configure hawq with yarn integration, there would be resourcemanager and nodemanager in the cluster.
hadoop role hawq role
hadoop-master namenode hawq standby
hawq-master snamenode,resourcemanager hawq master
other node datanode, nodemanager segment
Install them together does not means they have connections, it's your config file that make them can reach each other.
You can install all the master component together, but there maybe too heavy for the machine. Read more information about Apache HAWQ at http://incubator.apache.org/projects/hawq.html and read some docs at http://hdb.docs.pivotal.io/211/hdb/index.html.
Besides, you could subscribe the dev and user mail list, send email to dev-subscribe#hawq.incubator.apache.org / user-subscribe#hawq.incubator.apache.org to subscribe and send emails to dev#hawq.incubator.apache.org / user#hawq.incubator.apache.org to ask questions.

Can open source hbase work on Cloudera distribution of Hadoop

I have a Cloudera Distribution installed as a 5 Node Cluster. Now I do not want to use the Hbase parcel that comes with cloudera,
but instead I want to use only HDFS from the cloudera setup and an opensource version of Hbase.
So my question is will this work or I will have to install normal open-source version of Apache Hadoop for HDFS and then go forward with the Opensource version of Apache Hbase on top of it.
As long as the version of hadoop matches the version of used by the hadoop client used by the version of hbase matches it should all work.

How to install impala on an already running hadoop cluster

I have an already up and running Hadoop, 5-node cluster. I want to install Impala on the HDFS cluster without the Cloudera Manager. Can anyone supposedly guide me through the process or a link of the same.
Thanks.

Resources