Installing Spark through Ambari - hadoop

I've configured cluster of VMs via Ambari.
Now trying to install Spark.
In all tutorials (i.e. here) it's pretty simple; Spark installation is similar to other services:
But it appears that in my Ambari instance there is simply no such entry.
How can I add Spark entry to Ambari services?

There should be a SPARK folder under the /var/lib/ambari-server/resources/stacks/HDP/2.2/services directory. Additionally there should be spark folders identified by their version number under /var/lib/ambari-server/resources/common-services/SPARK. Either someone modified your environment or it's a bad and or non-standard install of Ambari.
I would suggest re-installing as it is hard to say exactly what you need to add as its unclear what other things may be missing in the environment.

Related

How do we install Apache BigTop with Ambari?

I am trying to find out how to deploy a hadoop cluster using ambari by using apache big top
According to the latest release bigtop 1.5:
https://blogs.apache.org/bigtop/
my understanding is that Bigtop Mpack was added as a new feature, that enables users to
deploy Bigtop components via Apache Ambari.
I am able to install the Bigtop components via command line, but do not find any documentation on how to install these bigtop hadoop components via ambari.
Can someone please help redirect me into some documentation that tells me how to install various hadoop components(bigtop packages) via ambari?
Thanks,
I'm from Bigtop community. Though I don't have a comprehensive answer. The Bigtop user mailing list had a discussion recently that has several tech details can answer your question:
https://lists.apache.org/thread.html/r8c5d8dfdee9b7d72164504ff2f2ea641ce39aa02364d60917eaa9fa5%40%3Cuser.bigtop.apache.org%3E
OTOH, you are always welcome to join the mailing list and ask questions. Our community is active and happy to answer questions.
Build a repo of Big Top
To install that repo with Ambari, you have to register the stack/version. You will need to create a version file. I found an example of one here.
Complete installation like you would with a normal build
This is highly theoretical (..haven't done this before..) I have worked with a BIGTOP Mpack before that took care of some of this work but it's not production ready yet, and works with an old version of Ambari, not the newest. (I was able to install/stop/start HDFS/Hive). These instruction above should work with any version of Ambari.
I have been able to test Matt Andruff's theory with a VM. Here was my process and where I stopped;
Built a repo of Apache BigTop 1.5.0
Built BigTop using Gradlew
Installed Apache Ambari 2.6.1 on my system
Enabled BigInsights build version xml file and modified the package version numbers to match my Bigtop build
Note: You can also build your own version file if you want as Matt mentioned
Setup a webserver to host your package repo
Point your xml version file repo to your local webserver for packages
From there you can complete the installation of your packages as you would normally.
I have only done this with a single VM thus far and will be trying to spin up a small cluster using AWS in the coming weeks.

How to build deb/rpm repos from open source Hadoop or publicly available HDP source code to be installed by ambari

I am trying to install the open source hadoop or building the HDP from source to be installed by ambari. I can see that it is possible to build the java packages for each component with the documentation available in apache repos, but how can i use those to build rpm/deb packages that are provided by hortonworks for HDP distribution to be installed by ambari.
#ShivamKhandelwal Building Ambari From Source is a challenge but one that can be accomplished with some persistence. In this post I have disclosed the commands I used recently to build Ambari 2.7.5 in centos:
Ambari 2.7.5 installation failure on CentOS 7
"Building HDP From Source" is very big task as it requires building each component separately, creating your own public/private repo which contains all the component repos or rpms for each operating system flavor. This is a monumental task which was previously managed by many employees and component contributors at Hortonworks.
When you install Ambari from HDP, it comes out of the box with their repos including their HDP stack (HDFS,Yarn,MR,Hive, etc). When you install Ambari From Source, there is no stack. The only solution is to Build Your Own Stack which is something I am expert at doing.
I am currently building a DDP stack as an example to share with the public. I started this project by reverse engineering a HDF Management Pack which includes stack structure (files/folders) to role out NiFi, Kafka, Zookeeper, and more. I have customized it to be my own stack with my own services and components (NiFi, Hue, Elasticsearch, etc).
My goal with DDP is to eventually make my own repos for the Components and Services I want, with the versions I want to install in my cluster. Next I will copy some HDP Components like HDFS,YARN,HIVE from the HDP stack directly into my DDP stack using the last free public HDP Stack (HDP 3.1.5).

How to install CM over an existing non CDH Cluster

Is it possible to install CM over an existing non CDH cluster?
For example, I have manually installed Hadoop and other services to my VMs.
Can I install CM and force it to manage my cluster?
It is doubtful you could do this since CM expects either parcels or CDH packages installed with Hadoop. If you really wanted to try to do this, it would be easier to install CM + CDH using packages then overwrite the specific artifacts in the package but this could be very tedious.
It is not possible to install Cloudera Manager on an non CDH Cluster. One of the reasons is Cloudera expects the installation to be carried out using the CDH packages or using the CDH parcels. Even CDH Packages and CDH parcels can't coexist. Other reason is that the jars bundled by Cloudera are different from the jars available in the native distributions of the software components such as Hive etc. So, It is not going to work.
Don't waste time attempting it.

Can I use Spark without Hadoop for development environment?

I'm very new to the concepts of Big Data and related areas, sorry if I've made some mistake or typo.
I would like to understand Apache Spark and use it only in my computer, in a development / test environment. As Hadoop include HDFS (Hadoop Distributed File System) and other softwares that only matters to distributed systems, can I discard that? If so, where can I download a version of Spark that doesn't need Hadoop? Here I can find only Hadoop dependent versions.
What do I need:
Run all features from Spark without problems, but in a single computer (my home computer).
Everything that I made in my computer with Spark should run in a future cluster without problems.
There's reason to use Hadoop or any other distributed file system for Spark if I will run it on my computer for testing purposes?
Note that "Can apache spark run without hadoop?" is a different question from mine, because I do want run Spark in a development environment.
Yes you can install Spark without Hadoop.
Go through Spark official documentation :http://spark.apache.org/docs/latest/spark-standalone.html
Rough steps :
Download precomplied spark or download spark source and build locally
extract TAR
Set required environment variable
Run start script .
Spark(without Hadoop) - Available on Spark Download page
URL : https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
If this url do not work then try to get it from Spark download page
This is not a proper answer to original question.
Sorry, It is my fault.
If someone want to run spark without hadoop distribution tar.gz.
there should be environment variable to set. this spark-env.sh worked for me.
#!/bin/sh
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Hadoop CouchDB Elastic Search

I have already installed CouchDB (ver 1.1.0), Elastic Search (0.17.6) on my Fedora. I want now to install Hadoop Map/reduce (http://hadoop.apache.org/mapreduce/) and Hadoop DFS (http://hadoop.apache.org/hdfs/) on this machine but I wonder whether there is a conflict and problem between them? Can Elastic Search and CouchDB function properly?
Thanks for your answers
I see no reason for conflict. I wouldn't put all of these on one production machine, because of the performance issues, but if it's your development box, then go ahead.
CouchDB is project written in Erlang that uses Mozilla's SpiderMonkey for executing Javastcipt queries
Hadoop is pure Java and will not conflict with above in any way.
Elasticsearch and Lucene are also Java, and it wont conflict with Hadoop because theirs startup scripts will define specific classpaths, so multiple installed versions of the same libraries shouldn't create an issue.

Resources