Can I use Spark without Hadoop for development environment? - hadoop

I'm very new to the concepts of Big Data and related areas, sorry if I've made some mistake or typo.
I would like to understand Apache Spark and use it only in my computer, in a development / test environment. As Hadoop include HDFS (Hadoop Distributed File System) and other softwares that only matters to distributed systems, can I discard that? If so, where can I download a version of Spark that doesn't need Hadoop? Here I can find only Hadoop dependent versions.
What do I need:
Run all features from Spark without problems, but in a single computer (my home computer).
Everything that I made in my computer with Spark should run in a future cluster without problems.
There's reason to use Hadoop or any other distributed file system for Spark if I will run it on my computer for testing purposes?
Note that "Can apache spark run without hadoop?" is a different question from mine, because I do want run Spark in a development environment.

Yes you can install Spark without Hadoop.
Go through Spark official documentation :http://spark.apache.org/docs/latest/spark-standalone.html
Rough steps :
Download precomplied spark or download spark source and build locally
extract TAR
Set required environment variable
Run start script .
Spark(without Hadoop) - Available on Spark Download page
URL : https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
If this url do not work then try to get it from Spark download page

This is not a proper answer to original question.
Sorry, It is my fault.
If someone want to run spark without hadoop distribution tar.gz.
there should be environment variable to set. this spark-env.sh worked for me.
#!/bin/sh
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Related

Switching Spark versions and distributing jars to all nodes - Yarn v Standalone

I have an environment setup with both Spark 2.0.1 and 2.2.0. both run in standalone mode and contain a master and 3 slaves. They each sit on the same servers and are configured in the exact same way. I only ever want to run one at once and to do so I set the SPARK_HOME environment version to the location of the Spark version I wish to start and run start-master.sh and start-slaves.sh in the bin folder of that particular version.
I have a jar file which I wish to use for all Spark programs to be executed with. This is regardlss of version. I'm aware I could just pass it in the spark-submit --jars parameter but I don't want to have to account for any transfer time in the job execution so am currently placing the the jar file in the jars folder of each of the master and slave nodes prior to startup. This is a regular task as the jar file gets updated quite often.
If I wish to switch Spark versions I must run stop-slaves.sh and stop-master.sh in the bin folder of the version I wish to stop, then go through the above process again.
Key things I wish achieve are that I can differentiate the transfer of jars from execution timings and that I can easily switch versions. I am able to do this with my current setup but its all done manually and I'm looking at automating it. However, I don't want to spend time doing this if theres already a solution that will do what I need.
Is there a better way of doing this? I'm currently looking at Yarn to see if it can offer anything.

Installing Spark through Ambari

I've configured cluster of VMs via Ambari.
Now trying to install Spark.
In all tutorials (i.e. here) it's pretty simple; Spark installation is similar to other services:
But it appears that in my Ambari instance there is simply no such entry.
How can I add Spark entry to Ambari services?
There should be a SPARK folder under the /var/lib/ambari-server/resources/stacks/HDP/2.2/services directory. Additionally there should be spark folders identified by their version number under /var/lib/ambari-server/resources/common-services/SPARK. Either someone modified your environment or it's a bad and or non-standard install of Ambari.
I would suggest re-installing as it is hard to say exactly what you need to add as its unclear what other things may be missing in the environment.

Design and Evaluation of Network-Levitated Merge for Hadoop Acceleration

i need to develop a project on "Design and Evaluation of Network-Levitated
Merge for HADOOP Acceleration" but i am HADOOP fresher i don't have any idea about HADOOP projects or how to combine the HADOOP functionality with GUI..
please guide me regarding this scenario. .
it would be convenient for me if i get an idea of HADOOP project. .
any simple upload and download project with HADOOP and GUI functionality will do my task. ..
Actually is my M tech project. For this project you need to make changes in HADOOP-0.20.2 and build it again and check the performance of hadoop using terasort. For this Author uses Infiband Network. Author had provided me the code for this project. But it was compiled on older version of linux. which has no update. So i was not able to compile that code. If you have infiband network install OFED in to your system. there is the patch and manual available at mallanox.com. download that manual and follow it.
it it works let me know. otherwise let it go.

Hadoop 2.7.1 Eclipse plugin creation

After reading almost all the previous posts and web links, I feel to write this post as a need. I am unable to find any directory named ${YOUR_HADOOP_HOME}/src/contrib/eclipse-plugin in my windows based build of Hadoop 2.7.1.
I have downloaded already compiled build from a source but as a matter of learning i want to build it myself. Is there any other way to have source files for creating Hadoop 2.7.1 eclipse plugin? or did i miss something at the time of building my own windows based hadoop? Please explain and if possible provide source for windows 7 build environment.
Thanks

Getting started with Hadoop and Eclipse

I'm following a couple of tutorials for setting up Hadoop with Eclipse.
This one is from Cloudera : http://v-lad.org/Tutorials/Hadoop/05%20-%20Setup%20SSHD.html
But this seems to focus on checking out the latest code from Hadoop and tweaking it.
This is rare although, usually the latest release of Hadoop will suffice most users needs?
Whereas this tutorial seems to focus on setting up and running hadoop :
http://v-lad.org/Tutorials/Hadoop/05%20-%20Setup%20SSHD.html
I just want to run some basic map reduce jobs to get started. I don't think I should be using the latest code from Hadoop as cloudera specifies in above first link to get started ?
Here is a blog entry and screencast on developing/debugging applications in Eclipse. The procedure works with most versions of Hadoop.
You may try this tutorial on installing Hadoop plugin for eclipse: http://bigsonata.com/?p=168

Resources