Building spark without any hadoop dependencies - hadoop

I found some references to -Phadoop-provided flag for building spark without hadoop libraries but cannot find a good example of how to use it. how can I build spark from source and make sure it does not add any of it's own hadoop dependencies. it looks like when I built the latest spark it included a bunch of 2.8.x hadoop stuff which conflicts with my cluster hadoop version.

Spark has download options for "pre-built with user-provided Hadoop", which are consequently named with spark-VERSION-bin-without-hadoop.tgz
If you would really like to build it, then run this from the project root
./build/mvn -Phadoop-provided -DskipTests clean package

Related

Is it possible to build Apache Spark against Hadoop 2.5.1

After compiling Hadoop 2.5.1 with maven
hadoop version
Hadoop 2.5.1, I tried to compile apache spark using the following command:
mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.1 -Pdeb -DskipTests clean package
But apparently there is no 2.5 profile.
My question is : what should I do?
rebuild hadoop 2.4
or compile spark with profile 2.4
or any other solution ?
Looks like this was asked after the poster inquired:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-with-Hadoop-2-5-0-td15827.html
"The hadoop-2.4 profile is really intended to be "Hadoop 2.4+". It
should compile and run fine with Hadoop 2.5 as far as I know. CDH 5.2
is Hadoop 2.5 + Spark 1.1, so there is evidence it works."
Just changing the profile name worked for me.
Thx for the answers.

can't get hadoop to see snappy

i'm on rhel7 64bit. I managed to apparently build the hadoop 2.4.1 distribution from source. before that, i built snappy from source and installed it. then i build the hadoop dist. with
mvn clean install -Pdist,native,src -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy
yet when i look at $HADOOP_HOME/lib/native i see hdfs and hadoop libs but not snappy. so when i run hadoop checknative it says that i don't have snappy installed. furthermore, i downloaded hadoop-snappy, and compiled /that/ and it generated the snappy libs. i copied those over to $HADOOP_HOME/lib/native /and/ to $HADOOP_HOME/lib just for extra measure. STILL, hadoop checknative doesn't see it!
found the non-obvious solution in an obscure place http://lucene.472066.n3.nabble.com/Issue-with-loading-the-Snappy-Codec-td3910039.html
needed to add -Dcompile.native=true. this was not highlighted in the apache build doc nor was it in any build guide i've come across!

Install oozie on Hadoop 2.2

I need some guidance on installing Oozie on Hadoop 2.2. The Quick Start docs page indicates that
IMPORTANT: By default it builds against Hadoop 1.1.1. It's possible to
build against Hadoop 2.x versions as well, but it is strongly
recommend to use a Bigtop distribution if using Hadoop 2.x because the
Oozie sharelibs built from the tarball distribution will not work with
it.
I haven't been able to get Bigtop to work.
I tried following some guidance from here but it only tells me to edit the pom.xml files, not what to edit in them.
I have pig and maven installed.
Thanks in advance
This is a problem with the releases resolving shared libraries with Maven, and has been since fixed if you use git master. I had this problem so hopefully this solution will work for the Oozie version you are building from.
The advice at here is of use. Similar to the blog post you linked, the grep command will indicate the offending files:
$ grep -l "2.2.0-SNAPSHOT" `find . -name "pom.xml"`
./hadooplibs/hadoop-2/pom.xml
./hadooplibs/hadoop-distcp-2/pom.xml
./hadooplibs/hadoop-test-2/pom.xml
./pom.xml
Any mentions of 2.2.0-SNAPSHOT in these files should be replaced with 2.2.0
I would suggest removing the -SNAPSHOT part using the following command:
$ grep -l "2.2.0-SNAPSHOT" `find . -name "pom.xml"` | xargs sed -i 's|2.2.0-SNAPSHOT|2.2.0|g'
UPDATE: If you don't have Hadoop JARs built from when you built Hadoop itself then you will need to add the option -DincludeHadoopJars
And then build the package:
$ mvn clean package assembly:single -Dhadoop.version=2.2.0 -DskipTests
Or if you're using JDK7 and/or targeting Java 7 (as I did):
$ mvn clean package assembly:single -Dhadoop.version=2.2.0 -DjavaVersion=1.7 -DtargetJavaVersion=1.7 -DskipTests
Documentation on building Oozie (version 4 docs) is available here.
The above worked building release-4.0.0 with Hadoop 2.2 and Java SDK 7.
The distro can then be found in distro/target.

Where can I find a tutorial for installing and running cascading.jruby?

I have Hadoop installed and testing fine, however unable to find any instructions for a n00b on
How to setup cascading and cascading.jruby. Where to place the cascading Jars and how to configure jading to build the ruby assemblies correctly?
Is anyone using jenkins to build this automatically?
Edit: more details
I'm trying to build the example word count job from https://github.com/etsy/cascading.jruby
I've installed
hadoop, and run the tests successfully.
installed jruby
gem install cascading.jruby
jade - https://github.com/etsy/jading
installed ant
created the wordcount sample wc.rb
Run jade to compile the wc.rb to a jar
jade wc.rb
I get the following compile error
Buildfile: build.xml does not exist!
Build failed
RuntimeError: Ant retrieve failed
(root) at /usr/bin/hjade:89
Which makes sense looking at the jade code, but this isn't covered in the example usage? What am I missing here?
Sorry for the delay; this is my first answer, here.
The issue you describe, Jading not being able to locate its Ant build script when called from a symlink, is indeed an issue. I'd recommend just adding your Jading clone to your PATH rather than creating symlinks (or submit a pull request to fix the issue!).
To address some of your other concerns, I've created a Getting Started page in the Jading wiki which may be of some help. It walks you through getting up and running with local and remote cascading.jruby jobs without installing anything besides preqs (Java, Ant, JRuby, and the Hadoop client+config). Included now is a full example wordcount script that should function both locally and on a Hadoop cluster, and has been tested on Etsy's own internal cluster.
And backing up further still to address your question about Jenkins, yes, at Etsy we use Jenkins to build and deploy cascading.jruby (and Scalding) to our cluster. However, that build process does not currently use Jading to produce the job jar. Our build predated Jading and Jading was an attempt to release a cleaner version of the process we go through to build that jar. Our build could easily using Jading (and the original examples came from actual uses on our code), but we have slightly different requirements for the artifacts produced by our build.
If you have any other issues with Jading, feel free to submit issues or pull requests to the github project.
If you are using jruby. You must be using bundler as well. In that case you can add cascading.jruby as a dependency in your gemfile.
You could anyways try installing from your project folder as:
gem install 'cascading.jruby'
Hope this Helps.
I've got the working end to end now.
I had created symlinks to the hadoop, jading binaries in /usr/local/bin
The scripts need to be run from their own directories in order to find the supporting files
i.e. the following works: (assuming the cascading.jruby example is in ~/dev/cascading.jruby.demo/wc.rb
cd /usr/local/jading
./jade ~/dev/cascading.jruby.demo/wc.rb
# creates a jade.jar locally in jading folder
cd /usr/local/hadoop
./bin/hadoop jar /usr/local/jading/jade.jar ~/dev/cascading.jruby.demo/wc.rb ~/dev/cascading.jruby.demo/sampledata/in.txt

Build a Hadoop Ecplise Library from CDH4 jar files

I am trying to build a Hadoop library of all the jar files that I need to a build map/reduce job in Eclipse.
What are the .jar files that I need AND from what folders of the single node install of CDH4 when installed Hadoop on Ubuntu?
Assuming you've downloaded the CDH4 tarball distro from https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs
Unpack the tarball
locate the build.properties file in the unpacked directory:
hadoop-2.0.0-cdh4.0.0/src/hadoop-mapreduce-project/src/contrib/eclipse-plugin
Add a property to this file for your eclipse installation directory:
eclipse.home=/opt/eclipse/jee-indigo-SR2
Finally run ant from the hadoop-2.0.0-cdh4.0.0/src/hadoop-mapreduce-project directory to build the jar
You'll now have a jar in the hadoop-2.0.0-cdh4.0.0/src/hadoop-mapreduce-project/build/contrib/eclipse-plugin/ folder
To finally answer your question, the dependency jars are now in:
hadoop-2.0.0-cdh4.0.0/src/hadoop-mapreduce-project/build/contrib/eclipse-plugin/
And to be really verbose if you want the list, see this pastebin

Resources