Maven built Hadoop rpms have no files in them - maven

I am trying to build an HDP 3.1.4 cluster by replacing the Cloudera provided HDFS and YARN rpms wit the Apache Hadoop.
For the same, I am building the Hadoop 3.1.1 rpms using the maven method with the below command which completes successfully.
`mvn -B clean install rpm:rpm -DnewVersion=3.1.1 -DskipTests`
However, when I list the files in rpms using rpm -qlp, it says "contains no files"
rpm -qlp hadoop-main-3.1.1-1.noarch.rpm
(contains no files)
Any help will be appreciated.

Related

What does it change in hadoop usage by giraph built with -Phadoop_2 and by giraph built with -Phadoop_yarn?

I understood that giraph-dist-1.2.0-hadoop2-bin.tar.gz binary distribution is built with following maven command and it is officially supported by with hadoop-2.5.1.
"mvn -Phadoop_2 clean install"
I successfully used giraph-dist-1.2.0-hadoop2-bin.tar.gz in pseudo distributed mode on hadoop-2.5.1 in which I configured yarn.
Now, I downloaded giraph-dist-1.2.0-hadoop2-src.tar.gz and successfully build giraph with yarn support, using command and patches taken from Building Giraph with Hadoop, i.e.:
"mvn -Phadoop_yarn -Dhadoop.version=2.5.1 clean package -DskipTests"
Since I already configured yarn in Hadoop 2.5.1, I didn't understand if and what I have to change in Hadoop 2.5.1 configuration about mapred-site.xml and yarn-site.xml in order to use giraph with yarn support?
I think that the main question is: What does it change in Hadoop usage by giraph built with -Phadoop_2 and by giraph built with -Phadoop_yarn?
The only documentation that I found is the following one:
Apache Hadoop 2 (latest version: 2.5.1)
This is the latest version of Hadoop 2 (supporting YARN in addition to MapReduce) Giraph could use. You may tell maven to use this version
with "mvn -Phadoop_2 ".

Failed dependencies when install pxf service

When I rpm pxf service in hawq, I got some errors:
error: Failed dependencies:
hadoop >= 2.6.0 is needed by pxf-service-0:3.0.0-root.noarch
hadoop-hdfs >= 2.6.0 is needed by pxf-service-0:3.0.0-root.noarch
What's your advice here ?
Please make sure the PXF rpm OS architecture version matches. For example if the PXF rpm is built for RHEL6 and you are installing on RHEL7 then you may see some dependency issues
Could you please make sure the version of hadoop you are running in the cluster .I guess you might be running a lower version of hadoop .You have to run atleast 2.6 version of hadoop to run the current version of pxf .
The wiki here use the rpm bigtop(hadoop).
https://cwiki.apache.org/confluence/display/HAWQ/Build+Package+and+Install+with+RPM
It means if I install with rpm(HAWQ 2.2.0), the other ways (using binary hadoop without rpm installs like tar) are not support.
If I install hadoop use tar, I must build HAWQ from source code for now.
Please refer to:
https://issues.apache.org/jira/browse/HAWQ-1568

How to install custom Spark version in Cloudera

I am new to Spark, Hadoop and Cloudera. We need to use a specific version (1.5.2) of Spark and also have the requirement to use Cloudera for the cluster management, also for Spark.
However, CDH 5.5 comes with Spark 1.5.0 and can not be changed very easily.
People are mentioning to "just download" a custom version of spark manually. But how to manage this "custom" spark version by Cloudera, so I can distribute it across the cluster? Or, does it need to be operated and provisioned completely separate from Cloudera?
Thanks for any help and explanation.
Yes, It is possible to run any Apache Spark version .!!
Steps we need to make sure before doing it:
You have YARN configured in the CM. After which you can run your application as a YARN application with spark-submit. please refer to this link. It will be used to work like any other YARN application.
It is not mandatory to install spark, you can run your application.
Under YARN, you can run any application, with any version of Spark. After all, Spark it's a bunch of libraries, so you can pack your jar with your dependencies and send it to YARN. However there are some additional, small tasks to be done.
In the following link, dlb8 provides a list of tasks to be done to run Spark 2.0 in an installation with a previous version. Just change version/paths accordingly.
Find the version of CDH and Hadoop running on your cluster using
$ hadoop version
Hadoop 2.6.0-cdh5.4.8
Download Spark and extract the sources. Pre built Spark binaries should work out of the box with most CDH versions, unless there are custom fixes in your CDH build in which case you can use the spark-2.0.0-bin-without-hadoop.tgz.
(Optional) You can also build Spark by opening the distribution directory in the shell and running the following command using the CDH and Hadoop version from step 1
$ ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
Note: With Spark 2.0 the default build uses Scala version 2.11. If you need to stick to Scala 2.10, use the -Dscala-2.10 property or
$ ./dev/change-scala-version.sh 2.10
Note that -Phadoop-provided enables the profile to build the assembly without including Hadoop-ecosystem dependencies provided by Cloudera.
Extract the tgz file.
$tar -xvzf /path/to/spark-2.0.0-bin-hadoop2.6.tgz
cd into the custom Spark distribution and configure the custom Spark distribution by copying the configuration from your current Spark version
$ cp -R /etc/spark/conf/* conf/
$ cp /etc/hive/conf/hive-site.xml conf/
Change SPARK_HOME to point to folder with the Spark 2.0 distribution
$ sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh
Change spark.master to yarn from yarn-client in spark-defaults.conf
$ sed -i 's/spark.master=yarn-client/spark.master=yarn/' conf/spark-
defaults.conf
Delete spark.yarn.jar from spark-defaults.conf
$ sed '-i /spark.yarn.jar/d' conf/spark-defaults.conf
Finally test your new Spark installation:
$ ./bin/run-example SparkPi 10 --master yarn
$ ./bin/spark-shell --master yarn
$ ./bin/pyspark
Update log4j.properties to suppress annoying warnings. Add the following to conf/log4j.properties
echo "log4j.logger.org.spark_project.jetty=ERROR" >> conf/log4j.properties
However, it can be adapted to the opposite, since the bottom line is "to use a Spark version on an installation with a different version".
It's even simpler if you don't have to deal with 1.x - 2.x version changes, because you don't need to pay attention to the change of scala version and of the assembly approach.
I tested it in a CDH5.4 installation to set 1.6.3 and it worked fine. I did it with the "spark.yarn.jars" option:
#### set "spark.yarn.jars"
$ cd $SPARK_HOME
$ hadoop fs mkdir spark-2.0.0-bin-hadoop
$ hadoop fs -copyFromLocal jars/* spark-2.0.0-bin-hadoop
$ echo "spark.yarn.jars=hdfs:///nameservice1/user/<yourusername>/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf

how to install Spark and Hadoop from tarball separately [Cloudera]

I want to install Cloudera distribution of Hadoop and Spark using tarball.
I have already set up Hadoop in Pseudo-Distributed mode in my local machine and successfully ran a Yarn example.
I have downloaded latest tarballs CDH 5.3.x from here
But the folder structure of Spark downloaded from Cloudera is differrent from Apache website. This may be because Cloudera provides it's own version maintained separately.
So, as there are no documentation I have found yet to install Spark from this Cloudera's tarball separately.
Could someone help me to understand how to do it?
Spark could be extracted to any directory. You just need to run the ./bin/spark-submit command (available in extracted spark directory) with required parameters to submit the job. To start spark interactive shell, please use command ./bin/spark-shell.

How to download CDH4 setup manually

How can I download the CDH4 setup manually?
I mean I want to download the setup without using apt-get from the ubuntu command prompt.
CDH4 is Cloudera's distribution of Apache Hadoop. CDH is a collection of Apache Hadoop and several components of its ecosystem.
Assuming that you are requesting for the source, each component can be downloaded as a tarball from the following location:
CDH Packaging and Tarball Information

Resources