Setting up Nutch 1.3 and Hadoop 0.20.2 - hadoop

I have a multi-node cluster running on UEC(Ubuntu enterprise cloud) and I thought it will be a good idea to set up nutch with it .
However, I found this tutorial unhelpful http://wiki.apache.org/nutch/NutchHadoopTutorial as this is for Nucth release 0.8 and it doesn't support latest versions. Can somebody tell me how can I configure Nutch 1.3 with Hadoop?

Related

Can Hadoop 3.2 HDFS client be used to work with Hadoop 2.x HDFS nodes?

I am trying to build a Java program using Hadoop 3.2 client. Will it be able to work with Hadoop 2.x clusters? Or, is it not supported? Thank you for sharing your experience.
With Hadoop and most Apache-licensed projects compatibility is only guaranteed between minor version numbers. So you should not expect a 3.2 client to work with a 2.x Hadoop cluster.
Cloudera's blog Upgrading your clusters and workloads from Apache Hadoop 2 to Apache Hadoop 3 written by Suma Shivaprasad also mentions the following:
Compatibility with Hadoop 2
Wire compatibility
Hadoop 3 preserves wire compatibility with Hadoop 2 clients
Distcp/WebHDFS compatibility is preserved
API compatibility
Hadoop 3 doesn’t preserve full API level compatibility due to the following changes
Classpath – Dependency version bumps like guava
Removal of deprecated APIs and tools
Shell script rewrites
Incompatible bug fixes
But also states:
Migrating Workloads
MapReduce applications
MapReduce is fully binary compatible and workloads should run as is without any changes required.

Where to find CDH and all its software versions?

I want to know what is the CDH version currently being used most and Its all software version in detail. I.e.: If CDH 5.6 then what is the MapReduce, Hive, Impala, Sqoop etc version in this package.
The most used? You're not going to be able to find that information unless Cloudera collected it and published the CDH versions their clients use.
Click the respective CDH version here for version information
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball.html
For example, Hadoop is at 2.6.0 , Sqoop is at 1.4.6 for CDH 5.6

installation for spark 2.1.0 with Ambari 2.4.2.0

I am relatively new on cluster installations for Spark along with Ambari. Recently, I got a task for installing Spark 2.1.0 on a cluster which pre-installed Ambari with Spark 1.6.2 with HDFS & YARN 2.7.3.
My task is to have Spark 2.1.0 installed since it is the newest version with better compacity with RSpark and more. I searched over the internet for couple days, only found some installation guide on either AWS or Spark 2.1.0 alone.
such as following:
http://data-flair.training/blogs/install-deploy-run-spark-2-x-multi-node-cluster-step-by-step-guide/
and http://spark.apache.org/docs/latest/building-spark.html.
But none of them mentioning the interference of different versions of Spark. Since I need to keep this cluster running, I would like to know some potential threat for the cluster.
Is there some proper way to do this installation? Thanks a lot!
If you want to have your SPARK2 installation managed by Ambari then SPARK2 must be provisioned by Ambari.
HDP 2.5.3 does NOT support Spark 2.1.0, it does however come with a technical preview of Spark 2.0.0.
Your options are:
Install Spark 2.1.0 manually and not have it managed by Ambari
Use Spark 2.0.0 instead of Spark 2.1.0 which is provided by HDP 2.5.3
Use a different stack. ie. IBM Open Platform (IOP) 4.3, slated to release in 2017, it will ship with Spark 2.1.0 support. You can get started using it today with the technical preview release.
Upgrade HDP (2.6) which supports Spark 2.1.
Extend the HDP 2.5 stack to support Spark 2.1.0. You can see how to customize and extend ambari stacks on the wiki. This would let you used Spark 2.1.0 and have it managed by ambari. However, this would be a lot of work to implement and being that you're new to Ambari it would be rather difficult.

How to install apache Storm on windows 7

Can anyone tell me that how can i install apache storm on windows 7 ?
I am new to big data so need a little help.
Please explain what does not work. Storm is Java + Python so setting up on Windows should not be a problem. Zookeeper run on Windows just fine. There are many Vagrant / Docker implementations that will work as well. So what problem are you trying to resolve?
BTW, if you are trying to set it up for development, you dont need a cluster. You can run it with local cluster settings. (check storm documentation)
The general steps are:
Download Zookeeper
Untar and configure single node cluster
Download Storm
Follow Storm documentation and configure Nimbus/Supervisor settings
Follow Storm documentation to start Nimbus, Supervisor, Storm UI and Log Viewer.
Make sure you read documentation of 0.10.0 and 1.0.x These releases are not compatible and some of the libs you may want to use will not work with the new Storm release.

Impala on Hadoop 1.0.4

I am trying to work on impala in my linux box. Mine is not a cloudera distribution. I installed Hadoop, Hive, HBase and other components individually.
Here are the versions
Hadoop - 1.0.4
HBase - 0.94.8
Hive - 0.9.0
Impala - 1.2.3
I installed impala using rpm as mine is a redhat linux box.
I am not able to configure the impala servrer (indeed not able to find site.xml's) in my machine.
In the research I did, I came to know that impala will only work with Hadoop 2.x. Is it true? If it is correct, I need to migrate to 2.x rather than wasting time on 1.x.
Could someone confirm the same? Thanks in advance.
I suggest to use latest CDH 4.x
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_prereqs.html

Resources