I have a three-node cluster running Hadoop 2.2.0 and HBase 0.98.1 and I need to use a Nutch 2.2.1 crawler on top of that. But it only supports Hadoop versions from 1.x branch. By now I am able to submit a Nutch job to my cluster, but it fails with java.lang.NumberFormatException.
So my question is pretty simple: how do I make Nutch work in my environment?
At the moment it's impossible to integrate Nutch 2.2.1 (Gora 0.3) with HBase 0.98.x.
See: https://issues.apache.org/jira/browse/GORA-304
Official Nutch tutorial recommends only 0.90.x HBase branch:
http://wiki.apache.org/nutch/Nutch2Tutorial
Also you can download HBase 0.94.24-hadoop-2.5.0 version which I created and tested today:
https://github.com/dobromyslov/hbase/releases/tag/0.94.24-hadoop-2.5.0
Take a note that Nutch 2.2.1 does not support HBase 0.94.x and you have to get the latest Nutch 2.x from Git branch: https://github.com/apache/nutch/tree/2.x
Related
I'm running hive 2.1.1, hadoop 2.7.3 on Ubuntu 16.04.
According to Hive on Spark: Getting Started , it says
Install/build a compatible version. Hive root pom.xml's
defines what version of Spark it was built/tested
with.
I checked the pom.xml, it shows that spark version is 1.6.0.
<spark.version>1.6.0</spark.version>
But Hive on Spark: Getting Started also says that
Prior to Spark 2.0.0: ./make-distribution.sh --name
"hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
Since Spark
2.0.0: ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
So now I'm confused because I am running hadoop 2.7.3. Do I have to downgrade my hadoop to 2.4?
Which version of Spark should I use? 1.6.0 or 2.0.0?
Thank you!
I am currently using spark 2.0.2 with hadoop 2.7.3 and hive 2.1 and it's working fine. And I think hive will support both version of spark 1.6.x and 2.x but I will suggest you to go with spark 2.x since it's the latest version.
Some motivational link for why to use spark 2.x
https://docs.cloud.databricks.com/docs/latest/sample_applications/04%20Apache%20Spark%202.0%20Examples/03%20Performance%20Apache%20(Spark%202.0%20vs%201.6).html
Apache Spark vs Apache Spark 2
The current version of Spark 2.X is not compatible with Hive 2.1 and Hadoop 2.7, there is a major bug:
JavaSparkListener is not available and Hive crash on execution
https://issues.apache.org/jira/browse/SPARK-17563
You can try to build Hive 2.1 with Hadoop 2.7 and Spark 1.6 with:
./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
If you take a look to the command after 2.0 the difference is that ./make-distribution is inside the folder /dev.
If it does not work for hadoop 2.7.X, I can confirm you that I have been able to successfully built it with Hadoop 2.6, by using:
./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"
and for scala 2.10.5
I want to use elasticsearch on hadoop. Can any one suggest me step by step installation and configuration of elasticsearch on hadoop? Is there version dependency of elasticsearch and hadoop?
Installation:
http://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
Configuration:
http://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
Hadoop 1.x (ideally the latest stable version in the 1.x line,
currently 1.2.1) or 2.x (ideally the latest stable version, currently
2.2.0). elasticsearch-hadoop is tested daily against Apache Hadoop. Any distro compatible with Apache Hadoop should work just fine.
I have Hadoop 2.5.1 installed on three nodes (1 master, 2 slave nodes) and I want to know the version compatibility of HBase and Hive?
Also, are any alternatives for this Hadoop+Hbase+Hive integration or any guides explaining the installation of Hadoop 2.5.1 with compatible HBase and Hive ?
Currently I am trying with Apache Ambari for the above integration and its still ongoing.
Environment:
Jdk version: 1.7.0_67
RHEL 5
64 bit architecture
Any leads will be much appreciated!
With hadoop 2.5.1 supported versions are:
HBase-0.98.x (Support for Hadoop 1.1+ is deprecated.)
HBase-1.0.x (Hadoop 1.x is NOT supported)
HBase-1.1.x
HBase-1.2.x
Here is the link : http://hbase.apache.org/book.html#configuration
Warning: only hive 1.2.1 can work with Hbase 2.x.
I have no other option than to install HBase 0.90.6 as it is only recommended stable version for Nutch (web crawler) other than 0.90.4.
My question, which Hadoop version is recommended for HBase 0.90.6 to work on pseudo distributed mode?
I figured out Hadoop 0.20.205.0 is the compatible version.
I tried Hadoop 1.2.1 but it doesn't seem to work well with HBase 0.90.6
I have a Hadoop cluster with version 1.2.1 and recently i also downloaded hbase 0.94.11 to try out. I able to setup hbase t run in distributed mode but when i checked the web gui status, it stated that the Hadoop version is 1.0.4. I noticed that this is because hbase use the hadoop-core-1.0.4.jar file comes together with hbase. So my question is should i replace this jar file with the hadoop-core-1.2.1.jar so that hbase can use the latest hadoop-core jar file? And does it matter?
Cw
You don't have to do that if 1.0.4 works for you. Because the newest version may bring you any other problems and just replace hadoop-core.jar is unsafe. If you want to upgrade the HBase, please follow the official guide.
Hope it helps.