How to run Nutch in Hadoop installed in pseudo-distributed mode - hadoop

I have Nutch 1.13 installed on my Ubuntu. I can run a crawl in standalone mode. It successfully runs and produces the desired results but I have no idea how to run it in hadoop now? I have Hadoop installed in pseudo distributed mode and I want to run a Nutch crawl with Hadoop and monitor it. How can I do it? There are a lot of tutorials for running it in standalone mode but I couldn't find any clear instructions on how Can I run it in Hadoop except that I have to use "Nutch Job" after I build it with ant.
Thanks for your help.

Make sure you have built Nutch from source i.e. don't use the binary release which works only in local mode. Once you've compile with
ant clean runtime
go to runtime/deploy/bin and run the scripts as usual.
NB you need to modify the conf files prior to recompiling.

Related

Can I use Spark prebuilt without hadoop on Windows?

I'm using Spark 3.1.3 prebuilt without Hadoop on a production unix based server. Spark is running in standalone mode. I'm using local filesystem rather than a distributed filesystem such as Hadoop.
I'd ideally like to replicate my production environment locally but unfortunately I'm restricted to using Windows.
Typically, I am able to run Spark on Windows by using Spark 3.1.3 prebuilt for Hadoop Y and using the winutils tool provided here: https://github.com/steveloughran/winutils
It's my understanding that winutils is simulating Hadoop rather than a unix FS.
Am I able to use the exact same Spark binaries in production and on my Windows development machine? Or am I restricted to using Spark prebuilt for Hadoop locally?
Can you explain why either solution works?
I tried running my Spark scripts locally using the version prebuilt without Hadoop but I'm unable to start my scripts. (Will provide some logs and edit this when I'm back on my Windows machine)
"Without" only refers to the scripts/libraries in the downloaded tarball. The more correct term would be "bring your own Hadoop". You will still need HADOOP_CONF_DIR + HADOOP_HOME set, as well as HDFS client JAR libraries to use a local FS.
Yes, you can use Spark on Windows by including the correct version of Winutils. Or you can use WSL2 and download Spark within a full Unix environment.

Running Spark-Shell on Windows

I have downloaded spark, sbt, scala, and git onto my Windows computer. When I try and run spark-shell in my command prompt, I get "Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this program."
I tried to follow this guide: https://x86x64.wordpress.com/2015/04/29/installing-spark-on-windows/ ,but I don't have a build subfolder so I am not sure if that is the problem.
Any help would be appreciated.
That's an old guide for spark 1.3.
Please use this guide to set up spark on Windows.
http://www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf
This guide uses Maven and you are going to use sbt but nevertheless you will be able to execute spark-shell with this guide.

Running Apache Spark on Windows 7

I am trying to run Apache Spark on Windows 7. At first I have installed SBT by msi, then extracted files from spark-1.0.0 to program files by 7-zip. In the command line, I wrote the following:
spark-directory: sbt/sbt assembly
After a few seconds of processing, I got errors like:
-server access error: connection timed out
-could not retrieve jansi 1.1
-error during sbt execution: error retrieving required libraries
-unresolved dependency, jansi 1.1 not found
Could you please give me some advices about running Spark on Windows? I am looking for the right way because I am completely new with this technology. Regards.
You could use the pre-built spark from here
The scripts inside bin folder works in windows 7.
You need to set HADOOP_HOME variable in your path.
spark on windows for more information
If you are using building with sbt approach, then you'll need git also.
Install Scala, sbt and git on your machine. Download Spark source code and run following command
sbt assembly
In case,if you use prebuilt release,Here is the step by step process :
How to run Apache Spark on Windows7 in standalone mode

Hadoop cluster setup with cygwin

I going to setup a hadoop cluster in my project(3nodes). My doubt is that we can continue using cygwin or should have linux os in my machine to setup the cluster?
In otherwords, setting up cluster using cygwin leads to pseduo distributed mode with single node or it is like a normal distributed cluster??
Please help me to understand
Thanks.
I tried to set up Hadoop cluster (CDH5.0.2 distribution) in pseduo mode with cygwin and it was hell. I had problems with classpaths, cygwin was unable to parse some paths from hadoop files so i had to rewrite some hadoop code. So I do not recommend using hadoop with cygwin.
Generaly if you just install hadoop with cygwin it is in pseudo distributed mode (see. this) but of course you can install hadoop in distributed mode, but you need more nodes than 1. Think good about it is here Running Hadoop on Windows

Pig 0.12.0 over Hadoop 2.3.0 on Windows 2008 r2 x64

I've build Hadoop 2.3.0 Src successfully on windows 2008 r2 x64. The NameNode, DataNode, ResourceManager and NodeManager all work fine now.
I recompiled the Pig-0.12.0 Src using Apache Ant with ant clean jar-withouthadoop -Dhadoopversion=23 parameters.
Then I copied the recompiled Pig dir to Hadoop Home dir, and set all the environment variables.
I used Mintty, MSYS to open a Bash Shell Window, and typed pig -x local or pig.
The screen shows lots of [INFO]... messages and finally grunt> shell.
But whenever I type any Pig command, it doesn't react.
Has anyone run Pig run time successfully over Hadoop 2.3.0 on windows?
Any possible suggestion??
Finally I solved my problem by directly using bash.exe instead of mintty /bin/bash -l.
The local mode and mapreduce mode of pig runtime all work fine now.
I note here in case of the same situation from someone.

Resources