Hadoop cluster setup with cygwin - hadoop

I going to setup a hadoop cluster in my project(3nodes). My doubt is that we can continue using cygwin or should have linux os in my machine to setup the cluster?
In otherwords, setting up cluster using cygwin leads to pseduo distributed mode with single node or it is like a normal distributed cluster??
Please help me to understand
Thanks.

I tried to set up Hadoop cluster (CDH5.0.2 distribution) in pseduo mode with cygwin and it was hell. I had problems with classpaths, cygwin was unable to parse some paths from hadoop files so i had to rewrite some hadoop code. So I do not recommend using hadoop with cygwin.
Generaly if you just install hadoop with cygwin it is in pseudo distributed mode (see. this) but of course you can install hadoop in distributed mode, but you need more nodes than 1. Think good about it is here Running Hadoop on Windows

Related

Can I use Spark prebuilt without hadoop on Windows?

I'm using Spark 3.1.3 prebuilt without Hadoop on a production unix based server. Spark is running in standalone mode. I'm using local filesystem rather than a distributed filesystem such as Hadoop.
I'd ideally like to replicate my production environment locally but unfortunately I'm restricted to using Windows.
Typically, I am able to run Spark on Windows by using Spark 3.1.3 prebuilt for Hadoop Y and using the winutils tool provided here: https://github.com/steveloughran/winutils
It's my understanding that winutils is simulating Hadoop rather than a unix FS.
Am I able to use the exact same Spark binaries in production and on my Windows development machine? Or am I restricted to using Spark prebuilt for Hadoop locally?
Can you explain why either solution works?
I tried running my Spark scripts locally using the version prebuilt without Hadoop but I'm unable to start my scripts. (Will provide some logs and edit this when I'm back on my Windows machine)
"Without" only refers to the scripts/libraries in the downloaded tarball. The more correct term would be "bring your own Hadoop". You will still need HADOOP_CONF_DIR + HADOOP_HOME set, as well as HDFS client JAR libraries to use a local FS.
Yes, you can use Spark on Windows by including the correct version of Winutils. Or you can use WSL2 and download Spark within a full Unix environment.

How to run Nutch in Hadoop installed in pseudo-distributed mode

I have Nutch 1.13 installed on my Ubuntu. I can run a crawl in standalone mode. It successfully runs and produces the desired results but I have no idea how to run it in hadoop now? I have Hadoop installed in pseudo distributed mode and I want to run a Nutch crawl with Hadoop and monitor it. How can I do it? There are a lot of tutorials for running it in standalone mode but I couldn't find any clear instructions on how Can I run it in Hadoop except that I have to use "Nutch Job" after I build it with ant.
Thanks for your help.
Make sure you have built Nutch from source i.e. don't use the binary release which works only in local mode. Once you've compile with
ant clean runtime
go to runtime/deploy/bin and run the scripts as usual.
NB you need to modify the conf files prior to recompiling.

How to uninstall the Hadoop on Mac Completely

I have installed hadoop 2.5.1 on my mac book pro through terminal but now i want to uninstall completely from my mac book pr.
so please let me know the process.
Thank you in advance.
If you have installed by downloading and extracting Hadoop tarball, then you just have to delete the extracted directory (the directory path depends on where you have extracted the tarball to on the filesystem) using command line utility like rm.
Also, if you have changed Namenode, Datanode data directories (by configuring them in hdfs-site.xml) other than the default then you have to delete those directories as well.

Pig 0.12.0 over Hadoop 2.3.0 on Windows 2008 r2 x64

I've build Hadoop 2.3.0 Src successfully on windows 2008 r2 x64. The NameNode, DataNode, ResourceManager and NodeManager all work fine now.
I recompiled the Pig-0.12.0 Src using Apache Ant with ant clean jar-withouthadoop -Dhadoopversion=23 parameters.
Then I copied the recompiled Pig dir to Hadoop Home dir, and set all the environment variables.
I used Mintty, MSYS to open a Bash Shell Window, and typed pig -x local or pig.
The screen shows lots of [INFO]... messages and finally grunt> shell.
But whenever I type any Pig command, it doesn't react.
Has anyone run Pig run time successfully over Hadoop 2.3.0 on windows?
Any possible suggestion??
Finally I solved my problem by directly using bash.exe instead of mintty /bin/bash -l.
The local mode and mapreduce mode of pig runtime all work fine now.
I note here in case of the same situation from someone.

Could not extract Cloudera Hadoop VM archive

I am new to Cloudera. I have worked on hadoop previously, now I want to try Cloudera Hadoop. For this I started with Cloudera Hadoop VM.
The downloaded the file in 7zip format with 2GB size. When I try to extract, it shows error
Can not open file cloudera-quickstart-vm-4.4.0-1-vmware.7z as archive.
All other files are extracting properly but this single file is not extracting. I have downloaded the file three times but got the same error. Is there any specific way to extract this file?
Any help would be appreciated.
You don't need to do anything special, but I had to download the Standard QuickStart VirtualBox VM 3 times before the archive was complete. The final file that worked for me was actually ~2.6G in size.
If Windows, you need to have WinRar installed. This seems to be a common problem while trying to use/install Cloudera QuickStart VM.

Resources