I'm using Spark 3.1.3 prebuilt without Hadoop on a production unix based server. Spark is running in standalone mode. I'm using local filesystem rather than a distributed filesystem such as Hadoop.
I'd ideally like to replicate my production environment locally but unfortunately I'm restricted to using Windows.
Typically, I am able to run Spark on Windows by using Spark 3.1.3 prebuilt for Hadoop Y and using the winutils tool provided here: https://github.com/steveloughran/winutils
It's my understanding that winutils is simulating Hadoop rather than a unix FS.
Am I able to use the exact same Spark binaries in production and on my Windows development machine? Or am I restricted to using Spark prebuilt for Hadoop locally?
Can you explain why either solution works?
I tried running my Spark scripts locally using the version prebuilt without Hadoop but I'm unable to start my scripts. (Will provide some logs and edit this when I'm back on my Windows machine)
"Without" only refers to the scripts/libraries in the downloaded tarball. The more correct term would be "bring your own Hadoop". You will still need HADOOP_CONF_DIR + HADOOP_HOME set, as well as HDFS client JAR libraries to use a local FS.
Yes, you can use Spark on Windows by including the correct version of Winutils. Or you can use WSL2 and download Spark within a full Unix environment.
Related
I have Nutch 1.13 installed on my Ubuntu. I can run a crawl in standalone mode. It successfully runs and produces the desired results but I have no idea how to run it in hadoop now? I have Hadoop installed in pseudo distributed mode and I want to run a Nutch crawl with Hadoop and monitor it. How can I do it? There are a lot of tutorials for running it in standalone mode but I couldn't find any clear instructions on how Can I run it in Hadoop except that I have to use "Nutch Job" after I build it with ant.
Thanks for your help.
Make sure you have built Nutch from source i.e. don't use the binary release which works only in local mode. Once you've compile with
ant clean runtime
go to runtime/deploy/bin and run the scripts as usual.
NB you need to modify the conf files prior to recompiling.
I have downloaded spark, sbt, scala, and git onto my Windows computer. When I try and run spark-shell in my command prompt, I get "Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this program."
I tried to follow this guide: https://x86x64.wordpress.com/2015/04/29/installing-spark-on-windows/ ,but I don't have a build subfolder so I am not sure if that is the problem.
Any help would be appreciated.
That's an old guide for spark 1.3.
Please use this guide to set up spark on Windows.
http://www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf
This guide uses Maven and you are going to use sbt but nevertheless you will be able to execute spark-shell with this guide.
I am trying to run Apache Spark on Windows 7. At first I have installed SBT by msi, then extracted files from spark-1.0.0 to program files by 7-zip. In the command line, I wrote the following:
spark-directory: sbt/sbt assembly
After a few seconds of processing, I got errors like:
-server access error: connection timed out
-could not retrieve jansi 1.1
-error during sbt execution: error retrieving required libraries
-unresolved dependency, jansi 1.1 not found
Could you please give me some advices about running Spark on Windows? I am looking for the right way because I am completely new with this technology. Regards.
You could use the pre-built spark from here
The scripts inside bin folder works in windows 7.
You need to set HADOOP_HOME variable in your path.
spark on windows for more information
If you are using building with sbt approach, then you'll need git also.
Install Scala, sbt and git on your machine. Download Spark source code and run following command
sbt assembly
In case,if you use prebuilt release,Here is the step by step process :
How to run Apache Spark on Windows7 in standalone mode
I going to setup a hadoop cluster in my project(3nodes). My doubt is that we can continue using cygwin or should have linux os in my machine to setup the cluster?
In otherwords, setting up cluster using cygwin leads to pseduo distributed mode with single node or it is like a normal distributed cluster??
Please help me to understand
Thanks.
I tried to set up Hadoop cluster (CDH5.0.2 distribution) in pseduo mode with cygwin and it was hell. I had problems with classpaths, cygwin was unable to parse some paths from hadoop files so i had to rewrite some hadoop code. So I do not recommend using hadoop with cygwin.
Generaly if you just install hadoop with cygwin it is in pseudo distributed mode (see. this) but of course you can install hadoop in distributed mode, but you need more nodes than 1. Think good about it is here Running Hadoop on Windows
I am new to Cloudera. I have worked on hadoop previously, now I want to try Cloudera Hadoop. For this I started with Cloudera Hadoop VM.
The downloaded the file in 7zip format with 2GB size. When I try to extract, it shows error
Can not open file cloudera-quickstart-vm-4.4.0-1-vmware.7z as archive.
All other files are extracting properly but this single file is not extracting. I have downloaded the file three times but got the same error. Is there any specific way to extract this file?
Any help would be appreciated.
You don't need to do anything special, but I had to download the Standard QuickStart VirtualBox VM 3 times before the archive was complete. The final file that worked for me was actually ~2.6G in size.
If Windows, you need to have WinRar installed. This seems to be a common problem while trying to use/install Cloudera QuickStart VM.