Downloading spark-csv in Windows - windows

I am a beginner in the Spark world, and want to do my Machine Learning algorithms using SparkR.
I installed Spark in standalone mode in my laptop (Win 7 64-bit) and I am available to run Spark (1.6.1), Pyspark and begin SparkR in Windows following this effective guide: link . Once I started SparkR I began with the famous Flights example:
#Set proxy
Sys.setenv(http_proxy="http://user:password#proxy.companyname.es:8080/")
#Set SPARK_HOME
Sys.setenv(SPARK_HOME="C:/Users/amartinezsistac/spark-1.6.1-bin-hadoop2.4")
#Load SparkR and its library
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R", "lib"), .libPaths()))
library(SparkR)
#Set Spark Context and SQL Context
sc = sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
#Read Data
link <- "s3n://mortar-example-data/airline-data"
flights <- read.df(sqlContext, link, source = "com.databricks.spark.csv", header= "true")
Nevertheless, I receive the next error message after the last line:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160)
at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at org.apache.spark.api.r.RBackendHandler.ch
It seems like the reason is that I do not have installed the read-csv package, which can be downloaded from this page (Github link). As well as in Stack, in spark-packages.org website, (link) the advice is to do: $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 which is for a Linux installation.
My question is: How could I run this code line from Windows 7 cmd in order to download this package?
I also tried an alternate solution for my error message (Github) without success:
#In master you don't need spark-csv.
#CSV data source is built into SparkSQL. Just use it as follows:
flights <- read.df(sqlContext, "out/data.txt", source = "com.databricks.spark.csv", delimiter="\t", header="true", inferSchema="true")
Thanks in advance to everyone.

It is the same for Windows. When you start spark-shell from the bin directory, start it this way:
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

Related

Version mismatch between H2O and R package, where to get the right one?

I am using H2O (basic version) and it works well. I want to try Deep Water for GPU support. So, I carefully followed the instruction on;
https://www.h2o.ai/deep-water/#try
to install Deep Water. However, it failed to run and showed this error:
Error in h2o.init(nthreads = -1, port = 54323, startH2O = FALSE) :
Version mismatch! H2O is running version 3.15.0.393 but h2o-R package is version 3.13.0.369.
Install the matching h2o-R version from - http://h2o-release.s3.amazonaws.com/h2o/(HEAD detached at c46596cad)
Where do I get the right version?
According to the deep-water link, it wants you to use 3.13.0. And your error message is saying you are using the 3.13.0.369 R package.
So, I think the problem is that you have 3.15.0.393 already running on this machine. Kill it and try again.
From inside your current R session, h2o.shutdown() might work. If not, and you using unix, do something like ps auxw | grep h2o to find its PID and kill it; if using Windows search for h2o in the task manager. Or, cleanest, if you know you have an R (or Python, etc.) client where you started that 3.15.0 version of H2O, go and close that client.
you can force the Connection
h2o.init(ip=Cluster_ip, port = Cluster_port,
strict_version_check = FALSE,
startH2O = FALSE)

Error while connecting sparklyr to remote sparkR in Rstudio

I tried following command in my local RStudio session to connect to sparkR -
sc <- spark_connect(master = "spark://x.x.x.x:7077",
spark_home = "/home/hduser/spark-2.0.0-bin-hadoop2.7", version="2.0.0", config = list())
But, I am getting following error -
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory '/home/hduser/spark-2.0.0-bin-hadoop2.7' not found
Any help?
Thanks in advance
may I ask you have you actually installed the spark into that folder?
Can you show the result of ls command in /home/ubuntu/ folder?
And sessionInfo() in R?
Let me please share with you how I am using the custom folder structure.
It is on Win, not Ubuntu but I guess it won't make much of the difference.
Using the most recent dev edition
If you would check on GitHub the RStudio guys are updating sparklyr almost every day fixing numerous reported bugs:
devtools::install_github("rstudio/sparklyr")
in my case only installation of sparklyr_0.4.12 has resolved problem with Spark 2.0 under Windows
Checking Spark availability
please check if version you're inquiring is available:
spark_available_versions()
You should see something like the line below, which indicates that the version you indend to use is actually available for your sparklyr package.
[13] 2.0.0 2.7 spark_install(version = "2.0.0", hadoop_version = "2.7")
Installation of Spark
Just to keep the order you may like to install spark in other location rather then home folder of RStudio cache.
options(spark.install.dir = "c:/spark")
Once you are sure the desire version is available it is time to install spark
spark_install(version = "2.0.0", hadoop_version = "2.7")
I'd check if it is install correctly (change it for shell ls if needed)
cd c:/spark
dir (in Win) | ls (in Ubuntu)
Now specify the location of the edition you want to use:
Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-bin-hadoop2.7')
And finally enjoy the creation of connection:
sc <- spark_connect(master = "local")
I hope it helps.

Unable to run SparkR in Rstudio

I cant use sparkR in Rstudio because im getting some error: Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
I have tried to search for the solution but cant find one. Here is how I have tried to setup sparkR:
Sys.setenv(SPARK_HOME="C/Users/alibaba555/Downloads/spark") # The path to your spark installation
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library("SparkR", lib.loc="C/Users/alibaba555/Downloads/spark/R") # The path to the lib folder in the spark location
library(SparkR)
sparkR.session(master="local[*]",sparkConfig=list(spark.driver.memory="2g")*
Now execution starst with a message:
Launching java with spark-submit command
C/Users/alibaba555/Downloads/spark/bin/spark-submit2.cmd
sparkr-shell
C:\Users\ALIBAB~1\AppData\Local\Temp\Rtmp00FFkx\backend_port1b90491e4622
And finally after a few minutes it returns an error message:
Error in sparkR.sparkContext(master, appName, sparkHome,
sparkConfigMap, : JVM is not ready after 10 seconds
Thanks!
It looks like the path to your spark library is wrong. It should be something like: library("SparkR", lib.loc="C/Users/alibaba555/Downloads/spark/R/lib")
I'm not sure if that will fix your problem, but it could help. Also, what versions of Spark/SparkR and Scala are you using? Did you build from source?
What seemed to be causing my issues boiled down to the working directory of our users being a networked mapped drive.
Changing the working directory fixed the issue.
If by chance you are also using databricks-connect make sure that the .databricks-connect file is copied into the %HOME% of each user who will be running Rstudio or set up databricks-connect for each of them.

Loading com.databricks.spark.csv via RStudio

I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve.
When launching the SparkR-shell
./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3
I can read a .csv-file as follows
flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true")
Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message:
15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
I know I should load com.databricks:spark-csv_2.10:1.0.3 in a way, but I have no idea how to do this. Could someone help me?
This is the right syntax (after hours of trying):
(Note - You've to focus on the first line. Notice to double-quotes)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
library(SparkR)
library(magrittr)
# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-Flights-example")
sqlContext <- sparkRSQL.init(sc)
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")
My colleagues and I found the solution. We have initialized the sparkContext like this:
sc <- sparkR.init(appName="SparkR-Example",sparkEnvir=list(spark.executor.memory="1g"),sparkJars="spark-csv-assembly-1.1.0.jar")
We did not find how to load a remote jar, hence we have downloaded spark-csv_2.11-1.0.3.jar. Including this one in sparkJars however does not work, since it does not find its dependencies locally. You can add a list of jars as well, but we have build an assembly jar containing all dependencies. When loading this jar, it is possible to load the .csv-file as desired:
flights <- read.df(sqlContext, "data/nycflights13.csv","com.databricks.spark.csv",header="true")
I have downloaded Spark-1.4.0, via command line I went to the directory Spark-1.4.0/R, where I have build the SparkR package located in the subdirectory pkg as follows:
R CMD build --resave-data pkg
This gives you a .tar file which you can install in RStudio (with devtools, you should be able to install the package in pkg as well).
In RStudio, you should set your path to Spark as follows:
Sys.setenv(SPARK_HOME="path_to_spark/spark-1.4.0")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
And you should be ready to go. I can only talk from mac experience, I hope it helps?
If after you tried Pragith's solution above and you still having the issue. It is very possible the csv file you want to load is not in the current RStudio working directory. Use getwd() to check the RStudio directory and make sure the csv file is there.

Pig without Hadoop on Windows 7

I am trying to run PigUnit tests on a Windows 7 machine before running the actual pig script on a Ubuntu cluster and I start to think that my understanding of "withouthadoop" is not correct.
Do I need to install Hadoop to locally run a PigUnit test on a Windows 7 machine?
I installed:
eclipse Juno & ant
cygwin
I set up:
JAVA_HOME=C:\Program Files\Java\jdk1.6.0_39
PIG_HOME=C:\Users\john.doe\Java\eclipse\pig
PIG_CLASSPATH=%PIG_HOME%\bin
I created using eclipse's Ant builder jar-all and pigunit-jar:
pig.jar
pig-withouthadoop.jar
pigunit.jar
Still when I type pig -x local in cygwin I get:
$./pig -x local
cygpath: can't convert empty path
Exception in thread "main" java.io.IOException: Error opening job jar: /usr/lib/pig/pig-withouthadoop.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:135)
Caused by: java.io.FileNotFoundException: \usr\lib\pig\pig-withouthadoop.jar (the systen cannot find the given path)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:127)
at java.util.jar.JarFile.<init>(JarFile.java:136)
at java.util.jar.JarFile.<init>(JarFile.java:73)
at org.apache.hadoop.util.RunJar.main(RunJar.java:133)
When I try to run the test from http://pig.apache.org/docs/r0.10.0/test.html#pigunit from within eclipse using the option "Run as JUnit", I get:
java.io.IOException
at org.apache.pig.pigunit.pig.PigServer.registerScript(PigServer.java:62)
at org.apache.pig.pigunit.PigTest.registerScript(PigTest.java:171)
at org.apache.pig.pigunit.PigTest.assertOutput(PigTest.java:267)
at org.apache.pig.pigunit.PigTest.assertOutput(PigTest.java:262)
at da.utils.pigunit.PigUnitExample.testTop2Queries(PigUnitExample.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
I am starting to think that I missed some crucial basic information to run Pig on Windows, I have to say also that I am no experienced user with Windows 7 and cygwin, I come from the Unix world.
Don't fight it. Install Hadoop HDInsight server on Windows from the Web Platform installer:
http://www.microsoft.com/web/downloads/platform.aspx
It doesn't take long or take up that much space, and the whole shebang is just set up and running for you. I can't get Pig scripts to take parameters, and there's no HBase, but you get HDFS, Pig, Hive. You can even get a whole local cluster going if just follow:: http://social.msdn.microsoft.com/Forums/en-US/hdinsight/thread/885efc22-fb67-4df8-8648-4ff38098dac6/
I have installed pig 0.12 in cygwin (I run windows 7 64-bit) without installing hadoop. As far as I can see, the steps I followed where:
Install Cygwin64 (with Perl package)
Download pig-0.12.1.tar.gz, copy to home folder
Extract to home folder in cygwin:
$ tar xzf pig-0.12.1.tar.gz
Export JAVA_HOME:
$ export JAVA_HOME=/cygdrive/c/Program\ Files/Java/jre6/
Add pig to path:
$ export PATH=~/pig-0.12.1/bin/:$PATH
Rename pig-0.12.1.jar to pig.jar:
$ cp pig-0.12.1.jar pig.jar
Export PIG_CLASSPATH:
$ export PIG_CLASSPATH=~/pig-0.12.1/pig.jar
Run pig in local mode (start Grunt):
$ pig -x local
There will be a warning:
"cygpath: cannot create short name of C:\cygwin64\home\xxx\pig-0.12.1\logs"
We can remove it simply by running:
$ mkdir logs
Like you, I'm trying at the moment to get a functioning Pig installation on a Windows PC using cygwin in order to learn Pig Latin using small datasets on a single JVM. Not a huge ask, you would have thought, but the pain is almost unbearable. I come from a Windows background and the UNIX part is the steep learning curve for me. The pig-withouthadoop jar doesn't contain hadoop, so hadoop needs to be already installed on your machine to use it; the pig.jar contains pig's own version of hadoop and so is the one to use if hadoop is not already installed on your machine. This is the way I understand it, and it seems to be born out by dumping a list of the contents of each .jar to a text file and viewing the results in Notepad++ .
When you type pig -x local at cygwin's dollar prompt, the bash command script 'pig' is invoked and run. Have a look at it (from your PIG_HOME) with $cd bin $ cat pig. I've been right through it these last few days with vim (!) and near the end of the code is a little fork for cygwin users, in order to cast environment variables, that up until now have been in Unix format, into a form that the Windows version of java.exe will understand when 'exec java ...' is called right at the end of the script. Without this conversion, the Windows java.exe won't understand its parameters:
'#' cygwin path translation
if $cygwin; then
CLASSPATH=cygpath -p -w "$CLASSPATH"
PIG_HOME=cygpath -d "£PIG_HOME"
PIG_LOG_DIR=cygpath -d "$PIG_LOG_DIR"
fi
Cygpath is a cygwin utility that converts UNIX-style file paths into Windows-style file paths, and vice versa. The error message: "cygpath: can't convert empty path" must come from here, I think. Check that CLASSPATH, PIG_HOME and PIG_LOG_DIR aren't empty, perhaps by placing your own echo commands in the script.
On my machine and installation, there was an error generated here, but not the same as yours. I found replacing -w and -d with -m, which makes cygpath use the C:/Program Files/Java... syntax conversion, worked. But then other problems appear, which I shall leave for my own question.
According to this note [1], it is not possible to use the Hadoop native libraries on windows 7 directly using Cygwin:
Hadoop native library is supported only on *nix platforms only. Unfortunately it is known not to work on Cygwin [...]
I have traced down the error message cygpath: can't convert empty path to the line JAVA_LIBRARY_PATH=cygpath -w "$JAVA_LIBRARY_PATH" in hadoop-config.sh which I commented out following the advice from [2]
Comment some translation in hadoop-config.sh.
#if $cygwin; then
#HADOOP_PREFIX=`cygpath -w "$HADOOP_PREFIX"`
#HADOOP_LOG_DIR=`cygpath -w "$HADOOP_LOG_DIR"`
#JAVA_LIBRARY_PATH=`cygpath -w "$JAVA_LIBRARY_PATH"`
#fi
Now I get the following error:
Error before Pig is launched -- ERROR 2999: Unexpected internal error.
java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
So the conclusion I draw from this, is that Pig, even in local mode, requires the HDFS. And the HDFS requires the Hadoop native libraries. And the native libraries are known not to work on Cygwin. Hence: IMHO, Pig cannot run using Cygwin as it is.
To run PigUnit on Windows 7. I don't install Cygwin.
Thanks for Konstantin Kudryavtsev i use his FixHadoopOnWindows.runFix()
http://simpletoad.blogspot.com/2013/05/pigunit-issue-on-windows.html
I call the runFix in my setUp, for example
private static PigTest test;
#BeforeClass
public static void setUp() throws IOException, ParseException {
try {
FixHadoopOnWindows.runFix();
// TODO: load pig script properly
test = new PigTest("src/pig/example.pig");
//test.override(...);
}
catch (Exception e) {
}
}
Use maven, need the following dependency
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.15.0</version>
</dependency>
<dependency>
<groupId>org.jboss.forge</groupId>
<artifactId>forge-javassist</artifactId>
<version>2</version>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pigunit</artifactId>
<version>0.15.0</version>
<scope>test</scope>
</dependency>

Resources