Loading com.databricks.spark.csv via RStudio - rstudio

I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve.
When launching the SparkR-shell
./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3
I can read a .csv-file as follows
flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true")
Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message:
15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
I know I should load com.databricks:spark-csv_2.10:1.0.3 in a way, but I have no idea how to do this. Could someone help me?

This is the right syntax (after hours of trying):
(Note - You've to focus on the first line. Notice to double-quotes)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
library(SparkR)
library(magrittr)
# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-Flights-example")
sqlContext <- sparkRSQL.init(sc)
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")

My colleagues and I found the solution. We have initialized the sparkContext like this:
sc <- sparkR.init(appName="SparkR-Example",sparkEnvir=list(spark.executor.memory="1g"),sparkJars="spark-csv-assembly-1.1.0.jar")
We did not find how to load a remote jar, hence we have downloaded spark-csv_2.11-1.0.3.jar. Including this one in sparkJars however does not work, since it does not find its dependencies locally. You can add a list of jars as well, but we have build an assembly jar containing all dependencies. When loading this jar, it is possible to load the .csv-file as desired:
flights <- read.df(sqlContext, "data/nycflights13.csv","com.databricks.spark.csv",header="true")

I have downloaded Spark-1.4.0, via command line I went to the directory Spark-1.4.0/R, where I have build the SparkR package located in the subdirectory pkg as follows:
R CMD build --resave-data pkg
This gives you a .tar file which you can install in RStudio (with devtools, you should be able to install the package in pkg as well).
In RStudio, you should set your path to Spark as follows:
Sys.setenv(SPARK_HOME="path_to_spark/spark-1.4.0")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
And you should be ready to go. I can only talk from mac experience, I hope it helps?

If after you tried Pragith's solution above and you still having the issue. It is very possible the csv file you want to load is not in the current RStudio working directory. Use getwd() to check the RStudio directory and make sure the csv file is there.

Related

How can I prevent a package from opening when starting a RStudio project

I have the lmerTest package installed, but it masks some of the lme4 functions and it loads automatically when I start a specific project.
I have tried deleteing the package but now I just get an error:
Loading required package: lmerTest Error in .requirePackage(package) :
unable to find required package ‘lmerTest’ In addition: Warning
message: In library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called ‘lmerTest’
Every time I open Rstudio it attempts to load the lmerTest package. The error message repeats ten times.
I have checked the Rprofile file but it is not in there.
This seems to be the same problem:
https://support.rstudio.com/hc/en-us/community/posts/200798587-How-can-I-prevent-a-package-from-trying-to-load-at-startup-
The quick and dirty solution here is to delete (or 'hide') the .RData file from the folder.
In the terminal you could do the following, assuming the terminal is open in the working directory.
mv ./.RData ./.RData_old
This should get RStudio opening without that package loading but all your other objects will not be there either. If you still have the code used to create the objects, you'd just need to run through it again to create all the objects again.

Error while connecting sparklyr to remote sparkR in Rstudio

I tried following command in my local RStudio session to connect to sparkR -
sc <- spark_connect(master = "spark://x.x.x.x:7077",
spark_home = "/home/hduser/spark-2.0.0-bin-hadoop2.7", version="2.0.0", config = list())
But, I am getting following error -
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory '/home/hduser/spark-2.0.0-bin-hadoop2.7' not found
Any help?
Thanks in advance
may I ask you have you actually installed the spark into that folder?
Can you show the result of ls command in /home/ubuntu/ folder?
And sessionInfo() in R?
Let me please share with you how I am using the custom folder structure.
It is on Win, not Ubuntu but I guess it won't make much of the difference.
Using the most recent dev edition
If you would check on GitHub the RStudio guys are updating sparklyr almost every day fixing numerous reported bugs:
devtools::install_github("rstudio/sparklyr")
in my case only installation of sparklyr_0.4.12 has resolved problem with Spark 2.0 under Windows
Checking Spark availability
please check if version you're inquiring is available:
spark_available_versions()
You should see something like the line below, which indicates that the version you indend to use is actually available for your sparklyr package.
[13] 2.0.0 2.7 spark_install(version = "2.0.0", hadoop_version = "2.7")
Installation of Spark
Just to keep the order you may like to install spark in other location rather then home folder of RStudio cache.
options(spark.install.dir = "c:/spark")
Once you are sure the desire version is available it is time to install spark
spark_install(version = "2.0.0", hadoop_version = "2.7")
I'd check if it is install correctly (change it for shell ls if needed)
cd c:/spark
dir (in Win) | ls (in Ubuntu)
Now specify the location of the edition you want to use:
Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-bin-hadoop2.7')
And finally enjoy the creation of connection:
sc <- spark_connect(master = "local")
I hope it helps.

Unable to run SparkR in Rstudio

I cant use sparkR in Rstudio because im getting some error: Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
I have tried to search for the solution but cant find one. Here is how I have tried to setup sparkR:
Sys.setenv(SPARK_HOME="C/Users/alibaba555/Downloads/spark") # The path to your spark installation
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library("SparkR", lib.loc="C/Users/alibaba555/Downloads/spark/R") # The path to the lib folder in the spark location
library(SparkR)
sparkR.session(master="local[*]",sparkConfig=list(spark.driver.memory="2g")*
Now execution starst with a message:
Launching java with spark-submit command
C/Users/alibaba555/Downloads/spark/bin/spark-submit2.cmd
sparkr-shell
C:\Users\ALIBAB~1\AppData\Local\Temp\Rtmp00FFkx\backend_port1b90491e4622
And finally after a few minutes it returns an error message:
Error in sparkR.sparkContext(master, appName, sparkHome,
sparkConfigMap, : JVM is not ready after 10 seconds
Thanks!
It looks like the path to your spark library is wrong. It should be something like: library("SparkR", lib.loc="C/Users/alibaba555/Downloads/spark/R/lib")
I'm not sure if that will fix your problem, but it could help. Also, what versions of Spark/SparkR and Scala are you using? Did you build from source?
What seemed to be causing my issues boiled down to the working directory of our users being a networked mapped drive.
Changing the working directory fixed the issue.
If by chance you are also using databricks-connect make sure that the .databricks-connect file is copied into the %HOME% of each user who will be running Rstudio or set up databricks-connect for each of them.

Building RXTX with --disable-locks

I need to build RXTX (http://rxtx.qbang.org/wiki/index.php/Main_Page) for a 64 bit platform with --disable-locks. ( the target platform is a Ubuntu Snappy platform so there is a problem with the permissions and lock files).
The .configure --disable-locks and make seemed to run ok.
I got a new .jar file and librxtxSerial.so as a result.
However when I installed them, I get the following error :
java.lang.UnsatisfiedLinkError: gnu.io.RXTXCommDriver.nativeGetVersion()Ljava/lang/String; thrown while loading gnu.io.RXTXCommDriver
java.lang.NoClassDefFoundError: Could not initialize class gnu.io.RXTXCommDriver thrown while loading gnu.io.RXTXCommDriver
Is there something else I need to do?
Are there other object files I need to copy over.
Thanks in advance.
I solved the problem myself.
The problem was that the configure script was not expecting a java version higher than 1.5 as shown here in t he extract :
case $JAVA_VERSION in
1.2*|1.3*|1.4*|1.5*)
#fix_parameters $JPATH/jre/lib/javax.comm.properties
CLASSPATH=".:\$(TOP):\$(TOP)/src:"find $JPATH/ -name RXTXcomm.jar |head -n1
RXTX_PATH="\$(JPATH)/jre/lib/\$(OS_ARCH)"
JHOME=$JPATH/"jre/lib/ext"
So the paths were not being set up correctly for me.
I changed it to
case $JAVA_VERSION in
1.2*|1.3*|1.4*|1.5*|1.7*)
Then it worked ok.

Downloading spark-csv in Windows

I am a beginner in the Spark world, and want to do my Machine Learning algorithms using SparkR.
I installed Spark in standalone mode in my laptop (Win 7 64-bit) and I am available to run Spark (1.6.1), Pyspark and begin SparkR in Windows following this effective guide: link . Once I started SparkR I began with the famous Flights example:
#Set proxy
Sys.setenv(http_proxy="http://user:password#proxy.companyname.es:8080/")
#Set SPARK_HOME
Sys.setenv(SPARK_HOME="C:/Users/amartinezsistac/spark-1.6.1-bin-hadoop2.4")
#Load SparkR and its library
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R", "lib"), .libPaths()))
library(SparkR)
#Set Spark Context and SQL Context
sc = sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
#Read Data
link <- "s3n://mortar-example-data/airline-data"
flights <- read.df(sqlContext, link, source = "com.databricks.spark.csv", header= "true")
Nevertheless, I receive the next error message after the last line:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160)
at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at org.apache.spark.api.r.RBackendHandler.ch
It seems like the reason is that I do not have installed the read-csv package, which can be downloaded from this page (Github link). As well as in Stack, in spark-packages.org website, (link) the advice is to do: $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 which is for a Linux installation.
My question is: How could I run this code line from Windows 7 cmd in order to download this package?
I also tried an alternate solution for my error message (Github) without success:
#In master you don't need spark-csv.
#CSV data source is built into SparkSQL. Just use it as follows:
flights <- read.df(sqlContext, "out/data.txt", source = "com.databricks.spark.csv", delimiter="\t", header="true", inferSchema="true")
Thanks in advance to everyone.
It is the same for Windows. When you start spark-shell from the bin directory, start it this way:
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

Resources