I tried following command in my local RStudio session to connect to sparkR -
sc <- spark_connect(master = "spark://x.x.x.x:7077",
spark_home = "/home/hduser/spark-2.0.0-bin-hadoop2.7", version="2.0.0", config = list())
But, I am getting following error -
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory '/home/hduser/spark-2.0.0-bin-hadoop2.7' not found
Any help?
Thanks in advance
may I ask you have you actually installed the spark into that folder?
Can you show the result of ls command in /home/ubuntu/ folder?
And sessionInfo() in R?
Let me please share with you how I am using the custom folder structure.
It is on Win, not Ubuntu but I guess it won't make much of the difference.
Using the most recent dev edition
If you would check on GitHub the RStudio guys are updating sparklyr almost every day fixing numerous reported bugs:
devtools::install_github("rstudio/sparklyr")
in my case only installation of sparklyr_0.4.12 has resolved problem with Spark 2.0 under Windows
Checking Spark availability
please check if version you're inquiring is available:
spark_available_versions()
You should see something like the line below, which indicates that the version you indend to use is actually available for your sparklyr package.
[13] 2.0.0 2.7 spark_install(version = "2.0.0", hadoop_version = "2.7")
Installation of Spark
Just to keep the order you may like to install spark in other location rather then home folder of RStudio cache.
options(spark.install.dir = "c:/spark")
Once you are sure the desire version is available it is time to install spark
spark_install(version = "2.0.0", hadoop_version = "2.7")
I'd check if it is install correctly (change it for shell ls if needed)
cd c:/spark
dir (in Win) | ls (in Ubuntu)
Now specify the location of the edition you want to use:
Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-bin-hadoop2.7')
And finally enjoy the creation of connection:
sc <- spark_connect(master = "local")
I hope it helps.
Related
When running flutter on windows server 2016, doctor is showing VS as not installed. If someone could point me in source as to how flutter detects ide's(my understanding is that they use default locations - not environment vars), that would be helpful.
Thanks to #pskink for locating it; sorry it took me so long to respond back.
location as of: v1.15.9
GitHub Flutter
Found that flutter calls vswhere.exe to resolve location and version information.
EX:
$vsw_path = "C:\Program Files (x86)\Microsoft Visual Studio\Installer\vswhere.exe";
$vsw_raw_data = (& $vsw_path -version 16);
$vsw_sdata = #{};
$vsw_ins_start = 3;
$vsw_ins_stop = $vsw_raw_data.Count - $vsw_ins_start;
do {
$k,$v = (($vsw_raw_data[$vsw_ins_start]) -Split ':\s');
$vsw_sdata.add($k,$v); ++$vsw_ins_start
} while ($vsw_ins_start -ne $vsw_ins_stop);
$vsw_sdata.productPath
Additionally, it passes '-required' parameters to verify if certain packages are installed, etc: Github VSWhere
I am using H2O (basic version) and it works well. I want to try Deep Water for GPU support. So, I carefully followed the instruction on;
https://www.h2o.ai/deep-water/#try
to install Deep Water. However, it failed to run and showed this error:
Error in h2o.init(nthreads = -1, port = 54323, startH2O = FALSE) :
Version mismatch! H2O is running version 3.15.0.393 but h2o-R package is version 3.13.0.369.
Install the matching h2o-R version from - http://h2o-release.s3.amazonaws.com/h2o/(HEAD detached at c46596cad)
Where do I get the right version?
According to the deep-water link, it wants you to use 3.13.0. And your error message is saying you are using the 3.13.0.369 R package.
So, I think the problem is that you have 3.15.0.393 already running on this machine. Kill it and try again.
From inside your current R session, h2o.shutdown() might work. If not, and you using unix, do something like ps auxw | grep h2o to find its PID and kill it; if using Windows search for h2o in the task manager. Or, cleanest, if you know you have an R (or Python, etc.) client where you started that 3.15.0 version of H2O, go and close that client.
you can force the Connection
h2o.init(ip=Cluster_ip, port = Cluster_port,
strict_version_check = FALSE,
startH2O = FALSE)
I cant use sparkR in Rstudio because im getting some error: Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
I have tried to search for the solution but cant find one. Here is how I have tried to setup sparkR:
Sys.setenv(SPARK_HOME="C/Users/alibaba555/Downloads/spark") # The path to your spark installation
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library("SparkR", lib.loc="C/Users/alibaba555/Downloads/spark/R") # The path to the lib folder in the spark location
library(SparkR)
sparkR.session(master="local[*]",sparkConfig=list(spark.driver.memory="2g")*
Now execution starst with a message:
Launching java with spark-submit command
C/Users/alibaba555/Downloads/spark/bin/spark-submit2.cmd
sparkr-shell
C:\Users\ALIBAB~1\AppData\Local\Temp\Rtmp00FFkx\backend_port1b90491e4622
And finally after a few minutes it returns an error message:
Error in sparkR.sparkContext(master, appName, sparkHome,
sparkConfigMap, : JVM is not ready after 10 seconds
Thanks!
It looks like the path to your spark library is wrong. It should be something like: library("SparkR", lib.loc="C/Users/alibaba555/Downloads/spark/R/lib")
I'm not sure if that will fix your problem, but it could help. Also, what versions of Spark/SparkR and Scala are you using? Did you build from source?
What seemed to be causing my issues boiled down to the working directory of our users being a networked mapped drive.
Changing the working directory fixed the issue.
If by chance you are also using databricks-connect make sure that the .databricks-connect file is copied into the %HOME% of each user who will be running Rstudio or set up databricks-connect for each of them.
I am a beginner in the Spark world, and want to do my Machine Learning algorithms using SparkR.
I installed Spark in standalone mode in my laptop (Win 7 64-bit) and I am available to run Spark (1.6.1), Pyspark and begin SparkR in Windows following this effective guide: link . Once I started SparkR I began with the famous Flights example:
#Set proxy
Sys.setenv(http_proxy="http://user:password#proxy.companyname.es:8080/")
#Set SPARK_HOME
Sys.setenv(SPARK_HOME="C:/Users/amartinezsistac/spark-1.6.1-bin-hadoop2.4")
#Load SparkR and its library
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R", "lib"), .libPaths()))
library(SparkR)
#Set Spark Context and SQL Context
sc = sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
#Read Data
link <- "s3n://mortar-example-data/airline-data"
flights <- read.df(sqlContext, link, source = "com.databricks.spark.csv", header= "true")
Nevertheless, I receive the next error message after the last line:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160)
at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at org.apache.spark.api.r.RBackendHandler.ch
It seems like the reason is that I do not have installed the read-csv package, which can be downloaded from this page (Github link). As well as in Stack, in spark-packages.org website, (link) the advice is to do: $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 which is for a Linux installation.
My question is: How could I run this code line from Windows 7 cmd in order to download this package?
I also tried an alternate solution for my error message (Github) without success:
#In master you don't need spark-csv.
#CSV data source is built into SparkSQL. Just use it as follows:
flights <- read.df(sqlContext, "out/data.txt", source = "com.databricks.spark.csv", delimiter="\t", header="true", inferSchema="true")
Thanks in advance to everyone.
It is the same for Windows. When you start spark-shell from the bin directory, start it this way:
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0
I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve.
When launching the SparkR-shell
./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3
I can read a .csv-file as follows
flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true")
Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message:
15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
I know I should load com.databricks:spark-csv_2.10:1.0.3 in a way, but I have no idea how to do this. Could someone help me?
This is the right syntax (after hours of trying):
(Note - You've to focus on the first line. Notice to double-quotes)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
library(SparkR)
library(magrittr)
# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-Flights-example")
sqlContext <- sparkRSQL.init(sc)
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")
My colleagues and I found the solution. We have initialized the sparkContext like this:
sc <- sparkR.init(appName="SparkR-Example",sparkEnvir=list(spark.executor.memory="1g"),sparkJars="spark-csv-assembly-1.1.0.jar")
We did not find how to load a remote jar, hence we have downloaded spark-csv_2.11-1.0.3.jar. Including this one in sparkJars however does not work, since it does not find its dependencies locally. You can add a list of jars as well, but we have build an assembly jar containing all dependencies. When loading this jar, it is possible to load the .csv-file as desired:
flights <- read.df(sqlContext, "data/nycflights13.csv","com.databricks.spark.csv",header="true")
I have downloaded Spark-1.4.0, via command line I went to the directory Spark-1.4.0/R, where I have build the SparkR package located in the subdirectory pkg as follows:
R CMD build --resave-data pkg
This gives you a .tar file which you can install in RStudio (with devtools, you should be able to install the package in pkg as well).
In RStudio, you should set your path to Spark as follows:
Sys.setenv(SPARK_HOME="path_to_spark/spark-1.4.0")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
And you should be ready to go. I can only talk from mac experience, I hope it helps?
If after you tried Pragith's solution above and you still having the issue. It is very possible the csv file you want to load is not in the current RStudio working directory. Use getwd() to check the RStudio directory and make sure the csv file is there.