Change the max- ppsize in R and sparkR - sparkr

I R and sparkR I want to change the 'max-ppsize'.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
page 93/105 there is a desciption for the command.
How exactly do I type it in when I have open the R or sparkR program on Ubuntu ? I couldnt find a clear description for it else where.
I have looked at the answer on
How to set max ppsize in R?
I can do it in R successfully but in I can't do it in SparkR.
I launch sparkR on ubuntu like this
cd /home/.../R/spark-1.4.0
and then I type
.bin/sparkR --max-ppsize=90000
But I get an error message: "log4j Warn No appenders could be found for logger"

Related

how to change spark.r.backendConnectionTimeout value in RStudio?

I am using RStudio to connect to my HDFS file using SparkR. When I leave Spark analyses running overnight, I get "R session aborted" error the next day. From Spark's documentation on SparkR (https://spark.apache.org/docs/latest/configuration.html), the default value of spark.r.backendConnectionTimeout is set to 6000s. I would like to change this value to something large that my connection doesn't time out after the analyses is done.
I have tried the following:
sparkR.session(master = "local[*]", sparkConfig = list(spark.r.backendConnectionTimeout = 10))
sparkR.session(master = "local[*]", spark.r.backendConnectionTimeout = 10)
I get the same output for both commands:
Spark package found in SPARK_HOME: C:\Spark\spark-2.3.2-bin-hadoop2.7
Launching java with spark-submit command C:\Spark\spark-2.3.2-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\XYZ\AppData\Local\Temp\3\RtmpiEaE5q\backend_port696c18316c61
Java ref type org.apache.spark.sql.SparkSession id 1
It seems that the parameter was not passed correctly. Also, I am not sure where to pass that parameter.
Any help would be appreciated.
A similar post is around, but that involves Zeppelin (how to change spark.r.backendConnectionTimeout value?).
Thanks.
I found the solution: it is to modify the spark-defaults.conf file and add the following line:
spark.r.backendConnectionTimeout = 6000000
(or whatever time limit you want)
IMPORTANT note - restart hadoop and yarn services, and try connecting to Spark with SparkR normally:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local")
You can check if the settings took place or not at http://localhost:4040/environment/
I hope this comes useful for other people.

Error: Could not find or load main class org.apache.hadoop.util.VersionInfo

Hi I have just started learing Hadoop. I am following "Hadoop The definitive guide" book. While setting up Hadoop in Window 7 following "Apendix A" of the book, I am getting error
"Error: Could not find or load main class org.apache.hadoop.util.VersionInfo"
when I am running the command
hadoop version
Can anyone have an idea how to correct this or tell me what I am missing?
I am using Cygwin64.
if you add following line in the end of hadoop-config.sh ( available in HADOOP_INSTALL\libexec), it should work.
CLASSPATH='cygpath -wp "$CLASSPATH"'
I solved this issue on oracle linux by adding HADOOP_PREFIX=

Input path does not exist: file:/D:/pigsample_1749383998_1377684507424

I am facing a wiered issue.
I am running PIG 0.11 on windows7/64 bit machine with latest version of cygwin.
I am a weblog which I want to order it by userName to have all the user activities for the same user together to feed for next line of processing.
I am starting commandprompt -> cygwin.bat -> on the cygwin console go to D:/ -> pig and typing the following script on grunt shall (local mode).
(Note I've set PIG_HOME, PIG_CLASSPATH correctly).
Script is :
USERACTIVITIES = LOAD '/D:/path/of/logs/useractivities' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',') AS (datetimeUnProcessed:chararray, username:chararray, request:chararray);
USERACTIVITIES_ORDERED = ORDER USERACTIVITIES by username;
STORE USERACTIVITIES_ORDERED INTO '/D:/readyfornextinput/useractivities' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
When I do illustrate USERACTIVITIES_ORDERED I see it going smooth.
But when I do store/dump I face wiered issue.
It fails by saying :
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/D:/pigsample_1749383998_1377684507424
When I tried to search this pigsample_number file I could find that in :
D:/tmp//mapred/local/localRunner
I am not sure how it is happening.
I am not sure if its windows/cygwin related issue or someone saw this on Linux also.
For reference, you can find the stacktrace attached here:
2013-08-28 15:38:28,863 [Thread-46] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0004
java.lang.RuntimeException:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path does not exist: file:/D:/pigsample_1749383998_1377684507424
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:157)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:677)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
Caused by:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path does not exist: file:/D:/pigsample_1288777582_1377684802262
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
at org.apache.pig.impl.io.ReadToEndLoader.(ReadToEndLoader.java:126)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:131)
... 6 more
Any help on this will be useful.
Looks like this is reproducible only on cygwin environment.
I've documented the root cause and solution here

Hadoop and Hive Homes in CDH4

I'm trying to configure RHive in the CDH4 environment.
When reading a package 'RHive' in R, the error below got returned.
I'm guessing that's due to wrong homes.
If so, what would be the correct ones?
Or if that's not the reason, what's wrong with that?
Any help would be very appreciated.
Thanks.
> Sys.setenv(HIVE_HOME="/etc/hive")
> Sys.setenv(HADOOP_HOME="/etc/hadoop")
> library(RHive)
Loading required package: rJava
Loading required package: Rserve
This is RHive 0.0-7. For overview type '?RHive'.
HIVE_HOME=/etc/hive
[1] "there is no slaves file of HADOOP. so you should pass hosts argument when you call rhive.connect()."
Error : .onLoad failed in loadNamespace() for 'RHive', details:
call: .jnew("org/apache/hadoop/conf/Configuration")
error: java.lang.ClassNotFoundException
In addition: Warning message:
In file(file, "rt") :
cannot open file '/etc/hadoop/conf/slaves': No such file or directory
Error: package/namespace load failed for 'RHive'
Had the problems but solved it. Downside is that I have to keep track of a bunch of sym links
After struggling with install RHive_0.0-7.tar.gz on CDH 4.7.x and getting:
Warning in file(file, "rt") :
cannot open file '/etc/hadoop/conf/slaves': No such file or directory
[1] "there is no slaves file of HADOOP. so you should pass hosts argument when you call rhive.connect()."
In /etc/hadoop/conf
I added a the following sym link ----> ln -s /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/etc/hadoop/conf.empty/slaves slaves
(why Cloudera CHD 4.7 installs in /opt without creating the proper sym links from /usr/lib is puzzling)
I also defined the followingin /usr/lib64/R/etc/Renviron
## set hive paths
HIVE_HOME='/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hive'
HADOOP_HOME='/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop'
LD_LIBRARY_PATH='/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop'
At a shell prompt I ran R CMD INSTALL RHive_0.0-7.tar.gz
Installation Happiness!!
++++++
Inside R-Studio (server)
>
> library(RHive)
Loading required package: rJava
Loading required package: Rserve
This is RHive 0.0-7. For overview type ‘?RHive’.
HIVE_HOME=/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hive
call rhive.init() because HIVE_HOME is set.
rhive.init()
>
+++++++
You should set the HADOOP_CONF_DIR separately.
Try export $HADOOP_CONF_DIR=/etc/hadoop/conf/conf.pseudo
The conf.pseudo has the slaves file.
Though I'd be curious to see if you can make RHive work with CDH4.

Loading SparkR within RStudio: Err [could not find function "launchBackend"]

Having trouble loading SparkR into my RStudio. When I try to load a spark context, I get the error:
Could not find function "launchBackend."
I can get sparkR working on my terminal shell, but that's also because I don't have to load a spark context (it seems like it's already there). Any help?
Sys.setenv(SPARK_HOME="~/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
One thing which might help is to type the full path in. R doesn't expand the tilde.
Instead of:
Sys.setenv(SPARK_HOME="~/spark-1.4.1")
Something like this:
Sys.setenv(SPARK_HOME="/home/spark-1.4.1")

Resources