When creating a Spark context in PySpark, I typically use the following code:
conf = (SparkConf().setMaster("yarn-client").setAppName(appname)
.set("spark.executor.memory", "10g")
.set("spark.executor.instances", "7")
.set("spark.driver.memory", "5g")
.set("spark.shuffle.service.enabled","true")
.set("spark.dynamicAllocation.enabled","true")
.set("spark.dynamicAllocation.minExecutors","5")
)
sc = SparkContext(conf=conf)
However, this puts it in the default queue, which is almost always over capacity. We have several less busy queues available, so my question is - how do I set my Spark context to use another queue?
Edit: To clarify - I'm looking to set the queue for interactive jobs (e.g., exploratory analysis in a Jupyter notebook), so I can't set the queue with spark-submit.
You can use below argument in you spark-submit command.
--queue queue_name
You can set this property in your code. spark.yarn.queue
Hope this will help.
Thanks
Try to use spark.yarn.queue rather than queue.
conf = pyspark.SparkConf().set("spark.yarn.queue", "your_queue_name")
sc
Related
I am using RStudio to connect to my HDFS file using SparkR. When I leave Spark analyses running overnight, I get "R session aborted" error the next day. From Spark's documentation on SparkR (https://spark.apache.org/docs/latest/configuration.html), the default value of spark.r.backendConnectionTimeout is set to 6000s. I would like to change this value to something large that my connection doesn't time out after the analyses is done.
I have tried the following:
sparkR.session(master = "local[*]", sparkConfig = list(spark.r.backendConnectionTimeout = 10))
sparkR.session(master = "local[*]", spark.r.backendConnectionTimeout = 10)
I get the same output for both commands:
Spark package found in SPARK_HOME: C:\Spark\spark-2.3.2-bin-hadoop2.7
Launching java with spark-submit command C:\Spark\spark-2.3.2-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\XYZ\AppData\Local\Temp\3\RtmpiEaE5q\backend_port696c18316c61
Java ref type org.apache.spark.sql.SparkSession id 1
It seems that the parameter was not passed correctly. Also, I am not sure where to pass that parameter.
Any help would be appreciated.
A similar post is around, but that involves Zeppelin (how to change spark.r.backendConnectionTimeout value?).
Thanks.
I found the solution: it is to modify the spark-defaults.conf file and add the following line:
spark.r.backendConnectionTimeout = 6000000
(or whatever time limit you want)
IMPORTANT note - restart hadoop and yarn services, and try connecting to Spark with SparkR normally:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local")
You can check if the settings took place or not at http://localhost:4040/environment/
I hope this comes useful for other people.
I have a elasticsearch docker image listening on 127.0.0.1:9200, I tested it using sense and kibana, It works fine, I am able to index and query documents. Now when I try to write to it from a spark App
val sparkConf = new SparkConf().setAppName("ES").setMaster("local")
sparkConf.set("es.index.auto.create", "true")
sparkConf.set("es.nodes", "127.0.0.1")
sparkConf.set("es.port", "9200")
sparkConf.set("es.resource", "spark/docs")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val rdd = sc.parallelize(Seq(numbers, airports))
rdd.saveToEs("spark/docs")
It fails to connect, and keeps on retrying
16/07/11 17:20:07 INFO HttpMethodDirector: I/O exception (java.net.ConnectException) caught when processing request: Operation timed out
16/07/11 17:20:07 INFO HttpMethodDirector: Retrying request
I tried using IPAddress given by docker inspect for the elasticsearch image, that also does not work. However when I use a native installation of elasticsearch, the Spark App runs fine. Any ideas?
Also, set the config
es.nodes.wan.only to true
As mentioned in this answer if you are having issues writing to ES.
Couple things I would check:
The Elasticsearch-Hadoop spark connector version you are working with. Make sure that it is not beta. There was a fixed bug related to the IP resolving.
Since 9200 is the default port, you may remove this line: sparkConf.set("es.port", "9200") and check.
Check that there is no proxy configured in your Spark environment or config files.
I assume that you run Elasticsaerch and Spark on the same machine. Can you try to configure your machine IP address instead of 127.0.0.1
Hope this helps! :)
Had the same problem and a further issue was that the confs set using sparkConf.set() didn't have an effect. But supplying the confs with the saving function worked, like this:
rdd.saveToEs("spark/docs", Map("es.nodes" -> "127.0.0.1", "es.nodes.wan.only" -> "true"))
This question already has answers here:
Error java.lang.OutOfMemoryError: GC overhead limit exceeded
(22 answers)
Closed 6 years ago.
I am running a spark job and I am setting the following configurations in the spark-defaults.sh. I have the following changes in the name node. I have 1 data node. And I am working on data of 2GB.
spark.master spark://master:7077
spark.executor.memory 5g
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
But I am getting an error saying GC limit exceeded.
Here is the code I am working on.
import os
import sys
import unicodedata
from operator import add
try:
from pyspark import SparkConf
from pyspark import SparkContext
except ImportError as e:
print ("Error importing Spark Modules", e)
sys.exit(1)
# delimeter function
def findDelimiter(text):
sD = text[1]
eD = text[2]
return (eD, sD)
def tokenize(text):
sD = findDelimiter(text)[1]
eD = findDelimiter(text)[0]
arrText = text.split(sD)
text = ""
seg = arrText[0].split(eD)
arrText=""
senderID = seg[6].strip()
yield (senderID, 1)
conf = SparkConf()
sc = SparkContext(conf=conf)
textfile = sc.textFile("hdfs://my_IP:9000/data/*/*.txt")
rdd = textfile.flatMap(tokenize)
rdd = rdd.reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("hdfs://my_IP:9000/data/total_result503")
I even tried groupByKey instead of also. But I am getting the same error. But when I tried removing the reduceByKey or groupByKey I am getting outputs. Can some one help me with this error.
Should I also increase the size of GC in hadoop. And as I said earlier I have set driver.memory to 5gb, I did it in the name node. Should I do that in data node as well?
Try to add below setting for your spark-defaults.sh:
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions -XX:+UseG1GC
Tuning jvm garbage collection might be tricky, but "G1GC" seems works pretty good. Worth trying!!
The code you have should have worked with your configuration . As suggested earlier try using G1GC .
Also try reducing storage memory fraction . By default its 60% . Try reducing it to 40% or less.
You can set it by adding spark.storage.memoryFraction 0.4
I was able to solve the problem. I was running my hadoop in the root user of the master node. But I configured the hadoop in a different user in the datanodes. Now I configured them in the root user of the data node and increased the executor and driver memory it worked fine.
I'm new to storm trying to use deubugging
i forced topology.debug: true in storm.yaml
but when i finished sumbiting topology couldn't find where is the result of debug
I noticed in storm ui that topology.debug is false !
why it coudln't read my changes ?
Each node/machine in you cluster has it's own storm.yaml file. Thus, your changes to your local storm.yaml does not have any effect. However, you can overwrite this value via a topology configuration that is provided when you submit the topology:
Config cfg = new Config();
cfg.setDebug(true);
StormSubmitter.submitTopology("myTopology", cfg, builder.createTopology());
You will find the log files on the nodes in you cluster in your_storm_dir/logs/
The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.createSymlink(job.getConfiguration());
DistributedCache.addCacheFile(new URI
("hdfs:/user/hadoop/harsh/libnative1.so"),conf.getConfiguration());
and in the mapper :
System.loadLibrary("libnative1.so");
(i also tried
System.loadLibrary("libnative1");
System.loadLibrary("native1");
But I am getting this error:
java.lang.UnsatisfiedLinkError: no libnative1.so in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
JobConf.MAPRED_MAP_TASK_ENV
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..
Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.map.child.env",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2