I've got an issue with my Spark cluster. I've got a cluster set up on 3 different ec2 machines and everything is working when trying to submit in there.
What I'm trying to do is to submit from my Windows 7 machine using Eclipse.
When I run the Pi.py example locally on Eclipse it works fine.
I'll show you my code so that it's easier to understand:
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.conf import SparkConf
from pyspark import SparkContext
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
conf = SparkConf()
conf.setMaster("spark://52.50.19.132:7077") #"52.50.19.132" is the master's
# public ip, as the private one didn't allow connection(don't know why...ssh?)
conf.setAppName("PythonPi")
conf.set("spark.executor.memory", "500m")
sc = SparkContext(conf=conf)
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 < 1 else 0
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
sc.stop()
When I run this from Eclipse I can connect to the master and I can see from the Spark UI that the application is running. But I keep on getting
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have sufficient resources
So at the beginning I though it was a problem of ports that were closed and I opened all of them. But nothing changed.
I thought that it was a problem of ssh connection but I'm actually connected via Putty to the ec2 machines, although I thought the issue was the other way around, I mean from the ec2 machines to connect to the Windows machine instead.
I also checked via
netstat -an
if there was some issues and I found that there is a SYN_SENT between the ec2 workers and my Windows machine.
I hope you can help
Related
Trying to start an h2o cluster on (MapR) hadoop via python
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))
# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search('Open H2O Flow in your web browser', line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
frequently throws a timeout error
Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster
Looking at the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html) (and just doing a wordfind for the word "timeout"), was unable to find anything that helped the problem (eg. extending the timeout time via hadoop jar h2odriver.jar -timeout <some time> did nothing but extend the time until the timeout error popped up).
Have noticed that this happens often when there is another instance of an h2o cluster already up and running (which I don't understand since I would think that YARN could support multiple instances), yet also sometimes when there is no other cluster initialized.
Anyone know anything else that can be tried to solve this problem or get more debugging info beyond the error message being thrown by h2o?
UPDATE:
Trying to recreate the problem from the commandline, getting
[me#mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 172.18.4.62]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
mapreduce.map.java.opts: -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
----- YARN cluster metrics -----
Number of YARN worker nodes: 6
----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 2.0 / 8.7 GB used, 1 / 2 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
----------------------------------------------------------------------
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
and noticing the later outputs
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster
I am confused by the reported 0GB mem. and 0 vcores becuase there are no other applications running on the cluster and looking at the cluster details in the YARN RM web UI shows
(using image, since could not find unified place in log files for this info and why the mem. availability is so uneven despite having no other running applications, I do not know). At this point, should mention that don't have much experience tinkering with / examining YARN configs, so it's difficult for me to find relevant information at this point.
Could it be that I am starting h2o cluster with -mapperXmx=6g, but (as shown in the image) one of the nodes only has 5g mem. available, so if this node is randomly selected to contribute to the initialized h2o application, it does not have enough memory to support the requested mapper mem.? Changing the startup command to /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir and start/stopping multiple times without error seems to support this theory (though need to check further to determine if I'm interpreting things correctly).
This is most likely because your Hadoop cluster is busy, and there just isn't space to start new yarn containers.
If you ask for N nodes, then you either get all N nodes, or the launch process times out like you are seeing. You can optionally use the -timeout command line flag to increase the timeout.
I have a Hadoop cluster which uses the company's Active Directory as Kerberos realm. The nodes, and the end-user Linux workstations are all Ubuntu 16.04. They are joined to the same domain using PowerBroker PBIS, so SSH logons between the workstations and the grid nodes are single sign-on. End-users run long-running scripts from their workstations, which repeatedly use SSH to first launch Spark / Yarn jobs on the cluster, and then keep track of their progress, which have to keep running overnight and on weekends well beyond the 10-hour lifetime of a Kerberos ticket.
I'm looking for a way to install permanent, service-style, Kerberos keytabs for the users, relieving them of the need to deal with kinit. I understand this would imply anyone with shell access to the grid as a particular user would be able to authenticate as that user.
I've also noticed that performing non-SSO SSH logins using password automatically creates net ticket valid from the time of the login. If this behaviour could be enabled for SSO logins, that would solve my problem.
You just have to ask users to add --principal and --keytab arguments to their Spark jobs. Then Spark (actually YARN) code will renew tickets for you automatically. We have jobs that run for weeks using this approach.
See for example https://spark.apache.org/docs/latest/security.html#yarn-mode
For long-running apps like Spark Streaming apps to be able to write to
HDFS, it is possible to pass a principal and keytab to spark-submit
via the --principal and --keytab parameters respectively. The keytab
passed in will be copied over to the machine running the Application
Master via the Hadoop Distributed Cache (securely - if YARN is
configured with SSL and HDFS encryption is enabled). The Kerberos
login will be periodically renewed using this principal and keytab and
the delegation tokens required for HDFS will be generated periodically
so the application can continue writing to HDFS.
You can see in Spark driver logs when Yarn renews a Kerberos ticket.
If you are accessing Hive/Hbase or any other components with need kerberos ticket then make your spark code to relogin in case of ticket expired. You have to update ticket to use keytab rather than relying on a TGT to already exist in the cache. This is done by using the UserGroupInformation class from the Hadoop Security package. Add below snippet in you spark job for long running-
val configuration = new Configuration
configuration.addResource("/etc/hadoop/conf/hdfs-site.xml")
UserGroupInformation.setConfiguration(configuration)
UserGroupInformation.getCurrentUser.setAuthenticationMethod(AuthenticationMethod.KERBEROS)
UserGroupInformation.loginUserFromKeytabAndReturnUGI(
"hadoop.kerberos.principal", " path of hadoop.kerberos.keytab file")
.doAs(new PrivilegedExceptionAction[Unit]() {
#Override
def run(): Unit = {
//hbase/hive connection
// logic
}
})
Above we specify the name of our service principal and the path to the keytab file we generated. As long as that keytab is valid our program will use the desired service principal for all actions, regardless of whether or not the user running the program has already authenticated and received a TGT.
If there is no other component access except spark then you don't need to write above code. Simply provide keytab and principal in you spark submit command.
spark-submit --master yarn-cluster --keytab "xxxxxx.keytab" --principal "svc-xxxx#xxxx.COM" xxxx.jar
I took the suggestion above to use the --keytab argument to specify a custom keytab on the grid node from which I submit to Spark. I create my own per-user keytab using the script below. It holds until the user changes password.
Note that the script makes the simplifying assumptions that the Kerberos realm is same as the DNS domain and the LDAP directory where users are defined. This holds for my setup, use with care on yours. It also expects the users to be sudoers on that grid node. A more refined script might separate keytab generation and installation.
#!/usr/bin/python2.7
from __future__ import print_function
import os
import sys
import stat
import getpass
import subprocess
import collections
import socket
import tempfile
def runSudo(cmd, pw):
try:
subprocess.check_call("echo '{}' | sudo -S -p '' {}".format(pw, cmd), shell = True)
return True
except subprocess.CalledProcessError:
return False
def testPassword(pw):
subprocess.check_call("sudo -k", shell = True)
if not runSudo("true", pw):
print("Incorrect password for user {}".format(getpass.getuser()), file = sys.stderr)
sys.exit(os.EX_NOINPUT)
class KeytabFile(object):
def __init__(self, pw):
self.userName = getpass.getuser()
self.pw = pw
self.targetPath = "/etc/security/keytabs/{}.headless.keytab".format(self.userName)
self.tempFile = None
KeytabEntry = collections.namedtuple("KeytabEntry", ("kvno", "principal", "encryption"))
def LoadExistingKeytab(self):
if not os.access(self.targetPath, os.R_OK):
# Note: the assumption made here, that the Kerberos realm is same as the DNS domain,
# may not hold in other setups
domainName = ".".join(socket.getfqdn().split(".")[1:])
encryptions = ("aes128-cts-hmac-sha1-96", "arcfour-hmac", "aes256-cts-hmac-sha1-96")
return [
self.KeytabEntry(0, "#".join( (self.userName, domainName)), encryption)
for encryption in encryptions ]
def parseLine(keytabLine):
tokens = keytabLine.strip().split(" ")
return self.KeytabEntry(int(tokens[0]), tokens[1], tokens[2].strip("()"))
cmd ="klist -ek {} | tail -n+4".format(self.targetPath)
entryLines = subprocess.check_output(cmd, shell = True).splitlines()
return map(parseLine, entryLines)
class KtUtil(subprocess.Popen):
def __init__(self):
subprocess.Popen.__init__(self, "ktutil",
stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell = True)
def SendLine(self, line, expectPrompt = True):
self.stdin.write(bytes(line + "\n"))
self.stdin.flush()
if expectPrompt:
self.stdout.readline()
def Quit(self):
self.SendLine("quit", False)
rc = self.wait()
if rc != 0:
raise subprocess.CalledProcessError(rc, "ktutil")
def InstallUpdatedKeytab(self):
fd, tempKt = tempfile.mkstemp(suffix = ".keytab")
os.close(fd)
entries = self.LoadExistingKeytab()
ktutil = self.KtUtil()
for entry in entries:
cmd = "add_entry -password -p {} -k {} -e {}".format(
entry.principal, entry.kvno + 1, entry.encryption)
ktutil.SendLine(cmd)
ktutil.SendLine(self.pw)
os.unlink(tempKt)
ktutil.SendLine("write_kt {}".format(tempKt))
ktutil.Quit()
if not runSudo("mv {} {}".format(tempKt, self.targetPath), self.pw):
os.unlink(tempKt)
print("Failed to install the keytab to {}.".format(self.targetPath), file = sys.stderr)
sys.exit(os.EX_CANTCREAT)
os.chmod(self.targetPath, stat.S_IRUSR)
# TODO: Also change group to 'hadoop'
if __name__ == '__main__':
def main():
userPass = getpass.getpass("Please enter your password: ")
testPassword(userPass)
kt = KeytabFile(userPass)
kt.InstallUpdatedKeytab()
main()
I have a Spark/YARN cluster with 3 slaves setup on AWS.
I spark-submit a job like this: ~/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster my.py And the final result is a file containing all the hostnames from all the slaves in a cluster. I was expecting I get a mix of hostnames in the output file, however, I only see one hostname in the output file. That means YARN never utilize the other slaves in the cluster.
Am I missing something in the configuration?
I have also included my spark-env.sh settings below.
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop/
SPARK_EXECUTOR_INSTANCES=3
SPARK_WORKER_CORES=3
my.py
import socket
import time
from pyspark import SparkContext, SparkConf
def get_ip_wrap(num):
return socket.gethostname()
conf = SparkConf().setAppName('appName')
sc = SparkContext(conf=conf)
data = [x for x in range(1, 100)]
distData = sc.parallelize(data)
result = distData.map(get_ip_wrap)
result.saveAsTextFile('hby%s'% str(time.time()))
After I updated the following setting or spark-env.sh, all slaves are utilized.
SPARK_EXECUTOR_INSTANCES=3
SPARK_EXECUTOR_CORES=8
I have a elasticsearch docker image listening on 127.0.0.1:9200, I tested it using sense and kibana, It works fine, I am able to index and query documents. Now when I try to write to it from a spark App
val sparkConf = new SparkConf().setAppName("ES").setMaster("local")
sparkConf.set("es.index.auto.create", "true")
sparkConf.set("es.nodes", "127.0.0.1")
sparkConf.set("es.port", "9200")
sparkConf.set("es.resource", "spark/docs")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val rdd = sc.parallelize(Seq(numbers, airports))
rdd.saveToEs("spark/docs")
It fails to connect, and keeps on retrying
16/07/11 17:20:07 INFO HttpMethodDirector: I/O exception (java.net.ConnectException) caught when processing request: Operation timed out
16/07/11 17:20:07 INFO HttpMethodDirector: Retrying request
I tried using IPAddress given by docker inspect for the elasticsearch image, that also does not work. However when I use a native installation of elasticsearch, the Spark App runs fine. Any ideas?
Also, set the config
es.nodes.wan.only to true
As mentioned in this answer if you are having issues writing to ES.
Couple things I would check:
The Elasticsearch-Hadoop spark connector version you are working with. Make sure that it is not beta. There was a fixed bug related to the IP resolving.
Since 9200 is the default port, you may remove this line: sparkConf.set("es.port", "9200") and check.
Check that there is no proxy configured in your Spark environment or config files.
I assume that you run Elasticsaerch and Spark on the same machine. Can you try to configure your machine IP address instead of 127.0.0.1
Hope this helps! :)
Had the same problem and a further issue was that the confs set using sparkConf.set() didn't have an effect. But supplying the confs with the saving function worked, like this:
rdd.saveToEs("spark/docs", Map("es.nodes" -> "127.0.0.1", "es.nodes.wan.only" -> "true"))
This question already has answers here:
Error java.lang.OutOfMemoryError: GC overhead limit exceeded
(22 answers)
Closed 6 years ago.
I am running a spark job and I am setting the following configurations in the spark-defaults.sh. I have the following changes in the name node. I have 1 data node. And I am working on data of 2GB.
spark.master spark://master:7077
spark.executor.memory 5g
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
But I am getting an error saying GC limit exceeded.
Here is the code I am working on.
import os
import sys
import unicodedata
from operator import add
try:
from pyspark import SparkConf
from pyspark import SparkContext
except ImportError as e:
print ("Error importing Spark Modules", e)
sys.exit(1)
# delimeter function
def findDelimiter(text):
sD = text[1]
eD = text[2]
return (eD, sD)
def tokenize(text):
sD = findDelimiter(text)[1]
eD = findDelimiter(text)[0]
arrText = text.split(sD)
text = ""
seg = arrText[0].split(eD)
arrText=""
senderID = seg[6].strip()
yield (senderID, 1)
conf = SparkConf()
sc = SparkContext(conf=conf)
textfile = sc.textFile("hdfs://my_IP:9000/data/*/*.txt")
rdd = textfile.flatMap(tokenize)
rdd = rdd.reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("hdfs://my_IP:9000/data/total_result503")
I even tried groupByKey instead of also. But I am getting the same error. But when I tried removing the reduceByKey or groupByKey I am getting outputs. Can some one help me with this error.
Should I also increase the size of GC in hadoop. And as I said earlier I have set driver.memory to 5gb, I did it in the name node. Should I do that in data node as well?
Try to add below setting for your spark-defaults.sh:
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions -XX:+UseG1GC
Tuning jvm garbage collection might be tricky, but "G1GC" seems works pretty good. Worth trying!!
The code you have should have worked with your configuration . As suggested earlier try using G1GC .
Also try reducing storage memory fraction . By default its 60% . Try reducing it to 40% or less.
You can set it by adding spark.storage.memoryFraction 0.4
I was able to solve the problem. I was running my hadoop in the root user of the master node. But I configured the hadoop in a different user in the datanodes. Now I configured them in the root user of the data node and increased the executor and driver memory it worked fine.