Spark node is not starting in DSE cluster - hadoop

Analytics node seems down.
Datacenter: Analytics
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
DN 4.44 MB 1 ?
Datacenter: Cassandra
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
UN 281.94 GB 1 ?
UN 281.21 GB 1 ?
UN 281.23 GB 1 ?
Datacenter: Solr
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
UN 277.97 GB 1 ?
UN 286.75 GB 1 ?
So i logged in to that node and trying to start spark
dse cassandra -k
But i get the below exception
Exception encountered during startup: null
INFO 07:15:58 DSE shutting down...
INFO 07:15:58 All plugins are stopped.


YARN complains No route to host (Host unreachable)

Attempting to run h2o on a HDP 3.1 cluster and running into error that appears to be about YARN resource capacity...
[ml1user#HW04 h2o-]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
[Possible callback IP address:]
[Possible callback IP address:]
[Possible callback IP address:]
Using mapper->driver callback IP address and port:
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings: -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10 11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----
Queue name: default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 1.00
Maximum capacity: 1.00
Application count: 0
Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (11.0 GB)
exceeds the following YARN settings:
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Looking in the YARN configs in Ambari UI, these properties are nowhere to be found. But checking the YARN logs in the YARN resource manager UI and checking some of the logs for the killed application, I see what appears to be unreachable-host errors...
Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info. No route to host (Host unreachable)
at Method)
at water.hadoop.EmbeddedH2OConfig$
End of LogType:stderr
Taking note of " No route to host (Host unreachable)". However, I can access all the other nodes from each other and they can all ping each other, so not sure what is going on here. Any suggestions for debugging or fixing?
Think I found the problem, TLDR: firewalld (nodes running on centos7) was still running, when should be disabled on HDP clusters.
From another community post:
For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available. The easiest way to do this is to temporarily disable iptables, as follows:
systemctl disable firewalld
service firewalld stop
So apparently iptables and firewalld need to be disabled across the cluster (supporting docs can be found here, I only disabled them on the Ambari installation node). After stopping these services across the cluster (I recommend using clush), was able to run the yarn job without incident.
Normally, this problem is either due to bad DNS configuration, firewalls, or network unreachability. To quote this official doc:
The hostname of the remote machine is wrong in the configuration files
The client's host table /etc/hosts has an invalid IPAddress for the target host.
The DNS server's host table has an invalid IPAddress for the target host.
The client's routing tables (In Linux, iptables) are wrong.
The DHCP server is publishing bad routing information.
Client and server are on different subnets, and are not set up to talk to each other. This may be an accident, or it is to deliberately lock down the Hadoop cluster.
The machines are trying to communicate using IPv6. Hadoop does not currently support IPv6
The host's IP address has changed but a long-lived JVM is caching the old value. This is a known problem with JVMs (search for "java negative DNS caching" for the details and solutions). The quick solution: restart the JVMs
For me, the problem was that the driver was inside a Docker container which made it impossible for the workers to send data back to it. In other words, workers and the driver not being in the same subnet. The solution as given in this answer was to set the following configurations:<container's host IP accessible by the workers>
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>

Cassandra - unable to connect via cqlsh

I have a problem in connecting to cassandra via clqsh. I've deployed a cluster consisting of 3 nodes on CentOS7. I could see that nodes are connecting with each other. nodetool status output is bellow:
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN ${SEED2} 226.47 KiB 1 60,3% <hash> rack1
UN ${SEED} 190.77 KiB 1 50,9% <hash> rack1
UN ${IP} 157.62 KiB 1 88,7% <hash> rack1
But connecting via cqlsh doesn't work. I've tried connection to localhost and to node IP. Here is the output of cqlsh command:
[root#node02 default.conf]# cqlsh
Connection error: ('Unable to connect to any servers', {'':
error(111, "Tried connecting to [('', 9042)]. Last error:
Connection refused")})
[root#node02 default.conf]# cqlsh ${IP}
connection error: ('Unable to connect to any servers', {'${IP}':
ConnectionShutdown('Connection to ${IP} was closed',)})
It's not such obvious for me why 'Connection to ... was closed' is printed if connecting to rpc_address but 'Connectiong refused' when connecting to the localhost.
Does anyone know the cause of such problem?
cassandra.yaml file is bellow:
# Cassandra storage config YAML
cluster_name: '${NAME}'
hinted_handoff_enabled: true
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
hints_directory: /var/lib/cassandra/hints
key_cache_size_in_mb: 2
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
saved_caches_directory: /var/lib/cassandra/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
concurrent_reads: 32
concurrent_writes: 32
storage_port: 7000
ssl_storage_port: 7001
rpc_port: 9042
start_rpc: true
rpc_keepalive: true
rpc_server_type: sync
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
listen_address: ${IP}
rpc_address: ${IP}
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: ${IP},${SEED}
Found the issue. You set rpc_port to 9042. I think you're confusing rpc with native (cql). Rpc is the old interface that is deprecated in later releases. I would recommend setting start_rpc to false and set rpc_port back to it's default value: 9160.

Spark streaming from Kafka returns result on local but Not working on Yarn

I am using Cloudera's VM CDH 5.12, spark v1.6, kafka(installed by yum) v0.10 and python 2.66 and scala 2.10
Below is a simple spark application that I am running. It takes events from kafka and prints it after map reduce.
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: <zk> <topic>", file=sys.stderr)
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 1)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
When I submit above code using following command(local) it runs fine
spark-submit --master local[2] --jars /usr/lib/spark/lib/spark-examples.jar <ZKhostname>:2181 <kafka-topic>
But when I submit same above code using following command(YARN) it doesn't work
spark-submit --master yarn --deploy-mode client --jars /usr/lib/spark/lib/spark-examples.jar <ZKhostname>:2181 <kafka-topic>
Here is the log generated when ran on YARN(cutting them short, logs may differ from above mentioned spark settings):
INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host:
ApplicationMaster RPC port: 0
queue: root.cloudera
start time: 1515766709025
final status: UNDEFINED
tracking URL: http://quickstart.cloudera:8088/proxy/application_1515761416282_0010/
user: cloudera
40 INFO YarnClientSchedulerBackend: Application application_1515761416282_0010 has started running.
40 INFO Utils: Successfully started service '' on port 53694.
40 INFO NettyBlockTransferService: Server created on 53694
53 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
54 INFO BlockManagerMasterEndpoint: Registering block manager quickstart.cloudera:56220 with 534.5 MB RAM, BlockManagerId(1, quickstart.cloudera, 56220)
07 INFO ReceiverTracker: Starting 1 receivers
07 INFO ReceiverTracker: ReceiverTracker started
07 INFO PythonTransformedDStream: metadataCleanupDelay = -1
07 INFO KafkaInputDStream: metadataCleanupDelay = -1
07 INFO KafkaInputDStream: Slide time = 10000 ms
07 INFO KafkaInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
07 INFO KafkaInputDStream: Checkpoint interval = null
07 INFO KafkaInputDStream: Remember duration = 10000 ms
07 INFO KafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka.KafkaInputDStream#7137ea0e
07 INFO PythonTransformedDStream: Slide time = 10000 ms
07 INFO PythonTransformedDStream: Storage level = StorageLevel(false, false, false, false, 1)
07 INFO PythonTransformedDStream: Checkpoint interval = null
07 INFO PythonTransformedDStream: Remember duration = 10000 ms
07 INFO PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream#de77734
10 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.8 KB, free 534.5 MB)
10 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.5 KB, free 534.5 MB)
20 INFO JobScheduler: Added jobs for time 1515766760000 ms
30 INFO JobScheduler: Added jobs for time 1515766770000 ms
40 INFO JobScheduler: Added jobs for time 1515766780000 ms
After this, the job just starts repeating following lines(after some delay set by stream context) and doesnt printout kafka's stream, whereas job on master local with the exact same code does.
Interestingly it prints following line every-time a kafka event occurs(picture is of increased spark memory settings)
Note that:
Data is in kafka and I can see that in consumer console
I have also tried increasing executor's momory(3g) and network timeout time(800s) but no success
Can you see application stdout logs through Yarn Resource Manager UI?
Follow your Yarn Resource Manager link.(http://localhost:8088).
Find your application in running applications list and follow application's link. (http://localhost:8088/application_1396885203337_0003/)
Open "stdout : Total file length is xxxx bytes" link to see log file on browser.
Hope this helps.
When in local mode the application runs in a single machine and you get to see all the prints given in the codes.When run on a cluster everything is in distributed mode and runs on different machines/cores an will not be able to see the print given
Try to get the logs generated by spark using command yarn logs -applicationId
It's possible, that your is an alias and it's not defined on yarn nodes, or is not resolved on the yarn nodes for other reasons.

Cassandra Node communication issue

I have two node cluster on AWS. Everything was working fine until yesterday.
Today I came across a problem when I run nodetool status then the following error appears.
Node1 thinks Node2 is down and vice versa.
From ip2
ip2$ nodetool status
Datacenter: DC1
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN <ip1> ? 256 ? 27c91f95-4b58-492b-a16e-d9b99867a505 r1
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN <ip2> 9.11 GiB 256 ? e628324d-34dd-4c9c-a53d-99abfacb54af rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
From ip1
ip1$ nodetool status
Datacenter: DC1
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN <ip2> ? 256 ? e628324d-34dd-4c9c-a53d-99abfacb54af r1
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN <ip1> 9.14 GiB 256 ? 27c91f95-4b58-492b-a16e-d9b99867a505 rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
As per last line there is some replication setting problem but I am not able to figure this out. Please suggest.
WARN [OptionalTasks:1] 2017-08-08 15:33:37,223 - CassandraRoleManager skipped default role setup: some nodes were not ready
INFO [OptionalTasks:1] 2017-08-08 15:33:37,223 - Setup task failed with error, rescheduling
INFO [HANDSHAKE-/] 2017-08-08 15:33:37,340 - Handshaking version with /

spring-xd yarn admin yarn-container fails

Version: spring-xd-1.0.1
Distributed mode: yarn
Hadoop version: cdh5
I have modified the config/servers.yml to point to right applicationDir, zookeeper, hdfs, resourcemanager,redis, mysqldb
However after the push, when I start admin, it is killed by yarn after sometime.
I do not understand why the container will consume 31G of memory.
Please point me in the right direction to debug this problem. Also, how do I increase the log level
Following error is observed in logs:
Got ContainerStatus=[container_id { app_attempt_id { application_id { id: 432 cluster_timestamp: 1415816376410 } attemptId: 1 } id: 2 } state: C_COMPLETE diagnostics: "Container [pid=19374,containerID=container_1415816376410_0432_01_000002] is running beyond physical memory limits. Current usage: 1.2 GB of 1 GB physical memory used; 31.7 GB of 2.1 GB virtual memory used. Killing container.\nDump of the process-tree for container_1415816376410_0432_01_000002 :\n\t|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE\n\t|- 19381 19374 19374 19374 (java) 3903 121 33911242752 303743 /usr/java/jdk1.7.0_45-cloudera/bin/java -DxdHomeDir=./ -Dxd.module.config.location=file:./ -Dspring.config.location=./servers.yml org.springframework.xd.dirt.server.AdminServerApplication \n\t|- 19374 24125 19374 19374 (bash) 0 0 110804992 331 /bin/bash -c /usr/java/jdk1.7.0_45-cloudera/bin/java -DxdHomeDir=./ -Dxd.module.config.location=file:./ -Dspring.config.location=./servers.yml org.springframework.xd.dirt.server.AdminServerApplication 1>/var/log/hadoop-yarn/container/application_1415816376410_0432/container_1415816376410_0432_01_000002/Container.stdout 2>/var/log/hadoop-yarn/container/application_1415816376410_0432/container_1415816376410_0432_01_000002/Container.stderr \n\nContainer killed on request. Exit code is 143\nContainer exited with a non-zero exit code 143\n" exit_status: 143
Yes, with the current version 1.1.0/1.1.1 you don't need to run the admin explicitly. The containers and admin will be instantiated by yarn when you submit the application.
