Storm supervisor cat't connect the zk - apache-storm

the error msg in supervisor.log: Storm supervisor cat't create stormClusterState
at the same time,It is empty in the /storm/supervisor directory of zk.The nimbus process can be started but the supervisor cannot start.why?
the error msg in supervisor.log:
ava.lang.Error: java.lang.RuntimeException: org.apache.storm.shade.org.apache.zookeeper.KeeperExceptionsConnectionLossException: KeeperErrorCode = ConnectionLoss for /stom
at org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:663)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:667)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.utils.Utils.lambda$createDefaultUncaughtExceptionHandler$2(Utils.java:1047)[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.utils.Utils$$Lambda$17/00000000F826AC00.uncaughtException(UnknownSource)[storm-client-2.3.0.jar:2.3.0]
at java. lang. ThreadGroup.uncaughtException (ThreadGroup.java:B68) [7:1.8.0_2421
at java.lang. ThreadGroup.uncaughtException (ThreadGroup. java: 866) [?:1.8.0 242j
at java.lang.Thread.uncaughtException(Thread.java: 1335) [7:1.8.0 242]
Caused by: java.lang.RuntimeException:org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:KeeperErrorCode=ConnectionLossfor/storm
at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:493)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:147)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:288)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.zookeeper.ClientZookeeper.mkdirs(ClientZookeeper.java:70)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ZKStateStorage.(ZKStateStorage.java:65)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ZKStateStorageFactory.mkStore(ZKStateStorageFactory.java:30)~[storm-client-2.3.0.jar:2.3.01
at org.apache.storm.cluster.ClusterUtils.mkStateStorageImpl(ClusterUtils.java:318)~[storm-client-2.3.0.jar:2.3.01
at org.apache.storm.cluster.ClusterUtils.mkStormClusterStateImpl(ClusterUtils.java:301)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ClusterUtils.mkStormClusterState(ClusterUtils.java:286)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.daemon.supervisor. Supervisor.(Supervisor.java: 160) ~[storm-server-2.3.0.jar:2.3.0]
at org.apache.storm.daemon.supervisor.Supervisor.(Supervisor.java:127)~[storm-server-2.3.0.jar:2.3.0]
at org.apache.storm.daemon.supervisor.Supervisor.main(Supervisor.java:200)~[storm-server-2.3.0.jar:2.3.0]
caused by: org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /storm
at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)~[storm-shaded-deps-2.3.0.jar:2.3.0]
at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:54)~[storm-shaded-deps-2.3.0.jar:2.3.01
at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1l11)~[storm-shaded-deps-2.3.0.jar:2.3.0]
at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268)~[storm-shaded-deps-2.3.0.jar:2.3.0]
at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuj.derImpl.java:257)~[storm-shaded-deps-2.3.0.jar:2.3.01
at org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callkäthRetry(StandardConnectionHandlingPolicy.java:64)-[storm-shaded-deps
at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)~[storm-shaded-deps-2.3.0.jar:2.3.0]
at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254)~[storm-shaded-deps-2.3.0.jar:2.3
at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247)~[storm-shaded-deps-2.3.0.jar:2.3.0]
at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206)~[storm-shaded-deps-2.3.0.jar:2.3.0]
at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)~[storm-shaded-deps-2.3.0.jar:2.3.0]
at org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:144)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:288)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.zookeeper.ClientZookeeper.mkdirs(ClientZookeeper.java:70)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ZKStateStorage.(ZKStateStorage.java:65)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ZKStateStorageFactory.mkStore(ZKStateStorageFactory.java:30)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ClusterUtils.mkStateStorageImpl(ClusterUtils.java:318)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ClusterUtils.mkStormClusterStateImpl(ClusterUtils.java:301)~[storm-client-2.3.0.jar:2.3.0]
at org.apache.storm.cluster.ClusterUtils.mkStormClusterState(ClusterUtils.java:286)-[storm-client-2.3.0.iar:2.3.01
at org.apache.storm.daemon.supervisor.Supervisor.(Supervisor.java:160)~[storm-server-2.3.0.jar:2.3.0]
at org.apache.storm.daemon.supervisor.Supervisor.(Supervisor.java:127)~[storm-server-2.3.0.jar:2.3.0]
at org.apache.storm.daemon. supervisor.Supervisor.main (Supervisor.java:200) ~[storm-server-2.3.0.jar:2.3.0]

The /storm node already exists in the zk client, so the storm cluster cannot be connected when reconnecting to zk. You can log in to the zk client to delete the /storm node and then start the storm cluster related processes.

Related

Flink in YARN + Checkpointing in HDFS - recurring error org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException

Flink YARN Cluster High Availability:
high-availability: zookeeper
high-availability.storageDir: hdfs://hann/user/flink/recovery
high-availability.zookeeper.quorum: XXX:2181
high-availability.zookeeper.path.root: /flink
state.backend: rocksdb
state.checkpoints.dir: hdfs://hann/user/flink/checkpoints
state.checkpoints.num-retained: 5
+ Streaming job (Каfka source -> Flink -> Some sinks)
StreamExecutionEnvironment:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(<interval>);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE;
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(<interval>);
env.getCheckpointConfig().setCheckpointTimeout(<interval>);
env.setRestartStrategy(<restartStrategies>);
Work well without checkpointing but with it - periodic crashes:
2018-06-29 07:15:56,429 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 444 # 1530245743320 for job cf58d818c629f8297c6331b4130db1f9.
2018-06-29 07:16:16,638 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 444 of job cf58d818c629f8297c6331b4130db1f9 expired before completing.
2018-06-29 07:16:16,796 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 445 # 1530245776638 for job cf58d818c629f8297c6331b4130db1f9.
2018-06-29 07:16:24,596 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Kafka (5/6) (5d1bb37e21bd68a04a752e62323c6d88) switched from RUNNING to FAILED.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 444 for operator Source: Kafka (5/6).}
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Could not materialize checkpoint 444 for operator Source: Kafka (5/6).
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://hann/user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-444/8ec33328-eb51-4c74-8b1b-dfc0ef185bfd in order to obtain the stream state handle
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://hann/user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-444/8ec33328-eb51-4c74-8b1b-dfc0ef185bfd in order to obtain the stream state handle
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:447)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:352)
at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
... 7 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-444/8ec33328-eb51-4c74-8b1b-dfc0ef185bfd (inode 97646080): File does not exist. Holder DFSClient_NONMAPREDUCE_-2015925738_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3752)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3839)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3809)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:748)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:248)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:551)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2220)
at org.apache.hadoop.ipc.Client.call(Client.java:1470)
at org.apache.hadoop.ipc.Client.call(Client.java:1401)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.complete(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:443)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.complete(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2251)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2233)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:311)
... 12 more
At the same time in checkpoints dir:
~ # hdfs dfs -ls /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/
Found 6 items
drwxr-xr-x - flink flink 0 2018-06-29 07:15 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-441
drwxr-xr-x - flink flink 0 2018-06-29 07:15 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-442
drwxr-xr-x - flink flink 0 2018-06-29 07:15 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-443
drwxr-xr-x - flink flink 0 2018-06-29 07:16 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-445
drwxr-xr-x - flink flink 0 2018-06-29 02:48 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/shared
drwxr-xr-x - flink flink 0 2018-06-29 02:48 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/taskowned
There is no chk-444 folder in checkpoints directory
I'm stucked =(
I tried FsStatBackend and RocksDBStateBackend and there is no difference - I get this error every 5-6 hours.
P.S.
Flink 1.5.0
Hadoop 2.6.0

Is S3NativeFileSystem call killing my Pyspark Application on AWS EMR 4.6.0

My Spark application is failing when it has to access numerous CSV files (~1000 # 63MB each) from S3, and pipe them into a Spark RDD. The actual process of splitting up the CSV seems to work, but an extra function call to S3NativeFileSystem seems to be causing an error and the job to crash.
To begin, the following is my PySpark Application:
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
import time
startTime = float(time.time())
dataPath = 's3://PATHTODIRECTORY/'
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "MYKEY")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
def buildSchemaDF(tableName, columnList):
currentRDD = sc.textFile(dataPath + tableName).map(lambda line: line.split("|"))
currentDF = currentRDD.toDF(columnList)
return currentDF
loadStartTime = float(time.time())
lineitemDF = buildSchemaDF('lineitem*', ['l_orderkey','l_partkey','l_suppkey','l_linenumber','l_quantity','l_extendedprice','l_discount','l_tax','l_returnflag','l_linestatus','l_shipdate','l_commitdate','l_receiptdate','l_shipinstruct','l_shipmode','l_comment'])
lineitemDF.registerTempTable("lineitem")
loadTimeElapsed = float(time.time()) - loadStartTime
queryStartTime = float(time.time())
qstr = """
SELECT
lineitem.l_returnflag,
lineitem.l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_discount) as sum_disc,
sum(l_tax) as sum_tax,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(l_orderkey) as count_order
FROM
lineitem
WHERE
l_shipdate <= '19981001'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus
"""
tpch1DF = sqlContext.sql(qstr)
queryTimeElapsed = float(time.time()) - queryStartTime
totalTimeElapsed = float(time.time()) - startTime
tpch1DF.show()
queryResults = [qstr, loadTimeElapsed, queryTimeElapsed, totalTimeElapsed]
distData = sc.parallelize(queryResults)
distData.saveAsTextFile(dataPath + 'queryResults.csv')
print 'Load Time: ' + str(loadTimeElapsed)
print 'Query Time: ' + str(queryTimeElapsed)
print 'Total Time: ' + str(totalTimeElapsed)
To take it step by step I start off by spinning up a Spark EMR Cluster with the following AWS CLI command (carriage returns added for readability):
aws emr create-cluster --name "Big TPCH Spark cluster2" --release-label emr-4.6.0
--applications Name=Spark --ec2-attributes KeyName=blazing-test-aws
--log-uri s3://aws-logs-132950491118-us-west-2/elasticmapreduce/j-1WZ39GFS3IX49/
--instance-type m3.2xlarge --instance-count 6 --use-default-roles
After the EMR cluster finishes provisioning I then copy over my Pyspark application onto the master node at '/home/hadoop/pysparkApp.py'. With it copied over I'm able to add the Step for spark-submit.
aws emr add-steps --cluster-id j-1DQJ8BDL1394N --steps
Type=spark,Name=SparkTPCHTests,Args=[--deploy-mode,cluster,-
conf,spark.yarn.submit.waitAppCompletion=true,--num-executors,5,--executor
cores,5,--executor memory,20g,/home/hadoop/tpchSpark.py]
,ActionOnFailure=CONTINUE
Now if I run this step over only a few of the aforementioned CSV files the final results will be generated, but the script will still claim to have failed.
I think it's associated with an extra call to S3NativeFileSystem, but I'm not certain. These are the Yarn log messages I'm getting which lead me to that conclusion. The first call appears to work just fine:
16/05/15 23:18:00 INFO HadoopRDD: Input split: s3://data-set-builder/splitLineItem2/lineitemad:0+64901757
16/05/15 23:18:00 INFO latency: StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[ED8011CE4E1F6F18], ServiceEndpoint=[https://data-set-builder.s3-us-west-2.amazonaws.com], HttpClientPoolLeasedCount=0, RetryCapacityConsumed=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=2, ClientExecuteTime=[77.956], HttpRequestTime=[77.183], HttpClientReceiveResponseTime=[20.028], RequestSigningTime=[0.229], CredentialsRequestTime=[0.003], ResponseProcessingTime=[0.128], HttpClientSendRequestTime=[0.35],
While the second one does not seem to execute properly, resulting in "Partial Results" (206 Error):
16/05/15 23:18:00 INFO S3NativeFileSystem: Opening 's3://data-set-builder/splitLineItem2/lineitemad' for reading
16/05/15 23:18:00 INFO latency: StatusCode=[206], ServiceName=[Amazon S3], AWSRequestID=[10BDDE61AE13AFBE], ServiceEndpoint=[https://data-set-builder.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RetryCapacityConsumed=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=2, Client Execute Time=[296.86], HttpRequestTime=[295.801], HttpClientReceiveResponseTime=[293.667], RequestSigningTime=[0.204], CredentialsRequestTime=[0.002], ResponseProcessingTime=[0.34], HttpClientSendRequestTime=[0.337],
16/05/15 23:18:02 INFO ApplicationMaster: Waiting for spark context initialization ...
I'm lost as to why it's even making the second call to S3NativeFileSystem when the first one appears to have responded effectively and even split the file. Is this something that is a product of my EMR configuration? I know S3Native has file limit issues and that a straight S3 call is optimal, which is what I've tried to do, but this call seems to be there no matter what I do. Please help!
Also, to add a few other error messages in my Yarn Log in case they are relevant.
1)
16/05/15 23:19:22 ERROR ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
16/05/15 23:19:22 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.)
2)
16/05/15 23:19:22 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401
java.io.FileNotFoundException: /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:226)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/15 23:19:22 ERROR BypassMergeSortShuffleWriter: Error while deleting file /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401
16/05/15 23:19:22 WARN TaskMemoryManager: leak 32.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap#762be8fe
16/05/15 23:19:22 ERROR Executor: Managed memory leak detected; size = 33816576 bytes, TID = 14
16/05/15 23:19:22 ERROR Executor: Exception in task 13.0 in stage 1.0 (TID 14)
java.io.FileNotFoundException: /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/3a/temp_shuffle_b9001fca-bba9-400d-9bc4-c23c002e0aa9 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Order of precedence for spark configurations is :
SparkContext (code/application) > Spark-submit > Spark-defaults.conf
So couple of things to point here -
Use YARN cluster as deploy mode and master in your spark submit command -
spark-submit --deploy-mode cluster --master yarn ...
OR
spark-submit --master yarn-cluster ...
Remove "local" string from line sc = SparkContext("local", "Simple App") in your code. Use conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf) to initialize Spark context.
Ref - http://spark.apache.org/docs/latest/programming-guide.html

Failed to connect to hadoop cluster when accessing file from pyspark

I'm running the following code:
conf = SparkConf().setAppName("basicRegressionUbuntu").setMaster("spark://MyCUSTOMIP:7077")
sc = SparkContext(conf=conf)
rdd = sc.textFile("hdfs://MYHADOOPMASTERNODE:8020/sampleData/Sacramentorealestatetransactions.csv")
It throws the following:
16/03/25 10:01:11 WARN security.UserGroupInformation: PriviledgedActionException as:hduser (auth:SIMPLE) cause:java.io.IOException: Failed to connect to /10.0.2.15:42939
Exception in thread "main" java.io.IOException: Failed to connect to /10.0.2.15:42939
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection timed out: /10.0.2.15:42939
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
I know that file path exists because when I SSH into MYHADOOPMASTERNODE and do an hdfs dfs -ls /sampleData/ it shows me the fille.
Any help would be much appreciated!

Spring-XD Curator Connection Timeout

In Spring-XD the Curator Connection times out:
WARN ConnectionStateManager-0 curator.ConnectionState - Connection
attempt unsuccessful after 63021 (greater than max timeout of 60000).
Resetting connection and trying again with a new connection.
Curator tries to re-establish the connection, but fails. Please check the logs below. Has anyone faced similar issue? Please let me know if you know of any ways to resolve the issue or if you know of any workarounds.
Also the default Curator connection time out is 60000. Is there a way to increase it? Does spring-xd expose a property which can be set?
2014-12-10 01:24:41,003 WARN ConnectionStateManager-0
server.ContainerRegistrar - >>> disconnected container:
1c8a234d-4b8d-4d65-b374-xxxxe8619 2014-12-10 01:24:41,004 INFO
DeploymentsPathChildrenCache-0 server.ContainerRegistrar - Path cache
event: null, type: CONNECTION_SUSPENDED 2014-12-10 01:24:41,005 INFO
ConnectionStateManager-0 server.ContainerRegistrar - Undeploying
module [ModuleDescriptor#350920b1 moduleName = 'rabbit', moduleLabel =
'rabbit', group = 'xxx-ingestion-2', sourceChannelName = [null],
sinkChannelName = [null], sinkChannelName = [null], index = 0, type =
source, parameters = map['vhost' -> 'xxx_virtual_host', 'requeue' ->
'false', 'outputType' -> 'text/plain', 'queues' -> 'xx.xxx.queue',
'addresses' -> 'xxxmq.xx.xxxx.com'], children = list[[empty]]]
2014-12-10 01:24:46,022 ERROR pool-22-thread-1
connection.CachingConnectionFactory - Channel shutdown: clean
connection shutdown; protocol method:
method<connection.close>(reply-code=200, reply-text=OK, class-id=0, method-id=0)
2014-12-10 01:24:56,007 **ERROR CuratorFramework-0
curator.ConnectionState - Connection timed out for connection string
(514.xx.93.xxx:2181,504.58.xxx.xx:2181) and timeout (15000) / elapsed**
(15004) org.apache.curator.CuratorConnectionLossException:
KeeperErrorCode = ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744) 2014-12-10 01:24:56,161
ERROR main-EventThread curator.ConnectionState - Connection timed out
for connection string (514.xx.93.xxx:2181,504.58.xxx.xx:2181) and
timeout (15000) / elapsed (15159)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:474)
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)
at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
at org.springframework.xd.dirt.server.ContainerRegistrar$StreamModuleWatcher.process(ContainerRegistrar.java:744)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2014-12-10 01:25:03,014 ERROR CuratorFramework-0 imps.CuratorFrameworkImpl - **Background retry gave up**
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Is this reproducible? Are you running in clustered or single node mode?
The Curator connection timeout (in milliseconds) can be set via system property curator-default-connection-timeout.

DSE 4.5.1 hadoop node stopped working

I have 5 node cluster in DSE 4.5 is running and up. out of 5 nodes 1 node is hadoop_enabled. But suddenly Hadoop node stopped working.
Logs :
INFO [JOB-TRACKER-INIT] 2014-08-19 08:18:44,196 CassandraFileSystem.java (line 68) CassandraFileSystem.uri : cfs://54.xx.xx.xx/
INFO [JOB-TRACKER-INIT] 2014-08-19 08:18:44,196 CassandraFileSystem.java (line 69) Default block size: 67108864
INFO [JOB-TRACKER-INIT] 2014-08-19 08:18:44,196 CassandraFileSystemThriftStore.java (line 309) Consistency level for reads from cfs: LOCAL_QUORUM
INFO [JOB-TRACKER-INIT] 2014-08-19 08:18:44,196 CassandraFileSystemThriftStore.java (line 310) Consistency level for writes into cfs: LOCAL_QUORUM
ERROR [JOB-TRACKER-INIT] 2014-08-19 08:18:44,197 UserGroupInformation.java (line 1124) PriviledgedActionException as:cassandra cause:java.io.IOException: UnavailableException()
INFO [JOB-TRACKER-INIT] 2014-08-19 08:18:44,197 JobTracker.java (line 2430) problem cleaning system directory: null
java.io.IOException: UnavailableException()
at com.datastax.bdp.hadoop.cfs.CassandraFileSystemThriftStore.mutateINode(CassandraFileSystemThriftStore.java:905)
at com.datastax.bdp.hadoop.cfs.CassandraFileSystemThriftStore.storeINode(CassandraFileSystemThriftStore.java:827)
at com.datastax.bdp.hadoop.cfs.CassandraFileSystem.mkdir(CassandraFileSystem.java:157)
at com.datastax.bdp.hadoop.cfs.CassandraFileSystem.mkdirs(CassandraFileSystem.java:140)
at com.datastax.bdp.hadoop.cfs.CassandraFileSystem.initialize(CassandraFileSystem.java:74)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at org.apache.hadoop.mapred.JobTracker$3.run(JobTracker.java:2373)
at org.apache.hadoop.mapred.JobTracker$3.run(JobTracker.java:2371)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2371)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2195)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2189)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:303)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:294)
at com.datastax.bdp.hadoop.mapred.JobTrackerRunner.initService(JobTrackerRunner.java:84)
at com.datastax.bdp.hadoop.mapred.JobTrackerRunner.initService(JobTrackerRunner.java:31)
at com.datastax.bdp.hadoop.mapred.ServiceRunner.run(ServiceRunner.java:127)
at java.lang.Thread.run(Thread.java:744)
Caused by: UnavailableException()
at org.apache.cassandra.thrift.ThriftConversion.rethrow(ThriftConversion.java:57)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:1079)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:1061)
at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:954)
at com.datastax.bdp.server.DseServer.batch_mutate(DseServer.java:576)
at com.datastax.bdp.hadoop.cfs.CassandraFileSystemThriftStore.mutateINode(CassandraFileSystemThriftStore.java:897)
... 23 more
Can anyone help on this issue? I'm not able to run hive.
Thanks

Resources