Setting up a RocketMQ cluster: slave not visible & not replicating - rocketmq

I'm trying to set up a RocketMQ cluster, with a single name server, 1 master and 2 slaves. But, I'm running into some problems.
The version I'm running is downloaded from github/rocketmq-all-4.1.0-incubating.zip.
The brokers are run using mqbroker -c broker.conf, where broker.conf
differs for master and slave. For the master I have:
listenPort=10911
brokerName=mybroker
brokerClusterName=mybrokercluster
brokerId=0
deleteWhen=04
fileReservedTime=48
brokerRole=SYNC_MASTER
flushDiskType=ASYNC_FLUSH
And for slaves:
listenPort=10911
brokerName=mybroker
brokerClusterName=mybrokercluster
brokerId=1
deleteWhen=04
fileReservedTime=48
brokerRole=SLAVE
flushDiskType=ASYNC_FLUSH
The second slave has brokerId=2.
Brokers start up fine, some parts of the logs for a slave:
2017-10-02 20:31:35 INFO main - brokerRole=ASYNC_MASTER
2017-10-02 20:31:35 INFO main - flushDiskType=ASYNC_FLUSH
(...)
2017-10-02 20:31:35 INFO main - Replace, key: brokerId, value: 0 -> 1
2017-10-02 20:31:35 INFO main - Replace, key: brokerRole, value:
ASYNC_MASTER -> SLAVE
(...)
2017-10-02 20:31:37 INFO main - Set user specified name server address:
172.22.1.38:9876
2017-10-02 20:31:37 INFO ShutdownHook - Shutdown hook was invoked, 1
2017-10-02 20:31:37 INFO ShutdownHook - shutdown thread
PullRequestHoldService interrupt false
2017-10-02 20:31:37 INFO ShutdownHook - join thread PullRequestHoldService
eclipse time(ms) 0 90000
2017-10-02 20:31:37 WARN ShutdownHook - unregisterBroker Exception,
172.22.1.38:9876
org.apache.rocketmq.remoting.exception.RemotingConnectException: connect to
<172.22.1.38:9876> failed
at
org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:359)
~[rocketmq-remoting-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.out.BrokerOuterAPI.unregisterBroker(BrokerOuterAPI.java:221)
~[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.out.BrokerOuterAPI.unregisterBrokerAll(BrokerOuterAPI.java:198)
~[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.BrokerController.unregisterBrokerAll(BrokerController.java:623)
[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.BrokerController.shutdown(BrokerController.java:589)
[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at org.apache.rocketmq.broker.BrokerStartup$1.run(BrokerStartup.java:218)
[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_141]
2017-10-02 20:31:37 INFO ShutdownHook - Shutdown hook over, consuming total
time(ms): 25
2017-10-02 20:31:45 INFO BrokerControllerScheduledThread1 - dispatch behind
commit log 0 bytes
2017-10-02 20:31:45 INFO BrokerControllerScheduledThread1 - Slave fall
behind master: 0 bytes
2017-10-02 20:31:45 INFO BrokerControllerScheduledThread1 - register broker
to name server 172.22.1.38:9876 OK
2017-10-02 20:32:15 INFO BrokerControllerScheduledThread1 - register broker
to name server 172.22.1.38:9876 OK
As I suspect the broker is trying to connect to the name server, which
isn't running initially, so it retries and eventually succeeds?
However, later when trying clusterList I only see one broker listed, which happens to be a slave (172.22.1.17) and has brokerId=2 in the configuration (although here it's listed as 0):
$ ./mqadmin clusterList -n 172.22.1.38:9876
#Cluster Name #Broker Name #BID #Addr
#Version #InTPS(LOAD) #OutTPS(LOAD) #PCWait(ms) #Hour
#SPACE
mybrokercluster mybroker 0 172.22.1.17:10911
V4_1_0_SNAPSHOT 0.00(0,0ms) 0.00(0,0ms) 0
418597.80 -1.0000
Moreover, when sending messages to the master, I get SLAVE_NOT_AVAILABLE.
Why is that? Are the brokers configured properly? If so, wy does
clusterList report them incorrectly?

you should change slave port,as you know 10911 has been used by anthoer process(master node),slave should be use different tcp port(eg.10921/10931 and so on)
tips:my cluster deploy on one machine,so i changed tcp port and startup successful,if you master&slave deploy on different machine and startup failed,you should visit rocketmq error log for more information.
notice:one master which have more than one slave,brokerId should be different

Related

The oozie job does not run with the message [AM container is launched, waiting for AM container to Register with RM]

I ran a shell job among the oozie examples.
However, YARN application is not executed.
Detail information YARN UI & LOG:
https://docs.google.com/document/d/1N8LBXZGttY3rhRTwv8cUEfK3WkWtvWJ-YV1q_fh_kks/edit
YARN application status is
Application Priority: 0 (Higher Integer value indicates higher priority)
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
Queue: default
FinalStatus Reported by AM: Application has not completed yet.
Finished: N/A
Elapsed: 20mins, 30sec
Tracking URL: ApplicationMaster
Log Aggregation Status: DISABLED
Application Timeout (Remaining Time): Unlimited
Diagnostics: AM container is launched, waiting for AM container to Register with RM
Application Attempt status is
Application Attempt State: FAILED
Elapsed: 13mins, 19sec
AM Container: container_1607273090037_0001_02_000001
Node: N/A
Tracking URL: History
Diagnostics Info: ApplicationMaster for attempt appattempt_1607273090037_0001_000002 timed out
Node Local Request Rack Local Request Off Switch Request
Num Node Local Containers (satisfied by) 0
Num Rack Local Containers (satisfied by) 0 0
Num Off Switch Containers (satisfied by) 0 0 1
nodemanager log
2020-12-07 01:45:16,237 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_1607273090037_0001_01_000001]
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1607273090037_0001_01_000001 transitioned from SCHEDULED to RUNNING
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1607273090037_0001_01_000001
2020-12-07 01:45:16,272 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /tmp/hadoop-oozie/nm-local-dir/usercache/oozie/appcache/application_1607273090037_0001/container_1607273090037_0001_01_000001/default_container_executor.sh]
2020-12-07 01:45:17,301 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1607273090037_0001_01_000001's ip = 127.0.0.1, and hostname = localhost.localdomain
2020-12-07 01:45:17,345 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1607273090037_0001_01_000001 since CPU usage is not yet available.
2020-12-07 01:45:48,274 INFO logs: Aliases are enabled
2020-12-07 01:54:50,242 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 496756, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
2020-12-07 01:58:10,071 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1607273090037_0001_000001 (auth:SIMPLE)
2020-12-07 01:58:10,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1607273090037_0001_01_000001
What is the problem ?

Kafka on Mac There are no in-flight requests for node -1 while starting console consumer

I am new to Kafka and trying to install and run console consumer but getting error java.lang.IllegalStateException: There are no in-flight requests for node -1
Environment I tried on is as below
Kafka Version kafka_2.13-2.6.0
MacOS java11 Fails
MacOS java 1.8 Fails
Windows 10 Java11 Success
Below are the steps in details which I am performing. The same steps works on Windows.
STEP 1 Download KAFKA
I just downloaded the Kafka from https://www.apache.org/dyn/closer.cgi?path=/kafka/2.6.0/kafka_2.13-2.6.0.tgz
STEP 2 Start zookeeper service, which works fine.
I am starting zookeeper with below command
bin % zookeeper-server-start.sh ../config/zookeeper.properties
here is the zookeeper log
[2020-08-30 14:28:52,234] INFO Server environment:java.io.tmpdir=/var/folders/q7/khp8p9k14rzfs_zl52m57hlr0000gn/T/ (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:java.compiler=<NA> (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:os.name=Mac OS X (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:os.arch=x86_64 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:os.version=10.15.6 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:user.name=jigarnaik (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:user.home=/Users/jigarnaik (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:user.dir=/Users/jigarnaik/Documents/kafka_2.13-2.6.0/bin (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:os.memory.free=496MB (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:os.memory.max=512MB (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,234] INFO Server environment:os.memory.total=512MB (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,235] INFO minSessionTimeout set to 6000 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,235] INFO maxSessionTimeout set to 60000 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,236] INFO Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000 datadir /tmp/zookeeper/version-2 snapdir /tmp/zookeeper/version-2 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-08-30 14:28:52,243] INFO Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory (org.apache.zookeeper.server.ServerCnxnFactory)
[2020-08-30 14:28:52,247] INFO Configuring NIO connection handler with 10s sessionless connection timeout, 2 selector thread(s), 32 worker threads, and 64 kB direct buffers. (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2020-08-30 14:28:52,254] INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2020-08-30 14:28:52,268] INFO zookeeper.snapshotSizeFactor = 0.33 (org.apache.zookeeper.server.ZKDatabase)
[2020-08-30 14:28:52,271] INFO Snapshotting: 0x0 to /tmp/zookeeper/version-2/snapshot.0 (org.apache.zookeeper.server.persistence.FileTxnSnapLog)
[2020-08-30 14:28:52,273] INFO Snapshotting: 0x0 to /tmp/zookeeper/version-2/snapshot.0 (org.apache.zookeeper.server.persistence.FileTxnSnapLog)
[2020-08-30 14:28:52,286] INFO Using checkIntervalMs=60000 maxPerMinute=10000 (org.apache.zookeeper.server.ContainerManager)
load: 1.58 cmd: java 17764 waiting 0.68u 0.11s
STEP 3 start Kafka - the logs look fine to me.
After that I am starting Kafka using below command
bin % kafka-server-start.sh ../config/server.properties
Kafka startup looks fine with below log tail
[2020-08-30 14:33:41,708] INFO [TransactionCoordinator id=0] Starting up. (kafka.coordinator.transaction.TransactionCoordinator)
[2020-08-30 14:33:41,709] INFO [Transaction Marker Channel Manager 0]: Starting (kafka.coordinator.transaction.TransactionMarkerChannelManager)
[2020-08-30 14:33:41,709] INFO [TransactionCoordinator id=0] Startup complete. (kafka.coordinator.transaction.TransactionCoordinator)
[2020-08-30 14:33:41,728] INFO [ExpirationReaper-0-AlterAcls]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2020-08-30 14:33:41,757] INFO [/config/changes-event-process-thread]: Starting (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread)
[2020-08-30 14:33:41,770] INFO [SocketServer brokerId=0] Starting socket server acceptors and processors (kafka.network.SocketServer)
[2020-08-30 14:33:41,773] INFO [SocketServer brokerId=0] Started data-plane acceptor and processor(s) for endpoint : ListenerName(PLAINTEXT) (kafka.network.SocketServer)
[2020-08-30 14:33:41,773] INFO [SocketServer brokerId=0] Started socket server acceptors and processors (kafka.network.SocketServer)
[2020-08-30 14:33:41,775] INFO Kafka version: 2.6.0 (org.apache.kafka.common.utils.AppInfoParser)
[2020-08-30 14:33:41,775] INFO Kafka commitId: 62abe01bee039651 (org.apache.kafka.common.utils.AppInfoParser)
[2020-08-30 14:33:41,775] INFO Kafka startTimeMs: 1598769221773 (org.apache.kafka.common.utils.AppInfoParser)
[2020-08-30 14:33:41,776] INFO [KafkaServer id=0] started (kafka.server.KafkaServer)
[2020-08-30 14:33:41,808] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set(word-count-output-0, word-count-input-0) (kafka.server.ReplicaFetcherManager)
[2020-08-30 14:33:41,815] INFO [Partition word-count-output-0 broker=0] Log loaded for partition word-count-output-0 with initial high watermark 0 (kafka.cluster.Partition)
[2020-08-30 14:33:41,821] INFO [Partition word-count-input-0 broker=0] Log loaded for partition word-count-input-0 with initial high watermark 0 (kafka.cluster.Partition)
STEP 4 Create topic, topic created successfully.
after which I. am creating topic
bin % kafka-topics.sh \
--create \
--zookeeper localhost:2181 \
--replication-factor 1 \
--partitions 1 \
--topic my-topic
Created topic my-topic.
list topic
bin % kafka-topics.sh \
--list \
--zookeeper localhost:2181
my-topic
word-count-input
word-count-output
STEP 5 Run console producer, fails with below error.
but when I try to start the console producer using below command, I am getting exception java.lang.IllegalStateException: There are no in-flight requests for node -1 The same step when I run on windows it works fine, I am getting issue only on Mac.
bin % kafka-console-producer.sh \
--topic word-count-input \
--bootstrap-server localhost:9092
>[2020-08-30 14:35:17,902] ERROR [Producer clientId=console-producer] Uncaught error in kafka producer I/O thread: (org.apache.kafka.clients.producer.internals.Sender)
java.nio.BufferUnderflowException
at java.nio.Buffer.nextGetIndex(Buffer.java:509)
at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:364)
at org.apache.kafka.common.protocol.ByteBufferAccessor.readInt(ByteBufferAccessor.java:43)
at org.apache.kafka.common.message.ResponseHeaderData.read(ResponseHeaderData.java:102)
at org.apache.kafka.common.message.ResponseHeaderData.<init>(ResponseHeaderData.java:70)
at org.apache.kafka.common.requests.ResponseHeader.parse(ResponseHeader.java:66)
at org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:717)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:834)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:553)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:325)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240)
at java.lang.Thread.run(Thread.java:748)
[2020-08-30 14:35:17,904] ERROR [Producer clientId=console-producer] Uncaught error in kafka producer I/O thread: (org.apache.kafka.clients.producer.internals.Sender)
java.lang.IllegalStateException: There are no in-flight requests for node -1
at org.apache.kafka.clients.InFlightRequests.requestQueue(InFlightRequests.java:62)
at org.apache.kafka.clients.InFlightRequests.completeNext(InFlightRequests.java:70)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:833)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:553)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:325)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240)
at java.lang.Thread.run(Thread.java:748)
[2020-08-30 14:35:17,938] ERROR Uncaught exception in thread 'kafka-producer-network-thread | console-producer': (org.apache.kafka.common.utils.KafkaThread)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:113)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:447)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:397)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:678)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:580)
at org.apache.kafka.common.network.Selector.poll(Selector.java:485)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:325)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240)
at java.lang.Thread.run(Thread.java:748)
Finally I found the issue, the port was being used by sonarqube running on my local but surprisingly when I started kafka there was no error reported.
Honestly, I don't know how Firewall/Antivirus played some role but somehow its linked to this problem.
In my case, Consumer was able to consume messages only once but it throws exception BufferUnderflowException when I started the consumer again.
Turning-Off Firewall/Antivirus was't option for me so I added/enabled advertised.listeners=PLAINTEXT://localhost:9092 configuration in server.properties of kafka-server and afterwards it works as expected.
kafka version: 2.6.0
My norton antivirus firewall settings was causing this issue. I disabled firewall settings and everything started working fine.
I got the same issue. It was happening because sonarqube uses a second port 9092. To avoid this, I suggest to use the sonarqube docker image instead of downloading locally.

Change Mesos Master Leader, cause Marathon shutdown?

Env:
Zookeeper on computer A,
Mesos master on computer B as Leader,
Mesos master on computer C,
Marathon on computer B singleton.
Action:
Kill Mesos master task on computer B, attempt to change mesos cluster leader
Result:
Mesos cluster leader change to mesos master on computer C,
But Marathon task on computer auto shutdown with following logs.
Question:
Somebody can help me why marathon down? and how to fix it!
Logs:
I1109 12:19:10.010197 11287 detector.cpp:152] Detected a new leader: (id='9')
I1109 12:19:10.010646 11291 group.cpp:699] Trying to get '/mesos/json.info_0000000009' in ZooKeeper
I1109 12:19:10.013425 11292 zookeeper.cpp:262] A new leading master (UPID=master#10.4.23.55:5050) is detected
[2017-11-09 12:19:10,015] WARN Disconnected (mesosphere.marathon.MarathonScheduler:Thread-23)
I1109 12:19:10.018977 11292 sched.cpp:2021] Asked to stop the driver
I1109 12:19:10.019161 11292 sched.cpp:336] New master detected at master#10.4.23.55:5050
I1109 12:19:10.019892 11292 sched.cpp:1203] Stopping framework d52cbd8c-1015-4d94-8328-e418876ca5b2-0000
[2017-11-09 12:19:10,020] INFO Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Abdicating leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Stopping the election service (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,029] INFO backgroundOperationsLoop exiting (org.apache.curator.framework.imps.CuratorFrameworkImpl:Curator-Framework-0)
[2017-11-09 12:19:10,061] INFO Session: 0x15f710ffb010058 closed (org.apache.zookeeper.ZooKeeper:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,061] INFO EventThread shut down for session: 0x15f710ffb010058 (org.apache.zookeeper.ClientCnxn:pool-3-thread-1-EventThread)
[2017-11-09 12:19:10,063] INFO Stopping MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,063] INFO Lost leadership (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,066] INFO All actors suspended:
* Actor[akka://marathon/user/offerMatcherStatistics#-1904211014]
* Actor[akka://marathon/user/reviveOffersWhenWanted#-238627718]
* Actor[akka://marathon/user/expungeOverdueLostTasks#608979053]
* Actor[akka://marathon/user/launchQueue#803590575]
* Actor[akka://marathon/user/offersWantedForReconciliation#598482724]
* Actor[akka://marathon/user/offerMatcherLaunchTokens#813230776]
* Actor[akka://marathon/user/offerMatcherManager#1205401692]
* Actor[akka://marathon/user/instanceTracker#1055980147]
* Actor[akka://marathon/user/killOverdueStagedTasks#-40058350]
* Actor[akka://marathon/user/taskKillServiceActor#-602552505]
* Actor[akka://marathon/user/rateLimiter#-911383474]
* Actor[akka://marathon/user/deploymentManager#2013376325] (mesosphere.marathon.core.leadership.impl.LeadershipCoordinatorActor:marathon-akka.actor.default-dispatcher-10)
I1109 12:19:10.069551 11272 sched.cpp:2021] Asked to stop the driver
[2017-11-09 12:19:10,068] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,069] INFO Stopped MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,070] INFO Terminating due to leadership abdication or failure (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,071] INFO Call postDriverRuns callbacks on (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,074] INFO Now standing by. Closing existing handles and rejecting new. (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-12)
[2017-11-09 12:19:10,074] INFO Suspending scheduler actor (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-2)
[2017-11-09 12:19:10,083] INFO Finished postDriverRuns callbacks (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,084] INFO ExpungeOverdueLostTasksActor has stopped (mesosphere.marathon.core.task.jobs.impl.ExpungeOverdueLostTasksActor:marathon-akka.actor.default-dispatcher-9)
[1]+ Exit 137
I think there is wrong configuration in zookeeper cluster. Use 3 zookeeper cluster and 2 mesos master n multiple slaves. Ref : https://www.google.co.in/amp/s/beingasysadmin.wordpress.com/2014/08/16/managing-ha-docker-cluster-using-multiple-mesos-masters/amp/
Did you set masters reference to marathon conf?
can you do
cat /etc/marathon/conf/master

Pig keeps trying to connect to job history server (and fails)

I'm running a Pig job that fails to connect to the Hadoop job history server.
The task (usually any task with GROUP BY) runs for a while and then it starts with a message like:
2015-04-21 19:05:22,825 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-04-21 19:05:26,721 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-04-21 19:05:29,721 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
It then continues for a while retrying the connection. Sometimes it precedes further with the job. Othertimes it throws this exception:
2015-04-21 19:05:55,822 [main] WARN org.apache.pig.tools.pigstats.mapreduce.MRJobStats - Unable to get job counters
java.io.IOException: java.io.IOException: java.net.NoRouteToHostException: No Route to Host from cluster-01/10.10.10.11 to 0.0.0.0:10020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getCounters(HadoopShims.java:132)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addCounters(MRJobStats.java:284)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:280)
I found this question here but in my case the job history server is started. If I run netstat, I find:
tcp 0 0 0.0.0.0:10020 0.0.0.0:* LISTEN 12073/java off (0.00/0/0)
Where 12073 is ...
12073 pts/4 Sl 0:07 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Dproc_historyserver -Xmx1000m -Djava.library.path=/data/hadoop/hadoop/lib -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/hadoop/hadoop-2.3.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/data/hadoop/hadoop-2.3.0 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/hadoop/hadoop/logs -Dhadoop.log.file=mapred-hadoop-historyserver-cluster-01.log -Dhadoop.root.logger=INFO,RFA -Dmapred.jobsummary.logger=INFO,JSA -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
I tried opening the port 10200 in case it was a firewall issue:
ACCEPT tcp -- anywhere anywhere tcp dpt:10020
... but no luck.
After a few minutes, some of the tasks just arbitrarily continue to the next part.
I'm using Hadoop 2.3 and Pig 0.14.
My question is:
1) What are the possible reasons why Pig cannot connect to the job history server (JHS) given that the JHS is running on the same port that Pig looks for it?
... or failing that ...
2) Is there any way to just tell Pig to stop trying to connect to the JHS and continue with the task?
It seems that most Hadoop installation/configuration guides neglect to mention configuring the Job History Server. It seems that Pig, in particular, relies on this server. It also seems like the default (local) settings for the JHS won't work in a multi-node cluster.
The solution was to add the hostname of the server into the configuration in mapred-site.xml to make sure it could be accesses from the other machines. (In my version of the file, the lines had to be added as "new" ... there were no previous settings.)
<property>
<name>mapreduce.jobhistory.address</name>
<value>cm:10020</value>
<description>Host and port for Job History Server (default 0.0.0.0:10020)</description>
</property>
Then restart the job history server:
mr-jobhistory-daemon.sh stop historyserver
mr-jobhistory-daemon.sh start historyserver
If you get a bind exception (port in use), it means the stop didn't work. Either
Use ps ax | grep -e JobHistory to get the process and kill it manually with kill -9 [pid]. Then call the start command above again. Or
Use a different port in the configuration
Pig should pick up the new settings automatically. Run a Pig script and hope for the best.
start history server in hadoop bin using the below command
bin$ ./mr-jobhistory-daemon.sh start historyserver
run pig using the below command
$pig
Config mapreduce.jobhistory.address in hadoop/etc/hadoop/mapred-site.xml,
then:
mapred --daemon start
The solution was the History server was not running:
[user#vm9 sbin]$ ./mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /home/user/hadoop-2.7.7/logs/mapred-user-historyserver-vm9.out
[user#vm9 sbin]$ jps
5683 NameNode
6309 NodeManager
5974 SecondaryNameNode
8075 RunJar
6204 ResourceManager
8509 JobHistoryServer
5821 DataNode
8542 Jps
[user#vm9 sbin]$
Now pig can run properly and it will connect to the job history server and the dump command is working fine.

Storm worker not starting

I am trying to storm a storm topology but the storm worker refuses to start when I try to run the java command which invokes the worker process I get the following error:
Exception: java.lang.StackOverflowError thrown from the UncaughtExceptionHandler in thread "main"
I am not able to find what problem is causing this. Has anyone faced similar issue
Edit:
when I runt the worker process with flag -V I get the following error:
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:java.library.path=/usr/local/lib:/opt/local/lib:/usr/lib
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:java.io.tmpdir=/tmp
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:java.compiler=<NA>
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:os.name=Linux
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:os.arch=amd64
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:os.version=3.5.0-23-generic
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:user.name=storm
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:user.home=/home/storm
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:user.dir=/home/storm/storm-0.9.0.1
797 [main] ERROR org.apache.zookeeper.server.NIOServerCnxn - Thread Thread[main,5,main] died
PS: When I run the same topology in local cluster it works fine, only when i deploy in cluster mode it doesnt start.
Just found out the issue. The jar I creted to upload in the storm cluster, was kept in the storm base directory pics. This somehow was creating conflict which was not shown in the log file and actually log file never got created.
Make sure no external jars are present in the base storm folder from where one start storm. Really tricky error no idea why this happens until you just get around it.
Hope the storm guys add this into the logs so that user facing such issue can pinpoint why exactly this is happening.

Resources